subject:"Re\: \[PERFORM\] Hardware\/OS recommendations for large databases \("

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-28 Thread Merlin Moncure

 
 It certainly makes quite a difference as I measure it:
 
 doing select(1) from a 181000 page table (completely uncached) on my
PIII:
 
 8.0 : 32 s
 8.1 : 25 s
 
 Note that the 'fastcount()' function takes 21 s in both cases - so all
 the improvement seems to be from the count overhead reduction.

Are you running windows?  There is a big performance improvement in
count(*) on pg 8.0-8.1 on win32 that is not relevant to this debate...

Merlin

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-28 Thread Mark Kirkwood


Merlin Moncure wrote:

It certainly makes quite a difference as I measure it:

doing select(1) from a 181000 page table (completely uncached) on my


PIII:


8.0 : 32 s
8.1 : 25 s

Note that the 'fastcount()' function takes 21 s in both cases - so all
the improvement seems to be from the count overhead reduction.



Are you running windows?  There is a big performance improvement in
count(*) on pg 8.0-8.1 on win32 that is not relevant to this debate...



No - FreeBSD 6.0 on a dual PIII 1 Ghz. The slow cpu means that the 8.1 
improvements are very noticeable!


A point of interest - applying Niels palloc - avoiding changes to 
NodeAgg.c and int8.c in 8.0 changes those results to:


8.0 + palloc avoiding patch : 27 s

(I am guessing the remaining 2 s could be shaved off if I backported 
8.1's virtual tuples - however that looked like a lot of work)


Cheers

Mark

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-28 Thread Brendan Duddridge

Forgive my ignorance, but what is MPP? Is that part of Bizgres? Is it  
possible to upgrade from Postgres 8.1 to Bizgres?


Thanks,


Brendan Duddridge | CTO | 403-277-5591 x24 |  [EMAIL PROTECTED]

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com

On Nov 28, 2005, at 3:05 PM, Luke Lonergan wrote:


Mark,

On 11/28/05 1:45 PM, Mark Kirkwood [EMAIL PROTECTED] wrote:


8.0 : 32 s
8.1 : 25 s


A 22% reduction.

select count(1) on 12,900MB = 1617125 pages fully cached:

MPP based on 8.0 : 6.06s
MPP based on 8.1 : 4.45s

A 26% reduction.

I'll take it!

I am looking to back-port Tom's pre-8.2 changes and test again, maybe
tonight.

- Luke



---(end of  
broadcast)---

TIP 5: don't forget to increase your free space map settings





smime.p7s
Description: S/MIME cryptographic signature

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-28 Thread David Boreham


Brendan Duddridge wrote:

Thanks for your reply. So how is that different than something like  
Slony2 or pgcluster with multi-master replication? Is it similar  
technology? We're currently looking for a good clustering solution  
that will work on our Apple Xserves and Xserve RAIDs.


I think you need to be more specific about what you're trying to do.
'clustering' encompasses so many things that it means almost nothing by 
itself.


slony provides facilities for replicating data. Its primary purpose is
to improve reliability. MPP distributes both data and queries. Its
primary purpose is to improve performance for a subset of all query types.



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-28 Thread David Lang


On Mon, 28 Nov 2005, Brendan Duddridge wrote:


Hi David,

Thanks for your reply. So how is that different than something like Slony2 or 
pgcluster with multi-master replication? Is it similar technology? We're 
currently looking for a good clustering solution that will work on our Apple 
Xserves and Xserve RAIDs.


MPP doesn't just split up the data, it splits up the processing as well, 
so if you have a 5 machine cluster, each machine holds 1/5 of your data 
(plus a backup for one of the other machines) and when you do a query MPP 
slices and dices the query to send a subset of the query to each machine, 
it then gets the responses from all the machines and combines them


if you ahve to do a full table scan for example, wach machine would only 
have to go through 20% of the data


a Slony of pgcluster setup has each machine with a full copy of all the 
data, only one machine can work on a given query at a time, and if you 
have to do a full table scan one machine needs to read 100% of the data.


in many ways this is the holy grail of databases. almost all other areas 
of computing can now be scaled by throwing more machines at the problem in 
a cluster, with each machine just working on it's piece of the problem, 
but databases have had serious trouble doing the same and so have been 
ruled by the 'big monster machine'. Oracle has been selling Oracle Rac for 
a few years, and reports from people who have used it range drasticly 
(from it works great, to it's a total disaster), in part depending on the 
types of queries that have been made.


Greenplum thinks that they have licked the problems for the more general 
case (and that commodity networks are now fast enough to match disk speeds 
in processing the data) if they are right then when they hit full release 
with the new version they should be cracking a lot of the 
price/performance records on the big database benchmarks (TPC and 
similar), and if their pricing is reasonable, they may be breaking them by 
an order of magnatude or more (it's not unusual for the top machines to 
spend more then $1,000,000 on just their disk arrays for those 
systems, MPP could conceivably put togeather a cluster of $5K machines 
that runs rings around them (and probably will for at least some of the 
subtests, the big question is if they can sweep the board and take the top 
spots outright)


they have more details (and marketing stuff) on their site at 
http://www.greenplum.com/prod_deepgreen_cluster.html


don't get me wrong, I am very impressed with their stuff, but (haveing 
ranted a little here on the list about them) I think MPP and it's 
performace is a bit off topic for the postgres performance list (at least 
until the postgres project itself starts implementing similar features :-)


David Lang


Thanks,


Brendan Duddridge | CTO | 403-277-5591 x24 |  [EMAIL PROTECTED]

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB  T2G 0V9

http://www.clickspace.com

On Nov 27, 2005, at 8:09 PM, David Lang wrote:


On Mon, 28 Nov 2005, Brendan Duddridge wrote:

Forgive my ignorance, but what is MPP? Is that part of Bizgres? Is it 
possible to upgrade from Postgres 8.1 to Bizgres?


MPP is the Greenplum propriatary extention to postgres that spreads the 
data over multiple machines, (raid, but with entire machines not just 
drives, complete with data replication within the cluster to survive a 
machine failing)


for some types of queries they can definantly scale lineraly with the 
number of machines (other queries are far more difficult and the overhead 
of coordinating the machines shows more. this is one of the key things that 
the new version they recently announced the beta for is supposed to be 
drasticly improving)


early in the year when I first looked at them their prices were exorbadent, 
but Luke says I'm wildly mistake on their current prices so call them for 
details


it uses the same interfaces as postgres so it should be a drop in 
replacement to replace a single server with a cluster.


it's facinating technology to read about.

I seem to remember reading that one of the other postgres companies is also 
producing a clustered version of postgres, but I don't remember who and 
know nothing about them.


David Lang






---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-27 Thread Luke Lonergan

Have you factored in how long it takes to build an index on 5TB?  And the index 
size?

Really, it's a whole different world at multi-TB, everything has to scale.

Btw we don't just scan in parallel, we do all in parallel, check the sort 
number on this thread.  Mpp is for the god box too.

And your price is wrong - but if you want free then you'll have to find another 
way to get your work done.

- Luke
- Luke
--
Sent from my BlackBerry Wireless Device


-Original Message-
From: David Lang [EMAIL PROTECTED]
To: Luke Lonergan [EMAIL PROTECTED]
CC: pgsql-performance@postgresql.org pgsql-performance@postgresql.org
Sent: Sat Nov 26 14:34:14 2005
Subject: Re: [PERFORM] Hardware/OS recommendations for large databases (

On Sun, 27 Nov 2005, Luke Lonergan wrote:

 For data warehousing its pretty well open and shut.  To use all cpus and 
 io channels on each query you will need mpp.

 Has anyone done the math.on the original post?  5TB takes how long to 
 scan once?  If you want to wait less than a couple of days just for a 
 seq scan, you'd better be in the multi-gb per second range.

if you truely need to scan the entire database then you are right, however 
indexes should be able to cut the amount you need to scan drasticly.

David Lang


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-27 Thread Stephan Szabo

On Sun, 27 Nov 2005, Luke Lonergan wrote:

 Has anyone done the math.on the original post?  5TB takes how long to
 scan once?  If you want to wait less than a couple of days just for a
 seq scan, you'd better be in the multi-gb per second range.

Err, I get about 31 megabytes/second to do 5TB in 170,000 seconds. I think
perhaps you were exaggerating a bit or adding additional overhead not
obvious from the above. ;)

---

At 1 gigabyte per second, 1 terrabyte should take about 1000 seconds
(between 16 and 17 minutes).  The impressive 3.2 gigabytes per second
listed before (if it actually scans consistently at that rate), puts it at
a little over 5 minutes I believe for 1, so about 26 for 5 terrabytes.
The 200 megabyte per second number puts it about 7 hours for 5
terrabytes AFAICS.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] Hardware/OS recommendations for large databases

2005-11-27 Thread Ron


At 01:18 AM 11/27/2005, Luke Lonergan wrote:
For data warehousing its pretty well open and shut.  To use all cpus 
and io channels on each query you will need mpp.


Has anyone done the math.on the original post?  5TB takes how long 
to scan once?  If you want to wait less than a couple of days just 
for a seq scan, you'd better be in the multi-gb per second range.

More than a bit of hyperbole there Luke.

Some common RW scenarios:
Dual 1GbE NICs = 200MBps = 5TB in 5x10^12/2x10^8= 25000secs= 
~6hrs57mins.  Network stuff like re-transmits of dropped packets can 
increase this, so network SLA's are critical.


Dual 10GbE NICs = ~1.6GBps (10GbE NICs can't yet do over ~800MBps 
apiece) = 5x10^12/1.6x10^9=  3125secs= ~52mins.  SLA's are even 
moire critical here.


If you are pushing 5TB around on a regular basis, you are not wasting 
your time  money on commodity = 300MBps RAID HW.  You'll be using 
800MBps and 1600MBps high end stuff, which means you'll need ~1-2hrs 
to sequentially scan 5TB on physical media.


Clever use of RAM can get a 5TB sequential scan down to ~17mins.

Yes, it's a lot of data.  But sequential scan times should be in the 
mins or low single digit hours, not days.  Particularly if you use 
RAM to maximum advantage.


Ron



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Hardware/OS recommendations for large databases

2005-11-27 Thread Luke Lonergan

Ron,

On 11/27/05 9:10 AM, Ron [EMAIL PROTECTED] wrote:

 Clever use of RAM can get a 5TB sequential scan down to ~17mins.
 
 Yes, it's a lot of data.  But sequential scan times should be in the
 mins or low single digit hours, not days.  Particularly if you use
 RAM to maximum advantage.

Unfortunately, RAM doesn't help with scanning from disk at all.

WRT using network interfaces to help - it's interesting, but I think what
you'd want to connect to is other machines with storage on them.

- Luke 



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-27 Thread Luke Lonergan

Stephan,

On 11/27/05 7:48 AM, Stephan Szabo [EMAIL PROTECTED] wrote:

 On Sun, 27 Nov 2005, Luke Lonergan wrote:
 
 Has anyone done the math.on the original post?  5TB takes how long to
 scan once?  If you want to wait less than a couple of days just for a
 seq scan, you'd better be in the multi-gb per second range.
 
 Err, I get about 31 megabytes/second to do 5TB in 170,000 seconds. I think
 perhaps you were exaggerating a bit or adding additional overhead not
 obvious from the above. ;)

Thanks - the calculator on my blackberry was broken ;-)
 
 At 1 gigabyte per second, 1 terrabyte should take about 1000 seconds
 (between 16 and 17 minutes).  The impressive 3.2 gigabytes per second
 listed before (if it actually scans consistently at that rate), puts it at
 a little over 5 minutes I believe for 1, so about 26 for 5 terrabytes.
 The 200 megabyte per second number puts it about 7 hours for 5
 terrabytes AFAICS.

7 hours, days, same thing ;-)

On the reality of sustained scan rates like that:

We're getting 2.5GB/s sustained on a 2 year old machine with 16 hosts and 96
disks.  We run them in RAID0, which is only OK because MPP has built-in host
to host mirroring for fault management.

We just purchased a 4-way cluster with 8 drives each using the 3Ware 9550SX.
Our thought was to try the simplest approach first, which is a single RAID5,
which gets us 7 drives worth of capacity and performance.  As I posted
earlier, we get about 400MB/s seq scan rate on the RAID, but the Postgres
8.0 current scan rate limit is 64% of 400MB/s or 256MB/s per host.  The 8.1
mods (thanks Qingqing and Tom!) may increase that significantly toward the
400 max - we've already merged the 8.1 codebase into MPP so we'll also
feature the same enhancements.

Our next approach is to run these machines in a split RAID0 configuration,
or RAID0 on 4 and 4 drives.  We then run an MPP segment instance bound to
each CPU and I/O channel.  At that point, we'll have all 8 drives of
performance and capacity per host and we should get 333MB/s with current MPP
and perhaps over 400MB/s with MPP/8.1.  That would get us up to the 3.2GB/s
for 8 hosts.

Even better, all operators are executed on all CPUs for each query, so
sorting, hashing, agg, etc etc are run on all CPUs in the cluster.

- Luke



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Hardware/OS recommendations for large databases

2005-11-27 Thread Ron


At 02:11 PM 11/27/2005, Luke Lonergan wrote:

Ron,

On 11/27/05 9:10 AM, Ron [EMAIL PROTECTED] wrote:

 Clever use of RAM can get a 5TB sequential scan down to ~17mins.

 Yes, it's a lot of data.  But sequential scan times should be in the
 mins or low single digit hours, not days.  Particularly if you use
 RAM to maximum advantage.

Unfortunately, RAM doesn't help with scanning from disk at all.
I agree with you if you are scanning a table cold, having never 
loaded it before, or if the system is not (or can't be) set up 
properly with appropriate buffers.


However, outside of those 2 cases there are often tricks you can use 
with enough RAM (and no, you don't need RAM equal to the size of the 
item(s) being scanned) to substantially speed things up.  Best case, 
you can get performance approximately equal to that of a RAM resident scan.




WRT using network interfaces to help - it's interesting, but I think what
you'd want to connect to is other machines with storage on them.
Maybe.  Or maybe you want to concentrate your storage in a farm that 
is connected by network or Fiber Channel to the rest of your 
HW.  That's what a NAS or SAN is after all.


The rest of your HW nowadays is often a cluster of RAM rich 
hosts.  Assuming 64GB per host, 5TB can be split across ~79 hosts if 
you want to make it all RAM resident.


Most don't have that kind of budget, but thankfully it is not usually 
necessary to make all of the data RAM resident in order to obtain if 
not all of the performance benefits you'd get if all of the data was.


Ron



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-27 Thread Stephan Szabo


On Sun, 27 Nov 2005, Luke Lonergan wrote:

 Stephan,

 On 11/27/05 7:48 AM, Stephan Szabo [EMAIL PROTECTED] wrote:

  On Sun, 27 Nov 2005, Luke Lonergan wrote:
 
  Has anyone done the math.on the original post?  5TB takes how long to
  scan once?  If you want to wait less than a couple of days just for a
  seq scan, you'd better be in the multi-gb per second range.
 
  Err, I get about 31 megabytes/second to do 5TB in 170,000 seconds. I think
  perhaps you were exaggerating a bit or adding additional overhead not
  obvious from the above. ;)

 Thanks - the calculator on my blackberry was broken ;-)

Well, it was suspiciously close to a factor of 60 off, which when working
in time could have just been a simple math error.

  At 1 gigabyte per second, 1 terrabyte should take about 1000 seconds
  (between 16 and 17 minutes).  The impressive 3.2 gigabytes per second
  listed before (if it actually scans consistently at that rate), puts it at
  a little over 5 minutes I believe for 1, so about 26 for 5 terrabytes.
  The 200 megabyte per second number puts it about 7 hours for 5
  terrabytes AFAICS.

 7 hours, days, same thing ;-)

 On the reality of sustained scan rates like that:

Well, the reason I asked was that IIRC the 3.2 used earlier in the
discussion was exactly multiplying scanners and base rate (ie, no
additional overhead).  I couldn't tell if that was back of the envelope or
if the overhead was in fact negligible.  (Or I could be misremembering the
conversation).  I don't doubt that it's possible to get the rate, just
wasn't sure if the rate was actually applicable to the ongoing discussion
of the comparison.

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-26 Thread David Lang


Another thought - I priced out a maxed out machine with 16 cores and
128GB of RAM and 1.5TB of usable disk - $71,000.

You could instead buy 8 machines that total 16 cores, 128GB RAM and 28TB
of disk for $48,000, and it would be 16 times faster in scan rate, which
is the most important factor for large databases.  The size would be 16
rack units instead of 5, and you'd have to add a GigE switch for $1500.

Scan rate for above SMP: 200MB/s

Scan rate for above cluster: 3,200Mb/s

You could even go dual core and double the memory on the cluster and
you'd about match the price of the god box.

- Luke


Luke, I assume you are talking about useing the Greenplum MPP for this 
(otherwise I don't know how you are combining all the different systems).


If you are, then you are overlooking one very significant factor, the cost 
of the MPP software, at $10/cpu the cluster has an extra $160K in software 
costs, which is double the hardware costs.


if money is no object then go for it, but if it is then you comparison 
would be (ignoring software maintinance costs) the 16 core 128G ram system 
vs ~3xsmall systems totaling 6 cores and 48G ram.


yes if scan speed is the bottleneck you still win with the small systems, 
but for most other uses the large system would win easily. and in any case 
it's not the open and shut case that you keep presenting it as.


David Lang

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-26 Thread Luke Lonergan

For data warehousing its pretty well open and shut.  To use all cpus and io 
channels on each query you will need mpp.

Has anyone done the math.on the original post?  5TB takes how long to scan 
once?  If you want to wait less than a couple of days just for a seq scan, 
you'd better be in the multi-gb per second range.

- Luke
--
Sent from my BlackBerry Wireless Device


-Original Message-
From: [EMAIL PROTECTED] [EMAIL PROTECTED]
To: pgsql-performance@postgresql.org pgsql-performance@postgresql.org
Sent: Sat Nov 26 13:51:18 2005
Subject: Re: [PERFORM] Hardware/OS recommendations for large databases (

Another thought - I priced out a maxed out machine with 16 cores and
128GB of RAM and 1.5TB of usable disk - $71,000.

You could instead buy 8 machines that total 16 cores, 128GB RAM and 28TB
of disk for $48,000, and it would be 16 times faster in scan rate, which
is the most important factor for large databases.  The size would be 16
rack units instead of 5, and you'd have to add a GigE switch for $1500.

Scan rate for above SMP: 200MB/s

Scan rate for above cluster: 3,200Mb/s

You could even go dual core and double the memory on the cluster and
you'd about match the price of the god box.

- Luke

Luke, I assume you are talking about useing the Greenplum MPP for this 
(otherwise I don't know how you are combining all the different systems).

If you are, then you are overlooking one very significant factor, the cost 
of the MPP software, at $10/cpu the cluster has an extra $160K in software 
costs, which is double the hardware costs.

if money is no object then go for it, but if it is then you comparison 
would be (ignoring software maintinance costs) the 16 core 128G ram system 
vs ~3xsmall systems totaling 6 cores and 48G ram.

yes if scan speed is the bottleneck you still win with the small systems, 
but for most other uses the large system would win easily. and in any case 
it's not the open and shut case that you keep presenting it as.

David Lang

---(end of broadcast)---
TIP 6: explain analyze is your friend


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-24 Thread Luke Lonergan

Mark,

See the results below and analysis - the pure HeapScan gets 94.1% of the max
available read bandwidth (cool!).  Nothing wrong with heapscan in the
presence of large readahead, which is good news.

That says it's something else in the path.  As you probably know there is a
page lock taken, a copy of the tuple from the page, lock removed, count
incremented for every iteration of the agg node on a count(*).  Is the same
true of a count(1)?

I recall that the profile is full of memcpy and memory context calls.

It would be nice to put some tracers into the executor and see where the
time is going.  I'm also curious about the impact of the new 8.1 virtual
tuples in reducing the executor overhead.  In this case my bet's on the agg
node itself, what do you think?

- Luke

On 11/21/05 9:10 PM, Mark Kirkwood [EMAIL PROTECTED] wrote:

 Luke Lonergan wrote:
 
 So that leaves the question - why not more than 64% of the I/O scan rate?
 And why is it a flat 64% as the I/O subsystem increases in speed from
 333-400MB/s?
 
 
 It might be interesting to see what effect reducing the cpu consumption
   entailed by the count aggregation has - by (say) writing a little bit
 of code to heap scan the desired relation (sample attached).

OK - here are results for a slightly smaller (still bigger than RAM)
lineitem on the same machine, using the same xfs filesystem that achieved
407MB/s:


12.9GB of DBT-3 data from the lineitem table

llonergan=# select relpages from pg_class where relname='lineitem';
 relpages 
--
  1579270
(1 row)

1579270*8192/100
12937 Million Bytes or 12.9GB

llonergan=# \timing
Timing is on.
llonergan=# select count(1) from lineitem;
  count   
--
 59986052
(1 row)

Time: 197870.105 ms
llonergan=# select count(1) from lineitem;
  count   
--
 59986052
(1 row)

Time: 49912.164 ms
llonergan=# select count(1) from lineitem;
  count   
--
 59986052
(1 row)

Time: 49218.739 ms

llonergan=# select fastcount('lineitem');
 fastcount 
---
  59986052
(1 row)

Time: 33752.778 ms
llonergan=# select fastcount('lineitem');
 fastcount 
---
  59986052
(1 row)

Time: 34543.646 ms
llonergan=# select fastcount('lineitem');
 fastcount 
---
  59986052
(1 row)

Time: 34528.053 ms


Analysis:

Bandwidth   Percent of max
dd Read 407MB/s 100%
Count(1)263MB/s 64.6%
HeapScan383MB/s 94.1%

Wow - looks like the HeapScan gets almost all of the available bandwidth!

- Luke



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-24 Thread Mark Kirkwood


Luke Lonergan wrote:



12.9GB of DBT-3 data from the lineitem table

llonergan=# select relpages from pg_class where relname='lineitem';
 relpages 
--

  1579270
(1 row)

1579270*8192/100
12937 Million Bytes or 12.9GB

llonergan=# \timing
Timing is on.
llonergan=# select count(1) from lineitem;
  count   
--

 59986052
(1 row)

Time: 197870.105 ms


So 198 seconds is the uncached read time with count (Just for clarity, 
did you clear the Pg and filesystem caches or unmount / remount the 
filesystem?)



llonergan=# select count(1) from lineitem;
  count   
--

 59986052
(1 row)

Time: 49912.164 ms
llonergan=# select count(1) from lineitem;
  count   
--

 59986052
(1 row)

Time: 49218.739 ms



and ~50 seconds is the (partially) cached read time with count


llonergan=# select fastcount('lineitem');
 fastcount 
---

  59986052
(1 row)

Time: 33752.778 ms
llonergan=# select fastcount('lineitem');
 fastcount 
---

  59986052
(1 row)

Time: 34543.646 ms
llonergan=# select fastcount('lineitem');
 fastcount 
---

  59986052
(1 row)

Time: 34528.053 ms



so ~34 seconds is the (partially) cached read time for fastcount -
I calculate this to give ~362Mb/s effective IO rate (I'm doing / by 
1024*1024 not 1000*1000) FWIW.


While this is interesting, you probably want to stop Pg, unmount the 
filesystem, and restart Pg to get the uncached time for fastcount too 
(and how does this compare to uncached read with dd using the same block 
size?).


But at this stage it certainly looks the the heapscan code is pretty 
efficient - great!


Oh - and do you want to try out 32K block size, I'm interested to see 
what level of improvement you get (as my system is hopelessly cpu bound...)!




Analysis:

Bandwidth   Percent of max
dd Read 407MB/s 100%
Count(1)263MB/s 64.6%
HeapScan383MB/s 94.1%



Cheers

Mark

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-24 Thread Mark Kirkwood


Luke Lonergan wrote:

Mark,


It would be nice to put some tracers into the executor and see where the
time is going.  I'm also curious about the impact of the new 8.1 virtual
tuples in reducing the executor overhead.  In this case my bet's on the agg
node itself, what do you think?



Yeah - it's pretty clear that the count aggregate is fairly expensive 
wrt cpu - However, I am not sure if all agg nodes suffer this way (guess 
we could try a trivial aggregate that does nothing for all tuples bar 
the last and just reports the final value it sees).


Cheers

Mark


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-24 Thread Luke Lonergan

Mark,

 Time: 197870.105 ms
 
 So 198 seconds is the uncached read time with count (Just for clarity,
 did you clear the Pg and filesystem caches or unmount / remount the
 filesystem?)

Nope - the longer time is due to the second write known issue with
Postgres - it writes the data to the table, but all of the pages are marked
dirty?  So, always on the first scan after loading they are written again.
This is clear as you watch vmstat - the pattern on the first seq scan is
half read / half write.

 Time: 49218.739 ms
 
 
 and ~50 seconds is the (partially) cached read time with count

Again - the pattern here is pure read and completely non-cached.  You see a
very nearly constant I/O rate when watching vmstat for the entire scan.

 Time: 34528.053 ms

 so ~34 seconds is the (partially) cached read time for fastcount -
 I calculate this to give ~362Mb/s effective IO rate (I'm doing / by
 1024*1024 not 1000*1000) FWIW.

The dd number uses 1000*1000, so I maintained it for the percentage of max.
 
 While this is interesting, you probably want to stop Pg, unmount the
 filesystem, and restart Pg to get the uncached time for fastcount too
 (and how does this compare to uncached read with dd using the same block
 size?).

I'll do it again sometime, but I've already deleted the file.  I've done the
following in the past to validate this though:

- Reboot machine
- Rerun scan

And we get identical results.
  
 But at this stage it certainly looks the the heapscan code is pretty
 efficient - great!

Yep.
 
 Oh - and do you want to try out 32K block size, I'm interested to see
 what level of improvement you get (as my system is hopelessly cpu bound...)!

Yah - done so in the past and not seen any - was waiting for Alan to post
his results.
 
 
 Analysis:
 
 Bandwidth   Percent of max
 dd Read 407MB/s 100%
 Count(1)263MB/s 64.6%
 HeapScan383MB/s 94.1%

Note these are all in consistent 1000x1000 units.

Thanks for the test - neat trick!  We'll use it to do some more profiling
some time soon...

- Luke



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-24 Thread Mark Kirkwood


Luke Lonergan wrote:


That says it's something else in the path.  As you probably know there is a
page lock taken, a copy of the tuple from the page, lock removed, count
incremented for every iteration of the agg node on a count(*).  Is the same
true of a count(1)?



Sorry Luke - message 3 - I seem to be suffering from a very small 
working memory buffer myself right now, I think it's after a day of 
working with DB2 ... :-)


Anyway, as I read src/backend/parser/gram.y:6542 - count(*) is 
transformed into count(1), so these two are identical.


Cheers (last time tonight, promise!)

Mark

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-24 Thread Mark Kirkwood


Luke Lonergan wrote:

Mark,



Time: 197870.105 ms


So 198 seconds is the uncached read time with count (Just for clarity,
did you clear the Pg and filesystem caches or unmount / remount the
filesystem?)



Nope - the longer time is due to the second write known issue with
Postgres - it writes the data to the table, but all of the pages are marked
dirty?  So, always on the first scan after loading they are written again.
This is clear as you watch vmstat - the pattern on the first seq scan is
half read / half write.



Ah - indeed - first access after a COPY no? I should have thought of 
that, sorry!



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-24 Thread Greg Stark

Mark Kirkwood [EMAIL PROTECTED] writes:

 Yeah - it's pretty clear that the count aggregate is fairly expensive wrt cpu 
 -
 However, I am not sure if all agg nodes suffer this way (guess we could try a
 trivial aggregate that does nothing for all tuples bar the last and just
 reports the final value it sees).

As you mention count(*) and count(1) are the same thing.

Last I heard the reason count(*) was so expensive was because its state
variable was a bigint. That means it doesn't fit in a Datum and has to be
alloced and stored as a pointer. And because of the Aggregate API that means
it has to be allocated and freed for every tuple processed.

There was some talk of having a special case API for count(*) and maybe
sum(...) to avoid having to do this.

There was also some talk of making Datum 8 bytes wide on platforms where that
was natural (I guess AMD64, Sparc64, Alpha, Itanic).

Afaik none of these items have happened but I don't know for sure.

-- 
greg


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-24 Thread Tom Lane

Greg Stark [EMAIL PROTECTED] writes:
 Last I heard the reason count(*) was so expensive was because its state
 variable was a bigint. That means it doesn't fit in a Datum and has to be
 alloced and stored as a pointer. And because of the Aggregate API that means
 it has to be allocated and freed for every tuple processed.

There's a hack in 8.1 to avoid the palloc overhead (courtesy of Neil
Conway IIRC).

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-24 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 Greg Stark [EMAIL PROTECTED] writes:
  Last I heard the reason count(*) was so expensive was because its state
  variable was a bigint. That means it doesn't fit in a Datum and has to be
  alloced and stored as a pointer. And because of the Aggregate API that means
  it has to be allocated and freed for every tuple processed.
 
 There's a hack in 8.1 to avoid the palloc overhead (courtesy of Neil
 Conway IIRC).

ah, cool, missed that.

-- 
greg


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-24 Thread Luke Lonergan

The same 12.9GB distributed across 4 machines using Bizgres MPP fits into
I/O cache.  The interesting result is that the query select count(1) is
limited in speed to 280 MB/s per CPU when run on the lineitem table.  So
when I run it spread over 4 machines, one CPU per machine I get this:

==
Bizgres MPP, 4 data segments, 1 per 2 CPUs
==
llonergan=# explain select count(1) from lineitem;
  QUERY PLAN

--
 Aggregate  (cost=582452.00..582452.00 rows=1 width=0)
   -  Gather Motion  (cost=582452.00..582452.00 rows=1 width=0)
 -  Aggregate  (cost=582452.00..582452.00 rows=1 width=0)
   -  Seq Scan on lineitem  (cost=0.00..544945.00 rows=15002800
width=0)
(4 rows)

llonergan=# \timing
Timing is on.
llonergan=# select count(1) from lineitem;
  count   
--
 59986052
(1 row)

Time: 12191.435 ms
llonergan=# select count(1) from lineitem;
  count   
--
 59986052
(1 row)

Time: 11986.109 ms
llonergan=# select count(1) from lineitem;
  count   
--
 59986052
(1 row)

Time: 11448.941 ms
==

That's 12,937 MB in 11.45 seconds, or 1,130 MB/s.  When you divide out the
number of Postgres instances (4), that's 283MB/s per Postgres instance.

To verify that this has nothing to do with MPP, I ran it in a special
internal mode on one instance and got the same result.

So - we should be able to double this rate by running one segment per CPU,
or two per host:

==
Bizgres MPP, 8 data segments, 1 per CPU
==
llonergan=# select count(1) from lineitem;
  count   
--
 59986052
(1 row)

Time: 6484.594 ms
llonergan=# select count(1) from lineitem;
  count   
--
 59986052
(1 row)

Time: 6156.729 ms
llonergan=# select count(1) from lineitem;
  count   
--
 59986052
(1 row)

Time: 6063.416 ms
==
That's 12,937 MB in 11.45 seconds, or 2,134 MB/s.  When you divide out the
number of Postgres instances (8), that's 267MB/s per Postgres instance.

So, if you want to select count(1), using more CPUs is a good idea!  For
most complex queries, having lots of CPUs + MPP is a good combo.

Here is an example of a sorting plan - this should probably be done with a
hash aggregation, but using 8 CPUs makes it go 8x faster:


- Luke



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-24 Thread Mark Kirkwood


Tom Lane wrote:

Greg Stark [EMAIL PROTECTED] writes:


Last I heard the reason count(*) was so expensive was because its state
variable was a bigint. That means it doesn't fit in a Datum and has to be
alloced and stored as a pointer. And because of the Aggregate API that means
it has to be allocated and freed for every tuple processed.



There's a hack in 8.1 to avoid the palloc overhead (courtesy of Neil
Conway IIRC).



It certainly makes quite a difference as I measure it:

doing select(1) from a 181000 page table (completely uncached) on my PIII:

8.0 : 32 s
8.1 : 25 s

Note that the 'fastcount()' function takes 21 s in both cases - so all 
the improvement seems to be from the count overhead reduction.


Cheers

Mark








---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-23 Thread Luke Lonergan

Bruce,

On 11/22/05 4:13 PM, Bruce Momjian pgman@candle.pha.pa.us wrote:

 Perfect summary.  We have a background writer now.  Ideally we would
 have a background reader, that reads-ahead blocks into the buffer cache.
 The problem is that while there is a relatively long time between a
 buffer being dirtied and the time it must be on disk (checkpoint time),
 the read-ahead time is much shorter, requiring some kind of quick
 create a thread approach that could easily bog us down as outlined
 above.

Yes, the question is how much read-ahead buffer is needed to equate to the
38% of I/O wait time in the current executor profile?

The idea of asynchronous buffering would seem appropriate if the executor
would use the 38% of time as useful work.

A background reader is an interesting approach - it would require admin
management of buffers where AIO would leave that in the kernel.  The
advantage over AIO would be more universal platform support I suppose?

 Right now the file system will do read-ahead for a heap scan (but not an
 index scan), but even then, there is time required to get that kernel
 block into the PostgreSQL shared buffers, backing up Luke's observation
 of heavy memcpy() usage.

As evidenced by the 16MB readahead setting still resulting in only 36% IO
wait.

 So what are our options?  mmap()?  I have no idea.  Seems larger page
 size does help.

Not sure about that, we used to run with 32KB page size and I didn't see a
benefit on seq scan at all.  I haven't seen tests in this thread that
compare 8K to 32K. 

- Luke



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-23 Thread Luke Lonergan

Title: Re: [PERFORM] Hardware/OS recommendations for large databases (



Alan,

Why not contribute something - put up proof of your stated 8KB versus 32KB page size improvement.

- Luke

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-23 Thread Alan Stange


Luke Lonergan wrote:
Why not contribute something - put up proof of your stated 8KB versus 
32KB page size improvement.


I did observe that 32KB block sizes were a significant win for our 
usage patterns.   It might be a win for any of the following reasons:


0) The preliminaries:   ~300GB database with about ~50GB daily 
turnover.   Our data is fairly reasonably grouped.  If we're getting one 
item on a page we're usually looking at the other items as well.


1) we can live with a smaller FSM size.  We were often leaking pages 
with a 10M page FSM setting.  With 32K pages, a 10M FSM size is 
sufficient.   Yes, the solution to this is run vacuum more often, but 
when the vacuum was taking 10 hours at a time, that was hard to do.


2) The typical datum size in our largest table is about 2.8KB, which is 
more than 1/4 page size thus resulting in the use of a toast table.   
Switching to 32KB pages allows us to get a decent storage of this data 
into the main tables, thus avoiding another table and associated large 
index.   Not having the extra index in memory for a table with 90M rows 
is probably beneficial.


3) vacuum time has been substantially reduced.  Vacuum analyze now run 
in the 2 to 3 hour range depending on load.


4) less cpu time spent in the kernel.  We're basically doing 1/4 as many 
system calls.  

Overall the system has now been working well.  We used to see the 
database being a bottleneck at times, but now it's keeping up nicely.


Hope this helps.

Happy Thanksgiving!

-- Alan

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-23 Thread Luke Lonergan

Alan,

On 11/23/05 2:00 PM, Alan Stange [EMAIL PROTECTED] wrote:

 Luke Lonergan wrote:
 Why not contribute something - put up proof of your stated 8KB versus
 32KB page size improvement.
 
 I did observe that 32KB block sizes were a significant win for our
 usage patterns.   It might be a win for any of the following reasons:
 (* big snip *)

Though all of what you relate is interesting, it seems irrelevant to your
earlier statement here:

 Alan Stange [EMAIL PROTECTED] writes:
 If your goal is sequential IO, then one must use larger block sizes.
 No one would use 8KB IO for achieving high sequential IO rates.   Simply
 put, read() is about the slowest way to get 8KB of data. Switching
 to 32KB blocks reduces all the system call overhead by a large margin.
 Larger blocks would be better still, up to the stripe size of your
 mirror.   (Of course, you're using a mirror and not raid5 if you care
 about performance.)

And I am interested in seeing if your statement is correct.  Do you have any
proof of this to share?

- Luke



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-23 Thread Luke Lonergan

Title: Re: [PERFORM] Hardware/OS recommendations for large databases (



Mark,

This is an excellent idea  unfortunately Im in Maui right now (Mahalo!) and Im not getting to testing with this. My first try was with 8.0.3 and its an 8.1 function I presume.

Not to be lazy  but any hint as to how to do the same thing for 8.0?

- Luke


On 11/21/05 9:10 PM, Mark Kirkwood [EMAIL PROTECTED] wrote:

Luke Lonergan wrote:

 So that leaves the question - why not more than 64% of the I/O scan rate?
 And why is it a flat 64% as the I/O subsystem increases in speed from
 333-400MB/s?


It might be interesting to see what effect reducing the cpu consumption
entailed by the count aggregation has - by (say) writing a little bit
of code to heap scan the desired relation (sample attached).

Cheers

Mark





/*
* fastcount.c
*
* Do a count that uses considerably less CPU time than an aggregate.
*/

#include postgres.h

#include funcapi.h
#include access/heapam.h
#include catalog/namespace.h
#include utils/builtins.h


extern Datum fastcount(PG_FUNCTION_ARGS);


PG_FUNCTION_INFO_V1(fastcount);
Datum
fastcount(PG_FUNCTION_ARGS)
{
text *relname = PG_GETARG_TEXT_P(0);
RangeVar *relrv;
Relationrel;
HeapScanDesc scan;
HeapTupletuple;
int64result = 0;

/* Use the name to get a suitable range variable and open the relation. */
relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
rel = heap_openrv(relrv, AccessShareLock);

/* Start a heap scan on the relation. */
scan = heap_beginscan(rel, SnapshotNow, 0, NULL);
while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
{
result++;
}

/* End the scan and close up the relation. */
heap_endscan(scan);
heap_close(rel, AccessShareLock);


PG_RETURN_INT64(result);
}

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-23 Thread Mark Kirkwood


Luke Lonergan wrote:

Mark,

This is an excellent idea – unfortunately I’m in Maui right now 
(Mahalo!) and I’m not getting to testing with this.  My first try was 
with 8.0.3 and it’s an 8.1 function I presume.


Not to be lazy – but any hint as to how to do the same thing for 8.0?



Yeah, it's 8.1 - I didn't think to check against 8.0. The attached 
variant works with 8.0.4 (textToQualifiedNameList needs 2 args)


cheers

Mark

P.s. Maui eh, sounds real nice.
/*
 * fastcount.c
 *
 * Do a count that uses considerably less CPU time than an aggregate.
 *
 * (Variant for 8.0.x - textToQualifiedNameList needs 2 args)
 */

#include postgres.h

#include funcapi.h
#include access/heapam.h
#include catalog/namespace.h
#include utils/builtins.h


extern Datum fastcount(PG_FUNCTION_ARGS);


PG_FUNCTION_INFO_V1(fastcount);
Datum
fastcount(PG_FUNCTION_ARGS)
{
text   *relname = PG_GETARG_TEXT_P(0);
RangeVar   *relrv;
Relationrel;
HeapScanDesc scan;
HeapTuple   tuple;
int64   result = 0;

/* Use the name to get a suitable range variable and open the relation. 
*/
relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname, ));
rel = heap_openrv(relrv, AccessShareLock);

/* Start a heap scan on the relation. */
scan = heap_beginscan(rel, SnapshotNow, 0, NULL);
while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
{
result++;
}

/* End the scan and close up the relation. */
heap_endscan(scan);
heap_close(rel, AccessShareLock);


PG_RETURN_INT64(result);
}

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-22 Thread Alan Stange


Luke,

- XFS will probably generate better data rates with larger files.   You 
really need to use the same file size as does postgresql.  Why compare 
the speed to reading a 16G file and the speed to reading a 1G file.   
They won't be the same.  If need be, write some code that does the test 
or modify lmdd to read a sequence of 1G files.   Will this make a 
difference?  You don't know until you do it.   Any time you cross a 
couple of 2^ powers in computing, you should expect some differences.


- you did umount the file system before reading the 16G file back in?   
Because if you didn't then your read numbers are possibly garbage.   
When the read began, 8G of the file was in memory.   You'd be very naive 
to think that somehow the read of the first 8GB somehow flushed that 
cached data out of memory.  After all, why would the kernel flush pages 
from file X when you're in the middle of a sequential read of...file 
X?   I'm not sure how Linux handles this, but Solaris would've found the 
8G still in memory.


- What was the hardware and disk configuration on which these numbers 
were generated?   For example, if you have a U320 controller, how did 
the read rate become larger than 320MB/s?


- how did the results change from before?   Just posting the new results 
is misleading given all the boasting we've had to read about your past 
results.


- there are two results below for writing to ext2:  one at 209 MB/s and 
one at 113MB/s.  Why are they different?


- what was the cpu usage during these tests?   We see postgresql doing 
200+MB/s of IO.   You've claimed many times that the machine would be 
compute bound at lower IO rates, so how much idle time does the cpu 
still have?


- You wrote:  We'll do a 16GB table size to ensure that we aren't 
reading from the read cache.   Do you really believe that??   You have 
to umount the file system before each test to ensure you're really 
measuring the disk IO rate.   If I'm reading your results correctly, it 
looks like you have three results for ext and xfs, each of which is 
faster than the prior one.  If I'm reading this correctly, then it looks 
like one is clearly reading from the read cache.


- Gee, it's so nice of you to drop your 120MB/s observation.  I guess my 
reading at 300MB/s wasn't convincing enough.  Yeah, I think it was the 
cpus too...


- I wouldn't focus on the flat 64% of the data rate number.  It'll 
probably be different on other systems.


I'm all for testing and testing.   It seems you still cut a corner 
without umounting the file system first.  Maybe I'm a little too old 
school on this, but I wouldn't spend a dime until you've done the 
measurements correctly. 

Good Luck. 


-- Alan



Luke Lonergan wrote:

Alan,

Looks like Postgres gets sensible scan rate scaling as the filesystem speed
increases, as shown below.  I'll drop my 120MB/s observation - perhaps CPUs
got faster since I last tested this.

The scaling looks like 64% of the I/O subsystem speed is available to the
executor - so as the I/O subsystem increases in scan rate, so does Postgres'
executor scan speed.

So that leaves the question - why not more than 64% of the I/O scan rate?
And why is it a flat 64% as the I/O subsystem increases in speed from
333-400MB/s?

- Luke
 
= Results ===


Unless noted otherwise all results posted are for block device readahead set
to 16M using blockdev --setra=16384 block_device.  All are using the
2.6.9-11 Centos 4.1 kernel.

For those who don't have lmdd, here is a comparison of two results on an
ext2 filesystem:


[EMAIL PROTECTED] dbfast1]# time bash -c (dd if=/dev/zero of=/dbfast1/bigfile
bs=8k count=80  sync)
80+0 records in
80+0 records out

real0m33.057s
user0m0.116s
sys 0m13.577s

[EMAIL PROTECTED] dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=80 sync=1
6553.6000 MB in 31.2957 secs, 209.4092 MB/sec

real0m33.032s
user0m0.087s
sys 0m13.129s


So lmdd with sync=1 is equivalent to a sync after a dd.

I use 2x memory with dd for the *READ* performance testing, but let's make
sure things are synced on both write and read for this set of comparisons.

First, let's test ext2 versus ext3, data=ordered, versus xfs:


16GB write, then read

---
ext2:
---
[EMAIL PROTECTED] dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=200 sync=1
16384. MB in 144.2670 secs, 113.5672 MB/sec

[EMAIL PROTECTED] dbfast1]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
count=200 sync=1
16384. MB in 49.3766 secs, 331.8170 MB/sec

---
ext3, data=ordered:

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-22 Thread Bruce Momjian

Greg Stark wrote:
 
 Alan Stange [EMAIL PROTECTED] writes:
 
  The point your making doesn't match my experience with *any* storage or 
  program
  I've ever used, including postgresql.   Your point suggests that the storage
  system is idle  and that postgresql is broken because it isn't able to use 
  the
  resources available...even when the cpu is very idle.  How can that make 
  sense?
 
 Well I think what he's saying is that Postgres is issuing a read, then waiting
 for the data to return. Then it does some processing, and goes back to issue
 another read. The CPU is idle half the time because Postgres isn't capable of
 doing any work while waiting for i/o, and the i/o system is idle half the time
 while the CPU intensive part happens.
 
 (Consider as a pathological example a program that reads 8k then sleeps for
 10ms, and loops doing that 1,000 times. Now consider the same program
 optimized to read 8M asynchronously and sleep for 10s. By the time it's
 finished sleeping it has probably read in all 8M. Whereas the program that
 read 8k in little chunks interleaved with small sleeps would probably take
 twice as long and appear to be entirely i/o-bound with 50% iowait and 50%
 idle.)
 
 It's a reasonable theory and it's not inconsistent with the results you sent.
 But it's not exactly proven either. Nor is it clear how to improve matters.
 Adding additional threads to handle the i/o adds an enormous amount of
 complexity and creates lots of opportunity for other contention that could
 easily eat all of the gains.

Perfect summary.  We have a background writer now.  Ideally we would
have a background reader, that reads-ahead blocks into the buffer cache.
The problem is that while there is a relatively long time between a
buffer being dirtied and the time it must be on disk (checkpoint time),
the read-ahead time is much shorter, requiring some kind of quick
create a thread approach that could easily bog us down as outlined
above.

Right now the file system will do read-ahead for a heap scan (but not an
index scan), but even then, there is time required to get that kernel
block into the PostgreSQL shared buffers, backing up Luke's observation
of heavy memcpy() usage.

So what are our options?  mmap()?  I have no idea.  Seems larger page
size does help.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-22 Thread Alan Stange


Bruce Momjian wrote:

Greg Stark wrote:
  

Alan Stange [EMAIL PROTECTED] writes:



The point your making doesn't match my experience with *any* storage or program
I've ever used, including postgresql.   Your point suggests that the storage
system is idle  and that postgresql is broken because it isn't able to use the
resources available...even when the cpu is very idle.  How can that make sense?
  

Well I think what he's saying is that Postgres is issuing a read, then waiting
for the data to return. Then it does some processing, and goes back to issue
another read. The CPU is idle half the time because Postgres isn't capable of
doing any work while waiting for i/o, and the i/o system is idle half the time
while the CPU intensive part happens.

(Consider as a pathological example a program that reads 8k then sleeps for
10ms, and loops doing that 1,000 times. Now consider the same program
optimized to read 8M asynchronously and sleep for 10s. By the time it's
finished sleeping it has probably read in all 8M. Whereas the program that
read 8k in little chunks interleaved with small sleeps would probably take
twice as long and appear to be entirely i/o-bound with 50% iowait and 50%
idle.)

It's a reasonable theory and it's not inconsistent with the results you sent.
But it's not exactly proven either. Nor is it clear how to improve matters.
Adding additional threads to handle the i/o adds an enormous amount of
complexity and creates lots of opportunity for other contention that could
easily eat all of the gains.



Perfect summary.  We have a background writer now.  Ideally we would
have a background reader, that reads-ahead blocks into the buffer cache.
The problem is that while there is a relatively long time between a
buffer being dirtied and the time it must be on disk (checkpoint time),
the read-ahead time is much shorter, requiring some kind of quick
create a thread approach that could easily bog us down as outlined
above.

Right now the file system will do read-ahead for a heap scan (but not an
index scan), but even then, there is time required to get that kernel
block into the PostgreSQL shared buffers, backing up Luke's observation
of heavy memcpy() usage.

So what are our options?  mmap()?  I have no idea.  Seems larger page
size does help.
For sequential scans, you do have a background reader.  It's the 
kernel.  As long as you don't issue a seek() between read() calls, the 
kernel will get the hint about sequential IO and begin to perform a read 
ahead for you.  This is where the above analysis isn't quite right:  
while postgresql is processing the returned data from the read() call, 
the kernel has also issued reads as part of the read ahead, keeping the 
device busy while the cpu is busy.  (I'm assuming these details for 
Linux; Solaris/UFS does work this way).  Issue one seek on the file and 
the read ahead algorithm will back off for a while.   This was my point 
about some descriptions of how the system works not being sensible.


If your goal is sequential IO, then one must use larger block sizes.   
No one would use 8KB IO for achieving high sequential IO rates.   Simply 
put, read() is about the slowest way to get 8KB of data. Switching 
to 32KB blocks reduces all the system call overhead by a large margin.  
Larger blocks would be better still, up to the stripe size of your 
mirror.   (Of course, you're using a mirror and not raid5 if you care 
about performance.)


I don't think the memcpy of data from the kernel to userspace is that 
big of an issue right now.  dd and all the high end network interfaces 
manage OK doing it, so I'd expect postgresql to do all right with it now 
yet too.   Direct IO will avoid that memcpy, but then you also don't get 
any caching of the files in memory.  I'd be more concerned about any 
memcpy calls or general data management within postgresql.Does 
postgresql use the platform specific memcpy() in libc?  Some care might 
be needed to ensure that the memory blocks within postgresql are all 
properly aligned to make sure that one isn't ping-ponging cache lines 
around (usually done by padding the buffer sizes by an extra 32 bytes or 
L1 line size).   Whatever you do, all the usual high performance 
computing tricks should be used prior to considering any rewriting of 
major code sections.


Personally, I'd like to see some detailed profiling being done using 
hardware counters for cpu cycles and cache misses, etc.   Given the poor 
quality of work that has been discussed here in this thread, I don't 
have much confidence in any other additional results at this time.   
None of the analysis would be acceptable in any environment in which 
I've worked.   Be sure to take a look at Sun's free Workshop tools as 
they are excellent for this sort of profiling and one doesn't need to 
recompile to use them.If I get a little time in the next week or two 
I might take a crack at this.


Cheers,

-- Alan


---(end of

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-22 Thread Greg Stark


Alan Stange [EMAIL PROTECTED] writes:

 For sequential scans, you do have a background reader.  It's the kernel.  As
 long as you don't issue a seek() between read() calls, the kernel will get the
 hint about sequential IO and begin to perform a read ahead for you.  This is
 where the above analysis isn't quite right:  while postgresql is processing 
 the
 returned data from the read() call, the kernel has also issued reads as part 
 of
 the read ahead, keeping the device busy while the cpu is busy.  (I'm assuming
 these details for Linux; Solaris/UFS does work this way).  Issue one seek on
 the file and the read ahead algorithm will back off for a while.   This was my
 point about some descriptions of how the system works not being sensible.

Well that's certainly the hope. But we don't know that this is actually as
effective as you assume it is. It's awfully hard in the kernel to make much
more than a vague educated guess about what kind of readahead would actually
help. 

This is especially true when a file isn't really being accessed in a
sequential fashion as Postgres may well do if, for example, multiple backends
are reading the same file. And as you pointed out it doesn't help at all for
random access index scans.

 If your goal is sequential IO, then one must use larger block sizes.   No one
 would use 8KB IO for achieving high sequential IO rates.   Simply put, read()
 is about the slowest way to get 8KB of data. Switching to 32KB blocks
 reduces all the system call overhead by a large margin.  Larger blocks would 
 be
 better still, up to the stripe size of your mirror.   (Of course, you're using
 a mirror and not raid5 if you care about performance.)

Switching to 32kB blocks throughout Postgres has pros but also major cons, not
the least is *extra* i/o for random access read patterns. One of the possible
advantages of the suggestions that were made, the ones you're shouting down,
would actually be the ability to use 32kB scatter/gather reads without
necessarily switching block sizes.

(Incidentally, your parenthetical comment is a bit confused. By mirror I
imagine you're referring to raid1+0 since mirrors alone, aka raid1, aren't a
popular way to improve performance. But raid5 actually performs better than
raid1+0 for sequential reads.)

 Does postgresql use the platform specific memcpy() in libc? Some care might
 be needed to ensure that the memory blocks within postgresql are all
 properly aligned to make sure that one isn't ping-ponging cache lines around
 (usually done by padding the buffer sizes by an extra 32 bytes or L1 line
 size). Whatever you do, all the usual high performance computing tricks
 should be used prior to considering any rewriting of major code sections.

So your philosophy is to worry about microoptimizations before worrying about
architectural issues?


-- 
greg


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-22 Thread Bruce Momjian

Alan Stange wrote:
 Bruce Momjian wrote:
  Right now the file system will do read-ahead for a heap scan (but not an
  index scan), but even then, there is time required to get that kernel
  block into the PostgreSQL shared buffers, backing up Luke's observation
  of heavy memcpy() usage.
 
  So what are our options?  mmap()?  I have no idea.  Seems larger page
  size does help.

 For sequential scans, you do have a background reader.  It's the 
 kernel.  As long as you don't issue a seek() between read() calls, the 

I guess you missed my text of Right now the file system will do
read-ahead, meaning the kernel.

 I don't think the memcpy of data from the kernel to userspace is that 
 big of an issue right now.  dd and all the high end network interfaces 
 manage OK doing it, so I'd expect postgresql to do all right with it now 
 yet too.   Direct IO will avoid that memcpy, but then you also don't get 
 any caching of the files in memory.  I'd be more concerned about any 
 memcpy calls or general data management within postgresql.Does 
 postgresql use the platform specific memcpy() in libc?  Some care might 
 be needed to ensure that the memory blocks within postgresql are all 
 properly aligned to make sure that one isn't ping-ponging cache lines 
 around (usually done by padding the buffer sizes by an extra 32 bytes or 
 L1 line size).   Whatever you do, all the usual high performance 
 computing tricks should be used prior to considering any rewriting of 
 major code sections.

We have dealt with alignment and MemCpy is what we used for small-sized
copies to reduce function call overhead.  If you want to improve it,
feel free to take a look.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-21 Thread Luke Lonergan

Alan,


On 11/19/05 8:43 PM, Alan Stange [EMAIL PROTECTED] wrote:

 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
 sdd 343.73175035.73   277.555251072   8326
 
 while doing a select count(1) on the same large table as before.
 Subsequent iostat output all showed that this data rate was being
 maintained.  The system is otherwise mostly idle during this measurement.

Yes - interesting.  Note the other result using XFS that I posted earlier
where I got 240+MB/s.  XFS has more aggressive readahead, which is why I
used it.
 
 Can you explain again why you think there's an IO ceiling of 120MB/s
 because I really don't understand?

OK - slower this time:

We've seen between 110MB/s and 120MB/s on a wide variety of fast CPU
machines with fast I/O subsystems that can sustain 250MB/s+ using dd, but
which all are capped at 120MB/s when doing sequential scans with different
versions of Postgres.

Understand my point: It doesn't matter that there is idle or iowait on the
CPU, the postgres executor is not able to drive the I/O rate for two
reasons: there is a lot of CPU used for the scan (the 40% you reported) and
a lack of asynchrony (the iowait time).  That means that by speeding up the
CPU you only reduce the first part, but you don't fix the second and v.v.

With more aggressive readahead, the second problem (the I/O asynchrony) is
handled better by the Linux kernel and filesystem.  That's what we're seeing
with XFS.

- Luke  



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-21 Thread Tom Lane

Luke Lonergan [EMAIL PROTECTED] writes:
 OK - slower this time:

 We've seen between 110MB/s and 120MB/s on a wide variety of fast CPU
 machines with fast I/O subsystems that can sustain 250MB/s+ using dd, but
 which all are capped at 120MB/s when doing sequential scans with different
 versions of Postgres.

Luke, sometime it would be nice if you would post your raw evidence
and let other people do their own analysis.  I for one have gotten
tired of reading sweeping generalizations unbacked by any data.

I find the notion of a magic 120MB/s barrier, independent of either
CPU or disk speed, to be pretty dubious to say the least.  I would
like to know exactly what the wide variety of data points you
haven't shown us are.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-21 Thread Alan Stange


Luke Lonergan wrote:

OK - slower this time:
We've seen between 110MB/s and 120MB/s on a wide variety of fast CPU
machines with fast I/O subsystems that can sustain 250MB/s+ using dd, but
which all are capped at 120MB/s when doing sequential scans with different
versions of Postgres.
  
Postgresql issues the exact same sequence of read() calls as does dd.   
So why is dd so much faster?


I'd be careful with the dd read of a 16GB file on an 8GB system.  Make 
sure you umount the file system first, to make sure all of the file is 
flushed from memory.   Some systems use a freebehind on sequential reads 
to avoid flushing memory...and you'd find that 1/2 of your 16GB file is 
still in memory.   The same point also holds for the writes:  when dd 
finishes not all the data is on disk.   You need to issue a sync() call 
to make that happen.  Use lmdd to ensure that the data is actually all 
written.   In other words, I think your dd results are possibly misleading.


It's trivial to demonstrate:

$ time dd if=/dev/zero of=/fidb1/bigfile bs=8k count=80
80+0 records in
80+0 records out

real0m13.780s
user0m0.134s
sys 0m13.510s

Oops.   I just wrote 470MB/s to a file system that has peak write speed 
of 200MB/s peak.


Now, you might say that you wrote a 16GB file on an 8 GB machine so this 
isn't an issue.   It does make your dd numbers look fast as some of the 
data will be unwritten.



I'd also suggest running dd on the same files as postgresql.  I suspect 
you'd find that the layout of the postgresql files isn't that good as 
they are grown bit by bit, unlike the file created by simply dd'ing a 
large file.



Understand my point: It doesn't matter that there is idle or iowait on the
CPU, the postgres executor is not able to drive the I/O rate for two
reasons: there is a lot of CPU used for the scan (the 40% you reported) and
a lack of asynchrony (the iowait time).  That means that by speeding up the
CPU you only reduce the first part, but you don't fix the second and v.v.

With more aggressive readahead, the second problem (the I/O asynchrony) is
handled better by the Linux kernel and filesystem.  That's what we're seeing
with XFS.


I think your point doesn't hold up.  Every time you make it, I come away 
posting another result showing it to be incorrect.


The point your making doesn't match my experience with *any* storage or 
program I've ever used, including postgresql.   Your point suggests that 
the storage system is idle  and that postgresql is broken because it 
isn't able to use the resources available...even when the cpu is very 
idle.  How can that make sense?   The issue here is that the storage 
system is very active doing reads on the files...which might be somewhat 
poorly allocated on disk because postgresql grows the tables bit by bit.


I had the same readahead in Reiser and in XFS.   The XFS performance was 
better because XFS does a better job of large file allocation on disk, 
thus resulting in many fewer seeks (generated by the file system itself) 
to read the files back in.   As an example, some file systems like UFS 
purposely scatter large files across cylinder groups to avoid forcing 
large seeks on small files; one can tune this behavior so that large 
files are more tightly allocated.




Of course, because this is engineering, I have another obligatory data 
point:   This time it's a 4.2GB table using 137,138  32KB pages with 
nearly 41 million rows.


A select count(1) on the table completes in 14.6 seconds, for an 
average read rate of 320 MB/s. 

One cpu was idle, the other averaged 32% system time and 68 user time 
for the 14 second period.   This is on a 2.2Ghz Opteron.   A faster cpu 
would show increased performance as I really am cpu bound finally. 

Postgresql is clearly able to issue the relevant sequential read() 
system calls and sink the resulting data without a problem if the file 
system is capable of providing the data.  It can do this up to a speed 
of ~300MB/s on this class of system.   Now it should be fairly simple to 
tweak the few spots where some excess memory copies are being done and 
up this result substantially.  I hope postgresql is always using the 
libc memcpy as that's going to be a lot faster then some private routine.


-- Alan



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-21 Thread Greg Stark


Alan Stange [EMAIL PROTECTED] writes:

 The point your making doesn't match my experience with *any* storage or 
 program
 I've ever used, including postgresql.   Your point suggests that the storage
 system is idle  and that postgresql is broken because it isn't able to use the
 resources available...even when the cpu is very idle.  How can that make 
 sense?

Well I think what he's saying is that Postgres is issuing a read, then waiting
for the data to return. Then it does some processing, and goes back to issue
another read. The CPU is idle half the time because Postgres isn't capable of
doing any work while waiting for i/o, and the i/o system is idle half the time
while the CPU intensive part happens.

(Consider as a pathological example a program that reads 8k then sleeps for
10ms, and loops doing that 1,000 times. Now consider the same program
optimized to read 8M asynchronously and sleep for 10s. By the time it's
finished sleeping it has probably read in all 8M. Whereas the program that
read 8k in little chunks interleaved with small sleeps would probably take
twice as long and appear to be entirely i/o-bound with 50% iowait and 50%
idle.)

It's a reasonable theory and it's not inconsistent with the results you sent.
But it's not exactly proven either. Nor is it clear how to improve matters.
Adding additional threads to handle the i/o adds an enormous amount of
complexity and creates lots of opportunity for other contention that could
easily eat all of the gains.

I also fear that heading in that direction could push Postgres even further
from the niche of software that works fine even on low end hardware into the
realm of software that only works on high end hardware. It's already suffering
a bit from that.

-- 
greg


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-21 Thread Alvaro Herrera

Greg Stark wrote:

 I also fear that heading in that direction could push Postgres even further
 from the niche of software that works fine even on low end hardware into the
 realm of software that only works on high end hardware. It's already suffering
 a bit from that.

What's high end hardware for you?  I do development on a Celeron 533
machine with 448 MB of RAM and I find it to work well (for a slow
value of well, certainly.)  If you're talking about embedded hardware,
that's another matter entirely and I don't think we really support the
idea of running Postgres on one of those things.

There's certainly true in that the memory requirements have increased a
bit, but I don't think it really qualifies as high end even on 8.1.

-- 
Alvaro Herrera   Developer, http://www.PostgreSQL.org
Jude: I wish humans laid eggs
Ringlord: Why would you want humans to lay eggs?
Jude: So I can eat them

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-21 Thread Michael Stone


On Mon, Nov 21, 2005 at 02:01:26PM -0500, Greg Stark wrote:

I also fear that heading in that direction could push Postgres even further
from the niche of software that works fine even on low end hardware into the
realm of software that only works on high end hardware. It's already suffering
a bit from that.


Well, there are are alread a bunch of open source DB's that can handle
the low end. postgres is the closest thing to being able to handle the
high end.

Mike Stone

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-21 Thread Bill McGonigle

Would it be worth first agreeing on a common set of criteria to 
measure?  I see many data points going back and forth but not much 
agreement on what's worth measuring and how to measure.


I'm not necessarily trying to herd cats, but it sure would be swell to 
have the several knowledgeable minds here come up with something that 
could uniformly tested on a range of machines, possibly even integrated 
into pg_bench or something.  Disagreements on criteria or methodology 
should be methodically testable.


Then I have dreams of a new pg_autotune that would know about these 
kinds of system-level settings.


I haven't been on this list for long, and only using postgres for a 
handful of years, so forgive it if this has been hashed out before.


-Bill
-
Bill McGonigle, Owner   Work: 603.448.4440
BFC Computing, LLC  Home: 603.448.1668
[EMAIL PROTECTED]   Mobile: 603.252.2606
http://www.bfccomputing.com/Pager: 603.442.1833
Jabber: [EMAIL PROTECTED]  Text: [EMAIL PROTECTED]
Blog: http://blog.bfccomputing.com/


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-21 Thread Luke Lonergan

Alan,

On 11/21/05 6:57 AM, Alan Stange [EMAIL PROTECTED] wrote:

 $ time dd if=/dev/zero of=/fidb1/bigfile bs=8k count=80
 80+0 records in
 80+0 records out
 
 real0m13.780s
 user0m0.134s
 sys 0m13.510s
 
 Oops.   I just wrote 470MB/s to a file system that has peak write speed
 of 200MB/s peak.

How much RAM on this machine?
 
 Now, you might say that you wrote a 16GB file on an 8 GB machine so this
 isn't an issue.   It does make your dd numbers look fast as some of the
 data will be unwritten.

This simple test, at 2x memory correlates very closely to Bonnie++ numbers
for sequential scan.  What's more, we see close to the same peak in practice
with multiple scanners.  Furthermore, if you run two of them simultaneously
(on two filesystems), you can also see the I/O limited.
  
 I'd also suggest running dd on the same files as postgresql.  I suspect
 you'd find that the layout of the postgresql files isn't that good as
 they are grown bit by bit, unlike the file created by simply dd'ing a
 large file.

Can happen if you're not careful with filesystems (see above).

There's nothing wrong with the dd test.
 
 I think your point doesn't hold up.  Every time you make it, I come away
 posting another result showing it to be incorrect.

Prove it - your Reiserfs number was about the same.

I also posted an XFS number that was substantially higher than 110-120.

 The point your making doesn't match my experience with *any* storage or
 program I've ever used, including postgresql.   Your point suggests that
 the storage system is idle  and that postgresql is broken because it
 isn't able to use the resources available...even when the cpu is very
 idle.  How can that make sense?   The issue here is that the storage
 system is very active doing reads on the files...which might be somewhat
 poorly allocated on disk because postgresql grows the tables bit by bit.

Then you've made my point - if the problem is contiguity of files on disk,
then larger allocation blocks would help on the CPU side.

The objective is clear: given a high performance filesystem, how much of the
available bandwidth can Postgres achieve?  I think what we're seeing is that
XFS is dramatically improving that objective.
 
 I had the same readahead in Reiser and in XFS.   The XFS performance was
 better because XFS does a better job of large file allocation on disk,
 thus resulting in many fewer seeks (generated by the file system itself)
 to read the files back in.   As an example, some file systems like UFS
 purposely scatter large files across cylinder groups to avoid forcing
 large seeks on small files; one can tune this behavior so that large
 files are more tightly allocated.

Our other tests have used ext3, reiser and Solaris 10 UFS, so this might
make some sense.

 Of course, because this is engineering, I have another obligatory data
 point:   This time it's a 4.2GB table using 137,138  32KB pages with
 nearly 41 million rows.
 
 A select count(1) on the table completes in 14.6 seconds, for an
 average read rate of 320 MB/s.

So, assuming that the net memory scan rate is about 2GB/s, and two copies
(one from FS cache to buffer cache, one from buffer cache to the agg node),
you have a 700MB/s filesystem with the equivalent of DirectIO (no FS cache)
because you are reading directly from the I/O cache.  You got half of that
because the I/O processing in the executor is limited to 320MB/s on that
fast CPU. 

My point is this: if you were to decrease the filesystem speed to say
400MB/s and still use the equivalent of DirectIO, I thinkPostgres would not
deliver 320MB/s, but rather something like 220MB/s due to the
producer/consumer arch of the executor.  If you get that part, then we're on
the same track, otherwise we disagree.

 One cpu was idle, the other averaged 32% system time and 68 user time
 for the 14 second period.   This is on a 2.2Ghz Opteron.   A faster cpu
 would show increased performance as I really am cpu bound finally.

Yep, with the equivalent of DirectIO you are.
 
- Luke



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-21 Thread Luke Lonergan

Tom,

On 11/21/05 6:56 AM, Tom Lane [EMAIL PROTECTED] wrote:

 Luke Lonergan [EMAIL PROTECTED] writes:
 OK - slower this time:
 
 We've seen between 110MB/s and 120MB/s on a wide variety of fast CPU
 machines with fast I/O subsystems that can sustain 250MB/s+ using dd, but
 which all are capped at 120MB/s when doing sequential scans with different
 versions of Postgres.
 
 Luke, sometime it would be nice if you would post your raw evidence
 and let other people do their own analysis.  I for one have gotten
 tired of reading sweeping generalizations unbacked by any data.

This has partly been a challenge to get others to post their results.
  
 I find the notion of a magic 120MB/s barrier, independent of either
 CPU or disk speed, to be pretty dubious to say the least.  I would
 like to know exactly what the wide variety of data points you
 haven't shown us are.

I'll try to put up some of them, they've occurred over the last 3 years on
various platforms including:
- Dual 3.2GHz Xeon, 2 x Adaptec U320 SCSI attached to 6 x 10K RPM disks,
Linux 2.6.4(?) - 2.6.10 kernel, ext2/3 and Reiser filesystems
120-130MB/s Postgres seq scan rate on 7.4 and 8.0.

- Dual 1.8 GHz Opteron, 2 x LSI U320 SCSI attached to 6 x 10K RPM disks,
Linux 2.6.10 kernel, ext2/3 and Reiser filesystems
110-120MB/s Postgres seq scan rate on 8.0

- Same machine as above running Solaris 10, with UFS filesystem.  When I/O
caching is tuned, we reach the same 110-120MB/s Postgres seq scan rate

- Sam machine as above with 7 x 15K RPM 144GB disks in an external disk
tray, same scan rate

Only when we got these new SATA systems and tried out XFS with large
readahead have we been able to break past the 120-130MB/s.  After Alan's
post, it seems that XFS might be a big part of that.  I think we'll test
ext2/3 against XFS on the same machine to find out.

It may have to wait a week, as many of us are on vacation.

- Luke



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-21 Thread Alan Stange


Luke,

it's time to back yourself up with some numbers.   You're claiming the 
need for a significant rewrite of portions of postgresql and you haven't 
done the work to make that case. 

You've apparently made some mistakes on the use of dd to benchmark a 
storage system.   Use lmdd and umount the file system before the read 
and post your results.  Using a file 2x the size of memory doesn't work 
corectly.  You can quote any other numbers you want, but until you use 
lmdd correctly you should be ignored.  Ideally, since postgresql uses 
1GB files, you'll want to use 1GB files for dd as well.


Luke Lonergan wrote:

Alan,

On 11/21/05 6:57 AM, Alan Stange [EMAIL PROTECTED] wrote:

  

$ time dd if=/dev/zero of=/fidb1/bigfile bs=8k count=80
80+0 records in
80+0 records out

real0m13.780s
user0m0.134s
sys 0m13.510s

Oops.   I just wrote 470MB/s to a file system that has peak write speed
of 200MB/s peak.


How much RAM on this machine?
  
Doesn't matter.  The result will always be wrong without a call to 
sync() or fsync() before the close() if you're trying to measure the 
speed of the disk subsystem.   Add that sync() and the result will be 
correct for any memory size.  Just for completeness:  Solaris implicitly 
calls sync() as part of close.   Bonnie used to get this wrong, so 
quoting Bonnie isn't any good.   Note that on some systems using 2x 
memory for these tests is almost OK.  For example, Solaris used to have 
a hiwater mark that would throttle processes and not allow more than a 
few 100K of  writes to be outstanding on a file.  Linux/XFS clearly 
allows a lot of write data to be outstanding.  It's best to understand 
the tools and know what they do and why they can be wrong than simply 
quoting some other tool that makes the same mistakes.


I find that postgresql is able to achieve about 175MB/s on average from 
a system capable of delivering 200MB/s peak and it does this with a lot 
of cpu time to spare.   Maybe dd can do a little better and deliver 
185MB/s.If I were to double the speed of my IO system, I might find 
that a single postgresql instance can sink about 300MB/s of data (based 
on the last numbers I posted).  That's why I have multi-cpu opterons and 
more than one query/client as they soak up the remaining IO capacity.


It is guaranteed that postgresql will hit some threshold of performance 
in the future and possible rewrites of some core functionality will be 
needed, but no numbers posted here so far have made the case that 
postgresql is in trouble now. In the mean time, build balanced 
systems with cpus that match the capabilities of the storage subsystems, 
use 32KB block sizes for large memory databases that are doing lots of 
sequential scans, use file systems tuned for large files, use opterons, etc.



As always, one has to post some numbers.   Here's an example of how dd 
doesn't do what you might expect:


mite02:~ # lmdd  if=internal of=/fidb2/bigfile bs=8k count=2k
16.7772 MB in 0.0235 secs, 714.5931 MB/sec

mite02:~ # lmdd  if=internal of=/fidb2/bigfile bs=8k count=2k sync=1
16.7772 MB in 0.1410 secs, 118.9696 MB/sec

Both numbers are correct.  But one measures the kernels ability to 
absorb 2000 8KB writes with no guarantee that the data is on disk and 
the second measures the disk subsystems ability to write 16MB of data.  
dd is equivalent to the first result.  You can't use the first type of 
result and complain that postgresql is slow.  If you wrote 16G of data 
on a machine with 8G memory then your dd result is possibly too fast by 
a factor of two as 8G of the data might not be on disk yet.  We won't 
know until you post some results.


Cheers,

-- Alan


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-21 Thread Luke Lonergan

Alan,

Looks like Postgres gets sensible scan rate scaling as the filesystem speed
increases, as shown below.  I'll drop my 120MB/s observation - perhaps CPUs
got faster since I last tested this.

The scaling looks like 64% of the I/O subsystem speed is available to the
executor - so as the I/O subsystem increases in scan rate, so does Postgres'
executor scan speed.

So that leaves the question - why not more than 64% of the I/O scan rate?
And why is it a flat 64% as the I/O subsystem increases in speed from
333-400MB/s?

- Luke
 
= Results ===

Unless noted otherwise all results posted are for block device readahead set
to 16M using blockdev --setra=16384 block_device.  All are using the
2.6.9-11 Centos 4.1 kernel.

For those who don't have lmdd, here is a comparison of two results on an
ext2 filesystem:


[EMAIL PROTECTED] dbfast1]# time bash -c (dd if=/dev/zero of=/dbfast1/bigfile
bs=8k count=80  sync)
80+0 records in
80+0 records out

real0m33.057s
user0m0.116s
sys 0m13.577s

[EMAIL PROTECTED] dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=80 sync=1
6553.6000 MB in 31.2957 secs, 209.4092 MB/sec

real0m33.032s
user0m0.087s
sys 0m13.129s


So lmdd with sync=1 is equivalent to a sync after a dd.

I use 2x memory with dd for the *READ* performance testing, but let's make
sure things are synced on both write and read for this set of comparisons.

First, let's test ext2 versus ext3, data=ordered, versus xfs:


16GB write, then read

---
ext2:
---
[EMAIL PROTECTED] dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=200 sync=1
16384. MB in 144.2670 secs, 113.5672 MB/sec

[EMAIL PROTECTED] dbfast1]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
count=200 sync=1
16384. MB in 49.3766 secs, 331.8170 MB/sec

---
ext3, data=ordered:
---
[EMAIL PROTECTED] ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=200 sync=1
16384. MB in 137.1607 secs, 119.4511 MB/sec

[EMAIL PROTECTED] ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
count=200 sync=1
16384. MB in 48.7398 secs, 336.1527 MB/sec

---
xfs:
---
[EMAIL PROTECTED] ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=200 sync=1
16384. MB in 52.6141 secs, 311.3994 MB/sec

[EMAIL PROTECTED] ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
count=200 sync=1
16384. MB in 40.2807 secs, 406.7453 MB/sec


I'm liking xfs!  Something about the way files are layed out, as Alan
suggested seems to dramatically improve write performance and perhaps
consequently the read also improves.  There doesn't seem to be a difference
between ext3 and ext2, as expected.

Now on to the Postgres 8 tests.  We'll do a 16GB table size to ensure that
we aren't reading from the read cache.  I'll write this file through
Postgres COPY to be sure that the file layout is as Postgres creates it. The
alternative would be to use COPY once, then tar/untar onto different
filesystems, but that may not duplicate the real world results.

These tests will use Bizgres 0_8_1, which is an augmented 8.0.3.  None of
the augmentations act to improve the executor I/O though, so for these
purposes it should be the same as 8.0.3.


26GB of DBT-3 data from the lineitem table

llonergan=# select relpages from pg_class where relname='lineitem';
 relpages 
--
  3159138
(1 row)

3159138*8192/100
25879 Million Bytes, or 25.9GB

---
xfs:
---
llonergan=# \timing
Timing is on.
llonergan=# select count(1) from lineitem;
   count   
---
 119994608
(1 row)

Time: 394908.501 ms
llonergan=# select count(1) from lineitem;
   count   
---
 119994608
(1 row)

Time: 99425.223 ms
llonergan=# select count(1) from lineitem;
   count   
---
 119994608
(1 row)

Time: 99187.205 ms

---
ext2:
---
llonergan=# select relpages from pg_class where relname='lineitem';
 relpages 
--
  3159138
(1 row)

llonergan=# \timing
Timing is on.
llonergan=# select count(1) from lineitem;
   count   
---
 119994608
(1 row)

Time: 395286.475 ms
llonergan=# select count(1) from lineitem;
   count   
---
 119994608
(1 row)

Time: 195756.381 ms
llonergan=# select count(1) from lineitem;
   count   
---

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-21 Thread Mark Kirkwood


Luke Lonergan wrote:


So that leaves the question - why not more than 64% of the I/O scan rate?
And why is it a flat 64% as the I/O subsystem increases in speed from
333-400MB/s?



It might be interesting to see what effect reducing the cpu consumption 
 entailed by the count aggregation has - by (say) writing a little bit 
of code to heap scan the desired relation (sample attached).


Cheers

Mark




/*
 * fastcount.c
 *
 * Do a count that uses considerably less CPU time than an aggregate.
 */

#include postgres.h

#include funcapi.h
#include access/heapam.h
#include catalog/namespace.h
#include utils/builtins.h


extern Datum fastcount(PG_FUNCTION_ARGS);


PG_FUNCTION_INFO_V1(fastcount);
Datum
fastcount(PG_FUNCTION_ARGS)
{
text   *relname = PG_GETARG_TEXT_P(0);
RangeVar   *relrv;
Relationrel;
HeapScanDesc scan;
HeapTuple   tuple;
int64   result = 0;

/* Use the name to get a suitable range variable and open the relation. 
*/
relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
rel = heap_openrv(relrv, AccessShareLock);

/* Start a heap scan on the relation. */
scan = heap_beginscan(rel, SnapshotNow, 0, NULL);
while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
{
result++;
}

/* End the scan and close up the relation. */
heap_endscan(scan);
heap_close(rel, AccessShareLock);


PG_RETURN_INT64(result);
}

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-20 Thread Mark Kirkwood


Mark Kirkwood wrote:



- I am happy that seqscan is cpu bound after ~110M/s (It's cpu bound on 
my old P3 system even earlier than that)


Ahem - after reading Alan's postings I am not so sure, ISTM that there 
is some more investigation required here too :-).




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-20 Thread William Yu


Alan Stange wrote:

Luke Lonergan wrote:

The aka iowait is the problem here - iowait is not idle (otherwise it
would be in the idle column).

Iowait is time spent waiting on blocking io calls.  As another poster
pointed out, you have a two CPU system, and during your scan, as 


iowait time is idle time. Period.   This point has been debated 
endlessly for Solaris and other OS's as well.


I'm sure the the theory is nice but here's my experience with iowait 
just a minute ago. I run Linux/XFce as my desktop -- decided I wanted to 
lookup some stuff in Wikipedia under Mozilla and my computer system 
became completely unusable for nearly a minute while who knows what 
Mozilla was doing. (Probably loading all the language packs.) I could 
not even switch to IRC (already loaded) to chat with other people while 
Mozilla was chewing up all my disk I/O.


So I went to another computer, connected to mine remotely (slow...) and 
checked top. 90% in the wa column which I assume is the iowait column. 
It may be idle in theory but it's not a very useful idle -- wasn't able 
to switch to any programs already running, couldn't click on the XFce 
launchbar to run any new programs.


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-20 Thread Steinar H. Gunderson

On Sat, Nov 19, 2005 at 08:13:09AM -0800, Luke Lonergan wrote:
 Iowait is time spent waiting on blocking io calls. 

To be picky, iowait is time spent in the idle task while the I/O queue is not
empty. It does not matter if the I/O is blocking or not (from userspace's
point of view), and if the I/O was blocking (say, PIO) from the kernel's
point of view, it would be counted in system.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-20 Thread Alan Stange


William Yu wrote:

Alan Stange wrote:

Luke Lonergan wrote:

The aka iowait is the problem here - iowait is not idle (otherwise it
would be in the idle column).

Iowait is time spent waiting on blocking io calls.  As another poster
pointed out, you have a two CPU system, and during your scan, as 


iowait time is idle time. Period.   This point has been debated 
endlessly for Solaris and other OS's as well.


I'm sure the the theory is nice but here's my experience with iowait 
just a minute ago. I run Linux/XFce as my desktop -- decided I wanted 
to lookup some stuff in Wikipedia under Mozilla and my computer system 
became completely unusable for nearly a minute while who knows what 
Mozilla was doing. (Probably loading all the language packs.) I could 
not even switch to IRC (already loaded) to chat with other people 
while Mozilla was chewing up all my disk I/O.


So I went to another computer, connected to mine remotely (slow...) 
and checked top. 90% in the wa column which I assume is the iowait 
column. It may be idle in theory but it's not a very useful idle -- 
wasn't able to switch to any programs already running, couldn't click 
on the XFce launchbar to run any new programs.


So, you have a sucky computer.I'm sorry, but iowait is still idle 
time, whether you believe it or not.


-- Alan


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-20 Thread Greg Stark


Alan Stange [EMAIL PROTECTED] writes:

  Iowait is time spent waiting on blocking io calls.  As another poster
  pointed out, you have a two CPU system, and during your scan, as predicted,
  one CPU went 100% busy on the seq scan.  During iowait periods, the CPU can
  be context switched to other users, but as I pointed out earlier, that's not
  useful for getting response on decision support queries.

I don't think that's true. If the syscall was preemptable then it wouldn't
show up under iowait, but rather idle. The time spent in iowait is time in
uninterruptable sleeps where no other process can be scheduled.

 iowait time is idle time. Period.   This point has been debated endlessly for
 Solaris and other OS's as well.
 
 Here's the man page:
   %iowait
  Show  the  percentage  of  time that the CPU or CPUs were
  idle during which the system had an outstanding disk  I/O
  request.
 
 If the system had some other cpu bound work to perform you wouldn't ever see
 any iowait time.  Anyone claiming the cpu was 100% busy on the sequential scan
 using the one set of numbers I posted is misunderstanding the actual metrics.

That's easy to test. rerun the test with another process running a simple C
program like main() {while(1);} (or two invocations of that on your system
because of the extra processor). I bet you'll see about half the percentage of
iowait because postres will get half as much opportunity to schedule i/o. If
what you are saying were true then you should get 0% iowait.

-- 
greg


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-20 Thread Steinar H. Gunderson

On Sun, Nov 20, 2005 at 09:22:41AM -0500, Greg Stark wrote:
 I don't think that's true. If the syscall was preemptable then it wouldn't
 show up under iowait, but rather idle. The time spent in iowait is time in
 uninterruptable sleeps where no other process can be scheduled.

You are confusing userspace with kernel space. When a process is stuck in
uninterruptable sleep, it means _that process_ can't be interrupted (say,
by a signal). The kernel can preempt it without problems.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-20 Thread Alan Stange


Greg Stark wrote:

Alan Stange [EMAIL PROTECTED] writes:

  

Iowait is time spent waiting on blocking io calls.  As another poster
pointed out, you have a two CPU system, and during your scan, as predicted,
one CPU went 100% busy on the seq scan.  During iowait periods, the CPU can
be context switched to other users, but as I pointed out earlier, that's not
useful for getting response on decision support queries.
  


I don't think that's true. If the syscall was preemptable then it wouldn't
show up under iowait, but rather idle. The time spent in iowait is time in
uninterruptable sleeps where no other process can be scheduled.
  
That would be wrong.   The time spent in iowait is idle time.  The 
iowait stat would be 0 on a machine with a compute bound runnable 
process available for each cpu.


Come on people, read the man page or look at the source code.   Just 
stop making stuff up.



  

iowait time is idle time. Period.   This point has been debated endlessly for
Solaris and other OS's as well.

Here's the man page:
  %iowait
 Show  the  percentage  of  time that the CPU or CPUs were
 idle during which the system had an outstanding disk  I/O
 request.

If the system had some other cpu bound work to perform you wouldn't ever see
any iowait time.  Anyone claiming the cpu was 100% busy on the sequential scan
using the one set of numbers I posted is misunderstanding the actual metrics.



That's easy to test. rerun the test with another process running a simple C
program like main() {while(1);} (or two invocations of that on your system
because of the extra processor). I bet you'll see about half the percentage of
iowait because postres will get half as much opportunity to schedule i/o. If
what you are saying were true then you should get 0% iowait.
Yes, I did this once about 10 years ago.   But instead of saying I bet 
and guessing at the result, you should try it yourself. Without 
guessing, I can tell you that the iowait time will go to 0%.  You can do 
this loop in the shell, so there's no code to write.  Also, it helps to 
do this with the shell running at a lower priority.


-- Alan



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-20 Thread Mark Kirkwood


Alan Stange wrote:

Another data point.
We had some down time on our system today to complete some maintenance 
work.  It took the opportunity to rebuild the 700GB file system using 
XFS instead of Reiser.


One iostat output for 30 seconds is

avg-cpu:  %user   %nice%sys %iowait   %idle
  1.580.00   19.69   31.94   46.78

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdd 343.73175035.73   277.555251072   8326

while doing a select count(1) on the same large table as before.   
Subsequent iostat output all showed that this data rate was being 
maintained.  The system is otherwise mostly idle during this measurement.


The sequential read rate is 175MB/s.  The system is the same as earlier, 
one cpu is idle and the second is ~40% busy doing the scan and ~60% 
idle.   This is  postgresql 8.1rc1, 32KB block size.  No tuning except 
for using a 1024KB read ahead.


The peak speed of the attached storage is 200MB/s (a 2Gb/s fiber channel 
controller).  I see no reason why this configuration wouldn't generate 
higher IO rates if a faster IO connection were available.


Can you explain again why you think there's an IO ceiling of 120MB/s 
because I really don't understand?




I think what is going on here is that Luke's observation of the 120 Mb/s 
rate is taken from data using 8K block size - it looks like we can get 
higher rates with 32K.


A quick test on my P3 system seems to support this (the numbers are a 
bit feeble, but the difference is interesting):


The test is SELECT 1 FROM table, stopping Pg and unmounting the file 
system after each test.


8K blocksize:
25 s elapsed
48 % idle from vmstat (dual cpu system)
70 % busy from gstat (Freebsd GEOM io monitor)
181819 pages in relation
56 Mb/s effective IO throughput


32K blocksize:
23 s elapsed
44 % idle from vmstat
80 % busy from gstat
45249 pages in relation
60 Mb/s effective IO throughput


I re-ran these several times - very repeatable (+/- 0.25 seconds).

This is Freebsd 6.0 with the readahead set to 16 blocks, UFS2 filesystem 
created with 32K blocksize (both cases). It might be interesting to see 
the effect of using 16K (the default) with the 8K Pg block size, I would 
expect this to widen the gap.


Cheers

Mark


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-20 Thread Mark Kirkwood


Mark Kirkwood wrote:


The test is SELECT 1 FROM table


That should read The test is SELECT count(1) FROM table

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-19 Thread Luke Lonergan

Mark,

On 11/18/05 6:27 PM, Mark Kirkwood [EMAIL PROTECTED] wrote:

 That too, meaning the business of 1 executor random reading a given
 relation file whilst another is sequentially scanning (some other) part
 of it

I think it should actually improve things - each I/O will read 16MB into the
I/O cache, then the next scanner will seek for 10ms to get the next 16MB
into cache, etc.  It should minimize the seek/data ratio nicely.  As long as
the tables are contiguous it should rock and roll.

- Luke



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-19 Thread Luke Lonergan

Alan,

On 11/18/05 11:39 AM, Alan Stange [EMAIL PROTECTED] wrote:

 Yes and no.   The one cpu is clearly idle.   The second cpu is 40% busy
 and 60% idle (aka iowait in the above numbers).

The aka iowait is the problem here - iowait is not idle (otherwise it
would be in the idle column).

Iowait is time spent waiting on blocking io calls.  As another poster
pointed out, you have a two CPU system, and during your scan, as predicted,
one CPU went 100% busy on the seq scan.  During iowait periods, the CPU can
be context switched to other users, but as I pointed out earlier, that's not
useful for getting response on decision support queries.

Thanks for your data, it exemplifies many of the points brought up:
- Lots of disks and expensive I/O hardware does not help improve performance
on large table queries because I/O bandwidth does not scale beyond
110-120MB/s on the fastest CPUs
- OLTP performance optimizations are different than decision support

Regards,

- Luke



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-19 Thread Mark Kirkwood


Luke Lonergan wrote:

Mark,

On 11/18/05 3:46 PM, Mark Kirkwood [EMAIL PROTECTED] wrote:
 


If you alter this to involve more complex joins (e.g 4. way star) and
(maybe add a small number of concurrent executors too) - is it still the
case?



4-way star, same result, that's part of my point.  With Bizgres MPP, the
4-way star uses 4 concurrent scanners, though not all are active all the
time.  And that's per segment instance - we normally use one segment
instance per CPU, so our concurrency is NCPUs plus some.



Luke - I don't think I was clear enough about what I was asking, sorry.

I added the more complex joins comment because:

- I am happy that seqscan is cpu bound after ~110M/s (It's cpu bound on 
my old P3 system even earlier than that)
- I am curious if the *other* access methods (indexscan, nested loop, 
hash, merge, bitmap) also suffer then same fate.


I'm guessing from your comment that you have tested this too, but I 
think its worth clarifying!


With respect to Bizgres MPP, scan parallelism is a great addition... 
very nice! (BTW - is that in 0.8, or are we talking a new product variant?)


regards

Mark



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-19 Thread Alan Stange


Luke Lonergan wrote:

Alan,

On 11/18/05 11:39 AM, Alan Stange [EMAIL PROTECTED] wrote:

  

Yes and no.   The one cpu is clearly idle.   The second cpu is 40% busy
and 60% idle (aka iowait in the above numbers).



The aka iowait is the problem here - iowait is not idle (otherwise it
would be in the idle column).

Iowait is time spent waiting on blocking io calls.  As another poster
pointed out, you have a two CPU system, and during your scan, as predicted,
one CPU went 100% busy on the seq scan.  During iowait periods, the CPU can
be context switched to other users, but as I pointed out earlier, that's not
useful for getting response on decision support queries.
  
iowait time is idle time. Period.   This point has been debated 
endlessly for Solaris and other OS's as well.


Here's the man page:
 %iowait
Show  the  percentage  of  time that the CPU or 
CPUs were
idle during which the system had an outstanding 
disk  I/O

request.

If the system had some other cpu bound work to perform you wouldn't ever 
see any iowait time.  Anyone claiming the cpu was 100% busy on the 
sequential scan using the one set of numbers I posted is 
misunderstanding the actual metrics.



Thanks for your data, it exemplifies many of the points brought up:
- Lots of disks and expensive I/O hardware does not help improve performance
on large table queries because I/O bandwidth does not scale beyond
110-120MB/s on the fastest CPUs
  
I don't think that is the conclusion from anecdotal numbers I posted.   
This file subsystem doesn't perform as well as expected for any tool. 
Bonnie, dd, star, etc., don't get a better data rate either.   In fact, 
the storage system wasn't built for performance; it was build to 
reliably hold a big chunk of data.   Even so,  postgresql is reading at 
130MB/s on it, using about 30% of a single cpu, almost all of which was 
system time.   I would get the same 130MB/s on a system with cpus that 
were substantially slower; the limitation isn't the cpus, or 
postgresql.  It's the IO system that is poorly configured for this test, 
not postgresqls ability to use it.


In fact, given the numbers I posted, it's clear this system could 
handily generate more than 120 MB/s using a single cpu given a better IO 
subsystem;  it has cpu time to spare.   A simple test can be done:   
build the database in /dev/shm and time the scans.  It's the same read() 
system call being used and now one has made the IO system infinitely 
fast.  The claim is being made that standard postgresql is unable to 
generate more than 120MB/s of IO on any IO system due to an inefficient 
use of the kernel API and excessive memory copies, etc.  Having the 
database be on a ram based file system is an example of expensive IO 
hardware and all else would be the same.   Hmmm, now that I think about 
this,  I could throw a medium sized table onto /dev/shm using 
tablespaces on one of our 8GB linux boxes.So why is this experiment 
not valid, or what is it about the above assertion that I am missing?



Anyway, if one cares about high speed sequential IO, then one should use 
a much larger block size to start.   Using 8KB IOs is inappropriate for 
such a configuration.  We happen to be using 32KB blocks on our largest 
database and it's been the best move for us.


-- Alan

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-19 Thread Alan Stange

Another data point. 

We had some down time on our system today to complete some maintenance 
work.  It took the opportunity to rebuild the 700GB file system using 
XFS instead of Reiser.


One iostat output for 30 seconds is

avg-cpu:  %user   %nice%sys %iowait   %idle
  1.580.00   19.69   31.94   46.78

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdd 343.73175035.73   277.555251072   8326

while doing a select count(1) on the same large table as before.   
Subsequent iostat output all showed that this data rate was being 
maintained.  The system is otherwise mostly idle during this measurement.


The sequential read rate is 175MB/s.  The system is the same as earlier, 
one cpu is idle and the second is ~40% busy doing the scan and ~60% 
idle.   This is  postgresql 8.1rc1, 32KB block size.  No tuning except 
for using a 1024KB read ahead.


The peak speed of the attached storage is 200MB/s (a 2Gb/s fiber channel 
controller).  I see no reason why this configuration wouldn't generate 
higher IO rates if a faster IO connection were available.


Can you explain again why you think there's an IO ceiling of 120MB/s 
because I really don't understand?


-- Alan



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Dave Cramer



On 17-Nov-05, at 2:50 PM, Alex Turner wrote:


Just pick up a SCSI drive and a consumer ATA drive.

Feel their weight.

You don't have to look inside to tell the difference.
At one point stereo manufacturers put weights in the case just to  
make them heavier.

The older ones weighed more and the consumer liked heavy stereos.

Be careful what you measure.

Dave


Alex

On 11/16/05, David Boreham [EMAIL PROTECTED] wrote:



 I suggest you read this on the difference between enterprise/SCSI  
and

desktop/IDE drives:

http://www.seagate.com/content/docs/pdf/whitepaper/ 
D2c_More_than_Interface_ATA_vs_SCSI_042003.pdf



 This is exactly the kind of vendor propaganda I was talking about
 and it proves my point quite well : that there's nothing specific  
relating
 to reliability that is different between SCSI and SATA drives  
cited in that

paper.
 It does have a bunch of FUD such as 'oh yeah we do a lot more
 drive characterization during manufacturing'.






---(end of  
broadcast)---

TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Dave Cramer



On 18-Nov-05, at 1:07 AM, Luke Lonergan wrote:


Greg,


On 11/17/05 9:17 PM, Greg Stark [EMAIL PROTECTED] wrote:

Ok, a more productive point: it's not really the size of the  
database that
controls whether you're I/O bound or CPU bound. It's the available  
I/O

bandwidth versus your CPU speed.


Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound  
after

110MB/s of I/O.  This is true of Postgres 7.4, 8.0 and 8.1.

A $1,000 system with one CPU and two SATA disks in a software RAID0  
will
perform exactly the same as a $80,000 system with 8 dual core CPUs  
and the
world's best SCSI RAID hardware on a large database for decision  
support

(what the poster asked about).


Now there's an interesting line drawn in the sand. I presume you have  
numbers to back this up ?


This should draw some interesting posts.

Dave


Regards,

- Luke



---(end of  
broadcast)---

TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Richard Huxton


Dave Cramer wrote:


On 18-Nov-05, at 1:07 AM, Luke Lonergan wrote:


Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound  after
110MB/s of I/O.  This is true of Postgres 7.4, 8.0 and 8.1.

A $1,000 system with one CPU and two SATA disks in a software RAID0  will
perform exactly the same as a $80,000 system with 8 dual core CPUs  
and the

world's best SCSI RAID hardware on a large database for decision  support
(what the poster asked about).



Now there's an interesting line drawn in the sand. I presume you have  
numbers to back this up ?


This should draw some interesting posts.


Well, I'm prepared to swap Luke *TWO* $1000 systems for one $80,000 
system if he's got one going :-)


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Alan Stange


Luke Lonergan wrote:

Alan,

On 11/18/05 5:41 AM, Alan Stange [EMAIL PROTECTED] wrote:

  

That's interesting, as I occasionally see more than 110MB/s of
postgresql IO on our system.  I'm using a 32KB block size, which has
been a huge win in performance for our usage patterns.   300GB database
with a lot of turnover.  A vacuum analyze now takes about 3 hours, which
is much shorter than before.  Postgresql 8.1, dual opteron, 8GB memory,
Linux 2.6.11, FC drives.



300GB / 3 hours = 27MB/s.
  
That's 3 hours under load, with 80 compute clients beating on the 
database at the same time.   We have the stats turned way up, so the 
analyze tends to read a big chunk of the tables a second time as 
well.We typically don't have three hours a day of idle time.


-- Alan

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Luke Lonergan

Title: Re: [PERFORM] Hardware/OS recommendations for large databases (



Dave,

On 11/18/05 5:00 AM, Dave Cramer [EMAIL PROTECTED] wrote:
 
 Now there's an interesting line drawn in the sand. I presume you have 
 numbers to back this up ?
 
 This should draw some interesting posts.

Part 2: The answer

System A:
This system is running RedHat 3 Update 4, with a Fedora 2.6.10 Linux kernel.

On a single table with 15 columns (the Bizgres IVP) at a size double memory (2.12GB), Postgres 8.0.3 with Bizgres enhancements takes 32 seconds to scan the table: thats 66 MB/s. Not the efficiency Id hope from the onboard SATA controller that Id like, I would have expected to get 85% of the 100MB/s raw read performance.

So thats $1,200 / 66 MB/s (without adjusting for 2003 price versus now) = 18.2 $/MB/s

Raw data:
[EMAIL PROTECTED] IVP]$ cat scan.sh 
#!/bin/bash

time psql -c select count(*) from ivp.bigtable1 dgtestdb
[EMAIL PROTECTED] IVP]$ cat sysout1
count 
--
1000
(1 row)


real 0m32.565s
user 0m0.002s
sys 0m0.003s

Size of the table data:
[EMAIL PROTECTED] IVP]$ du -sk dgtestdb/base
2121648 dgtestdb/base

System B:
This system is running an XFS filesystem, and has been tuned to use very large (16MB) readahead. Its running the Centos 4.1 distro, which uses a Linux 2.6.9 kernel.

Same test as above, but with 17GB of data takes 69.7 seconds to scan (!) Thats 244.2MB/s, which is obviously double my earlier point of 110-120MB/s. This system is running with a 16MB Linux readahead setting, lets try it with the default (I think) setting of 256KB  AHA! Now we get 171.4 seconds or 99.3MB/s.

So, using the tuned setting of blockdev setra 16384 we get $6,000 / 244MB/s = 24.6 $/MB/s
If we use the default Linux setting its 2.5x worse.

Raw data:
[EMAIL PROTECTED] IVP]$ cat scan.sh 
#!/bin/bash

time psql -c select count(*) from ivp.bigtable1 dgtestdb
[EMAIL PROTECTED] IVP]$ cat sysout3
count 
--
8000
(1 row)


real 1m9.875s
user 0m0.000s
sys 0m0.004s
[EMAIL PROTECTED] IVP]$ !du
du -sk dgtestdb/base
17021260 dgtestdb/base

Summary:

cough, cough OK  you can get more I/O bandwidth out of the current I/O path for sequential scan if you tune the filesystem for large readahead. This is a cheap alternative to overhauling the executor to use asynch I/O.

Still, there is a CPU limit here  this is not I/O bound, it is CPU limited as evidenced by the sensitivity to readahead settings. If the filesystem could do 1GB/s, you wouldnt go any faster than 244MB/s.

- Luke

Re: [PERFORM] Hardware/OS recommendations for large databases

2005-11-18 Thread Ron

While I agree with you in principle that pg becomes CPU bound 
relatively easily compared to other DB products (at ~110-120MBps 
according to a recent thread), there's a bit of hyperbole in your post.


a. There's a big difference between the worst performing 1C x86 ISA 
CPU available and the best performing 2C one (IIRC, that's the 
2.4GHz, 1MB L2 cache AMDx2 4800+ as of this writing)


b. Two 2C CPU's vs one 1C CPU means that a pg process will almost 
never be waiting on other non pg processes.  It also means that 3-4 
pg processes, CPU bound or not, can execute in parallel.  Not an 
option with one 1C CPU.


c. Mainboards with support for multiple CPUs and lots' of RAM are 
_not_ the cheap ones.


d.  No one should ever use RAID 0 for valuable data.  Ever.  So at 
the least you need 4 HD's for a RAID 10 set (RAID 5 is not a good 
option unless write performance is unimportant.  4HD RAID 5 is 
particularly not a good option.)


e. The server usually needs to talk to things over a network 
connection.  Often performance here matters.  Mainboards with 2 1GbE 
NICs and/or PCI-X (or PCI-E) slots for 10GbE cards are not the cheap ones.


f. Trash HDs mean poor IO performance and lower reliability.  While 
TOTL 15Krpm 4Gb FC HDs are usually overkill (Not always.  It depends 
on context.),
you at least want SATA II HDs with NCQ or TCQ support.  And you want 
them to have a decent media warranty- preferably a 5 year one if you 
can get it.  Again, these are not the cheapest HD's available.


g. Throughput limitations say nothing about latency 
considerations.  OLTP-like systems _want_ HD spindles.  AMAP.  Even 
non OLTP-like systems need a fair number of spindles to optimize HD 
IO: dedicated WAL set, multiple dedicated DB sets, dedicated OS and 
swap space set, etc, etc.  At 50MBps ASTR, you need 16 HD's operating 
in parallel to saturate the bandwidth of a PCI-X channel.
That's ~8 independent pg tasks (queries using different tables, 
dedicated WAL IO, etc) running in parallel.  Regardless of application domain.


h. Decent RAID controllers and HBAs are not cheap either.  Even SW 
RAID benefits from having a big dedicated RAM buffer to talk to.


While the above may not cost you $80K, it sure isn't costing you $1K either.
Maybe ~$15-$20K, but not $1K.

Ron


At 01:07 AM 11/18/2005, Luke Lonergan wrote:

Greg,


On 11/17/05 9:17 PM, Greg Stark [EMAIL PROTECTED] wrote:

 Ok, a more productive point: it's not really the size of the database that
 controls whether you're I/O bound or CPU bound. It's the available I/O
 bandwidth versus your CPU speed.

Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound after
110MB/s of I/O.  This is true of Postgres 7.4, 8.0 and 8.1.

A $1,000 system with one CPU and two SATA disks in a software RAID0 will
perform exactly the same as a $80,000 system with 8 dual core CPUs and the
world's best SCSI RAID hardware on a large database for decision support
(what the poster asked about).

Regards,

- Luke



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match





---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Luke Lonergan

Title: Re: [PERFORM] Hardware/OS recommendations for large databases (



Dave,


On 11/18/05 5:00 AM, Dave Cramer [EMAIL PROTECTED] wrote:
 
 Now there's an interesting line drawn in the sand. I presume you have 
 numbers to back this up ?
 
 This should draw some interesting posts.

OK, here we go:

The $1,000 system (System A):

- I bought 16 of these in 2003 for $1,200 each. They have Intel or Asus motherboards, Intel P4 3.0GHz CPUs with an 800MHz FSB. They have a system drive and two RAID0 SATA drives, the Western Digital 74GB Raptor (10K RPM). They have 1GB of RAM.

A test of write and read performance on the RAID0:

[EMAIL PROTECTED] raid0]$ time dd if=/dev/zero of=bigfile bs=8k count=25
25+0 records in
25+0 records out

real 0m17.453s
user 0m0.249s
sys 0m10.246s

[EMAIL PROTECTED] raid0]$ time dd if=bigfile of=/dev/null bs=8k
25+0 records in
25+0 records out

real 0m18.930s
user 0m0.130s
sys 0m3.590s

So, the write performance is 114MB/s and read performance is 106MB/s.

The $6,000 system (System B):

I just bought 5 of these systems for $6,000 each. They are dual Opteron systems with 8GB of RAM and 2x 250 model CPUs, which are close to the fastest. They have the new 3Ware 9550SX SATA RAID adapters coupled to Western Digital 400GB RE2 model hard drives. They are organized as a RAID5.

A test of write and read performance on the RAID5:

[EMAIL PROTECTED] dbfast1]# time dd if=/dev/zero of=bigfile bs=8k count=200
200+0 records in
200+0 records out

real 0m51.441s
user 0m0.288s
sys 0m29.119s

[EMAIL PROTECTED] dbfast1]# time dd if=bigfile of=/dev/null bs=8k
200+0 records in
200+0 records out

real 0m39.605s
user 0m0.244s
sys 0m19.207s

So, the write performance is 314MB/s and read performance is 404MB/s (!) This is the fastest Ive seen 8 disk drives perform.

So, the question is: which of these systems (A or B) can scan a large table faster using non-MPP postgres? How much faster would you wager?

Send your answer, and Ill post the result.

Regards,

- Luke

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Dave Cramer

On 18-Nov-05, at 8:30 AM, Luke Lonergan wrote: Richard,  On 11/18/05 5:22 AM, "Richard Huxton" dev@archonet.com wrote:  Well, I'm prepared to swap Luke *TWO* $1000 systems for one $80,000 system if he's got one going :-)  Finally, a game worth playing!  Except it’s backward – I’ll show you 80 $1,000 systems performing 80 times faster than one $80,000 system.Now you wouldn't happen to be selling a system that would enable this for postgres, now would ya ?  On your proposition – I don’t have any $80,000 systems for trade, do you?  - Luke

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Vivek Khera

On Nov 18, 2005, at 10:13 AM, Luke Lonergan wrote:Still, there is a CPU limit here – this is not I/O bound, it is CPU limited as evidenced by the sensitivity to readahead settings.   If the filesystem could do 1GB/s, you wouldn’t go any faster than 244MB/s.Yeah, and mysql would probably be faster on your trivial queries.  Try concurrent large joins and updates and see which system is faster.

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Vivek Khera


On Nov 18, 2005, at 1:07 AM, Luke Lonergan wrote:

A $1,000 system with one CPU and two SATA disks in a software RAID0  
will
perform exactly the same as a $80,000 system with 8 dual core CPUs  
and the
world's best SCSI RAID hardware on a large database for decision  
support

(what the poster asked about).


Hahahahahahahahahahahahaha! Whooo... needed to fall out of my chair  
laughing this morning.


I can tell you from direct personal experience that you're just plain  
wrong.


I've had to move my primary DB server from a dual P3 1GHz with 4-disk  
RAID10 SCSI, to Dual P3 2GHz with 14-disk RAID10 and faster drives,  
to Dual Opteron 2GHz with 8-disk RAID10 and even faster disks to keep  
up with my load on a 60+ GB database.  The Dual opteron system has  
just a little bit of extra capacity if I offload some of the  
reporting operations to a replicated copy (via slony1).  If I run all  
the queries on the one DB it can't keep up.


One most telling point about the difference in speed is that the 14- 
disk array system cannot keep up with the replication being generated  
by the dual opteron, even when it is no doing any other queries of  
its own.  The I/O system just ain't fast enough.



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Luke Lonergan

Alan,

On 11/18/05 5:41 AM, Alan Stange [EMAIL PROTECTED] wrote:

 
 That's interesting, as I occasionally see more than 110MB/s of
 postgresql IO on our system.  I'm using a 32KB block size, which has
 been a huge win in performance for our usage patterns.   300GB database
 with a lot of turnover.  A vacuum analyze now takes about 3 hours, which
 is much shorter than before.  Postgresql 8.1, dual opteron, 8GB memory,
 Linux 2.6.11, FC drives.

300GB / 3 hours = 27MB/s.

If you are using the 2.6 linux kernel, you may be fooled into thinking you
burst more than you actually get in net I/O because the I/O stats changed in
tools like iostat and vmstat.

The only meaningful stats are (size of data) / (time to process data).  Do a
sequential scan of one of your large tables that you know the size of, then
divide by the run time and report it.

I'm compiling some new test data to make my point now.

Regards,

- Luke



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Alan Stange


Richard Huxton wrote:

Dave Cramer wrote:


On 18-Nov-05, at 1:07 AM, Luke Lonergan wrote:

Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound  
after

110MB/s of I/O.  This is true of Postgres 7.4, 8.0 and 8.1.

A $1,000 system with one CPU and two SATA disks in a software RAID0  
will
perform exactly the same as a $80,000 system with 8 dual core CPUs  
and the
world's best SCSI RAID hardware on a large database for decision  
support

(what the poster asked about).



Now there's an interesting line drawn in the sand. I presume you 
have  numbers to back this up ?


This should draw some interesting posts.


That's interesting, as I occasionally see more than 110MB/s of 
postgresql IO on our system.  I'm using a 32KB block size, which has 
been a huge win in performance for our usage patterns.   300GB database 
with a lot of turnover.  A vacuum analyze now takes about 3 hours, which 
is much shorter than before.  Postgresql 8.1, dual opteron, 8GB memory, 
Linux 2.6.11, FC drives.


-- Alan

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Luke Lonergan

Title: Re: [PERFORM] Hardware/OS recommendations for large databases (



Richard,

On 11/18/05 5:22 AM, Richard Huxton dev@archonet.com wrote:

Well, I'm prepared to swap Luke *TWO* $1000 systems for one $80,000
system if he's got one going :-)

Finally, a game worth playing!

Except its backward  Ill show you 80 $1,000 systems performing 80 times faster than one $80,000 system.

On your proposition  I dont have any $80,000 systems for trade, do you?

- Luke

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Alan Stange


Luke Lonergan wrote:

Alan,

On 11/18/05 8:13 AM, Alan Stange [EMAIL PROTECTED] wrote:

I told you in my initial post that I was observing numbers in
excess of
what you claiming, but you seemed to think I didn't know how to
measure
an IO rate.

Prove me wrong, post your data.

I should note too that our system uses about 20% of a single cpu when
performing a table scan at 100MB/s of IO. I think you claimed the
system would be cpu bound at this low IO rate.


See above.
Here's the output from one iteration of iostat -k 60 while the box is 
doing a select count(1) on a 238GB table.


avg-cpu:  %user   %nice%sys %iowait   %idle
  0.990.00   17.97   32.40   48.64

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdd 345.95130732.53 0.007843952  0

We're reading 130MB/s for a full minute.  About 20% of a single cpu was 
being used.   The remainder being idle.


We've done nothing fancy and achieved results you claim shouldn't be 
possible.  This is a system that was re-installed yesterday, no tuning 
was done to the file systems, kernel or storage array.


What am I doing wrong?

9 years ago I co-designed a petabyte data store with a goal of 1GB/s IO 
(for a DOE lab).   And now I don't know what I'm doing,


Cheers,

-- Alan

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Luke Lonergan

Alex,

On 11/18/05 8:28 AM, Alex Turner [EMAIL PROTECTED] wrote:

 Ok - so I ran the same test on my system and get a total speed of
113MB/sec.
 Why is this?  Why is the system so limited to around just
110MB/sec?  I
 tuned read ahead up a bit, and my results improve a
bit..

OK! Now we're on the same page.  Finally someone who actually tests!

Check the CPU usage while it's doing the scan.  Know what it's doing?
Memory copies.  We've profiled it extensively.

So - that's the suckage - throwing more CPU power helps a bit, but the
underlying issue is poorly optimized code in the Postgres executor and lack
of I/O asynchrony.

- Luke



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Luke Lonergan

Bill,

On 11/18/05 7:55 AM, Bill McGonigle [EMAIL PROTECTED] wrote:
 
 There is some truth to it.  For an app I'm currently running (full-text
 search using tsearch2 on ~100MB of data) on:

Do you mean 100GB?  Sounds like you are more like a decision support
/warehousing application.
 
 Dev System:
 Asus bare-bones bookshelf case/mobo
 3GHz P4 w/ HT
 800MHz memory Bus
 Fedora Core 3 (nightly update)
 1GB RAM
 1 SATA Seagate disk (7200RPM, 8MB Cache)
 $800
 worst-case query: 7.2 seconds

About the same machine I posted results for, except I had two faster disks.

 now, the machine I'm deploying to:
 
 Dell SomthingOrOther
 (4) 2.4GHz Xeons
 533MHz memory bus
 RedHat Enterprise 3.6
 1GB RAM
 (5) 15 RPM Ultra SCSI 320 on an Adaptec RAID 5 controller
 $1
 same worst-case query: 9.6 seconds

Your problem here is the HW RAID controller - if you dump it and use the
onboard SCSI channels and Linux RAID you will see a jump from 40MB/s to
about 220MB/s in read performance and from 20MB/s to 110MB/s write
performance.  It will use less CPU too.
 
 Now it's not apples-to-apples.  There's a kernel 2.4 vs. 2.6 difference
 and the memory bus is much faster and I'm not sure what kind of context
 switching hit you get with the Xeon MP memory controller.  On a
 previous postgresql app I did I ran nearly identically spec'ed machines
 except for the memory bus and saw about a 30% boost in performance just
 with the 800MHz bus.  I imagine the Opteron bus does even better.

Memory bandwidth is so high on both that it's not a factor.  Context
switching / memory bus contention isn't either.
  
 So the small machine is probably slower on disk but makes up for it in
 single-threaded access to CPU and memory speed. But if this app were to
 be scaled it would make much more sense to cluster several $800
 machines than it would to buy 'big-iron'.

Yes it does - by a lot too.  Also, having a multiprocessing executor gets
all of each machine by having multiple CPUs scan simultaneously.

- Luke



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Alex Turner

Ok - so I ran the same test on my system and get a total speed of
113MB/sec.  Why is this?  Why is the system so limited to around just
110MB/sec?  I tuned read ahead up a bit, and my results improve a
bit..

Alex


On 11/18/05, Luke Lonergan [EMAIL PROTECTED] wrote:
  Dave,

  On 11/18/05 5:00 AM, Dave Cramer [EMAIL PROTECTED] wrote:
  
   Now there's an interesting line drawn in the sand. I presume you have
   numbers to back this up ?
  
   This should draw some interesting posts.

  Part 2: The answer

  System A:

 This system is running RedHat 3 Update 4, with a Fedora 2.6.10 Linux kernel.

  On a single table with 15 columns (the Bizgres IVP) at a size double memory
 (2.12GB), Postgres 8.0.3 with Bizgres enhancements takes 32 seconds to scan
 the table: that's 66 MB/s.  Not the efficiency I'd hope from the onboard
 SATA controller that I'd like, I would have expected to get 85% of the
 100MB/s raw read performance.

  So that's $1,200 / 66 MB/s (without adjusting for 2003 price versus now) =
 18.2 $/MB/s

  Raw data:
  [EMAIL PROTECTED] IVP]$ cat scan.sh
  #!/bin/bash

  time psql -c select count(*) from ivp.bigtable1 dgtestdb
  [EMAIL PROTECTED] IVP]$ cat sysout1
count
  --
   1000
  (1 row)


  real0m32.565s
  user0m0.002s
  sys 0m0.003s

  Size of the table data:
  [EMAIL PROTECTED] IVP]$ du -sk dgtestdb/base
  2121648 dgtestdb/base

  System B:

 This system is running an XFS filesystem, and has been tuned to use very
 large (16MB) readahead.  It's running the Centos 4.1 distro, which uses a
 Linux 2.6.9 kernel.

  Same test as above, but with 17GB of data takes 69.7 seconds to scan (!)
 That's 244.2MB/s, which is obviously double my earlier point of 110-120MB/s.
  This system is running with a 16MB Linux readahead setting, let's try it
 with the default (I think) setting of 256KB – AHA! Now we get 171.4 seconds
 or 99.3MB/s.

  So, using the tuned setting of blockdev —setra 16384 we get $6,000 /
 244MB/s = 24.6 $/MB/s
  If we use the default Linux setting it's 2.5x worse.

  Raw data:
  [EMAIL PROTECTED] IVP]$ cat scan.sh
  #!/bin/bash

  time psql -c select count(*) from ivp.bigtable1 dgtestdb
  [EMAIL PROTECTED] IVP]$ cat sysout3
count
  --
   8000
  (1 row)


  real1m9.875s
  user0m0.000s
  sys 0m0.004s
  [EMAIL PROTECTED] IVP]$ !du
  du -sk dgtestdb/base
  17021260dgtestdb/base

  Summary:

  cough, cough OK – you can get more I/O bandwidth out of the current I/O
 path for sequential scan if you tune the filesystem for large readahead.
 This is a cheap alternative to overhauling the executor to use asynch I/O.

  Still, there is a CPU limit here – this is not I/O bound, it is CPU limited
 as evidenced by the sensitivity to readahead settings.   If the filesystem
 could do 1GB/s, you wouldn't go any faster than 244MB/s.

  - Luke

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Alan Stange


Luke Lonergan wrote:

Alan,

On 11/18/05 9:31 AM, Alan Stange [EMAIL PROTECTED] wrote:

  

Here's the output from one iteration of iostat -k 60 while the box is
doing a select count(1) on a 238GB table.

avg-cpu:  %user   %nice%sys %iowait   %idle
   0.990.00   17.97   32.40   48.64

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdd 345.95130732.53 0.007843952  0

We're reading 130MB/s for a full minute.  About 20% of a single cpu was
being used.   The remainder being idle.



Cool - thanks for the results.  Is that % of one CPU, or of 2?  Was the
system otherwise idle?
  
Actually, this was dual cpu and there was other activity during the full 
minute, but it was on other file devices, which I didn't include in the 
above output.   Given that, and given what I see on the box now I'd 
raise the 20% to 30% just to be more conservative.  It's all in the 
kernel either way; using a different scheduler or file system would 
change that result.  Even better would be using direct IO to not flush 
everything else from memory and avoid some memory copies from kernel to 
user space.  Note that almost none of the time is user time.  Changing 
postgresql won't change the cpu useage.


One IMHO obvious improvement would be to have vacuum and analyze only do 
direct IO.  Now they appear to be very effective memory flushing tools.  
Table scans on tables larger than say 4x memory should probably also use 
direct IO for reads.


 
  

We've done nothing fancy and achieved results you claim shouldn't be
possible.  This is a system that was re-installed yesterday, no tuning
was done to the file systems, kernel or storage array.



Are you happy with 130MB/s?  How much did you pay for that?  Is it more than
$2,000, or double my 2003 PC?
  
I don't know what the system cost.   It was part of block of dual 
opterons from Sun that we got some time ago.   I think the 130MB/s is 
slow given the hardware, but it's acceptable.  I'm not too price 
sensitive; I care much more about reliability, uptime, etc.  

 
  

What am I doing wrong?

9 years ago I co-designed a petabyte data store with a goal of 1GB/s IO
(for a DOE lab).   And now I don't know what I'm doing,


Cool.  Would that be Sandia?

We routinely sustain 2,000 MB/s from disk on 16x 2003 era machines on
complex queries.
Disk?!  4 StorageTek tape silos.  That would be .002 TB/s.  One has to 
change how you think when you have that much data. And hope you don't 
have a fire, because there's no backup.   That work was while I was at 
BNL.   I believe they are now at 4PB of tape and 150TB of disk.


-- Alan

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Luke Lonergan

Alan,

On 11/18/05 10:30 AM, Alan Stange [EMAIL PROTECTED] wrote:

 Actually, this was dual cpu and there was other activity during the full
 minute, but it was on other file devices, which I didn't include in the
 above output.   Given that, and given what I see on the box now I'd
 raise the 20% to 30% just to be more conservative.  It's all in the
 kernel either way; using a different scheduler or file system would
 change that result.  Even better would be using direct IO to not flush
 everything else from memory and avoid some memory copies from kernel to
 user space.  Note that almost none of the time is user time.  Changing
 postgresql won't change the cpu useage.

These are all things that help on the IO wait side possibly, however, there
is a producer/consumer problem in postgres that goes something like this:

- Read some (small number of, sometimes 1) 8k pages
- Do some work on those pages, including lots of copies
- repeat

This back and forth without threading (like AIO, or a multiprocessing
executor) causes cycling and inefficiency that limits throughput.
Optimizing some of the memcopies and other garbage out, plus increasing the
internal (postgres) readahead would probably double the disk bandwidth.

But to be disk-bound (meaning that the disk subsystem is running at full
speed), requires asynchronous I/O.  We do this now with Bizgres MPP, and we
get fully saturated disk channels on every machine.  That means that even on
one machine, we run many times faster than non-MPP postgres.

 One IMHO obvious improvement would be to have vacuum and analyze only do
 direct IO.  Now they appear to be very effective memory flushing tools.
 Table scans on tables larger than say 4x memory should probably also use
 direct IO for reads.

That's been suggested many times prior - I agree, but this also needs AIO to
be maximally effective.

 I don't know what the system cost.   It was part of block of dual
 opterons from Sun that we got some time ago.   I think the 130MB/s is
 slow given the hardware, but it's acceptable.  I'm not too price
 sensitive; I care much more about reliability, uptime, etc.

Then I know what they cost - we have them too (V20z and V40z).  You should
be getting 400MB/s+ with external RAID.

 What am I doing wrong?
 
 9 years ago I co-designed a petabyte data store with a goal of 1GB/s IO
 (for a DOE lab).   And now I don't know what I'm doing,

 Cool.  Would that be Sandia?
 
 We routinely sustain 2,000 MB/s from disk on 16x 2003 era machines on
 complex queries.
 Disk?!  4 StorageTek tape silos.  That would be .002 TB/s.  One has to
 change how you think when you have that much data. And hope you don't
 have a fire, because there's no backup.   That work was while I was at
 BNL.   I believe they are now at 4PB of tape and 150TB of disk.

We had 1.5 Petabytes on 2 STK Silos at NAVO from 1996-1998 where I ran RD.
We also had a Cray T932 an SGI Origin 3000 with 256 CPUs, a Cray T3E with
1280 CPUs, 2 Cray J916s with 1 TB of shared disk, a Cray C90-16, a Sun E10K,
etc etc, along with clusters of Alpha machines and lots of SGIs.  It's nice
to work with a $40M annual budget.

Later, working with FSL we implemented a weather forecasting cluster that
ultimately became the #5 fastest computer on the TOP500 supercomputing list
from 512 Alpha cluster nodes.  That machine had a 10-way shared SAN, tape
robotics and a Myrinet interconnect and ran 64-bit Linux (in 1998).

- Luke



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Greg Stark

Alan Stange [EMAIL PROTECTED] writes:

 Luke Lonergan wrote:
  Alan,
 
  On 11/18/05 9:31 AM, Alan Stange [EMAIL PROTECTED] wrote:
 
 
  Here's the output from one iteration of iostat -k 60 while the box is
  doing a select count(1) on a 238GB table.
 
  avg-cpu:  %user   %nice%sys %iowait   %idle
 0.990.00   17.97   32.40   48.64
 
  Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
  sdd 345.95130732.53 0.007843952  0
 
  We're reading 130MB/s for a full minute.  About 20% of a single cpu was
  being used.   The remainder being idle.
 
 
  Cool - thanks for the results.  Is that % of one CPU, or of 2?  Was the
  system otherwise idle?
 
 Actually, this was dual cpu 

I hate to agree with him but that looks like a dual machine with one CPU
pegged. Yes most of the time is being spent in the kernel, but you're still
basically cpu limited.

That said, 130MB/s is nothing to sneeze at, that's maxing out two high end
drives and quite respectable for a 3-disk stripe set, even reasonable for a
4-disk stripe set. If you're using 5 or more disks in RAID-0 or RAID 1+0 and
only getting 130MB/s then it does seem likely the cpu is actually holding you
back here.

Still it doesn't show Postgres being nearly so CPU wasteful as the original
poster claimed.

 It's all in the kernel either way; using a different scheduler or file
 system would change that result. Even better would be using direct IO to not
 flush everything else from memory and avoid some memory copies from kernel
 to user space. Note that almost none of the time is user time. Changing
 postgresql won't change the cpu useage.

Well changing to direct i/o would still be changing Postgres so that's
unclear. And there are plenty of more mundane ways that Postgres is
responsible for how efficiently or not the kernel is used. Just using fewer
syscalls to do the same amount of reading would reduce cpu consumption.


 One IMHO obvious improvement would be to have vacuum and analyze only do 
 direct
 IO.  Now they appear to be very effective memory flushing tools.  Table scans
 on tables larger than say 4x memory should probably also use direct IO for
 reads.

-- 
greg


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Luke Lonergan

Greg,

On 11/18/05 11:07 AM, Greg Stark [EMAIL PROTECTED] wrote:

 That said, 130MB/s is nothing to sneeze at, that's maxing out two high end
 drives and quite respectable for a 3-disk stripe set, even reasonable for a
 4-disk stripe set. If you're using 5 or more disks in RAID-0 or RAID 1+0 and
 only getting 130MB/s then it does seem likely the cpu is actually holding you
 back here.

With an FC array, it's undoubtedly more like 14 drives, in which case
130MB/s is laughable.  On the other hand, I wouldn't be surprised if it were
a single 200MB/s Fibre Channel attachment.

It does make you wonder why people keep recommending 15K RPM drives, like it
would help *not*.

 Still it doesn't show Postgres being nearly so CPU wasteful as the original
 poster claimed.

It's partly about waste, and partly about lack of a concurrent I/O
mechanism.  We've profiled it for the waste, we've implemented concurrent
I/O to prove the other point.
 
 It's all in the kernel either way; using a different scheduler or file
 system would change that result. Even better would be using direct IO to not
 flush everything else from memory and avoid some memory copies from kernel
 to user space. Note that almost none of the time is user time. Changing
 postgresql won't change the cpu useage.
 
 Well changing to direct i/o would still be changing Postgres so that's
 unclear. And there are plenty of more mundane ways that Postgres is
 responsible for how efficiently or not the kernel is used. Just using fewer
 syscalls to do the same amount of reading would reduce cpu consumption.

Bingo.
 
- Luke



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Alan Stange


Luke Lonergan wrote:

opterons from Sun that we got some time ago.   I think the 130MB/s is
slow given the hardware, but it's acceptable.  I'm not too price
sensitive; I care much more about reliability, uptime, etc.


I don't know what the system cost. It was part of block of dual

Then I know what they cost - we have them too (V20z and V40z).  You should
be getting 400MB/s+ with external RAID.
Yes, but we don't.   This is where I would normally begin a rant on how 
craptacular Linux can be at times.  But, for the sake of this 
discussion, postgresql isn't reading the data any more slowly than does 
any other program.


And we don't have the time to experiment with the box.

I know it should be better, but it's good enough for our purposes at 
this time.


-- Alan


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Alan Stange


Greg Stark wrote:

Alan Stange [EMAIL PROTECTED] writes:

  

Luke Lonergan wrote:


Alan,

On 11/18/05 9:31 AM, Alan Stange [EMAIL PROTECTED] wrote:


  

Here's the output from one iteration of iostat -k 60 while the box is
doing a select count(1) on a 238GB table.

avg-cpu:  %user   %nice%sys %iowait   %idle
   0.990.00   17.97   32.40   48.64

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdd 345.95130732.53 0.007843952  0

We're reading 130MB/s for a full minute.  About 20% of a single cpu was
being used.   The remainder being idle.



Cool - thanks for the results.  Is that % of one CPU, or of 2?  Was the
system otherwise idle?

  
Actually, this was dual cpu 



I hate to agree with him but that looks like a dual machine with one CPU
pegged. Yes most of the time is being spent in the kernel, but you're still
basically cpu limited.

That said, 130MB/s is nothing to sneeze at, that's maxing out two high end
drives and quite respectable for a 3-disk stripe set, even reasonable for a
4-disk stripe set. If you're using 5 or more disks in RAID-0 or RAID 1+0 and
only getting 130MB/s then it does seem likely the cpu is actually holding you
back here.

Still it doesn't show Postgres being nearly so CPU wasteful as the original
poster claimed.
  
Yes and no.   The one cpu is clearly idle.   The second cpu is 40% busy 
and 60% idle (aka iowait in the above numbers).
Of that 40%, other things were happening as well during the 1 minute 
snapshot.   During some iostat outputs that I didn't post the cpu time 
was ~ 20%.


So, you can take your pick.   The single cpu usage is somewhere between 
20% and 40%.  As I can't remove other users of the system, it's the best 
measurement that I can make right now.


Either way, it's not close to being cpu bound.  This is with Opteron 
248, 2.2Ghz cpus.


Note that the storage system has been a bit disappointing:  it's an IBM 
Fast T600 with a 200MB/s fiber attachment.   It could be better, but 
it's not been the bottleneck in our work, so we haven't put any energy 
into it.  


It's all in the kernel either way; using a different scheduler or file
system would change that result. Even better would be using direct IO to not
flush everything else from memory and avoid some memory copies from kernel
to user space. Note that almost none of the time is user time. Changing
postgresql won't change the cpu useage.


Well changing to direct i/o would still be changing Postgres so that's
unclear. And there are plenty of more mundane ways that Postgres is
responsible for how efficiently or not the kernel is used. Just using fewer
syscalls to do the same amount of reading would reduce cpu consumption.
Absolutely.  This is why we're using a 32KB block size and also switched 
to using O_SYNC for the WAL syncing method. That's many MB/s that 
don't need to be cached in the kernel (thus evicting other data), and we 
avoid all the fysnc/fdatasync syscalls.


The purpose of direct IO isn't to make the vacuum or analyze faster, but 
to lessen their impact on queries with someone waiting for the 
results.   That's our biggest hit:  running a sequential scan on 240GB 
of data and flushing everything else out of memory.


Now that I'm think about this a bit, a big chunk of time is probably 
being lost in TLB misses and other virtual memory events that would be 
avoided if a larger page size was being used.


-- Alan 



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Luke Lonergan

Title: Re: [PERFORM] Hardware/OS recommendations for large databases (



Vivek,

On 11/18/05 8:07 AM, Vivek Khera [EMAIL PROTECTED] wrote:


On Nov 18, 2005, at 10:13 AM, Luke Lonergan wrote:

Still, there is a CPU limit here  this is not I/O bound, it is CPU limited as evidenced by the sensitivity to readahead settings.   If the filesystem could do 1GB/s, you wouldnt go any faster than 244MB/s.

Yeah, and mysql would probably be faster on your trivial queries.  Try concurrent large joins and updates and see which system is faster.

Thats what we do to make a living. And its Oracle that a lot faster because they implemented a much tighter, optimized I/O path to disk than Postgres.

Since you asked, we bought the 5 systems as a cluster  and with Bizgres MPP we get close to 400MB/s per machine on complex queries.

- Luke

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Dave Cramer

Luke,Interesting numbers. I'm a little concerned about the use of blockdev —setra 16384. If I understand this correctly it assumes that the table is contiguous on the disk does it not ?DaveOn 18-Nov-05, at 10:13 AM, Luke Lonergan wrote: Dave,  On 11/18/05 5:00 AM, "Dave Cramer" [EMAIL PROTECTED] wrote:Now there's an interesting line drawn in the sand. I presume you have   numbers to back this up ?This should draw some interesting posts.  Part 2: The answer  System A: This system is running RedHat 3 Update 4, with a Fedora 2.6.10 Linux kernel.  On a single table with 15 columns (the Bizgres IVP) at a size double memory (2.12GB), Postgres 8.0.3 with Bizgres enhancements takes 32 seconds to scan the table: that’s 66 MB/s.  Not the efficiency I’d hope from the onboard SATA controller that I’d like, I would have expected to get 85% of the 100MB/s raw read performance.  So that’s $1,200 / 66 MB/s (without adjusting for 2003 price versus now) = 18.2 $/MB/s  Raw data: [EMAIL PROTECTED] IVP]$ cat scan.sh  #!/bin/bash  time psql -c "select count(*) from ivp.bigtable1" dgtestdb [EMAIL PROTECTED] IVP]$ cat sysout1   count    --  1000 (1 row)   real    0m32.565s user    0m0.002s sys 0m0.003s  Size of the table data: [EMAIL PROTECTED] IVP]$ du -sk dgtestdb/base 2121648 dgtestdb/base  System B: This system is running an XFS filesystem, and has been tuned to use very large (16MB) readahead.  It’s running the Centos 4.1 distro, which uses a Linux 2.6.9 kernel.  Same test as above, but with 17GB of data takes 69.7 seconds to scan (!)  That’s 244.2MB/s, which is obviously double my earlier point of 110-120MB/s.  This system is running with a 16MB Linux readahead setting, let’s try it with the default (I think) setting of 256KB – AHA! Now we get 171.4 seconds or 99.3MB/s.  So, using the tuned setting of “blockdev —setra 16384” we get $6,000 / 244MB/s = 24.6 $/MB/s If we use the default Linux setting it’s 2.5x worse.  Raw data: [EMAIL PROTECTED] IVP]$ cat scan.sh  #!/bin/bash  time psql -c "select count(*) from ivp.bigtable1" dgtestdb [EMAIL PROTECTED] IVP]$ cat sysout3   count    --  8000 (1 row)   real    1m9.875s user    0m0.000s sys 0m0.004s [EMAIL PROTECTED] IVP]$ !du du -sk dgtestdb/base 17021260    dgtestdb/base  Summary:  cough, cough OK – you can get more I/O bandwidth out of the current I/O path for sequential scan if you tune the filesystem for large readahead.  This is a cheap alternative to overhauling the executor to use asynch I/O.  Still, there is a CPU limit here – this is not I/O bound, it is CPU limited as evidenced by the sensitivity to readahead settings.   If the filesystem could do 1GB/s, you wouldn’t go any faster than 244MB/s.  - Luke

Re: [PERFORM] Hardware/OS recommendations for large databases

2005-11-18 Thread Ron

Breaking the ~120MBps pg IO ceiling by any means 
is an important result.  Particularly when you 
get a ~2x improvement.  I'm curious how far we 
can get using simple approaches like this.


At 10:13 AM 11/18/2005, Luke Lonergan wrote:

Dave,

On 11/18/05 5:00 AM, Dave Cramer [EMAIL PROTECTED] wrote:

 Now there's an interesting line drawn in the sand. I presume you have
 numbers to back this up ?

 This should draw some interesting posts.

Part 2: The answer

System A:
This system is running RedHat 3 Update 4, with a Fedora 2.6.10 Linux kernel.

On a single table with 15 columns (the Bizgres 
IVP) at a size double memory (2.12GB), Postgres 
8.0.3 with Bizgres enhancements takes 32 seconds 
to scan the table: thats 66 MB/s.  Not the 
efficiency Id hope from the onboard SATA 
controller that Id like, I would have expected 
to get 85% of the 100MB/s raw read performance.
Have you tried the large read ahead trick with 
this system?  It would be interesting to see how 
much it would help.  It might even be worth it to 
do the experiment at all of [default, 2x default, 
4x default, 8x default, etc] read ahead until 
either a) you run out of resources to support the 
desired read ahead, or b) performance levels 
off.  I can imagine the results being very enlightening.




System B:
This system is running an XFS filesystem, and 
has been tuned to use very large (16MB) 
readahead.  Its running the Centos 4.1 distro, 
which uses a Linux 2.6.9 kernel.


Same test as above, but with 17GB of data takes 
69.7 seconds to scan (!)  Thats 244.2MB/s, 
which is obviously double my earlier point of 
110-120MB/s.  This system is running with a 16MB 
Linux readahead setting, lets try it with the 
default (I think) setting of 256KB  AHA! Now we get 171.4 seconds or 99.3MB/s.

The above experiment would seem useful here as well.



Summary:

cough, cough OK  you can get more I/O 
bandwidth out of the current I/O path for 
sequential scan if you tune the filesystem for 
large readahead.  This is a cheap alternative to 
overhauling the executor to use asynch I/O.


Still, there is a CPU limit here  this is not 
I/O bound, it is CPU limited as evidenced by the 
sensitivity to readahead settings.   If the 
filesystem could do 1GB/s, you wouldnt go any faster than 244MB/s.


- Luke


I respect your honesty in reporting results that 
were different then your expectations or 
previously taken stance.  Alan Stange's comment 
re: the use of direct IO along with your comments 
re: async IO and mem copies plus the results of 
these experiments could very well point us 
directly at how to most easily solve pg's CPU boundness during IO.


[HACKERS] are you watching this?

Ron



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Mark Kirkwood


Luke Lonergan wrote:

(mass snippage) 
time psql -c select count(*) from ivp.bigtable1 dgtestdb

[EMAIL PROTECTED] IVP]$ cat sysout3
  count   
--

 8000
(1 row)


real1m9.875s
user0m0.000s
sys 0m0.004s
[EMAIL PROTECTED] IVP]$ !du
du -sk dgtestdb/base
17021260dgtestdb/base


Summary:

cough, cough OK – you can get more I/O bandwidth out of the current 
I/O path for sequential scan if you tune the filesystem for large 
readahead.  This is a cheap alternative to overhauling the executor to 
use asynch I/O.


Still, there is a CPU limit here – this is not I/O bound, it is CPU 
limited as evidenced by the sensitivity to readahead settings.   If the 
filesystem could do 1GB/s, you wouldn’t go any faster than 244MB/s.





Luke,

Interesting - but possibly only representative for a workload consisting 
entirely of one executor doing SELECT ... FROM my_single_table.


If you alter this to involve more complex joins (e.g 4. way star) and 
(maybe add a small number of concurrent executors too) - is it still the 
case?


Cheers

Mark

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Luke Lonergan

Mark,

On 11/18/05 3:46 PM, Mark Kirkwood [EMAIL PROTECTED] wrote:
 
 If you alter this to involve more complex joins (e.g 4. way star) and
 (maybe add a small number of concurrent executors too) - is it still the
 case?

4-way star, same result, that's part of my point.  With Bizgres MPP, the
4-way star uses 4 concurrent scanners, though not all are active all the
time.  And that's per segment instance - we normally use one segment
instance per CPU, so our concurrency is NCPUs plus some.

The trick is the small number of concurrent executors part.  The only way
to get this with normal postgres is to have concurrent users, and normally
they are doing different things, scanning different parts of the disk.
These are competing things, and for concurrency enhancement something like
sync scan would be an effective optimization.

But in reporting, business analytics and warehousing in general, there are
reports that take hours to run.  If you can knock that down by factors of 10
using parallelism, it's a big win.  That's the reason that Teradata did $1.5
Billion in business last year.

More importantly - that's the kind of work that everyone using internet data
for analytics wants right now.

- Luke



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Luke Lonergan

Title: Re: [PERFORM] Hardware/OS recommendations for large databases (



Mark,

On 11/18/05 3:46 PM, Mark Kirkwood [EMAIL PROTECTED] wrote:

If you alter this to involve more complex joins (e.g 4. way star) and
(maybe add a small number of concurrent executors too) - is it still the
case?

I may not have listened to you - are you asking about whether the readahead works for these cases?

Ill be running some massive TPC-H benchmarks on these machines soon  well see then.

- Luke

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Mark Kirkwood


Luke Lonergan wrote:

Mark,

On 11/18/05 3:46 PM, Mark Kirkwood [EMAIL PROTECTED] wrote:

If you alter this to involve more complex joins (e.g 4. way star) and
(maybe add a small number of concurrent executors too) - is it still the
case?


I may not have listened to you - are you asking about whether the 
readahead works for these cases?


I’ll be running some massive TPC-H benchmarks on these machines soon – 
we’ll see then.



That too, meaning the business of 1 executor random reading a given 
relation file whilst another is sequentially scanning (some other) part 
of it


Cheers

Mark

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-17 Thread William Yu


Welty, Richard wrote:

David Boreham wrote:


I guess I've never bought into the vendor story that there are
two reliability grades. Why would they bother making two
different kinds of bearing, motor etc ? Seems like it's more
likely an excuse to justify higher prices.



then how to account for the fact that bleeding edge SCSI drives
turn at twice the rpms of bleeding edge consumer drives?


The motors spin twice as fast?

I'm pretty sure the original comment was based on drives w/ similar 
specs. E.g. 7200RPM enterprise drives versus 7200RPM consumer drives.


Next time one of my 7200RPM SCSIs fail, I'll take it apart and compare 
the insides to an older 7200RPM IDE from roughly the same era.


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] Hardware/OS recommendations for large databases (5TB)

2005-11-17 Thread Vivek Khera



On Nov 16, 2005, at 4:50 PM, Claus Guttesen wrote:


I'm (also) FreeBSD-biased but I'm not shure whether the 5 TB fs will
work so well if tools like fsck are needed. Gvinum could be one option
but I don't have any experience in that area.


Then look into an external filer and mount via NFS.  Then it is not  
FreeBSD's responsibility to manage the volume.



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-17 Thread Alex Turner

On 11/16/05, William Yu [EMAIL PROTECTED] wrote:
 Alex Turner wrote:
  Spend a fortune on dual core CPUs and then buy crappy disks...  I bet
  for most applications this system will be IO bound, and you will see a
  nice lot of drive failures in the first year of operation with
  consumer grade drives.
 
  Spend your money on better Disks, and don't bother with Dual Core IMHO
  unless you can prove the need for it.

 I would say the opposite -- you always want Dual Core nowadays. DC
 Opterons simply give you better bang for the buck than single core
 Opterons. Price out a 1xDC system against a 2x1P system -- the 1xDC will
 be cheaper. Do the same for 2xDC versus 4x1P, 4xDC versus 8x1P, 8xDC
 versus 16x1P, etc. -- DC gets cheaper by wider and wider margins because
 those mega-CPU motherboards are astronomically expensive.


Opteron 242 - $178.00
Opteron 242 - $178.00
Tyan S2882 - $377.50
Total: $733.50

Opteron 265 - $719.00
Tyan K8E - $169.00
Total: $888.00

Tyan K8E - doesn't have any PCI-X, so it's also not apples to apples. 
Infact I couldn't find a single CPU slot board that did, so you pretty
much have to buy a dual CPU board to get PCI-X.

1xDC is _not_ cheaper.

Our DB application does about 5 queries/second peak, plus a heavy
insert job once per day.  We only _need_ two CPUs, which is true for a
great many DB applications.  Unless you like EJB of course, which will
thrash the crap out of your system.

Consider the two most used regions for DBs:

a) OLTP - probably IO bound, large number of queries/sec updating info
on _disks_, not requiring much CPU activity except to retrieve item
infomration which is well indexed and normalized.

b) Data wharehouse - needs CPU, but probably still IO bound, large
data set that won't fit in RAM will required large amounts of disk
reads.  CPU can easily keep up with disk reads.

I have yet to come across a DB system that wasn't IO bound.

 DC also gives you a better upgrade path. Let's say you do testing and
 figure 2x246 is the right setup to handle the load. Well instead of
 getting 2x1P, use the same 2P motherboard but only populate 1 CPU w/ a
 DC/270. Now you have a server that can be upgraded to +80% more CPU by
 popping in another DC/270 versus throwing out the entire thing to get a
 4x1P setup.

No argument there.  But it's pointless if you are IO bound.


 The only questions would be:
 (1) Do you need a SMP server at all? I'd claim yes -- you always need 2+
 cores whether it's DC or 2P to avoid IO interrupts blocking other
 processes from running.

At least 2CPUs is always good for precisely those reasons.  More than
2CPUs gives diminishing returns.


 (2) Does a DC system perform better than it's Nx1P cousin? My experience
 is yes. Did some rough tests in a drop-in-replacement 1x265 versus 2x244
 and saw about +10% for DC. All the official benchmarks (Spec, Java, SAP,
 etc) from AMD/Sun/HP/IBM show DCs outperforming the Nx1P setups.

Maybe true, but the 265 does have a 25% faster FSB than the 244, which
might perhaps play a role.


 (3) Do you need an insane amount of memory? Well here's the case where
 the more expensive motherboard will serve you better since each CPU slot
 has its own bank of memory. Spend more money on memory, get cheaper
 single-core CPUs.

Remember - large DB is going to be IO bound.  Memory will get thrashed
for file block buffers, even if you have large amounts, it's all gonna
be cycled in and out again.


 Of course, this doesn't apply if you are an Intel/Dell-only shop. Xeon
 DCs, while cheaper than their corresponding single-core SMPs, don't have
 the same performance profile of Opteron DCs. Basically, you're paying a
 bit extra so your server can generate a ton more heat.

Dell/Xeon/Postgres is just a bad combination any day of the week ;)

Alex.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-17 Thread Alex Turner

Just pick up a SCSI drive and a consumer ATA drive.

Feel their weight.

You don't have to look inside to tell the difference.

Alex

On 11/16/05, David Boreham [EMAIL PROTECTED] wrote:


  I suggest you read this on the difference between enterprise/SCSI and
 desktop/IDE drives:

 http://www.seagate.com/content/docs/pdf/whitepaper/D2c_More_than_Interface_ATA_vs_SCSI_042003.pdf


  This is exactly the kind of vendor propaganda I was talking about
  and it proves my point quite well : that there's nothing specific relating
  to reliability that is different between SCSI and SATA drives cited in that
 paper.
  It does have a bunch of FUD such as 'oh yeah we do a lot more
  drive characterization during manufacturing'.





---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-17 Thread Alex Turner

On 11/16/05, Joshua D. Drake [EMAIL PROTECTED] wrote:
 
  The only questions would be:
  (1) Do you need a SMP server at all? I'd claim yes -- you always need
  2+ cores whether it's DC or 2P to avoid IO interrupts blocking other
  processes from running.

 I would back this up. Even for smaller installations (single raid 1, 1
 gig of ram). Why? Well because many applications are going to be CPU
 bound. For example
 we have a PHP application that is a CMS. On a single CPU machine, RAID 1
 it takes about 300ms to deliver a single page, point to point. We are
 not IO bound.
 So what happens is that under reasonable load we are actually waiting
 for the CPU to process the code.


This is the performance profile for PHP, not for Postgresql.  This is
the postgresql mailing list.

 A simple upgrade to an SMP machine literally doubles our performance
 because we are still not IO bound. I strongly suggest that everyone use
 at least a single dual core because of this experience.


Performance of PHP, not postgresql.

 
  (3) Do you need an insane amount of memory? Well here's the case where
  the more expensive motherboard will serve you better since each CPU
  slot has its own bank of memory. Spend more money on memory, get
  cheaper single-core CPUs.
 Agreed. A lot of times the slowest dual-core is 5x what you actually
 need. So get the slowest, and bulk up on memory. If nothing else memory
 is cheap today and it might not be tomorrow.
[snip]

Running postgresql on a single drive RAID 1 with PHP on the same
machine is not a typical installation.

300ms for PHP in CPU time?  wow dude - that's quite a page.  PHP
typical can handle up to 30-50 pages per second for a typical OLTP
application on a single CPU box.  Something is really wrong with that
system if it takes 300ms per page.

Alex.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-17 Thread Merlin Moncure

 Remember - large DB is going to be IO bound.  Memory will get thrashed
 for file block buffers, even if you have large amounts, it's all gonna
 be cycled in and out again.

'fraid I have to disagree here. I manage ERP systems for manufacturing
companies of various sizes.  My systems are all completely cpu
bound...even though the larger database are well into two digit gigabyte
sizes, the data turnover while huge is relatively constrained and well
served by the O/S cache. OTOH, query latency is a *huge* factor and we
do everything possible to lower it.  Even if the cpu is not 100% loaded,
faster processors make the application 'feel' faster to the client.

Merlin



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-17 Thread William Yu


Alex Turner wrote:

Opteron 242 - $178.00
Opteron 242 - $178.00
Tyan S2882 - $377.50
Total: $733.50

Opteron 265 - $719.00
Tyan K8E - $169.00
Total: $888.00


You're comparing the wrong CPUs. The 265 is the 2x of the 244 so you'll 
have to bump up the price more although not enough to make a difference.


Looks like the price of the 2X MBs have dropped since I last looked at 
it. Just a few months back, Tyan duals were $450-$500 which is what I 
was basing my priced less statement from.


Tyan K8E - doesn't have any PCI-X, so it's also not apples to apples. 
Infact I couldn't find a single CPU slot board that did, so you pretty

much have to buy a dual CPU board to get PCI-X.


You can get single CPU boards w/ PCIe and use PCIe controller cards. 
Probably expensive right now because they're so bleeding-edge new but 
definitely on the downswing.



a) OLTP - probably IO bound, large number of queries/sec updating info
on _disks_, not requiring much CPU activity except to retrieve item
infomration which is well indexed and normalized.


Not in my experience. I find on our OLTP servers, we run 98% in RAM and 
hence are 100% CPU-bound. Our DB is about 50GB in size now, servers run 
w/ 8GB of RAM. We were *very* CPU limited running 2x244. During busy 
hours of the day, our avg user transaction time were jumping from 
0.8sec to 1.3+sec. Did the 2x265 and now we're always in the 0.7sec to 
0.9sec range.



DC also gives you a better upgrade path. Let's say you do testing and
figure 2x246 is the right setup to handle the load. Well instead of
getting 2x1P, use the same 2P motherboard but only populate 1 CPU w/ a
DC/270. Now you have a server that can be upgraded to +80% more CPU by
popping in another DC/270 versus throwing out the entire thing to get a
4x1P setup.



No argument there.  But it's pointless if you are IO bound.


Why would you just accept we're IO bound, nothing we can do? I'd do 
everything in my power to make my app go from IO bound to CPU bound -- 
whether by optimizing my code or buying more hardware. I can tell you if 
our OLTP servers were IO bound, it would run like crap. Instead of  1 
sec, we'd be looking at 5-10 seconds per user transaction and our 
users would be screaming bloody murder.


In theory, you can always convert your IO bound DB to CPU bound by 
stuffing more and more RAM into your server. (Or partitioning the DB 
across multiple servers.) Whether it's cost effective depends on the DB 
and how much your users are paying you -- and that's a case-by-case 
analysis. Not a global statement of IO-bound, pointless.



(2) Does a DC system perform better than it's Nx1P cousin? My experience
is yes. Did some rough tests in a drop-in-replacement 1x265 versus 2x244
and saw about +10% for DC. All the official benchmarks (Spec, Java, SAP,
etc) from AMD/Sun/HP/IBM show DCs outperforming the Nx1P setups.



Maybe true, but the 265 does have a 25% faster FSB than the 244, which
might perhaps play a role.


Nope. There's no such thing as FSB on Opterons. On-die memory controller 
runs @ CPU speed and hence connects at whatever the memory runs at 
(rounded off to some multiplier math). There's the HT speed that 
controls the max IO bandwidth but that's based on the motherboard, not 
the CPU. Plus the 265 and 244 both run at 1.8Ghz so the memory 
multiplier  HT IO are both the same.


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-17 Thread Joshua Marsh

On 11/17/05, William Yu [EMAIL PROTECTED] wrote:
 No argument there.But it's pointless if you are IO bound.Why would you just accept we're IO bound, nothing we can do? I'd doeverything in my power to make my app go from IO bound to CPU bound --
whether by optimizing my code or buying more hardware. I can tell you ifour OLTP servers were IO bound, it would run like crap. Instead of  1sec, we'd be looking at 5-10 seconds per user transaction and our
users would be screaming bloody murder.In theory, you can always convert your IO bound DB to CPU bound bystuffing more and more RAM into your server. (Or partitioning the DBacross multiple servers.) Whether it's cost effective depends on the DB
and how much your users are paying you -- and that's a case-by-caseanalysis. Not a global statement of IO-bound, pointless.
We all want our systems to be CPU bound, but it's not always
possible. Remember, he is managing a 5 TB Databse. That's
quite a bit different than a 100 GB or even 500 GB database.

1 2 >

1 - 100 of 155 matches

Mail list logo