from:"Ron"

Re: [PERFORM] Finding bottleneck

2005-08-16 Thread Ron


I think I have a solution for you.

You have posted that you presently have these RAID volumes and behaviors:
  sda: data (10 spindles, raid10)
  sdb: xlog & clog (2 spindles, raid1)
  sdc: os and other stuff

Usually iostat (2 second interval) says:
avg-cpu: %user  %nice  %sys  %iowait  %idle
  32.380.00   12.8811.6243.12

Device:   tps kB_read/skB_wrtn/skB_readkB_wrtn
  sda  202.00  1720.00 0.00   3440  0
  sdb  152.50 4.00 2724.00  8   5448
  sdc 0.00 0.000.00 0 0

And during checkpoint:
avg-cpu:  %user   %nice  %sys  %iowait  %idle
   31.25  0.00   14.75   54.000.00

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sda3225.50  1562.00 35144.00   3124  70288
sdb 104.5010.00  2348.00 20   4696
sdc   0.00 0.00 0.00  0  0


During checkpoints sda is becoming saturated, essentially halting all 
other DB activity involving sda.  A lesser version of the porblem is 
probably occurring every time multiple entities on sda are being 
accessed simultaneously, particularly simultaneous writes.


My Proposed Solution:
Put comment and its index on it's own dedicated RAID volume.
Put comment_archive and its index on its own dedicated RAID volume.
Put the rest of the tables currently part of "data" on their own 
dedicated RAID volume.
Put the rest if the indexes to the tables currently part of "data" on 
their own dedicated RAID volume.

Put xlog on its own dedicated RAID volume.

The general idea here is to put any tables or indexes that tend to 
require simultaneous access, particularly write access, on different 
spindles.  Like all things, there's a point of diminishing returns 
that is dependent on the HW used and the DB load.


If you must wring every last bit of IO out of the HD subsystem, a 
more exact set of spindle assignments can be made by analyzing your 
queries and then 1) make sure writes that tend to be simultaneous are 
to different spindles, then (if you still need better IO) 2) make 
sure reads that tend to be simultaneous are to different 
spindles.  At some point, your controller will become the 
bottleneck.  At some point beyond that, the IO channels on the 
mainboard will become the bottleneck.


My suggestion should get you to within 80-90% of optimal if I've 
understood the implications of your posts correctly.


The other suggestion I'd make is to bump your RAM from 16GB to 32GB 
as soon as you can afford it and then tune your PostgreSQL parameters 
to make best use of it.  The more RAM resident your DB, the better.


Hope this helps,
Ron Peacetree


===Original Message Follows===
From: Kari Lavikka 
To: Merlin Moncure 
Subject: Re: Finding bottleneck
Date: Mon, 8 Aug 2005 19:19:09 +0300 (EETDST)

--

Actually I modified postgresql.conf a bit and there isn't commit 
delay any more. That didn't make noticeable difference though..


Workload is generated by a website with about 1000 dynamic page views 
a second. Finland's biggest site among youths btw.



Anyway, there are about 70 tables and here's some of the most important:
 relname  |  reltuples
--+-
 comment  | 1.00723e+08
 comment_archive  | 9.12764e+07
 channel_comment  | 6.93912e+06
 image| 5.80314e+06
 admin_event  |  5.1936e+06
 user_channel | 3.36877e+06
 users|  325929
 channel  |  252267

Queries to "comment" table are mostly IO-bound but are performing 
quite well. Here's an example:
(SELECT u.nick, c.comment, c.private, c.admin, c.visible, c.parsable, 
c.uid_sender, to_char(c.stamp, 'DD.MM.YY HH24:MI') AS stamp, 
c.comment_id FROM comment c INNER JOIN users u ON u.uid = 
c.uid_sender WHERE u.status = 'a' AND c.image_id = 15500900 AND 
c.uid_target = 780345 ORDER BY uid_target DESC, image_id DESC, 
c.comment_id DESC) LIMIT 36



And explain analyze:
 Limit  (cost=0.00..6.81 rows=1 width=103) (actual 
time=0.263..17.522 rows=12 loops=1)
   ->  Nested Loop  (cost=0.00..6.81 rows=1 width=103) (actual 
time=0.261..17.509 rows=12 loops=1)
 ->  Index Scan Backward using 
comment_uid_target_image_id_comment_id_20050527 on "comment" 
c  (cost=0.00..3.39 rows=1 width=92) (actual time=0.129..16.213 
rows=12 loops=1)

   Index Cond: ((uid_target = 780345) AND (image_id = 15500900))
 ->  Index Scan using users_pkey on users 
u  (cost=0.00..3.40 rows=1 width=15) (actual time=0.084..0.085 rows=1 loops=12)

Re: [PERFORM] Need for speed

2005-08-17 Thread Ron


At 05:15 AM 8/17/2005, Ulrich Wisser wrote:

Hello,

thanks for all your suggestions.

I can see that the Linux system is 90% waiting for disc io.


A clear indication that you need to improve your HD IO subsystem.


At that time all my queries are *very* slow.


To be more precise, your server performance at that point is 
essentially equal to your HD IO subsystem performance.




 My scsi raid controller and disc are already the fastest available.


Oh, REALLY?  This is the description of the system you gave us:

"We have a box with
Linux Fedora Core 3, Postgres 7.4.2
Intel(R) Pentium(R) 4 CPU 2.40GHz
2 scsi 76GB disks (15.000RPM, 2ms)"

The is far, Far, FAR from the "the fastest available" in terms of SW, 
OS, CPU host, _or_ HD subsystem.


The "fastest available" means
1= you should be running 8.0.3
2= you should be running the latest stable 2.6 based kernel
3= you should be running an Opteron based server
4= Fibre Channel HDs are higher performance than SCSI ones.
5= (and this is the big one) YOU NEED MORE SPINDLES AND A HIGHER END 
RAID CONTROLLER.


The absolute "top of the line" for RAID controllers is something 
based on Fibre Channel from Xyratex (who make the RAID engines for 
EMC and NetApps), Engino (the enterprise division of LSI Logic who 
sell mostly to IBM.  Apple has a server based on an Engino card), 
dot-hill (who bought Chaparral among others).  I suspect you can't 
afford them even if they would do business with you.  The ante for a 
FC-based RAID subsystem in this class is in the ~$32K to ~$128K 
range, even if you buy direct from the actual RAID HW manufacturer 
rather than an OEM like


In the retail commodity market, the current best RAID controllers are 
probably the 16 and 24 port versions of the Areca cards ( 
www.areca.us ).  They come darn close to saturating the the Real 
World Peak Bandwidth of a 64b 133MHz PCI-X bus.


I did put pg_xlog on another file system on other discs.


 The query plan uses indexes and "vacuum analyze" is run once a day.


That


To avoid aggregating to many rows, I already made some aggregation 
tables which will be updated after the import from the Apache 
logfiles.  That did help, but only to a certain level.


I believe the biggest problem is disc io. Reports for very recent 
data are quite fast, these are used very often and therefor already 
in the cache. But reports can contain (and regulary do) very old 
data. In that case the whole system slows down. To me this sounds 
like the recent data is flushed out of the cache and now all data 
for all queries has to be fetched from disc.


My machine has 2GB memory,





---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Need for speed

2005-08-17 Thread Ron


At 05:15 AM 8/17/2005, Ulrich Wisser wrote:

Hello,

thanks for all your suggestions.

I can see that the Linux system is 90% waiting for disc io.


A clear indication that you need to improve your HD IO subsystem if possible.



At that time all my queries are *very* slow.


To be more precise, your server performance at that point is 
essentially equal to your HD IO subsystem performance.




 My scsi raid controller and disc are already the fastest available.


Oh, REALLY?  This is the description of the system you gave us:

"We have a box with
Linux Fedora Core 3, Postgres 7.4.2
Intel(R) Pentium(R) 4 CPU 2.40GHz
2 scsi 76GB disks (15.000RPM, 2ms)"


The is far, Far, FAR from the "the fastest available" in terms of SW, 
OS, CPU host, _or_ HD subsystem.


The "fastest available" means
1= you should be running PostgreSQL 8.0.3
2= you should be running the latest stable 2.6 based kernel
3= you should be running an Opteron based server
4= Fibre Channel HDs are slightly higher performance than SCSI ones.
5= (and this is the big one) YOU NEED MORE SPINDLES AND A HIGHER END 
RAID CONTROLLER.


Your description of you workload was:
"one of our services is click counting for on line advertising. We do 
this by importing Apache log files every five minutes. This results 
in a lot of insert and delete statements. At the same time our 
customers shall be able to do on line reporting."


There are two issues here:
1= your primary usage is OLTP-like, but you are also expecting to do 
reports against the same schema that is supporting your OLTP-like 
usage.  Bad Idea.  Schemas that are optimized for reporting and other 
data mining like operation are pessimal for OLTP-like applications 
and vice versa.  You need two schemas: one optimized for lots of 
inserts and deletes (OLTP-like), and one optimized for reporting 
(data-mining like).


2= 2 spindles, even 15K rpm spindles, is minuscule.  Real enterprise 
class RAID subsystems have at least 10-20x that many spindles, 
usually split into 6-12 sets dedicated to different groups of tables 
in the DB.  Putting xlog on its own dedicated spindles is just the 
first step.


The absolute "top of the line" for RAID controllers is something 
based on Fibre Channel from Xyratex (who make the RAID engines for 
EMC and NetApps), Engino (the enterprise division of LSI Logic who 
sell mostly to IBM.  Apple has a server based on an Engino card), or 
dot-hill (who bought Chaparral among others).  I suspect you can't 
afford them even if they would do business with you.  The ante for a 
FC-based RAID subsystem in this class is in the ~$32K to ~$128K 
range, even if you buy direct from the actual RAID HW manufacturer 
rather than an OEM like EMC, IBM, or NetApp who will 2x or 4x the 
price.  OTOH, these subsystems will provide OLTP or OLTP-like DB apps 
with performance that is head-and-shoulders better than anything else 
to be found.  Numbers like 50K-200K IOPS.  You get what you pay for.


In the retail commodity market where you are more realistically going 
to be buying, the current best RAID controllers are probably the 
Areca cards ( www.areca.us ).  They come darn close to saturating the 
Real World Peak Bandwidth of a 64b 133MHz PCI-X bus and have better 
IOPS numbers than their commodity brethren.  However, _none_ of the 
commodity RAID cards have IOPS numbers anywhere near as high as those 
mentioned above.



To avoid aggregating to many rows, I already made some aggregation 
tables which will be updated after the import from the Apache 
logfiles.  That did help, but only to a certain level.


I believe the biggest problem is disc io. Reports for very recent 
data are quite fast, these are used very often and therefor already 
in the cache. But reports can contain (and regulary do) very old 
data. In that case the whole system slows down. To me this sounds 
like the recent data is flushed out of the cache and now all data 
for all queries has to be fetched from disc.


I completely agree.  Hopefully my above suggestions make sense and 
are of use to you.




My machine has 2GB memory,


...and while we are at it, OLTP like apps benefit less from RAM than 
data mining ones, but still 2GB of RAM is just not that much for a 
real DB server...



Ron Peacetree



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] limit number of concurrent callers to a stored

2005-08-17 Thread Ron


At 09:40 PM 8/17/2005, Alan Stange wrote:

is there a simple way to limit the number of concurrent callers to a 
stored proc?


The problem we have is about 50 clients come and perform the same 
operation at nearly the same time.  Typically, this query takes a 
few seconds to run, but in the case of this thundering herd the 
query time drops to 70 seconds or much more.  The query can return 
up to 15MB of data.


I'm assuming there is some significant write activity going on at 
some point as a result of the query, since MVCC should not care about 
concurrent read activity?


Is that "a few seconds each query" or "a few seconds total if we run 
50 queries sequentially but 70+ seconds per query if we try to run 50 
queries concurrently"?


A) If the former, "a few seconds" * 50 can easily be 70+ seconds, and 
things are what you should expect.  Getting higher performance in 
that situation means reducing per query times, which may or may not 
be easy.  Looking at the stored procedure code with an eye towards 
optimization would be a good place to start.


B) If the later, then table access contention is driving performance 
into the ground, and there are a few things you can try:
1= lock the table(s) under these circumstances so only one query of 
the 50 can be acting on it at a time.  If the table(s) is/are small 
enough to be made RAM resident, this may be a particularly low-cost, 
low-effort, reasonable solution.


2= put a queue into place and only let some small number n of queries 
run against the table(s) concurrently.  Adjust n until you get best 
performance.  There are a few ways this could be done.


3= Buy a SSD and put the table(s) in question on it.  IIRC, 3.5" 
format SSDs that can "drop in" replace HDs are available in up to 
147GB capacities.



The machine is a dual opteron, 8 GB memory, lots of fiber channel 
disk, Linux 2.6, etc.


So, I'm thinking that a semaphore than will block more than N 
clients from being in the core of the function at one time would be 
a good thing.


This will only help in case "B" above.  If you go the "hard" route of 
using systems programming, you will have a lot of details that must 
be paid attention to correctly or Bad Things (tm) will 
happen.  Putting the semaphore in place is the tip of the iceberg.



Hope this helps,
Ron Peacetree




---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] extremly low memory usage

2005-08-18 Thread Ron

At 01:55 PM 8/18/2005, John Arbash Meinel wrote:

Jeremiah Jahn wrote:

>here's an example standard query. Ireally have to make the first hit go
>faster. The table is clustered as well on full_name as well. 'Smith%'
>took 87 seconds on the first hit. I wonder if I set up may array wrong.
>I remeber see something about DMA access versus something else, and
>choose DMA access. LVM maybe?
>
>
It would be nice if you would format your queries to be a little bit
easier to read before posting them.
However, I believe I am reading it correctly, to say that the index scan
on identity is not your slow point. In fact, as near as I can tell, it
only takes 52ms to complete.

The expensive parts are the 4915 lookups into the litigant_details (each
one takes approx 4ms for a total of ~20s).
And then you do it again on case_data (average 3ms each * 4906 loops =
~15s).

How big are litigant_details and case_data?  If they can fit in RAM, 
preload them using methods like the "cat to /dev/null" trick and 
those table lookups will be ~100-1000x faster.  If they won't fit 
into RAM but the machine can be expanded to hold enough RAM to fit 
the tables, it's well worth the ~$75-$150/GB to upgrade the server so 
that the tables will fit into RAM.

If they can't be made to fit into RAM as atomic entities, you have a 
few choices:
A= Put the data tables and indexes on separate dedicated spindles and 
put litigant_details and case_data each on their own dedicated 
spindles.  This will lower seek conflicts.  Again it this requires 
buying some more HDs, it's well worth it.

B= Break litigant_details and case_data into a set of smaller tables 
(based on something sane like the first n characters of the primary key)
such that the smaller tables easily fit into RAM.  Given that you've 
said only 10GB/60GB is "hot", this could work very well.  Combine it 
with "A" above (put all the litigant_details sub tables on one 
dedicated spindle set and all the case_data sub tables on another 
spindle set) for added oomph.

C= Buy a SSD big enough to hold litigant_details and case_data and 
put them there.  Again, this can be combined with "A" and "B" above 
to lessen the size of the SSD needed.

So there is no need for preloading your indexes on the identity 
table.  It is definitely not the bottleneck.

So a few design bits, which may help your database.  Why is 
"actor_id" a text field instead of a number?
You could try creating an index on "litigant_details (actor_id, 
count_ori)" so that it can do just an index lookup, rather than an 
index+ filter.

Yes, that certainly sounds like it would be more efficient.

More importantly, though, the planner seems to think the join of 
identity to litigant_details will only return 1 row, not 5000.

Do you regularly vacuum analyze your tables?
Just as a test, try running:
set enable_nested_loop to off;
And then run EXPLAIN ANALYZE again, just to see if it is faster.

You probably need to increase some statistics targets, so that the
planner can design better plans.

> ->  Nested Loop  (cost=0.00..20411.83 rows=2 width=173) (actual 
time=12.891..38317.017 rows=4906 loops=1)
>->  Nested Loop  (cost=0.00..20406.48 rows=1 width=159)(actual 
time=12.826..23232.106 rows=4906 loops=1)
>   ->  Nested Loop  (cost=0.00..20403.18 rows=1 width=138) 
(actual time=12.751..22885.439 rows=4906 loops=1)
>   Join Filter: (("outer".case_id)::text = 
("inner".case_id)::text)
>   ->  Index Scan using name_speed on 
identity  (cost=0.00..1042.34 rows=4868 width=82) (actual time=0.142..52.538

> rows=4915 loops=1)
>Index Cond: (((full_name)::text >= 
'MILLER'::character varying) AND ((full_name)::text < 
'MILLES'::character varying))
>Filter: (((court_ori)::text = 
'IL081025J'::text) AND ((full_name)::text ~~ 'MILLER%'::text))
>->  Index Scan using lit_actor_speed on 
litigant_details  (cost=0.00..3.96 rows=1 width=81) (actual

> time=4.631..4.635 rows=1 loops=4915)
> Index Cond: (("outer".actor_id)::text = 
(litigant_details.actor_id)::text)

> Filter: ('IL081025J'::text = (court_ori)::text)
> ->  Seq Scan on court  (cost=0.00..3.29 
rows=1 width=33) (actual time=0.053..0.062 rows=1 loops=4906)

>  Filter: ('IL081025J'::text = (id)::text)
>  ->  Index Scan using case_speed on 
case_data (cost=0.00..5.29 rows=3 width=53) (actual time=3.049..3.058

> rows=1 loops=4906)
>   Index Cond: (('IL081025J'::text 
= (case_data.court_ori)::text) AND ((case_data.case_id)::text =

 > ("outer".case_id)::text))

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] sustained update load of 1-2k/sec

2005-08-19 Thread Ron

Alex mentions a nice setup, but I'm pretty sure I know how to beat 
that IO subsystems HW's performance by at least 1.5x or 2x.  Possibly 
more.  (No, I do NOT work for any vendor I'm about to discuss.)

Start by replacing the WD Raptors with Maxtor Atlas 15K II's.
At 5.5ms average access, 97.4MB/s outer track throughput, 85.9MB/s 
average, and 74.4 MB/s inner track throughput, they have the best 
performance characteristics of any tested shipping HDs I know 
of.  (Supposedly the new SAS versions will _sustain_ ~98MB/s, but 
I'll believe that only if I see it under independent testing).
In comparison, the numbers on the WD740GD are 8.1ms average access, 
71.8, 62.9, and 53.9 MB/s outer, average and inner track throughputs 
respectively.

Be prepared to use as many of them as possible (read: as many you can 
afford) if you want to maximize transaction rates, particularly for 
small transactions like this application seems to be mentioning.

Next, use a better RAID card.  The TOL enterprise stuff (Xyratex, 
Engino, Dot-hill) is probably too expensive, but in the commodity 
market benchmarks indicate that that Areca's 1GB buffer RAID cards 
currently outperform all the other commodity RAID stuff.

9 Atlas II's per card in a RAID 5 set, or 16 per card in a RAID 10 
set, should max the RAID card's throughput and come very close to, if 
not attaining, the real world peak bandwidth of the 64b 133MHz PCI-X 
bus they are plugged into.  Say somewhere in the 700-800MB/s range.

Repeat the above for as many independent PCI-X buses as you have for 
a very fast commodity RAID IO subsystem.

Two such configured cards used in the dame manner as mentioned by 
Alex should easily attain 1.5x - 2x the transaction numbers mentioned 
by Alex unless there's a bottleneck somewhere else in the system design.

Hope this helps,
Ron Peacetree

At 08:40 AM 8/19/2005, Alex Turner wrote:

I have managed tx speeds that high from postgresql going even as high
as 2500/sec for small tables, but it does require a good RAID
controler card (yes I'm even running with fsync on).  I'm using 3ware
9500S-8MI with Raptor drives in multiple RAID 10s.  The box wasn't too
$$$ at just around $7k.  I have two independant controlers on two
independant PCI buses to give max throughput. on with a 6 drive RAID
10 and the other with two 4 drive RAID 10s.

Alex Turner
NetEconomist

On 8/19/05, Mark Cotner <[EMAIL PROTECTED]> wrote:
> Hi all,
> I bet you get tired of the same ole questions over and
> over.
>
> I'm currently working on an application that will poll
> thousands of cable modems per minute and I would like
> to use PostgreSQL to maintain state between polls of
> each device.  This requires a very heavy amount of
> updates in place on a reasonably large table(100k-500k
> rows, ~7 columns mostly integers/bigint).  Each row
> will be refreshed every 15 minutes, or at least that's
> how fast I can poll via SNMP.  I hope I can tune the
> DB to keep up.
>
> The app is threaded and will likely have well over 100
> concurrent db connections.  Temp tables for storage
> aren't a preferred option since this is designed to be
> a shared nothing approach and I will likely have
> several polling processes.
>
> Here are some of my assumptions so far . . .
>
> HUGE WAL
> Vacuum hourly if not more often
>
> I'm getting 1700tx/sec from MySQL and I would REALLY
> prefer to use PG.  I don't need to match the number,
> just get close.
>
> Is there a global temp table option?  In memory tables
> would be very beneficial in this case.  I could just
> flush it to disk occasionally with an insert into blah
> select from memory table.
>
> Any help or creative alternatives would be greatly
> appreciated.  :)
>
> 'njoy,
> Mark
>
>
> --
> Writing software requires an intelligent person,
> creating functional art requires an artist.
> -- Unknown
>
>
> ---(end of broadcast)---
> TIP 3: Have you checked our extensive FAQ?
>
>http://www.postgresql.org/docs/faq
>

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] sustained update load of 1-2k/sec

2005-08-19 Thread Ron


At 09:58 AM 8/19/2005, Andreas Pflug wrote:

The 1-2k xact/sec for MySQL seems suspicious, sounds very much like 
write-back cached, not write-through, esp. considering that heavy 
concurrent write access isn't said to be MySQLs strength...


Don't be suspicious.

I haven't seen the code under discussion, but I have seen mySQL 
easily achieve these kinds of numbers using the myISAM storage engine 
in write-through cache

mode.

myISAM can be =FAST=.  Particularly when decent HW is thrown at it.

Ron



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] sustained update load of 1-2k/sec

2005-08-19 Thread Ron

At 12:34 PM 8/19/2005, Jeffrey W. Baker wrote:

On Fri, 2005-08-19 at 10:54 -0400, Ron wrote:
> Maxtor Atlas 15K II's.

> Areca's 1GB buffer RAID cards

The former are SCSI disks and the latter is an SATA controller. The
combination would have a transaction rate of approximately 0.

You are evidently thinking of the Areca ARC-11xx controllers (and you
are certainly right for that HW combination ;-) ). Those are not the
only product Areca makes that can be upgraded to a 1GB cache.

Until SAS infrastructure is good enough, U320 SCSI and FC HD's remain
the top performing HD's realistically available. At the most
fundamental, your DBMS is only as good as your HD IO subsystem, and
your HD IO subsystem is only as good as your HDs. As others have
said here, skimping on your HDs is _not_ a good design choice where
DBMSs are concerned.

As an aside, the Atlas 15K II's are now available in SAS:
http://www.maxtor.com/portal/site/Maxtor/menuitem.ba88f6d7cf664718376049b291346068/?channelpath=/en_us/Products/SCSI%20Hard%20Drives/Atlas%2015K%20Family/Atlas%2015K%20II%20SAS

I haven't seen independent benches on them, so I explicitly
referenced the U320 Atlas 15K II's known performance numbers
instead. As I said, Maxtor is claiming even better for the SAS
version of the Atlas 15K II.

None of the SAS <-> PCI-X or PCI-E RAID cards I know of are ready for
mass market yet, although a few are in beta..

I can vouch for the Areca controllers, however. You can certainly
achieve pgbench transaction rates in the hundreds per second even with
only 5 7200RPM disks and 128MB cache.

Don't forget to buy the battery.

Agreed.

Hope this is helpful,
Ron Peacetree

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Re: [PERFORM] extremly low memory usage

2005-08-19 Thread Ron

At 01:18 PM 8/19/2005, John A Meinel wrote:

Jeremiah Jahn wrote:
> Sorry about the formatting.
>
> On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote:
>
>>Jeremiah Jahn wrote:
>>
>>

...

>>The expensive parts are the 4915 lookups into the litigant_details (each
>>one takes approx 4ms for a total of ~20s).
>>And then you do it again on case_data (average 3ms each * 4906 loops =
>>~15s).
>
> Is there some way to avoid this?
>

Well, in general, 3ms for a single lookup seems really long. Maybe your
index is bloated by not vacuuming often enough. Do you tend to get a lot
of updates to litigant_details?

Given that the average access time for a 15Krpm HD is in the 5.5-6ms 
range (7.5-8ms for a 10Krpm HD), having an average of 3ms for a 
single lookup implies that ~1/2 (the 15Krpm case) or ~1/3 (the 10Krpm 
case) table accesses is requiring a seek.

This implies a poor match between physical layout and access pattern.

If I understand correctly, the table should not be very fragmented 
given that this is a reasonably freshly loaded DB?  That implies that 
the fields being looked up are not well sorted in the table compared 
to the query pattern.

If the entire table could fit in RAM, this would be far less of a 
consideration.  Failing that, the physical HD layout has to be 
improved or the query pattern has to be changed to reduce seeks.

There are a couple possibilities at this point. First, you can REINDEX
the appropriate index, and see if that helps. However, if this is a test
box, it sounds like you just did a dump and reload, which wouldn't have
bloat in an index.

Another possibility. Is this the column that you usually use when
pulling information out of litigant_details? If so, you can CLUSTER
litigant_details on the appropriate index. This will help things be
close together that should be, which decreases the index lookup costs.

However, if this is not the common column, then you probably will slow
down whatever other accesses you may have on this table.

After CLUSTER, the current data will stay clustered, but new data will
not, so you have to continually CLUSTER, the same way that you might
VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as
expensive as a VACUUM FULL. Be aware of this, but it might vastly
improve your performance, so it would be worth it.

CLUSTER can be a very large maintenance overhead/problem if the 
table(s) in question actually need to be "continually" re CLUSTER ed.

If there is no better solution available, then you do what you have 
to, but it feels like there should be a better answer here.

Perhaps the DB schema needs examining to see if it matches up well 
with its real usage?

Ron Peacetree

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] sustained update load of 1-2k/sec

2005-08-19 Thread Ron


At 03:31 PM 8/19/2005, Alex Turner wrote:

Don't forget that Ultra 320 is the speed of the bus, not each drive.
No matter how many honking 15k disks you put on a 320MB bus, you can
only get 320MB/sec! and have so many outstanding IO/s on the bus.


Of course.  This is exactly why multi-channel SCSI and multichannel 
Fibre Channel cards exist; and why external RAID enclosures usually 
have multiple such cards in them...


Even moderately acceptable U320 SCSI cards are dual channel at this 
point (think Adaptec dual channel AHA's), and Quad channel ones 
are just as common.  The Quads will, of course, saturate a 64b 133MHz 
PCI-X bus.  _IF_ the chipset on them can keep up.


The current kings of RAID card performance are all Fibre Channel 
based, and all the ones I know of are theoretically capable of 
saturating a 64b 133MHz PCI-X bus.  Again, _IF_ the chipset on them 
can keep up.


Most commodity RAID card have neither adequate CPU nor enough 
buffer.  Regardless of the peripheral IO technology they use.




Not so with SATA! Each drive is on it's own bus, and you are only
limited by the speed of your PCI-X Bus, which can be as high as
800MB/sec at 133Mhz/64bit.


That's the Theory anyway, and latency should be lower as well.  OTOH, 
as my wife likes to say "In theory, Theory and Practice are the 
same.  In practice, they almost never are."


You are only getting the performance you mention as long as your card 
can keep up with multiplexing N IO streams, crunching RAID 5 XORs 
(assuming you are using RAID 5), etc, etc.  As I'm sure you know, 
"The chain is only as strong as its weakest link.".


Most commodity SATA RAID cards brag about being able to pump 300MB/s 
(they were all over LW SF bragging about this!?), which in this 
context is woefully unimpressive.  Sigh.


I'm impressed with the Areca cards because they usually have CPUs 
that actually can come close to pushing the theoretical IO limit of 
the bus they are plugged into; and they can be upgraded to (barely) 
acceptable buffer amounts (come on, manufacturers! 4GB of DDR 
PC3200 is only -2- DIMMs, and shortly that will be enough to hold 8GB 
of DDR PC3200.  Give us more buffer!).



It's cheap and it's fast - all you have to do is pay for the 
enclosure, which can be a bit pricey, but there are some nice 24bay 
and even 40bay enclosures out there for SATA.


I've even seen 48 bay ones.  However, good enclosures, particularly 
for larger numbers of HDs, are examples of non-trivial engineering 
and priced accordingly.  Too many times I see people buy "bargain" 
enclosures and set themselves and their organizations up for some 
_very_ unpleasant times that could easily have been avoided by being 
careful to buy quality products.  "Pay when you buy or pay much more later."




Yes a 15k RPM drive will give you better seek time and better peak
through put, but put them all on a single U320 bus and you won't see
much return past a stripe size of 3 or 4


Agreed.  Same holds for 2Gbps FC.  Haven't tested 4Gbps FC personally 
yet, but I'm told the limit is higher in the manner you'd expect.




If it's raw transactions per second data warehouse style, it's all
about the xlog baby which is sequential writes, and all about large
block reads, which is sequential reads.

Alex Turner
NetEconomist
P.S. Sorry if i'm a bit punchy, I've been up since yestarday with
server upgrade nightmares that continue ;)


My condolences and sympathies.  I've definitely been there and done that.

Ron Peacetree



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] extremly low memory usage

2005-08-20 Thread Ron


At 04:11 PM 8/19/2005, Jeremiah Jahn wrote:

On Fri, 2005-08-19 at 14:23 -0500, John A Meinel wrote:
> Ron wrote:
> > At 01:18 PM 8/19/2005, John A Meinel wrote:
> >
> >> Jeremiah Jahn wrote:
> >> > Sorry about the formatting.
> >> >
> >> > On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote:
> >> >
> >> >>Jeremiah Jahn wrote:
> >> >>
> >> >>
> >>
> >> ...
> >>
> >> >>The expensive parts are the 4915 lookups into the litigant_details
> >> (each
> >> >>one takes approx 4ms for a total of ~20s).
> >> >>And then you do it again on case_data (average 3ms each * 4906 loops =
> >> >>~15s).
> >> >
> >> > Is there some way to avoid this?
> >> >
> >>
> >> Well, in general, 3ms for a single lookup seems really long. Maybe your
> >> index is bloated by not vacuuming often enough. Do you tend to get a lot
> >> of updates to litigant_details?
> >
> >
> > Given that the average access time for a 15Krpm HD is in the 5.5-6ms
> > range (7.5-8ms for a 10Krpm HD), having an average of 3ms for a single
> > lookup implies that ~1/2 (the 15Krpm case) or ~1/3 (the 10Krpm case)
> > table accesses is requiring a seek.
> >
I think LVM may be a problem, since it also seems to break things up on
the file system. My access time on the seek should be around 1/7th the
15Krpm I believe since it's a 14 disk raid 10 array. And no other
traffic at the moment.


Oops.  There's a misconception here.  RAID arrays increase 
_throughput_ AKA _bandwidth_ through parallel access to HDs.  OTOH, 
access time is _latency_, and that is not changed.  Access time for a 
RAID set is equal to that of the slowest access time, AKA highest 
latency, HD in the RAID set.



> Well, from what he has said, the total indexes are < 1GB and he has 6GB
> of ram. So everything should fit. Not to mention he is only accessing
> 5000/several million rows.
I table spaced some of the indexes and they are around 211066880 bytes
for the name_speed index and 149825330 for the lit_actor_speed index
tables seem to be about a gig.


Hmm.  And you think you are only using 250MB out of your 6GB of 
RAM?  Something doesn't seem to add up here.  From what's been 
posted, I'd expect much more RAM to be in use.




> > This implies a poor match between physical layout and access pattern.
>
> This seems to be the case. But since this is not the only query, it may
> be that other access patterns are more important to optimize for.
>
> >
> > If I understand correctly, the table should not be very fragmented given
> > that this is a reasonably freshly loaded DB?  That implies that the
> > fields being looked up are not well sorted in the table compared to the
> > query pattern.
> >
> > If the entire table could fit in RAM, this would be far less of a
> > consideration.  Failing that, the physical HD layout has to be improved
> > or the query pattern has to be changed to reduce seeks.
> >
> >
>
> ...
>
> >> After CLUSTER, the current data will stay clustered, but new data will
> >> not, so you have to continually CLUSTER, the same way that you might
> >> VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as
> >> expensive as a VACUUM FULL. Be aware of this, but it might vastly
> >> improve your performance, so it would be worth it.
> >
> >
> > CLUSTER can be a very large maintenance overhead/problem if the table(s)
> > in question actually need to be "continually" re CLUSTER ed.
> >
> > If there is no better solution available, then you do what you have to,
> > but it feels like there should be a better answer here.
> >
> > Perhaps the DB schema needs examining to see if it matches up well with
> > its real usage?
> >
> > Ron Peacetree
> >
>
> I certainly agree that CLUSTER is expensive, and is an on-going
> maintenance issue. If it is the normal access pattern, though, it may be
> worth it.

The query I've sent you is one of the most common I get just change the
name. I handle about 180K of them a day mostly between 8 and 5. The
clustering has never really been a problem. Like I said before I do it
about once a week. I handle about 3000 update an hour consisting of
about 1000-3000 statement per update. ie about 2.5 million updates per
hour. In the last few months or so I've filtered these down to about
400K update/delete/insert statements per hour.


2.5M updates per hour = ~695 updates per second.  400K per hour = 
~112 updates per sec.  These should be well wi

Re: [PERFORM] extremly low memory usage

2005-08-20 Thread Ron

At 02:53 PM 8/20/2005, Jeremiah Jahn wrote:

On Fri, 2005-08-19 at 16:03 -0500, John A Meinel wrote:
> Jeremiah Jahn wrote:
> > On Fri, 2005-08-19 at 12:18 -0500, John A Meinel wrote:
> >

>
> > it's cached alright. I'm getting a read rate of about 150MB/sec. I would
> > have thought is would be faster with my raid setup. I think I'm going to
> > scrap the whole thing and get rid of LVM. I'll just do a straight ext3
> > system. Maybe that will help. Still trying to get suggestions for a
> > stripe size.
> >
>
> I don't think 150MB/s is out of the realm for a 14 drive array.
> How fast is time dd if=/dev/zero of=testfile bs=8192 count=100
>
time dd if=/dev/zero of=testfile bs=8192 count=100
100+0 records in
100+0 records out

real1m24.248s
user0m0.381s
sys 0m33.028s

> (That should create a 8GB file, which is too big to cache everything)
> And then how fast is:
> time dd if=testfile of=/dev/null bs=8192 count=100

time dd if=testfile of=/dev/null bs=8192 count=100
100+0 records in
100+0 records out

real0m54.139s
user0m0.326s
sys 0m8.916s

and on a second run:

real0m55.667s
user0m0.341s
sys 0m9.013s

>
> That should give you a semi-decent way of measuring how fast the RAID
> system is, since it should be too big to cache in ram.

about 150MB/Sec. Is there no better way to make this go faster...?
Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of 
them doing raw sequential IO like this should be capable of at
 ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's,  ~7*79MB/s= 
553MB/s if using Fujitsu MAU's, and ~7*86MB/s= 602MB/s if using 
Maxtor Atlas 15K II's to devices external to the RAID array.

_IF_ the controller setup is high powered enough to keep that kind of 
IO rate up.  This will require a controller or controllers providing 
dual channel U320 bandwidth externally and quad channel U320 
bandwidth internally.  IOW, it needs a controller or controllers 
talking 64b 133MHz PCI-X, reasonably fast DSP/CPU units, and probably 
a decent sized IO buffer as well.

AFAICT, the Dell PERC4 controllers use various flavors of the LSI 
Logic MegaRAID controllers.  What I don't know is which exact one 
yours is, nor do I know if it (or any of the MegaRAID controllers) 
are high powered enough.

Talk to your HW supplier to make sure you have controllers adequate 
to your HD's.

...and yes, your average access time will be in the 5.5ms - 6ms range 
when doing a physical seek.
Even with RAID, you want to minimize seeks and maximize sequential IO 
when accessing them.

Best to not go to HD at all ;-)

Hope this helps,
Ron Peacetree

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] extremly low memory usage

2005-08-20 Thread Ron

I'm reposting this because my mailer hiccuped when I sent it the 
first time.  If this results in a double post, I apologize.

At 02:53 PM 8/20/2005, Jeremiah Jahn wrote:

On Fri, 2005-08-19 at 16:03 -0500, John A Meinel wrote:
> Jeremiah Jahn wrote:
> > On Fri, 2005-08-19 at 12:18 -0500, John A Meinel wrote:
> >

>
> > it's cached alright. I'm getting a read rate of about 150MB/sec. I would
> > have thought is would be faster with my raid setup. I think I'm going to
> > scrap the whole thing and get rid of LVM. I'll just do a straight ext3
> > system. Maybe that will help. Still trying to get suggestions for a
> > stripe size.
> >
>
> I don't think 150MB/s is out of the realm for a 14 drive array.
> How fast is time dd if=/dev/zero of=testfile bs=8192 count=100
>
time dd if=/dev/zero of=testfile bs=8192 count=100
100+0 records in
100+0 records out

real1m24.248s
user0m0.381s
sys 0m33.028s

> (That should create a 8GB file, which is too big to cache everything)
> And then how fast is:
> time dd if=testfile of=/dev/null bs=8192 count=100

time dd if=testfile of=/dev/null bs=8192 count=100
100+0 records in
100+0 records out

real0m54.139s
user0m0.326s
sys 0m8.916s

and on a second run:

real0m55.667s
user0m0.341s
sys 0m9.013s

>
> That should give you a semi-decent way of measuring how fast the RAID
> system is, since it should be too big to cache in ram.

about 150MB/Sec. Is there no better way to make this go faster...?
Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of 
them doing raw sequential IO like this should be capable of at
 ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's,  ~7*79MB/s= 
553MB/s if using Fujitsu MAU's, and ~7*86MB/s= 602MB/s if using 
Maxtor Atlas 15K II's to devices external to the RAID array.

_IF_ the controller setup is high powered enough to keep that kind of 
IO rate up.  This will require a controller or controllers providing 
dual channel U320 bandwidth externally and quad channel U320 
bandwidth internally.  IOW, it needs a controller or controllers 
talking 64b 133MHz PCI-X, reasonably fast DSP/CPU units, and probably 
a decent sized IO buffer as well.

AFAICT, the Dell PERC4 controllers use various flavors of the LSI 
Logic MegaRAID controllers.  What I don't know is which exact one 
yours is, nor do I know if it (or any of the MegaRAID controllers) 
are high powered enough.

Talk to your HW supplier to make sure you have controllers adequate 
to your HD's.

...and yes, your average access time will be in the 5.5ms - 6ms range 
when doing a physical seek.
Even with RAID, you want to minimize seeks and maximize sequential IO 
when accessing them.

Best to not go to HD at all ;-)

Hope this helps,
Ron Peacetree

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] extremly low memory usage

2005-08-20 Thread Ron


At 02:16 PM 8/20/2005, Jeremiah Jahn wrote:
I'm just watching gnome-system-monoitor. Which after careful 
consideration.and looking at dstat means I'm on CRACKGSM isn't
showing cached memory usageI asume that the cache memory usage 
is where data off of the disks would be cached...?


memory output from dstat is this for  a few seconds:

---procs--- --memory-usage- ---paging-- 
--disk/sdadisk/sdb- swap--- total-cpu-usage
run blk new|_used _buff _cach _free|__in_ _out_|_read write:_read 
write|_used _free|usr sys idl wai hiq siq
  0   0   0|1336M   10M 4603M   17M| 490B  833B|3823B 3503k:1607k 
4285k| 160k 2048M|  4   1  89   7   0   0
  1   0   0|1337M   10M 4600M   18M|   0 0 |   0 0 
:   0   464k| 160k 2048M| 25   0  75   0   0   0


  1   0   0|1334M   10M 4596M   25M|   0 0 |   0 0 
:   056k| 160k 2048M| 21   4  75   0   0   0


Then the "low memory usage" was a chimera.  Excellent!

Given the evidence in this thread, IMO you should upgrade your box to 
16GB of RAM ASAP.  That should be enough to cache most, if not all, 
of the 10GB of the "hot" part of your DB; thereby dedicating your HD 
subsystem as much as possible to writes (which is unavoidable HD 
IO).  As I've posted before, at $75-$150/GB, it's well worth the 
investment whenever you can prove it will help as we have here.


Hope this helps,
Ron Peacetree




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] extremly low memory usage

2005-08-21 Thread Ron

I'm resending this as it appears not to have made it to the list.

At 10:54 AM 8/21/2005, Jeremiah Jahn wrote:

On Sat, 2005-08-20 at 21:32 -0500, John A Meinel wrote:
> Ron wrote:
>
> Well, since you can get a read of the RAID at 150MB/s, that means that
> it is actual I/O speed. It may not be cached in RAM. Perhaps you could
> try the same test, only using say 1G, which should be cached.

[EMAIL PROTECTED] pgsql]# time dd if=/dev/zero of=testfile bs=1024 count=100
100+0 records in
100+0 records out

real0m8.885s
user0m0.299s
sys 0m6.998s

This is abysmally slow.

[EMAIL PROTECTED] pgsql]# time dd of=/dev/null if=testfile bs=1024 count=100
100+0 records in
100+0 records out

real0m1.654s
user0m0.232s
sys 0m1.415s

This transfer rate is the only one out of the 4 you have posted that 
is in the vicinity of where it should be.

The raid array I have is currently set up to use a single channel. But I
have dual controllers in the array. And dual external slots on the card.
The machine is brand new and has pci-e backplane.

So you have 2 controllers each with 2 external slots?  But you are 
currently only using 1 controller and only one external slot on that 
controller?

> > Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them
> > doing raw sequential IO like this should be capable of at
> >  ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's
BTW I'm using Seagate Cheetah 15K.4's

OK, now we have that nailed down.

> > AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic
> > MegaRAID controllers.  What I don't know is which exact one yours is,
> > nor do I know if it (or any of the MegaRAID controllers) are high
> > powered enough.

PERC4eDC-PCI Express, 128MB Cache, 2-External Channels

Looks like they are using the LSI Logic MegaRAID SCSI 320-2E 
controller.  IIUC, you have 2 of these, each with 2 external channels?

The specs on these appear a bit strange.  They are listed as being a 
PCI-Ex8 card, which means they should have a max bandwidth of 20Gb/s= 
2GB/s, yet they are also listed as only supporting dual channel U320= 
640MB/s when they could easily support quad channel U320= 
1.28GB/s.  Why bother building a PCI-Ex8 card when only a PCI-Ex4 
card (which is a more standard physical format) would've been 
enough?  Or if you are going to build a PCI-Ex8 card, why not support 
quad channel U320?  This smells like there's a problem with LSI's design.

The 128MB buffer also looks suspiciously small, and I do not see any 
upgrade path for it on LSI Logic's site.  "Serious" RAID controllers 
from companies like Xyratex, Engino, and Dot-hill can have up to 
1-2GB of buffer, and there's sound technical reasons for it.  See if 
there's a buffer upgrade available or if you can get controllers that 
have larger buffer capabilities.

Regardless of the above, each of these controllers should still be 
good for about 80-85% of 640MB/s, or ~510-540 MB/s apiece when doing 
raw sequential IO if you plug 3-4 fast enough HD's into each SCSI 
channel.  Cheetah 15K.4's certainly are fast enough.  Optimal setup 
is probably to split each RAID 1 pair so that one HD is on each of 
the SCSI channels, and then RAID 0 those pairs.  That will also 
protect you from losing the entire disk subsystem if one of the SCSI 
channels dies.

That 128MB of buffer cache may very well be too small to keep the IO 
rate up, and/or there may be a more subtle problem with the LSI card, 
and/or you may have a configuration problem, but _something(s)_ need 
fixing since you are only getting raw sequential IO of ~100-150MB/s 
when it should be above 500MB/s.

This will make the most difference for initial reads (first time you 
load a table, first time you make a given query, etc) and for any writes.

Your HW provider should be able to help you, even if some of the HW 
in question needs to be changed.  You paid for a solution.  As long 
as this stuff is performing at so much less then what it is supposed 
to, you have not received the solution you paid for.

BTW, on the subject of RAID stripes IME the sweet spot tends to be in 
the 64KB to 256KB range (very large, very read heavy data mines can 
want larger RAID stripes.).  Only experimentation will tell you what 
results in the best performance for your application.

I'm not really worried about the writing, it's the reading the reading
that needs to be faster.

Initial reads are only going to be as fast as your HD subsystem, so 
there's a reason for making the HD subsystem faster even if all you 
care about is reads.  In addition, I'll repeat my previous advice 
that upgrading to 16GB of RAM would be well worth it for you.

Hope this helps,
Ron Peacetree

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] extremly low memory usage

2005-08-21 Thread Ron

At 10:54 AM 8/21/2005, Jeremiah Jahn wrote:

On Sat, 2005-08-20 at 21:32 -0500, John A Meinel wrote:
> Ron wrote:
>
> Well, since you can get a read of the RAID at 150MB/s, that means that
> it is actual I/O speed. It may not be cached in RAM. Perhaps you could
> try the same test, only using say 1G, which should be cached.

[EMAIL PROTECTED] pgsql]# time dd if=/dev/zero of=testfile bs=1024 count=100
100+0 records in
100+0 records out

real0m8.885s
user0m0.299s
sys 0m6.998s

This is abysmally slow.

[EMAIL PROTECTED] pgsql]# time dd of=/dev/null if=testfile bs=1024 count=100
100+0 records in
100+0 records out

real0m1.654s
user0m0.232s
sys 0m1.415s

This transfer rate is the only one out of the 4 you have posted that 
is in the vicinity of where it should be.

The raid array I have is currently set up to use a single channel. But I
have dual controllers in the array. And dual external slots on the card.
The machine is brand new and has pci-e backplane.

So you have 2 controllers each with 2 external slots?  But you are 
currently only using 1 controller and only one external slot on that 
controller?

> > Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them
> > doing raw sequential IO like this should be capable of at
> >  ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's
BTW I'm using Seagate Cheetah 15K.4's

OK, now we have that nailed down.

> > AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic
> > MegaRAID controllers.  What I don't know is which exact one yours is,
> > nor do I know if it (or any of the MegaRAID controllers) are high
> > powered enough.

PERC4eDC-PCI Express, 128MB Cache, 2-External Channels

Looks like they are using the LSI Logic MegaRAID SCSI 320-2E 
controller.  IIUC, you have 2 of these, each with 2 external channels?

The specs on these appear a bit strange.  They are listed as being a 
PCI-Ex8 card, which means they should have a max bandwidth of 20Gb/s= 
2GB/s, yet they are also listed as only supporting dual channel U320= 
640MB/s when they could easily support quad channel U320= 
1.28GB/s.  Why bother building a PCI-Ex8 card when only a PCI-Ex4 
card (which is a more standard physical format) would've been 
enough?  Or if you are going to build a PCI-Ex8 card, why not support 
quad channel U320?  This smells like there's a problem with LSI's design.

The 128MB buffer also looks suspiciously small, and I do not see any 
upgrade path for it on LSI Logic's site.  "Serious" RAID controllers 
from companies like Xyratex, Engino, and Dot-hill can have up to 
1-2GB of buffer, and there's sound technical reasons for it.  See if 
there's a buffer upgrade available or if you can get controllers that 
have larger buffer capabilities.

Regardless of the above, each of these controllers should still be 
good for about 80-85% of 640MB/s, or ~510-540 MB/s apiece when doing 
raw sequential IO if you plug 3-4 fast enough HD's into each SCSI 
channel.  Cheetah 15K.4's certainly are fast enough.  Optimal setup 
is probably to split each RAID 1 pair so that one HD is on each of 
the SCSI channels, and then RAID 0 those pairs.  That will also 
protect you from losing the entire disk subsystem if one of the SCSI 
channels dies.

That 128MB of buffer cache may very well be too small to keep the IO 
rate up, and/or there may be a more subtle problem with the LSI card, 
and/or you may have a configuration problem, but _something(s)_ need 
fixing since you are only getting raw sequential IO of ~100-150MB/s 
when it should be above 500MB/s.

This will make the most difference for initial reads (first time you 
load a table, first time you make a given query, etc) and for any writes.

Your HW provider should be able to help you, even if some of the HW 
in question needs to be changed.  You paid for a solution.  As long 
as this stuff is performing at so much less then what it is supposed 
to, you have not received the solution you paid for.

BTW, on the subject of RAID stripes IME the sweet spot tends to be in 
the 64KB to 256KB range (very large, very read heavy data mines can 
want larger RAID stripes.).  Only experimentation will tell you what 
results in the best performance for your application.

I'm not really worried about the writing, it's the reading the reading
that needs to be faster.

Initial reads are only going to be as fast as your HD subsystem, so 
there's a reason for making the HD subsystem faster even if all you 
care about is reads.  In addition, I'll repeat my previous advice 
that upgrading to 16GB of RAM would be well worth it for you.

Hope this helps,
Ron Peacetree

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] Need for speed 2

2005-08-25 Thread Ron


At 03:10 AM 8/25/2005, Ulrich Wisser wrote:


I realize I need to be much more specific. Here is a more detailed
description of my hardware and system design.


Pentium 4 2.4GHz
Memory 4x DIMM DDR 1GB PC3200 400MHZ CAS3, KVR
Motherboard chipset 'I865G', two IDE channels on board


First suggestion: Get better server HW.  AMD Opteron based dual 
processor board is the current best in terms of price/performance 
ratio, _particularly_ for DB applications like the one you have 
described.  Such mainboards cost ~$400-$500.  RAM will cost about 
$75-$150/GB.  Opteron 2xx are ~$200-$700 apiece.   So a 2P AMD system 
can be had for as little as ~$850 + the cost of the RAM you need.  In 
the worst case where you need 24GB of RAM (~$3600), the total comes 
in at ~$4450.  As you can see from the numbers, buying only what RAM 
you actually need can save you a great deal on money.


Given what little you said about how much of your DB is frequently 
accessed, I'd suggest buying a server based around the 2P 16 DIMM 
slot IWill DK88 mainboard (Tyan has announced a 16 DIMM slot 
mainboard, but I do not think it is actually being sold yet.).  Then 
fill it with the minimum amount of RAM that will allow the "working 
set" of the DB to be cached in RAM.  In the worst case where DB 
access is essentially uniform and essentially random, you will need 
24GB of RAM to hold the 22GB DB + OS + etc.  That worst case is 
_rare_.  Usually DB's have a working set that is smaller than the 
entire DB.  You want to keep that working set in RAM.  If you can't 
identify the working set, buy enough RAM to hold the entire DB.


In particular, you want to make sure that any frequently accessed 
read only tables or indexes are kept in RAM.  The "read only" part is 
very important.  Tables (and their indexes) that are frequently 
written to _have_ to access HD.  Therefore you get much less out of 
having them in RAM.  Read only tables and their indexes can be loaded 
into tmpfs at boot time thereby keeping out of the way of the file 
system buffer cache.  tmpfs does not save data if the host goes down 
so it is very important that you ONLY use this trick with read only 
tables.  The other half of the trick is to make sure that the file 
system buffer cache does _not_ cache whatever you have loaded into tmpfs.




2x SEAGATE BARRACUDA 7200.7 80GB 7200RPM ATA/100
(software raid 1, system, swap, pg_xlog)
ADAPTEC SCSI RAID 2100S ULTRA160 32MB 1-CHANNEL
2x SEAGATE CHEETAH 15K.3 73GB ULTRA320 68-PIN WIDE
(raid 1, /var/lib/pgsql)


Second suggestion: you need a MUCH better IO subsystem.  In fact, 
given that you have described this system as being primarily OLTP 
like, this is more important that the above server HW.  Best would be 
to upgrade everything, but if you are strapped for cash, upgrade the 
IO subsystem first.


You need many more spindles and a decent RAID card or cards.  You 
want 15Krpm (best) or 10Krpm HDs.  As long as all of the HD's are at 
least 10Krpm, more spindles is more important than faster 
spindles.  If it's a choice between more 10Krpm discs or fewer 15Krpm 
discs, buy the 10Krpm discs.  Get the spindle count as high as you 
RAID cards can handle.


Whatever RAID cards you get should have as much battery backed write 
buffer as possible.  In the commodity market, presently the highest 
performance RAID cards I know of, and the ones that support the 
largest battery backed write buffer, are made by Areca.




Database size on disc is 22GB. (without pg_xlog)


Find out what the working set, ie the most frequently accessed 
portion, of this 22GB is and you will know how much RAM is worth 
having.  4GB is definitely too little!




Please find my postgresql.conf below.


Third suggestion:  make sure you are running a 2.6 based kernel and 
at least PG 8.0.3.  Helping beta test PG 8.1 might be an option for 
you as well.



Putting pg_xlog on the IDE drives gave about 10% performance 
improvement. Would faster disks give more performance?


What my application does:

Every five minutes a new logfile will be imported. Depending on the 
source of the request it will be imported in one of three "raw click"
tables. (data from two months back, to be able to verify customer 
complains)  For reporting I have a set of tables. These contain data 
from the last two years. My app deletes all entries from today and 
reinserts updated data calculated from the raw data tables.


The raw data tables seem to be read only?  If so, you should buy 
enough RAM to load them into tmpfs at boot time and have them be 
completely RAM resident in addition to having enough RAM for the OS 
to cache an appropriate amount of the rest of the DB.



The queries contain no joins only aggregates. I have several indexes 
to speed different kinds of queries.


My problems occur when one users does a report that contains too 
much old data. In that case all cache mechanisms will fail and disc 
io is the limiting factor.


If one query contains so much data, that a f

Re: [PERFORM] Read/Write block sizes

2005-08-25 Thread Ron

At 03:45 PM 8/25/2005, Josh Berkus wrote:

Jeff,

> Ask me sometime about my replacement for GNU sort. Â It uses the same
> sorting algorithm, but it's an order of magnitude faster due to better
> I/O strategy. Â Someday, in my infinite spare time, I hope to demonstrate
> that kind of improvement with a patch to pg.

Since we desperately need some improvements in sort performance, I do hope
you follow up on this.

--
--Josh

I'll generalize that.  IMO we desperately need 
any and all improvements in IO performance.  Even 
more so than we need improvements in sorting or sorting IO performance.

Ron

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Read/Write block sizes

2005-08-25 Thread Ron

At 04:49 PM 8/25/2005, Chris Browne wrote:

[EMAIL PROTECTED] (Ron) writes:
> At 03:45 PM 8/25/2005, Josh Berkus wrote:
>> > Ask me sometime about my replacement for GNU sort. Â It uses the
>> > same sorting algorithm, but it's an order of magnitude faster due
>> > to better I/O strategy. Â Someday, in my infinite spare time, I
>> > hope to demonstrate that kind of improvement with a patch to pg.
>>
>>Since we desperately need some improvements in sort performance, I
>>do hope you follow up on this.
>
> I'll generalize that.  IMO we desperately need any and all
> improvements in IO performance.  Even more so than we need
> improvements in sorting or sorting IO performance.

That's frankly a step backwards.  Feel free to "specialise" that instead.

We can agree to disagree, I'm cool with that.

I'm well aware that a Systems Approach to SW 
Architecture is not always popular in the Open 
Source world.  Nonetheless, my POV is that if we 
want to be taken seriously and beat "the big 
boys", we have to do everything smarter and 
faster, as well as cheaper, than they do.  You 
are not likely to be able to do that consistently 
without using some of the "icky" stuff one is 
required to study as part of formal training in 
the Comp Sci and SW Engineering fields.

A patch that improves some specific aspect of 
performance is a thousand times better than any 
sort of "desperate desire for any and

all improvements in I/O performance."

minor twisting of my words: substituting "desire" 
for "need".  The need is provable.  Just put "the 
big 5" (SQL Server, Oracle, DB2, mySQL, and 
PostgreSQL) into some realistic benches to see that.

Major twisting of my words: the apparent 
implication by you that I don't appreciate 
improvements in the IO behavior of specific 
things like sorting as much as I'd appreciate 
more "general" IO performance 
improvements.  Performance optimization is best 
done as an iterative improvement process that 
starts with measuring where the need is greatest, 
then improving that greatest need by the most you 
can, then repeating the whole cycle.  _Every_ 
improvement in such a process is a specific 
improvement, even if the improvement is a 
decision to re-architect the entire product to 
solve the current biggest issue.  Improving 
sorting IO is cool.  OTOH, if pg's biggest IO 
problems are elsewhere, then the amount of 
overall benefit we will get from improving 
sorting IO is going to be minimized until we 
improve the bigger problem(s).  Amdahl's Law.

The "specialized patch" is also pointedly better 
in that a *confidently submitted* patch is 
likely to be way better than any sort of 
"desperate clutching at whatever may come to hand."

Another distortion of my statement and POV.  I 
never suggested nor implied any sort of 
"desperate clutching...".  We have _measurable_ 
IO issues that need to be addressed in order for 
pg to be a better competitor in the 
marketplace.  Just as we do with sorting performance.

Far too often, I see people trying to address 
performance problems via the "desperate 
clutching at whatever seems near to hand," and that
generally turns out very badly as a particular 
result of the whole "desperate clutching" part.

If you can get a sort improvement submitted, that's a concrete improvement...

As I said, I'm all in favor of concrete, 
measurable improvement.  I do not think I ever 
stated I was in favor of anything else.

You evidently are mildly ranting because you've 
seen some examples of poor SW Engineering 
Discipline/Practice by people with perhaps 
inadequate skills for the issues they were trying 
to address.  We all have. "90% of everything is 
Jreck (eg of too low a quality)."

OTOH, I do not think I've given you any reason to 
think I lack such Clue, nor do I think my post was advocating such thrashing.

My post was intended to say that we need an 
Overall Systems Approach to pg optimization 
rather than just applying what compiler writer's 
call "peephole optimizations" to pg.  No more, no less.

I apologize if I somehow misled you,
Ron Peacetree

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] Inefficient queryplan for query with

2005-08-27 Thread Ron

At 10:27 AM 8/27/2005, Tom Lane wrote:

Arjen van der Meijden <[EMAIL PROTECTED]> writes:
> But appareantly there is a bug in the explain mechanism of the 8.1devel
> I'm using (I downloaded a nightly 25 august somewhere in the morning
> (CEST)), since it returned:
> ERROR:  bogus varno: 9

Yeah, someone else sent in a test case for this failure (or at least one
with a similar symptom) yesterday.  I'll try to fix it today.

> Is a nested loop normally so much (3x) more costly than a hash join? Or
> is it just this query that gets estimated wronly?

There's been some discussion that we are overestimating the cost of
nestloops in general, because we don't take into account that successive
scans of the inner relation are likely to find many pages already in
cache from the earlier scans.  So far no one's come up with a good cost
model to use for this, though.

regards, tom lane
It certainly seems common in the EXPLAIN ANALYZE output I see that 
the (estimated) cost of Nested Loop is far higher than the actual 
time measured.

What happened when someone tried the naive approach of telling the 
planner to estimate the cost of a nested loop based on fitting 
whatever entities are involved in the nested loop in RAM as much as 
possible?  When there are multiple such mappings, use whichever one 
results in the lowest cost for the NL in question.

Clearly, this should lead to an underestimate of the cost of the 
constant of operation involved, but since nested loops have the only 
polynomial growth function of the planner's choices, NL's should 
still have a decent chance of being more expensive than other choices 
under most circumstances.

In addition, if those costs are based on actual measurements of how 
long it takes to do such scans then the estimated cost has a decent 
chance of being fairly accurate under such circumstances.

It might not work well, but it seems like a reasonable first attempt 
at a solution?

Ron Peacetree

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] RAID Configuration Sugestion

2005-08-30 Thread Ron


At 08:37 AM 8/30/2005, Alvaro Nunes Melo wrote:

Hello,

We are about to install a new PostgreSQL server, and despite of 
being a very humble configuration compared to the ones we see in the 
list, it's the biggest one we've got till now.


The server is a Dual Xeon 3.0 with 2 GB RAM and two SCSI disks. Our 
main doubt is what is the best configuration for the disks. We are 
thinking about use them in a RAID-0 array. Is this the best option? 
What do you suggest on partitioning? Separate partitions for the OS, 
data and pg_xlog?


This is _very_ modest HW.  Unless your DB and/or DB load is similarly 
modest, you are not going to be happy with the performance of your DBMS.


At a minimum, for safety reasons you want 4 HDs: 2 for a RAID 1 set 
for the DB, and 2 for a RAID 1 set for the OS + pg_xlog.
2 extra HDs, even SCSI HDs, is cheap.  Especially when compared to 
the cost of corrupted or lost data.


HD's and RAM are cheap enough that you should be able to upgrade in 
more ways, but do at least that "upgrade"!


Beyond that, the best ways to spend you limited $ are highly 
dependent on your exact DB and its usage pattern.


Ron Peacetree



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] RAID Configuration Sugestion

2005-08-30 Thread Ron


At 12:56 PM 8/30/2005, Joshua D. Drake wrote:

Ron wrote:


At 08:37 AM 8/30/2005, Alvaro Nunes Melo wrote:


Hello,

We are about to install a new PostgreSQL server, and despite of 
being a very humble configuration compared to the ones we see in 
the list, it's the biggest one we've got till now.


The server is a Dual Xeon 3.0 with 2 GB RAM and two SCSI disks. 
Our main doubt is what is the best configuration for the disks. We 
are thinking about use them in a RAID-0 array. Is this the best 
option? What do you suggest on partitioning? Separate partitions 
for the OS, data and pg_xlog?



This is _very_ modest HW.  Unless your DB and/or DB load is 
similarly modest, you are not going to be happy with the 
performance of your DBMS.


Well that is a pretty blanket statement. I have many customers who 
happily run in less hardware that what is mentioned above.

It all depends on the application itself and how the database is utilized.


If your customers "run happily" on 2 HD's, then IME they have very 
modest DB storage and/or DB performance needs.  For safety reasons, 
the best thing to do if you only have 2 HD's is to run them as a RAID 
1 with everything on them.  The slightly better performing but 
considerably less safe alternative is to put the OS + logs on 1 HD 
and the DB on the other.  Any resemblance to a semi-serious OLTP load 
will reduce either such system to an HD IO bound one with poor IO rates.


If, as above, your DBMS is bounded by the performance of one HD, then 
you are AT BEST getting the raw IO rate of such a device: say 
~70-80MB/s in average sustained raw sequential IO.  Files system 
overhead and any seeking behavior will rapidly reduce that number to 
considerably less.  Consider that the CPU <-> memory IO subsystem is 
easily capable of ~3.2GBps.  So you are talking about slowing the  DB 
server to at most ~1/40, maybe even as little as ~1/200, its 
potential under such circumstances.


If your DB can fit completely in RAM and/or does light duty write IO, 
this may not be a serious issue.  OTOH, once you start using those 
HD's to any reasonable extent, most of the rest of the investment 
you've made in server HW is wasted.


As I keep saying, the highest priority in purchasing a DBMS is to 
make sure you have enough HD IO bandwidth.  RAM comes second, and CPU 
is a distant third.



At a minimum, for safety reasons you want 4 HDs: 2 for a RAID 1 set 
for the DB, and 2 for a RAID 1 set for the OS + pg_xlog.
2 extra HDs, even SCSI HDs, is cheap.  Especially when compared to 
the cost of corrupted or lost data.


Your real test is going to be prototyping the performance you need. 
A single RAID 1 mirror (don't use RAID 0) may be more
than enough. However based on the fact that you speced Xeons my 
guess is you spent money on CPUs when you should have

spent money on hard drives.


I agree with Josh on both points.  Don't use RAID 0 for persistent 
data unless you like losing data.  Spend more on HDs and RAM and less 
on CPU's (fast FSB is far more important than high clock rate.  In 
general buy the highest FSB with the slowest clock rate.).  If fact, 
if you are that strapped for cash, exchange those 2 SCSI HD's for 
their $ equivalent in SATA HD's.  The extra spindles will be well worth it.



If you still have the budget, I would suggest considering either 
what Ron suggested or possibly using a 4 drive RAID 10 instead.


IME, with only 4 HDs, it's usually better to split them them into two 
RAID 1's (one for the db, one for everything else including the logs) 
than it is to put everything on one RAID 10.  YMMV.



Ron Peacetree



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] RAID Configuration Sugestion

2005-08-30 Thread Ron


At 03:27 PM 8/30/2005, Joshua D. Drake wrote:


If you still have the budget, I would suggest considering either 
what Ron suggested or possibly using a 4 drive RAID 10 instead.



IME, with only 4 HDs, it's usually better to split them them into 
two RAID 1's (one for the db, one for everything else including the 
logs) than it is to put everything on one RAID 10.  YMMV.


Really? That's interesting.  My experience is different, I assume 
SCSI? Software/Hardware Raid?


The issue exists regardless of technologies used, although the 
technology used does affect when things become an irritation or 
serious problem.


The issue with "everything on the same HD set" seems to be that under 
light loads anything works reasonably well, but as load increases 
contention between DB table access, OS access, and xlog writes can 
cause performance problems.


In particular, _everything_ else hangs while logs are being written 
with "everything on the same HD set".  Thus leaving you with the 
nasty choices of small log writes that cause more seeking behavior, 
and the resultant poor overall HD IO performance, or large log writes 
that basically freeze the server until they are done.


Having the logs on a different HD, and if possible different IO bus, 
reduces this effect to a minimum and seems to be a better choice than 
the "shared everything" approach.


Although this effect seems largest when there are fewest HDs, the 
general pattern  is that one should use as many spindles as one can 
make use of and that they should be as dedicated as possible in their 
purpose(s).  That's why the TPC bench marked systems tend to have 
literally 100's of HD's and they tend to be split into very focused purposes.


Ron Peacetree



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] RAID Configuration Sugestion

2005-08-30 Thread Ron


At 08:04 PM 8/30/2005, Michael Stone wrote:

On Tue, Aug 30, 2005 at 07:02:28PM -0400, Ron wrote:
purpose(s).  That's why the TPC bench marked systems tend to have 
literally 100's of HD's and they tend to be split into very focused purposes.


Of course, TPC benchmark systems are constructed such that cost and 
storage capacity are irrelevant--in the real world things tend to be

more complicated.


The scary thing is that I've worked on RW production systems that 
bore a striking resemblance to a TPC benchmark system.  As you can 
imagine, they uniformly belonged to BIG organizations (read: lot's 'o 
$$$) who were using the systems for mission critical stuff where 
either it was company existence threatening for the system to be 
done, or they would lose much $$$ per min of down time, or both.


Financial institutions, insurance companies, central data mines for 
Fortune 2000 companies, etc _all_ build systems that push the state 
of the art in how much storage can be managed and how many HDs, CPUs, 
RAM DIMMs, etc are usable.


Historically, this has been the sole province of Oracle and DB2 on 
the SW side and equally outrageously priced custom HW.  Clearly, I'd 
like to see PostgreSQL change that ;-)


Ron Peacetree



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] RAID Configuration Sugestion

2005-08-30 Thread Ron


At 08:43 PM 8/30/2005, Michael Stone wrote:

On Tue, Aug 30, 2005 at 08:41:40PM -0400, Ron wrote:
The scary thing is that I've worked on RW production systems that 
bore a striking resemblance to a TPC benchmark system.  As you can 
imagine, they uniformly belonged to BIG organizations (read: lot's 
'o $$$) who were using the systems for mission critical stuff where 
either it was company existence threatening for the system to be 
done, or they would lose much $$$ per min of down time, or both.


Yeah, and that market is relevant to someone with one dell server 
and 2 hard disks how?
Because successful small companies that _start_ with one small server 
and 2 HDs grow to _become_ companies that need far more HW; ...and in 
the perfect world their SW scales to their increased needs...


_Without_ exponentially increasing their costs or overhead (as Oracle 
and DB2 currently do)


THIS is the real long term promise of OS DBMS.

Ron Peacetree



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Poor performance on HP Package Cluster

2005-09-01 Thread Ron

Your HD raw IO rate seems fine, so the problem is not likely to be 
with the HDs.


That consistent ~10x increase in how long it takes to do an import or 
a select is noteworthy.


This "smells" like an interconnect problem.  Was the Celeron locally 
connected to the HDs while the new Xeons are network 
connected?  Getting 10's or even 100's of MBps throughput out of 
local storage is much easier than it is to do over a network.  1GbE 
is required if you want HDs to push 72.72MBps over a network, and not 
even one 10GbE line will allow you to match local buffered IO of 
1885.34MBps.  What size are those network connects (Server A <-> 
storage, Server B <-> storage, Server A <-> Server B)?


Ron Peacetree


At 10:16 AM 9/1/2005, Ernst Einstein wrote:

I've set up a Package Cluster ( Fail-Over Cluster ) on our two HP 
DL380 G4 with MSA Storage G2.( Xeon 3,4Ghz, 6GB Ram, 2x [EMAIL PROTECTED] 
Raid1).  The system is running under Suse Linux Enterprise Server.


My problem is, that the performance is very low. On our old Server ( 
Celeron 2Ghz with 2 GB of Ram ) an import of our Data takes about 10

minutes. ( 1,1GB data ).  One of the DL380 it takes more than 90 minutes...
Selects response time have also been increased. Celeron 3 sec, Xeon 30-40sec.

I'm trying to fix the problem for two day's now, googled a lot, but 
i don't know what to do.


Top says, my CPU spends ~50% time with wait io.

top - 14:07:34 up 22 min,  3 users,  load average: 1.09, 1.04, 0.78
Tasks:  74 total,   3 running,  71 sleeping,   0 stopped,   0 zombie
Cpu(s): 50.0% us,  5.0% sy,  0.0% ni,  0.0% id, 45.0% wa,  0.0% hi,  0.0% si
Mem:   6050356k total,   982004k used,  5068352k free,60300k buffers
Swap:  2097136k total,0k used,  2097136k free,   786200k cached

  PID USER  PR  NI  VIRT  RES   SHR S %CPU 
%MEM  TIME+COMMAND
 9939 postgres   18   0  254m 143m 140m 
R   49.3  2.48:35.43 postgres:postgres plate [local] 
INSERT
 9938 postgres   16   0 13720 1440   1120 
S 4.9  0.00:59.08 psql -d plate -f 
dump.sql
10738 root 15   0  3988  1120 840 
R 4.9  0.00:00.05 top -d 
0.2
   1 root 16   0   640264 216 
S  0.0  0.0   0:05.03 
init[3]
   2 root 34  19  0   0 0 
S  0.0  0.0   0:00.00 [ksoftirqd/0]


vmstat 1:

ClusterNode2 root $ vmstat 1
procs ---memory-- ---swap-- -io --system--cpu
 r  b   swpd   freebuff   cachesi   sobi  bo 
  in  cs us sy id wa
 1  0  0 5032012  60888 82100800   216  6938 1952  5049 
40  8 15 37
 0  1  0 5031392  60892 82163200   0  8152 
2126  5725 45  6  0 49
 0  1  0 5030896  60900 82214400   0  8124 
2052  5731 46  6  0 47
 0  1  0 5030400  60908 82276800   0  8144 
2124  5717 44  7  0 50
 1  0  0 5029904  60924 82327200   0  8304 
2062  5763 43  7  0 49


I've read (2004), that Xeon may have problems with content switching 
- is the problem still existing? Can I do something to minimize the

problem?


postgresql.conf:

shared_buffers = 28672
effective_cache_size = 40
random_page_cost = 2


shmall & shmmax are set to 268435456

hdparm:

ClusterNode2 root $ hdparm -tT /dev/cciss/c0d0p1

/dev/cciss/c0d0p1:
Timing buffer-cache reads: 3772 MB in 2.00 seconds = 1885.34 MB/sec
Timing buffered disk reads: 150 MB in 2.06 seconds = 72.72 MB/sec





---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Ron


This should be able to run _very_ fast.

At 01:42 PM 9/1/2005, Matthew Sackman wrote:

Hi,

I'm having performance issues with a table consisting of 2,043,133 rows. The
schema is:

\d address
  Table "public.address"
Column|  Type  | Modifiers
--++---
 postcode_top | character varying(2)   | not null
 postcode_middle  | character varying(4)   | not null
 postcode_bottom  | character varying(7)   | not null
 postcode | character varying(10)  | not null
 property_type| character varying(15)  | not null
 sale_type| character varying(10)  | not null
 flat_extra   | character varying(100) | not null
 number   | character varying(100) | not null
 street   | character varying(100) | not null
 locality_1   | character varying(100) | not null
 locality_2   | character varying(100) | not null
 city | character varying(100) | not null
 county   | character varying(100) | not null
Indexes:
"address_city_index" btree (city)
"address_county_index" btree (county)
"address_locality_1_index" btree (locality_1)
"address_locality_2_index" btree (locality_2)
"address_pc_bottom_index" btree (postcode_bottom)
"address_pc_middle_index" btree (postcode_middle)
"address_pc_top_index" btree (postcode_top)
"address_pc_top_middle_bottom_index" btree (postcode_top,
 postcode_middle, postcode_bottom)
"address_pc_top_middle_index" btree (postcode_top, postcode_middle)
"address_postcode_index" btree (postcode)
"address_property_type_index" btree (property_type)
"address_street_index" btree (street)
"street_prefix" btree (lower("substring"((street)::text, 1, 1)))


IOW, each row takes ~1KB on HD.  First suggestion: format your HD to 
use 8KB pages with 1KB segments.  That'll out each row down on HD as 
an atomic unit.  8KB pages also "play nice" with pg.


At 1KB per row, this table takes up ~2.1GB and should fit into RAM 
fairly easily on a decently configured DB server (my _laptop_ has 2GB 
of RAM after all...)


Since you are using ~2.1GB for 2 years worth of data, 15 years worth 
should take no more than 2.1GB*7.5= 15.75GB.


If you replace some of those 100 char fields with integers for code 
numbers and have an auxiliary table for each of those fields mapping 
the code numbers to the associated 100 char string, you should be 
able to shrink a row considerably.  Your target is to have each row 
take <= 512B.  Once a row fits into one 512B sector on HD, there's a 
no point in making it smaller unless you can shrink it enough to fit 
2 rows into one sector (<= 256B).  Once two rows fit into one sector, 
there's no point shrinking a row unless you can make 3 rows fit into 
a sector.  Etc.


Assuming each 100 char (eg 100B) field can be replaced with a 4B int, 
each row could be as small as 76B.  That makes 85B per row the goal 
as it would allow you to fit 6 rows per 512B HD sector.  So in the 
best case your table will be 12x smaller in terms of real HD space.


Fitting one (or more) row(s) into one sector will cut down the real 
space used on HD for the table to ~7.88GB (or 1.32GB in the best 
case).  Any such streamlining will make it faster to load, make the 
working set that needs to be RAM for best performance smaller, etc, etc.



This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 
and a SATA harddrive.


Upgrade pg to 8.0.3 and make sure you have enough RAM for your real 
day to day load.  Frankly, RAM is so cheap ($75-$150/GB), I'd just 
upgrade the machine to 4GB as a matter of course.  P4's have PAE, so 
if your mainboard can hold it, put more than 4GB of RAM in if you 
find you need it.


Since you are describing your workload as being predominantly reads, 
you can get away with far less HD capability as long as you crank up 
RAM high enough to hold the working set of the DB.  The indications 
from the OP are that you may very well be able to hold the entire DB 
in RAM.  That's a big win whenever you can achieve it.


After these steps, there may still be performance issues that need 
attention, but the DBMS should be _much_ faster.


Ron Peacetree



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Ron

At 04:25 PM 9/1/2005, Tom Lane wrote:

Ron <[EMAIL PROTECTED]> writes:
> ...  Your target is to have each row take <= 512B.

Ron, are you assuming that the varchar fields are blank-padded or 
something?  I think it's highly unlikely that he's got more than a 
couple hundred bytes per row right now --- at least if the data is 
what it sounds like.

As it stands, each row will take 55B - 748B and each field is 
variable in size up to the maximums given in the OP's schema.  Since 
pg uses an underlying OS FS, and not a native one, there will be 
extra FS overhead no matter what we do, particularly to accommodate 
such flexibility...  The goal is to minimize overhead and maximize 
regularity in layout.  The recipe I know for HD IO speed is in 
keeping the data small, regular, and as simple as possible.

Even better, if the table(s) can be made RAM resident, then searches, 
even random ones, can be very fast.  He wants a 1000x performance 
improvement.  Going from disk resident to RAM resident should help 
greatly in attaining that goal.

In addition, by replacing as many variable sized text strings as 
possible with ints, the actual compare functions he used as examples 
should run faster as well.

The upthread comment about strcoll() set off some alarm bells in my 
head.  If the database wasn't initdb'd in C locale already, try 
making it so.  Also, use a single-byte encoding if you can (LatinX 
is fine, Unicode not).

Good thoughts I hadn't had.

> Upgrade pg to 8.0.3 and make sure you have enough RAM for your real
> day to day load.

Newer PG definitely better.  Some attention to the configuration 
parameters might also be called for.  I fear though that these 
things are probably just chipping at the margins ...

I don't expect 8.0.3 to be a major performance improvement.  I do 
expect it to be a major _maintenance_ improvement for both him and 
those of us trying to help him ;-)

The performance difference between not having the working set of the 
DB fit into RAM during ordinary operation vs having it be so (or 
better, having the whole DB fit into RAM during ordinary operation) 
has been considerably more effective than "chipping at the margins" 
IME.  Especially so if the HD IO subsystem is wimpy.

Ron Peacetree

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Ron

 don't think I've set anything to UTF8 or such
like.

> > I need to get to the stage where I can run queries such as:
> > select street, locality_1, locality_2, city from address
> > where (city = 'Nottingham' or locality_2 = 'Nottingham'
> >or locality_1 = 'Nottingham')
> >   and upper(substring(street from 1 for 1)) = 'A'
> > group by street, locality_1, locality_2, city
> > order by street
> > limit 20 offset 0
>
> This might be a lot quicker than pulling all the records like in 
your example

> queries...

Yes, that certainly does seem to be the case - around 4 seconds. But I
need it to be 10 times faster (or thereabouts) otherwise I have big
problems!


*beats drum* Get it in RAM, Get it in RAM, ...

Ron Peacetree



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Massive performance issues

2005-09-01 Thread Ron

At 06:22 PM 9/1/2005, Matthew Sackman wrote:

On Thu, Sep 01, 2005 at 06:05:43PM -0400, Ron wrote:
>
> Since I assume you are not going to run anything with the string
> "unstable" in its name in production (?!), why not try a decent
> production ready distro like SUSE 9.x and see how pg 8.0.3 runs on a
> OS more representative of what you are likely (or at least what is
> safe...) to run in production?

Well, you see, as ever, it's a bit complicated. The company I'm doing
the development for has been subcontracted to do it and the contractor was
contracted by the actual "client". So there are two companies involved
in addition to the "client". Sadly, the "client" actually has dictated
things like "it will be deployed on FreeBSD and thou shall not argue".

At least get them to promise they will use a release the BSD folks 
mark "stable"!

At this point in time, I actually have very little information about the
specification of the boxen that'll be running this application. This is
something I'm hoping to solve very soon. The worst part of it is that
I'm not going have direct (ssh) access to the box and all configuration
changes will most likely have to be relayed through techies at the
"client" so fine tuning this is going to be a veritable nightmare.

IME, what you have actually just said is "It will not be possible to 
safely fine tune the DB unless or until I have direct access; and/or 
someone who does have direct access is correctly trained."

Ick.

> >> > I need to get to the stage where I can run queries such as:
> >> > select street, locality_1, locality_2, city from address
> >> > where (city = 'Nottingham' or locality_2 = 'Nottingham'
> >> >or locality_1 = 'Nottingham')
> >> >   and upper(substring(street from 1 for 1)) = 'A'
> >> > group by street, locality_1, locality_2, city
> >> > order by street
> >> > limit 20 offset 0
> >>
> >> This might be a lot quicker than pulling all the records like in
> >your example
> >> queries...
> >
> >Yes, that certainly does seem to be the case - around 4 seconds. But I
> >need it to be 10 times faster (or thereabouts) otherwise I have big
> >problems!
>
> *beats drum* Get it in RAM, Get it in RAM, ...

Ok, but I currently have 2 million rows. When this launches in a couple
of weeks, it'll launch with 5 million+ and then gain > a million a year.

At my previously mentioned optimum of 85B per row, 2M rows is 
170MB.  5M rows is 425MB.  Assuming the gain of 1M rows per year, 
that's +85MB per year for this table.

Up to 2GB DIMMs are currently standard, and 4GB DIMMs are just in the 
process of being introduced.  Mainboards with anything from 4 to 16 
DIMM slots are widely available.

IOW, given the description you've provided this DB should _always_ 
fit in RAM.  Size the production system such that the entire DB fits 
into RAM during ordinary operation with an extra 1GB of RAM initially 
tossed on as a safety measure and the client will be upgrading the HW 
because it's obsolete before they run out of room in RAM.

I think the upshot of this all is 4GB RAM as a minimum and judicious use
of normalization so as to avoid more expensive string comparisons and
reduce table size is my immediate plan (along with proper configuration
of pg).

My suggestion is only slightly different.  Reduce table size(s) and 
up the RAM to the point where the whole DB fits comfortably in RAM.

You've got the rare opportunity to build a practical Memory Resident 
Database.  It should run like a banshee when you're done.  I'd love 
to see the benches on the final product.

Ron Peacetree

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] 8.x index insert performance

2005-11-14 Thread Ron

At 09:43 AM 11/14/2005, Kelly Burkhart wrote:

On Fri, 2005-11-11 at 18:02 -0500, Tom Lane wrote:
> > There very well could be a pattern in the data which could affect
> > things, however, I'm not sure how to identify it in 100K rows out of
> > 100M.
>
> I conjecture that the problem areas represent places where the key
> sequence is significantly "more random" than it is elsewhere.  Hard
> to be more specific than that though.
>

OK, I understand the pattern now.

My two tables hold orders, and order state transitions.  Most orders
have two transitions: creation and termination.  The problem happens
when there is a significant number of orders where termination is
happening a long time after creation, causing order_transition rows with
old ord_id values to be inserted.

This is valid, so I have to figure out a way to accomodate it.
Perhaps a small schema change would help?  Instead of having the 
order state transitions explicitly listed in the table, why not 
create two new tables; 1 for created orders and 1 for terminated 
orders.  When an order is created, its ord_id goes into the 
CreatedOrders table.  When an order is terminated, its ord_id is 
added to the TerminatedOrders table and then deleted from the 
CreatedOrders table.

Downsides to this approach are some extra complexity and that you 
will have to make sure that system disaster recovery includes making 
sure that no ord_id appears in both the CreatedOrders and 
TerminatedOrdes tables.  Upsides are that the insert problem goes 
away and certain kinds of accounting and inventory reports are now 
easier to create.

Ron

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Hardware/OS recommendations for large databases

2005-11-16 Thread Ron

Got some hard numbers to back your statement up?  IME, the Areca 
1160's with  >= 1GB of cache beat any other commodity RAID 
controller.  This seems to be in agreement with at least one 
independent testing source:

http://print.tweakers.net/?reviews/557

RAID HW from Xyratex, Engino, or Dot Hill will _destroy_ any 
commodity HW solution, but their price point is considerably higher.

...on another note, I completely agree with the poster who says we 
need more cache on RAID controllers.  We should all be beating on the 
RAID HW manufacturers to use standard DIMMs for their caches and to 
provide 2 standard DIMM slots in their full height cards (allowing 
for up to 8GB of cache using 2 4GB DIMMs as of this writing).

It should also be noted that 64 drive chassis' are going to become 
possible once 2.5" 10Krpm SATA II and FC HDs become the standard next 
year (48's are the TOTL now).  We need controller technology to keep up.

Ron

At 12:16 AM 11/16/2005, Alex Turner wrote:

Not at random access in RAID 10 they aren't, and anyone with their
head screwed on right is using RAID 10.  The 9500S will still beat the
Areca cards at RAID 10 database access patern.

Alex.

On 11/15/05, Dave Cramer <[EMAIL PROTECTED]> wrote:
> Luke,
>
> Have you tried the areca cards, they are slightly faster yet.
>
> Dave
>
> On 15-Nov-05, at 7:09 AM, Luke Lonergan wrote:
>
>
>
>
>
> I agree - you can get a very good one from www.acmemicro.com or
>
> www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA
>
> RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
>
> on a Tyan 2882 motherboard.  We get about 400MB/s sustained disk read
>
> performance on these (with tuning) on Linux using the xfs filesystem,
>
> which is one of the most critical factors for large databases.
>
>
>
>
> Note that you want to have your DBMS use all of the CPU and disk channel
>
> bandwidth you have on each query, which takes a parallel database like
>
> Bizgres MPP to achieve.
>
>
>
>
> Regards,
>

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Hardware/OS recommendations for large databases

2005-11-16 Thread Ron

You _ARE_ kidding right?  In what hallucination?

The performance numbers for the 1GB cache version of the Areca 1160 
are the _grey_ line in the figures, and were added after the original 
article was published:

"Note: Since the original Dutch article was published in late 
January, we have finished tests of the 16-port Areca ARC-1160 using 
128MB, 512MB and 1GB cache configurations and RAID 5 arrays of up to 
12 drives. The ARC-1160 was using the latest 1.35 beta firmware. The 
performance graphs have been updated to include the ARC-1160 results. 
Discussions of the results have not been updated, however. "

With 1GB of cache, the 1160's beat everything else in almost all of 
the tests they participated in.  For the few where they do not win 
hands down, the Escalade's (very occasionally) essentially tie.

These are very easy to read full color graphs where higher is better 
and the grey line representing the 1GB 1160's is almost always higher 
on the graph than anything else.  Granted the Escalades seem to give 
them the overall best run for their money, but they still are clearly 
second best when looking at all the graphs and the CPU utilization 
numbers in aggregate.

Ron

At 12:08 PM 11/16/2005, Alex Turner wrote:

Yes - that very benchmark shows that for a MySQL Datadrive in RAID 10,
the 3ware controllers beat the Areca card.

Alex.

On 11/16/05, Ron <[EMAIL PROTECTED]> wrote:
> Got some hard numbers to back your statement up?  IME, the Areca
> 1160's with  >= 1GB of cache beat any other commodity RAID
> controller.  This seems to be in agreement with at least one
> independent testing source:
>
> http://print.tweakers.net/?reviews/557
>
> RAID HW from Xyratex, Engino, or Dot Hill will _destroy_ any
> commodity HW solution, but their price point is considerably higher.
>
> ...on another note, I completely agree with the poster who says we
> need more cache on RAID controllers.  We should all be beating on the
> RAID HW manufacturers to use standard DIMMs for their caches and to
> provide 2 standard DIMM slots in their full height cards (allowing
> for up to 8GB of cache using 2 4GB DIMMs as of this writing).
>
> It should also be noted that 64 drive chassis' are going to become
> possible once 2.5" 10Krpm SATA II and FC HDs become the standard next
> year (48's are the TOTL now).  We need controller technology to keep up.
>
> Ron
>
> At 12:16 AM 11/16/2005, Alex Turner wrote:
> >Not at random access in RAID 10 they aren't, and anyone with their
> >head screwed on right is using RAID 10.  The 9500S will still beat the
> >Areca cards at RAID 10 database access patern.
> >
> >Alex.
> >
> >On 11/15/05, Dave Cramer <[EMAIL PROTECTED]> wrote:
> > > Luke,
> > >
> > > Have you tried the areca cards, they are slightly faster yet.
> > >
> > > Dave
> > >
> > > On 15-Nov-05, at 7:09 AM, Luke Lonergan wrote:
> > >
> > >
> > >
> > >
> > >
> > > I agree - you can get a very good one from www.acmemicro.com or
> > >
> > > www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA
> > >
> > > RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
> > >
> > > on a Tyan 2882 motherboard.  We get about 400MB/s sustained disk read
> > >
> > > performance on these (with tuning) on Linux using the xfs filesystem,
> > >
> > > which is one of the most critical factors for large databases.
> > >
> > >
> > >
> > >
> > > Note that you want to have your DBMS use all of the CPU and 
disk channel

> > >
> > > bandwidth you have on each query, which takes a parallel database like
> > >
> > > Bizgres MPP to achieve.
> > >
> > >
> > >
> > >
> > > Regards,
> > >
> >
> >---(end of broadcast)---
> >TIP 2: Don't 'kill -9' the postmaster
>
>
>
>

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] Hardware/OS recommendations for large databases

2005-11-16 Thread Ron

Amendment: there are graphs where the 1GB Areca 1160's do not do as 
well.  Given that they are mySQL specific and that similar usage 
scenarios not involving mySQL (as well as most of the usage scenarios 
involving mySQL; as I said these did not follow the pattern of the 
rest of the benchmarks) show the usual pattern of the 1GB 1160's in 
1st place or tied for 1st place, it seems reasonable that mySQL has 
something to due with the aberrant results in those 2 (IIRC) cases.

Ron

At 03:57 PM 11/16/2005, Ron wrote:

You _ARE_ kidding right?  In what hallucination?

The performance numbers for the 1GB cache version of the Areca 1160 
are the _grey_ line in the figures, and were added after the 
original article was published:

"Note: Since the original Dutch article was published in late 
January, we have finished tests of the 16-port Areca ARC-1160 using 
128MB, 512MB and 1GB cache configurations and RAID 5 arrays of up to 
12 drives. The ARC-1160 was using the latest 1.35 beta firmware. The 
performance graphs have been updated to include the ARC-1160 
results. Discussions of the results have not been updated, however. "

With 1GB of cache, the 1160's beat everything else in almost all of 
the tests they participated in.  For the few where they do not win 
hands down, the Escalade's (very occasionally) essentially tie.

These are very easy to read full color graphs where higher is better 
and the grey line representing the 1GB 1160's is almost always 
higher on the graph than anything else.  Granted the Escalades seem 
to give them the overall best run for their money, but they still 
are clearly second best when looking at all the graphs and the CPU 
utilization numbers in aggregate.

Ron

At 12:08 PM 11/16/2005, Alex Turner wrote:

Yes - that very benchmark shows that for a MySQL Datadrive in RAID 10,
the 3ware controllers beat the Areca card.

Alex.

On 11/16/05, Ron <[EMAIL PROTECTED]> wrote:
> Got some hard numbers to back your statement up?  IME, the Areca
> 1160's with  >= 1GB of cache beat any other commodity RAID
> controller.  This seems to be in agreement with at least one
> independent testing source:
>
> http://print.tweakers.net/?reviews/557
>
> RAID HW from Xyratex, Engino, or Dot Hill will _destroy_ any
> commodity HW solution, but their price point is considerably higher.
>
> ...on another note, I completely agree with the poster who says we
> need more cache on RAID controllers.  We should all be beating on the
> RAID HW manufacturers to use standard DIMMs for their caches and to
> provide 2 standard DIMM slots in their full height cards (allowing
> for up to 8GB of cache using 2 4GB DIMMs as of this writing).
>
> It should also be noted that 64 drive chassis' are going to become
> possible once 2.5" 10Krpm SATA II and FC HDs become the standard next
> year (48's are the TOTL now).  We need controller technology to keep up.
>
> Ron
>
> At 12:16 AM 11/16/2005, Alex Turner wrote:
> >Not at random access in RAID 10 they aren't, and anyone with their
> >head screwed on right is using RAID 10.  The 9500S will still beat the
> >Areca cards at RAID 10 database access patern.
> >
> >Alex.
> >
> >On 11/15/05, Dave Cramer <[EMAIL PROTECTED]> wrote:
> > > Luke,
> > >
> > > Have you tried the areca cards, they are slightly faster yet.
> > >
> > > Dave
> > >
> > > On 15-Nov-05, at 7:09 AM, Luke Lonergan wrote:
> > >
> > >
> > >
> > >
> > >
> > > I agree - you can get a very good one from www.acmemicro.com or
> > >
> > > www.rackable.com with 8x 400GB SATA disks and the new 3Ware 
9550SX SATA

> > >
> > > RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
> > >
> > > on a Tyan 2882 motherboard.  We get about 400MB/s sustained disk read
> > >
> > > performance on these (with tuning) on Linux using the xfs filesystem,
> > >
> > > which is one of the most critical factors for large databases.
> > >
> > >
> > >
> > >
> > > Note that you want to have your DBMS use all of the CPU and 
disk channel

> > >
> > > bandwidth you have on each query, which takes a parallel database like
> > >
> > > Bizgres MPP to achieve.
> > >
> > >
> > >
> > >
> > > Regards,
> > >
> >
> >---(end of broadcast)---
> >TIP 2: Don't 'kill -9' the postmaster
>
>
>
>

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] Hardware/OS recommendations for large databases

2005-11-18 Thread Ron

While I agree with you in principle that pg becomes CPU bound 
relatively easily compared to other DB products (at ~110-120MBps 
according to a recent thread), there's a bit of hyperbole in your post.

a. There's a big difference between the worst performing 1C x86 ISA 
CPU available and the best performing 2C one (IIRC, that's the 
2.4GHz, 1MB L2 cache AMDx2 4800+ as of this writing)

b. Two 2C CPU's vs one 1C CPU means that a pg process will almost 
never be waiting on other non pg processes.  It also means that 3-4 
pg processes, CPU bound or not, can execute in parallel.  Not an 
option with one 1C CPU.

c. Mainboards with support for multiple CPUs and lots' of RAM are 
_not_ the cheap ones.

d.  No one should ever use RAID 0 for valuable data.  Ever.  So at 
the least you need 4 HD's for a RAID 10 set (RAID 5 is not a good 
option unless write performance is unimportant.  4HD RAID 5 is 
particularly not a good option.)

e. The server usually needs to talk to things over a network 
connection.  Often performance here matters.  Mainboards with 2 1GbE 
NICs and/or PCI-X (or PCI-E) slots for 10GbE cards are not the cheap ones.

f. Trash HDs mean poor IO performance and lower reliability.  While 
TOTL 15Krpm 4Gb FC HDs are usually overkill (Not always.  It depends 
on context.),
you at least want SATA II HDs with NCQ or TCQ support.  And you want 
them to have a decent media warranty- preferably a 5 year one if you 
can get it.  Again, these are not the cheapest HD's available.

g. Throughput limitations say nothing about latency 
considerations.  OLTP-like systems _want_ HD spindles.  AMAP.  Even 
non OLTP-like systems need a fair number of spindles to optimize HD 
IO: dedicated WAL set, multiple dedicated DB sets, dedicated OS and 
swap space set, etc, etc.  At 50MBps ASTR, you need 16 HD's operating 
in parallel to saturate the bandwidth of a PCI-X channel.
That's ~8 independent pg tasks (queries using different tables, 
dedicated WAL IO, etc) running in parallel.  Regardless of application domain.

h. Decent RAID controllers and HBAs are not cheap either.  Even SW 
RAID benefits from having a big dedicated RAM buffer to talk to.

While the above may not cost you $80K, it sure isn't costing you $1K either.
Maybe ~$15-$20K, but not $1K.

Ron

At 01:07 AM 11/18/2005, Luke Lonergan wrote:

Greg,

On 11/17/05 9:17 PM, "Greg Stark" <[EMAIL PROTECTED]> wrote:

> Ok, a more productive point: it's not really the size of the database that
> controls whether you're I/O bound or CPU bound. It's the available I/O
> bandwidth versus your CPU speed.

Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound after
110MB/s of I/O.  This is true of Postgres 7.4, 8.0 and 8.1.

A $1,000 system with one CPU and two SATA disks in a software RAID0 will
perform exactly the same as a $80,000 system with 8 dual core CPUs and the
world's best SCSI RAID hardware on a large database for decision support
(what the poster asked about).

Regards,

- Luke

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] Hardware/OS recommendations for large databases

2005-11-18 Thread Ron

Breaking the ~120MBps pg IO ceiling by any means 
is an important result.  Particularly when you 
get a ~2x improvement.  I'm curious how far we 
can get using simple approaches like this.

At 10:13 AM 11/18/2005, Luke Lonergan wrote:

Dave,

On 11/18/05 5:00 AM, "Dave Cramer" <[EMAIL PROTECTED]> wrote:
>
> Now there's an interesting line drawn in the sand. I presume you have
> numbers to back this up ?
>
> This should draw some interesting posts.

Part 2: The answer

System A:
This system is running RedHat 3 Update 4, with a Fedora 2.6.10 Linux kernel.

On a single table with 15 columns (the Bizgres 
IVP) at a size double memory (2.12GB), Postgres 
8.0.3 with Bizgres enhancements takes 32 seconds 
to scan the table: thats 66 MB/s.  Not the 
efficiency Id hope from the onboard SATA 
controller that Id like, I would have expected 
to get 85% of the 100MB/s raw read performance.
Have you tried the large read ahead trick with 
this system?  It would be interesting to see how 
much it would help.  It might even be worth it to 
do the experiment at all of [default, 2x default, 
4x default, 8x default, etc] read ahead until 
either a) you run out of resources to support the 
desired read ahead, or b) performance levels 
off.  I can imagine the results being very enlightening.

System B:
This system is running an XFS filesystem, and 
has been tuned to use very large (16MB) 
readahead.  Its running the Centos 4.1 distro, 
which uses a Linux 2.6.9 kernel.

Same test as above, but with 17GB of data takes 
69.7 seconds to scan (!)  Thats 244.2MB/s, 
which is obviously double my earlier point of 
110-120MB/s.  This system is running with a 16MB 
Linux readahead setting, lets try it with the 
default (I think) setting of 256KB  AHA! Now we get 171.4 seconds or 99.3MB/s.

The above experiment would seem useful here as well.

Summary:

 OK  you can get more I/O 
bandwidth out of the current I/O path for 
sequential scan if you tune the filesystem for 
large readahead.  This is a cheap alternative to 
overhauling the executor to use asynch I/O.

Still, there is a CPU limit here  this is not 
I/O bound, it is CPU limited as evidenced by the 
sensitivity to readahead settings.   If the 
filesystem could do 1GB/s, you wouldnt go any faster than 244MB/s.

- Luke

I respect your honesty in reporting results that 
were different then your expectations or 
previously taken stance.  Alan Stange's comment 
re: the use of direct IO along with your comments 
re: async IO and mem copies plus the results of 
these experiments could very well point us 
directly at how to most easily solve pg's CPU boundness during IO.

[HACKERS] are you watching this?

Ron

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] weird performances problem

2005-11-22 Thread Ron

If I understand your HW config correctly, all of the pg stuff is on 
the same RAID 10 set?


If so, give WAL its own dedicated RAID 10 set.  This looks like the 
old problem of everything stalling while WAL is being committed to HD.


This concept works for other tables as well.  If you have a tables 
that both want services at the same time, disk arm contention will 
drag performance into the floor when they are on the same HW set.


Profile your HD access and put tables that want to be accessed at the 
same time on different HD sets.  Even if you have to buy more HW to do it.


Ron


At 04:56 AM 11/22/2005, Guillaume Smet wrote:

Qingqing Zhou wrote:
Someone is doing a massive *write* at this time, which makes your 
query *read* quite slow. Can you find out which process is doing write?


Indexes should be in memory so I don't expect a massive write to 
slow down the select queries. sdb is the RAID10 array dedicated to 
our data so the postgresql process is the only one to write on it. 
I'll check which write queries are running because there should 
really be a few updates/inserts on our db during the day.


On a four days log analysis, I have the following:
SELECT  403,964
INSERT  574
UPDATE  393
DELETE  26
So it's not really normal to have a massive write during the day.

Thanks for your help

--
Guillaume

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster





---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] weird performances problem

2005-11-22 Thread Ron


At 09:26 AM 11/22/2005, Guillaume Smet wrote:

Ron wrote:
If I understand your HW config correctly, all of the pg stuff is on 
the same RAID 10 set?


No, the system and the WAL are on a RAID 1 array and the data on 
their own RAID 10 array.


As has been noted many times around here, put the WAL on its own 
dedicated HD's.  You don't want any head movement on those HD's.



As I said earlier, there's only a few writes in the database so I'm 
not really sure the WAL can be a limitation: IIRC, it's only used 
for writes isn't it?


When you reach a WAL checkpoint, pg commits WAL data to HD... ...and 
does almost nothing else until said commit is done.



Don't you think we should have some io wait if the database was 
waiting for the WAL? We _never_ have any io wait on this server but 
our CPUs are still 30-40% idle.
_Something_ is doing long bursts of write IO on sdb and sdb1 every 30 
minutes or so according to your previous posts.


Profile your DBMS and find out what.



A typical top we have on this server is:
 15:22:39  up 24 days, 13:30,  2 users,  load average: 3.86, 3.96, 3.99
156 processes: 153 sleeping, 3 running, 0 zombie, 0 stopped
CPU states:  cpuusernice  systemirq  softirq  iowaitidle
   total   50.6%0.0%4.7%   0.0% 0.6%0.0%   43.8%
   cpu00   47.4%0.0%3.1%   0.3% 1.5%0.0%   47.4%
   cpu01   43.7%0.0%3.7%   0.0% 0.5%0.0%   51.8%
   cpu02   58.9%0.0%7.7%   0.0% 0.1%0.0%   33.0%
   cpu03   52.5%0.0%4.1%   0.0% 0.1%0.0%   43.0%
Mem:  3857224k av, 3307416k used,  549808k free,   0k shrd,   80640k buff
   2224424k actv,  482552k in_d,   49416k in_c
Swap: 4281272k av,   10032k used, 4271240k 
free 2602424k cached


As you can see, we don't swap, we have free memory, we have all our 
data cached (our database size is 1.5 GB).


Context switch are between 10,000 and 20,000 per seconds.

That's actually a reasonably high CS rate.  Again, why?


This concept works for other tables as well.  If you have tables 
that both want services at the same time, disk arm contention will 
drag performance into the floor when they are on the same HW set.
Profile your HD access and put tables that want to be accessed at 
the same time on different HD sets.  Even if you have to buy more HW to do it.


I use iostat and I can only see a little write activity and no read 
activity on both raid arrays.
Remember it's not just the overall amount, it's _when_and _where_ the 
write activity takes place.  If you have almost no write activity, 
but whenever it happens it all happens to the same place by multiple 
things contending for the same HDs, your performance during that time 
will be poor.


Since the behavior you are describing fits that cause very well, I'd 
see if you can verify that's what's going on.


Ron



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] weird performances problem

2005-11-22 Thread Ron


At 10:26 AM 11/22/2005, Guillaume Smet wrote:

Ron,

First of all, thanks for your time.

Happy to help.


As has been noted many times around here, put the WAL on its own 
dedicated HD's.  You don't want any head movement on those HD's.


Yep, I know that. That's just we supposed it was not so important if 
it was nearly a readonly database which is wrong according to you.
It's just good practice with pg that pg-xlog should always get it's 
own dedicated HD set.  OTOH, I'm not at all convinced given the scant 
evidence so far that it is the primary problem here; particularly 
since if I understand you correctly, px-xlog is not on sdb or sdb1 
where you are having the write storm.



_Something_ is doing long bursts of write IO on sdb and sdb1 every 
30 minutes or so according to your previous posts.


It's not every 30 minutes. It's a 20-30 minutes slow down 3-4 times 
a day when we have a high load.


Thanks for the correction and I apologize for the misunderstanding.
Clearly the first step is to instrument sdb and sdb1 so that you 
understand exactly what is being accessed and written on them.


Possibilities that come to mind:
a) Are some of your sorts requiring more than 32MB during high 
load?  If sorts do not usually require HD temp files and suddenly do, 
you could get behavior like yours.


b) Are you doing too many 32MB sorts during high load?  Same comment as above.

c) Are you doing some sort of data updates or appends during high 
load that do not normally occur?


d) Are you constantly doing "a little" write IO that turns into a 
write storm under high load because of queuing issues?


Put the scaffolding in needed to trace _exactly_ what's happening on 
sdb and sdb1 throughout the day and then examine the traces over a 
day, a few days, and a week.  I'll bet you will notice some patterns 
that will be helpful in identifying what's going on.


Ron



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Open request for benchmarking input

2005-11-26 Thread Ron


At 03:15 PM 11/26/2005, Luke Lonergan wrote:


I suggest specifying a set of basic system / HW benchmarks to baseline the
hardware before each benchmark is run.  This has proven to be a major issue
with most performance tests.  My pick for I/O is bonnie++.

Your equipment allows you the opportunity to benchmark all 5 machines
running together as a cluster - this is important to measure maturity of
solutions for high performance warehousing.  Greenplum can provide you a
license for Bizgres MPP for this purpose.
...and detailed config / tuning specs as well for it or everyone is 
probably wasting their time.  For instance, it seems fairly clear 
that the default 8KB table size and default read ahead size are both 
pessimal, at least for non OLTP-like apps.  In addition, there's been 
a reasonable amount of evidence that xfs should be the file system of 
choice for pg.


Things like optimal RAID strip size, how to allocate tables to 
various IO HW, and what levels of RAID to use for each RAID set also 
have to be defined.




The 16x SATA drives should be great, provided you have a high performance
RAID adapter configured properly.  You should be able to get 800MB/s of
sequential scan performance by using a card like the 3Ware 9550SX.  I've
also heard that the Areca cards are good (how good?).  Configuration of the
I/O must be validated though - I've seen as low as 25MB/s from a
misconfigured system.
The Areca cards, particularly with 1-2GB of buffer cache, are the 
current commodity RAID controller performance leader.  Better 
performance can be gotten out of HW from vendors like Xyratex, but it 
will cost much more.



Ron



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Hardware/OS recommendations for large databases

2005-11-27 Thread Ron


At 01:18 AM 11/27/2005, Luke Lonergan wrote:
For data warehousing its pretty well open and shut.  To use all cpus 
and io channels on each query you will need mpp.


Has anyone done the math.on the original post?  5TB takes how long 
to scan once?  If you want to wait less than a couple of days just 
for a seq scan, you'd better be in the multi-gb per second range.

More than a bit of hyperbole there Luke.

Some common RW scenarios:
Dual 1GbE NICs => 200MBps => 5TB in 5x10^12/2x10^8= 25000secs= 
~6hrs57mins.  Network stuff like re-transmits of dropped packets can 
increase this, so network SLA's are critical.


Dual 10GbE NICs => ~1.6GBps (10GbE NICs can't yet do over ~800MBps 
apiece) => 5x10^12/1.6x10^9=  3125secs= ~52mins.  SLA's are even 
moire critical here.


If you are pushing 5TB around on a regular basis, you are not wasting 
your time & money on commodity <= 300MBps RAID HW.  You'll be using 
800MBps and 1600MBps high end stuff, which means you'll need ~1-2hrs 
to sequentially scan 5TB on physical media.


Clever use of RAM can get a 5TB sequential scan down to ~17mins.

Yes, it's a lot of data.  But sequential scan times should be in the 
mins or low single digit hours, not days.  Particularly if you use 
RAM to maximum advantage.


Ron



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Hardware/OS recommendations for large databases

2005-11-27 Thread Ron

At 02:11 PM 11/27/2005, Luke Lonergan wrote:

Ron,

On 11/27/05 9:10 AM, "Ron" <[EMAIL PROTECTED]> wrote:

> Clever use of RAM can get a 5TB sequential scan down to ~17mins.
>
> Yes, it's a lot of data.  But sequential scan times should be in the
> mins or low single digit hours, not days.  Particularly if you use
> RAM to maximum advantage.

Unfortunately, RAM doesn't help with scanning from disk at all.
I agree with you if you are scanning a table "cold", having never 
loaded it before, or if the system is not (or can't be) set up 
properly with appropriate buffers.

However, outside of those 2 cases there are often tricks you can use 
with enough RAM (and no, you don't need RAM equal to the size of the 
item(s) being scanned) to substantially speed things up.  Best case, 
you can get performance approximately equal to that of a RAM resident scan.

WRT using network interfaces to help - it's interesting, but I think what
you'd want to connect to is other machines with storage on them.
Maybe.  Or maybe you want to concentrate your storage in a farm that 
is connected by network or Fiber Channel to the rest of your 
HW.  That's what a NAS or SAN is after all.

"The rest of your HW" nowadays is often a cluster of RAM rich 
hosts.  Assuming 64GB per host, 5TB can be split across ~79 hosts if 
you want to make it all RAM resident.

Most don't have that kind of budget, but thankfully it is not usually 
necessary to make all of the data RAM resident in order to obtain if 
not all of the performance benefits you'd get if all of the data was.

Ron

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] pg_dump slow

2005-11-30 Thread Ron


At 08:35 AM 11/30/2005, Franklin Haut wrote:

Hi

i´m using PostgreSQL on windows 2000, the pg_dump take around 50 minutes
to do backup of 200Mb data ( with no compression, and 15Mb with
compression),


Compression is reducing the data to 15/200= 3/40= 7.5% of original size?


but in windows XP does not pass of 40 seconds... :(


You mean that 40 secs in pg_dump under Win XP 
crashes, and therefore you have a WinXP problem?


Or do you mean that pg_dump takes 40 secs to 
complete under WinXP and 50 minutes under W2K and 
therefore you have a W2K problem?


In fact, either 15MB/40secs= 375KBps or 
200MB/40secs= 5MBps is _slow_, so there's a problem under either platform!


This happens with 8.1 and version 8.0, somebody 
passed for the same situation?


It will be that a configuration in the priorities of the exists
processes ?  in Windows XP the processing of schemes goes 70% and
constant accesses to the HardDisk, while that in windows 2000 it does
not pass of 3%.

Assuming Win XP completes the dump, the first thing to do is
*don't use W2K*
M$ has stopped supporting it in anything but absolutely minimum fashion anyway.
 _If_ you are going to use an M$ OS you should be using WinXP.
(You want to pay licensing fees for your OS, but 
you are using free DB SW?  Huh?  If you are 
trying to save $$$, use Open Source SW like Linux 
or *BSD.  pg will perform better under it, and it's cheaper!)



Assuming that for some reason you can't/won't 
migrate to a non-M$ OS, the next problem is the 
slow HD IO you are getting under WinXP.


What is the HW involved here?  Particularly the 
HD subsystem and the IO bus(es) it is plugged into?


For some perspective, Raw HD average IO rates for 
even reasonably modern 7200rpm HD's is in the 
~50MBps per HD range.  Top of the line 15Krpm 
SCSI and FC HD's have raw average IO rates of 
just under 80MBps per HD as of this post.


Given that most DB's are not on 1 HD (if you DB 
_is_ on only 1 HD, change that ASAP before you 
lose data...), for anything other than a 2 HD 
RAID 1 set I'd expect raw HD average IO rates to be at least 100MBps.


If you are getting >= 100MBps of average HD IO, 
you should be getting > 5MBps during pg_dump, and certainly > 375MBps!


Ron



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: RES: [PERFORM] pg_dump slow

2005-11-30 Thread Ron


At 12:27 PM 11/30/2005, Richard Huxton wrote:

Franklin Haut wrote:

Hi,
Yes, my problem is that the pg_dump takes 40 secs to complete under
WinXP and 50 minutes under W2K! The same database, the same hardware!,
only diferrent Operational Systems.
The hardware is:Pentium4 HT 3.2 GHz
   1024 MB Memory


Get the RAM up to at least 4096MB= 4GB for a DB server.  4 1GB DIMMs 
or 2 2GB DIMMS are ~ the same $$ as a HD (~$250-$300 US) and well 
worth the expense.



   HD 120GB SATA

"b" is "bit".  "B" is "Byte".  I made the correction.

You have =1= HD? and you are using it for everything: OS, pq, swap, etc?
Very Bad Idea.

At the very least, a DB server should have the OS on separate 
spindles from pg, and pg tables should be on something like a 4 HD 
RAID 10.  At the very least.


DB servers are about HDs.  Lots and lots of HDs compared to anything 
outside the DB realm.  Start thinking in terms of at least 6+ HD's 
attached to the system in question (I've worked on system with 
literally 100's).  Usually only a few of these are directly attached 
to the DB server and most are attached by LAN or FC.  But the point 
remains:  DBs and DB servers eat HDs in prodigious quantities.



There have been reports of very slow network performance on Win2k 
systems with the default configuration. You'll have to check the 
archives for details I'm afraid. This might apply to you.

Unless you are doing IO across a network, this issue will not apply to you.

By default W2K systems often had a default TCP/IP packet size of 576B 
and a tiny RWIN.  Optimal for analog modems talking over noisy POTS 
lines, but horrible for everything else


Packet size needs to be boosted to 1500B, the maximum.  RWIN should 
be boosted to _at least_ the largest number <= 2^16 that you can use 
without TCP scaling.  Benchmark network IO rates.  Then TCP scaling 
should be turned on and RWIN doubled and network IO benched 
again.  Repeat until there is no performance benefit to doubling RWIN 
or you run out of RAM that you can afford to toss at the problem or 
you hit the max for RWIN (very doubtful).




If you're happy that doesn't affect you then I'd look at the disk 
system - perhaps XP has newer drivers than Win2k.
I'll reiterate: Do _not_ run a production DB server on W2K.  M$ has 
obsoleted the platform and that it is not supported _nor_ any of 
reliable, secure, etc. etc.


A W2K based DB server, particularly one with a connection to the 
Internet, is a ticking time bomb at this point.
Get off W2K as a production platform ASAP.  Take to your 
CEO/Dean/whatever you call your Fearless Leader if you have to.


Economically and probably performance wise, it's best to use an Open 
Source OS like Linux or *BSD.  However, if you must use M$, at least 
use OS's that M$ is actively supporting.


Despite M$ marketing propaganda and a post in this thread to the 
contrary, you =CAN= often run a production DB server under WinXP and 
not pay M$ their usurious licensing fees for W2003 Server or any of 
their other products with "server" in the title.  How much RAM and 
how many CPUs you want in your DB server is the main issue.  For a 
1P, <= 4GB RAM vanilla box, WinXp will work just fine.



What do the MS performance-charts show is happening? Specifically, 
CPU and disk I/O.

His original post said ~3% CPU under W2K and ~70% CPU under WinXP

Ron



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: RES: [PERFORM] pg_dump slow

2005-11-30 Thread Ron

ral or illegal for anyone to use 
the industry standard layered architecture of having a DB connection 
layer separate from a Queuing system.   M$MQ is provided 
_specifically_ for that use.


Certainly "twiddling the bits" inside a M$ OS violates the EULA, and 
I'm not advocating anything of the kind.


OTOH, that Draconian EULA is yet _another_ reason to get rid of M$ 
OS's in one's organization.  When I buy something, it is _mine_.  You 
can tell me you won't support it if I modify it, but it's the height 
of Hubris to tell me that I'm not allowed to modify SW I paid for and 
own.  Tell your managers/employers at M$ that Customer Service and 
Respecting Customers =keeps= customers.  The reverse loses them.  Period.




> how many CPUs you want in your DB server is the main issue.  For a
> 1P, <= 4GB RAM vanilla box, WinXp will work just fine.

Now, who is guilty of propaganda here?


There is no propaganda here.  The statement is accurate in terms of 
the information given.  The biggest differentiations among M$ 
licenses is the CPU and RAM limit.



Also, your comments regarding hard disks while correct in the 
general sense are not helpful.  This is clearly not a disk bandwidth problem.
As Evidenced By?  His IO numbers are p*ss poor for any reasonable 
RAID setup, and 375KBps is bad even for a single HD.  He's claiming 
this is local IO, not network, so that possibility is out.  If you 
feel this is "clearly not a disk bandwidth problem", I fail to see 
your evidence or your alternative hypothesis.




> >What do the MS performance-charts show is happening? Specifically,
> >CPU and disk I/O.
> His original post said ~3% CPU under W2K and ~70% CPU under WinXP

Slow performance in extraction of bytea column strongly suggests tcp/ip.
issue.  I bet if you blanked out bytea column pg_dump will be fast.

Franlin: are you making pg_dump from local or remote box and is this a
clean install?  Try fresh patched win2k install and see what happens.
He claimed this was local, not network.  It is certainly an 
intriguing possibility that W2K and WinXP handle bytea 
differently.  I'm not competent to comment on that however.


Ron



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] 15,000 tables

2005-12-01 Thread Ron

Agreed.  Also the odds of fs corruption or data loss are higher in a 
non journaling fs.  Best practice seems to be to use a journaling fs 
but to put the fs log on dedicated spindles separate from the actual 
fs or pg_xlog.

Ron

At 01:40 PM 12/1/2005, Tino Wildenhain wrote:

Am Donnerstag, den 01.12.2005, 10:07 -0800 schrieb Gavin M. Roy:
> Hi Michael,
>
> I'm a fan of ReiserFS, and I can be wrong, but I believe using a
> journaling filesystem for the PgSQL database could be slowing things
> down.

Have a 200G+ database, someone pulling the power plug
or a regular reboot after a year or so.

Wait for the fsck to finish.

Now think again :-)

++Tino

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] 15,000 tables

2005-12-02 Thread Ron

Agreed, and I apologize for the imprecision of my post below.

I should have written:
"Best practice seems to be to use a journaling fs and log metadata 
only and put it on separate dedicated spindles."

I've seen enough HD failures that I tend to be paranoid and log the 
metadata of fs dedicated to WAL as well, but that may very well be overkill.

Ron

At 01:57 PM 12/1/2005, Tom Lane wrote:

Ron <[EMAIL PROTECTED]> writes:
> Agreed.  Also the odds of fs corruption or data loss are higher in a
> non journaling fs.  Best practice seems to be to use a journaling fs
> but to put the fs log on dedicated spindles separate from the actual
> fs or pg_xlog.

I think we've determined that best practice is to journal metadata only
(not file contents) on PG data filesystems.  PG does expect the filesystem
to remember where the files are, so you need metadata protection, but
journalling file content updates is redundant with PG's own WAL logging.

On a filesystem dedicated to WAL, you probably do not need any
filesystem journalling at all --- we manage the WAL files in a way
that avoids changing metadata for a WAL file that's in active use.
A conservative approach would be to journal metadata here too, though.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] two disks - best way to use them?

2005-12-02 Thread Ron


At 01:58 PM 12/2/2005, Rick Schumeyer wrote:

I installed another drive in my linux pc in an attempt to improve performance
on a large COPY to a table with a geometry index.

Based on previous discussion, it seems there are three things 
competing for the hard drive:


1)   the input data file
2)   the pg table
3)   the WAL

What is the best way to distribute these among two drives?  From 
Tom's comments
I would think that the pg table and the WAL should be 
separate.  Does it matter where the input data is?


Best is to have 3 HD or HD sets, one for each of the above.

With only 2, and assuming the input file is too large to fit 
completely into RAM at once, I'd test to see whether:

a=  input on one + pg table & WAL on the other, or
b= WAL on one + pg table & input file on the other
is best.

If the input file can be made 100% RAM resident, then use
c= pg table on one + WAL and input file on the other.

The big goal here is to minimize HD head seeks.

Ron



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] BLCKSZ

2005-12-05 Thread Ron


At 04:32 PM 12/5/2005, Olleg wrote:

Tom Lane wrote:

Olleg Samoylov <[EMAIL PROTECTED]> writes:

I try to test this. Linux, memory page 4kb, disk page 4kb. I set 
BLCKSZ to 4kb. I get some performance improve, but not big, may be 
because I have 4Gb on test server (amd64).
It's highly unlikely that reducing BLCKSZ is a good idea.  There 
are bad side-effects on the maximum index entry size, maximum 
number of tuple fields, etc.


Yes, when I set BLCKSZ=512, database dont' work. With BLCKSZ=1024 
database very slow. (This was surprise me. I expect increase 
performance in 8 times with 1024 BLCKSZ. :) )


No wonder pg did not work or was very slow BLCKSZ= 512 or 1024 means 
512 or 1024 *Bytes* respectively.  That's 1/16 and 1/8 the default 8KB BLCKSZ.



 As I already see in this maillist, increase of  BLCKSZ reduce 
performace too.


Where?  BLCKSZ as large as 64KB has been shown to improve 
performance.  If running a RAID, BLCKSZ of ~1/2 the RAID stripe size 
seems to be a good value.



May be exist optimum value? Theoretically BLCKSZ equal memory/disk 
page/block size may reduce defragmentation drawback of memory and disk.
Of course there's an optimal value... ...and of course it is 
dependent on your HW, OS, and DB application.


In general, and in a very fuzzy sense, "bigger is better".  pg files 
are laid down in 1GB chunks, so there's probably one limitation.
Given the HW you have mentioned, I'd try BLCKSZ= 65536 (you may have 
to recompile your kernel) and a RAID stripe of 128KB or 256KB as a first guess.




In any case, when you didn't say *what* you tested, it's
impossible to judge the usefulness of the change.
regards, tom lane


I test performace on database test server. This is copy of working 
billing system to test new features and experiments. Test task was 
one day traffic log. Average time of a one test was 260 minutes.


How large is a record in your billing system?  You want it to be an 
integer divisor of BLCKSZ (so for instance odd sizes in Bytes are BAD),
Beyond that, you application domain matters.  OLTP like systems need 
low latency access for frequent small transactions.  Data mining like 
systems need to do IO in as big a chunk as the HW and OS will 
allow.  Probably a good idea for BLCKSZ to be _at least_ max(8KB, 2x 
record size)




 Postgresql 7.4.8. Server dual Opteron 240, 4Gb RAM.


_Especially_ with that HW, upgrade to at least 8.0.x ASAP.  It's a 
good idea to not be running pg 7.x anymore anyway, but it's 
particularly so if you are running 64b SMP boxes.


Ron



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] two disks - best way to use them?

2005-12-06 Thread Ron


At 12:52 AM 12/6/2005, Thomas Harold wrote:

David Lang wrote:

in that case you logicly have two disks, so see the post from Ron 
earlier in this thread.


And it's a very nice performance gain.  Percent spent waiting 
according to "top" is down around 10-20% instead of 80-90%.  While 
I'm not prepared to benchmark, database performance is way up.  The 
client machines that are writing the data are running closer to 100% 
CPU (before they were well below 50% CPU utilization).

For accuracy's sake, which exact config did you finally use?

How did you choose the config you finally used?  Did you test the 
three options or just pick one?


Ron



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Can this query go faster???

2005-12-06 Thread Ron


At 04:43 AM 12/6/2005, Joost Kraaijeveld wrote:

Hi,

Is it possible to get this query run faster than it does now, by adding
indexes, changing the query?

SELECT customers.objectid FROM prototype.customers, prototype.addresses
WHERE
customers.contactaddress = addresses.objectid
ORDER BY zipCode asc, housenumber asc
LIMIT 1 OFFSET 283745

Explain:

Limit  (cost=90956.71..90956.71 rows=1 width=55)
  ->  Sort  (cost=90247.34..91169.63 rows=368915 width=55)
Sort Key: addresses.zipcode, addresses.housenumber
->  Hash Join  (cost=14598.44..56135.75 rows=368915 width=55)
  Hash Cond: ("outer".contactaddress = "inner".objectid)
  ->  Seq Scan on customers  (cost=0.00..31392.15
rows=368915 width=80)
  ->  Hash  (cost=13675.15..13675.15 rows=369315 width=55)
->  Seq Scan on addresses  (cost=0.00..13675.15
rows=369315 width=55)

The customers table has an index on contactaddress and objectid.
The addresses table has an index on zipcode+housenumber and objectid.

TIA
customer names, customers.objectid, addresses, and addresses.objectid 
should all be static (addresses do not change, just the customers 
associated with them; and once a customer has been assigned an id 
that better never change...).


To me, this sounds like the addresses and customers tables should be 
duplicated and then physically laid out in sorted order by 
.objectid in one set and by the "human friendly" 
associated string in the other set.
Then a finding a specific .objectid or it's associated 
string can be done in at worse O(lgn) time assuming binary search 
instead of O(n) time for a sequential scan.  If pg is clever enough, 
it might be able to do better than that.


IOW, I'd try duplicating the addresses and customers tables and using 
the appropriate CLUSTERed Index on each.


I know this breaks Normal Form.  OTOH, this kind of thing is common 
practice for data mining problems on static or almost static data.


Hope this is helpful,
Ron



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?

2005-12-24 Thread Ron

At 02:50 PM 12/24/2005, Frank Wiles wrote:

On Wed, 21 Dec 2005 22:31:54 -0500
Juan Casero <[EMAIL PROTECTED]> wrote:

> Sorry folks.  I had a couple of glasses of wine as I wrote this.
> Anyway I originally wanted the box to have more than two drives so I
> could do RAID 5 but that is going to cost too much.  Also, contrary
> to my statement below it seems to me I should run the 32 bit
> postgresql server on the 64 bit kernel. Would you agree this will
> probably yield the best performance?I know it depends alot on the
> system but for now this database is about 20 gigabytes. Not too large
> right now but it may grow 5x in the next year.

  You definitely DO NOT want to do RAID 5 on a database server.  That
  is probably the worst setup you could have, I've seen it have lower
  performance than just a single hard disk.

  RAID 1 and RAID 1+0 are optimal, but you want to stay far away from
  RAID 5.  IMHO RAID 5 is only useful on near line backup servers or
  Samba file servers where space is more important than speed.
That's a bit misleading.  RAID 5 excels when you want read speed but 
don't care as much about write speed.  Writes are typical ~2/3 the 
speed of reads on a typical decent RAID 5 set up.

Side Note:  Some years ago Mylex had a family of fast (for the time) 
RAID 5 HW controllers that actually read and wrote at the same 
speed.  IBM bought them to kill them and protect LSI Logic.  Mylex 
X24's (?IIRC the model number correctly?) are still reasonable HW.

So if you have tables that are read often and written to rarely or 
not at all, putting them on RAID 5 is optimal.  In both data mining 
like and OLTP like apps there are usually at least some such tables.

RAID 1 is good for stuff where speed doesn't matter and all you are 
looking for is an insurance policy.

RAID 10 is the best way to get high performance on both reads and 
writes, but it has a significantly greater cost for the same amount 
of usable physical media.

If you've got the budget or are dealing with small enough physical 
storage needs, by all means use RAID 10.  OTOH, if you are dealing 
with large enterprise class apps like Sarbanes Oxley compliance, 
medical and/or insurance, etc, etc, the storage needs can get so 
large that RAID 10 for everything or even most things is not 
possible.  Even if economically feasible.

RAID levels are like any other tool.  Each is useful in the proper 
circumstances.

Happy holidays,
Ron Peacetree

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?

2005-12-24 Thread Ron


At 04:42 PM 12/24/2005, Joshua D. Drake wrote:


If you've got the budget or are dealing with small enough physical 
storage needs, by all means use RAID 10.  OTOH, if you are dealing 
with large enterprise class apps like Sarbanes Oxley compliance, 
medical and/or insurance, etc, etc, the storage needs can get so 
large that RAID 10 for everything or even most things is not 
possible.  Even if economically feasible.


RAID levels are like any other tool.  Each is useful in the proper 
circumstances.

There is also RAID 50 which is quite nice.
The "quite nice" part that Joshua is referring to is that RAID 50 
gets most of the write performance of RAID 10 w/o using nearly as 
many HD's as RAID 10.  OTOH, there still is a significant increase in 
the number of HD's used, and that means MBTF's become more frequent 
but you are not getting protection levels you would with RAID 10.


IME RAID 50 gets mixed reviews.  My two biggest issues are
a= Admin of RAID 50 is more complex than the other commonly used 
versions (1, 10, 5, and 6)
b= Once a HD failure takes place, you suffer a _permenent_ 
performance drop, even after the automatic volume rebuild, until you 
take the entire RAID 50 array off line, reinitialize it, and rebuild 
it from scratch.


IME "a" and "b" make RAID 50 inappropriate for any but the biggest 
and most dedicated of DB admin groups.


YMMV,
Ron



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?

2005-12-24 Thread Ron


At 04:54 PM 12/24/2005, David Lang wrote:

raid 5 is bad for random writes as you state, but how does it do for 
sequential writes (for example data mining where you do a large 
import at one time, but seldom do other updates). I'm assuming a 
controller with a reasonable amount of battery-backed cache.
The issue with RAID 5 writes centers on the need to recalculate 
checksums for the ECC blocks distributed across the array and then 
write the new ones to physical media.


Caches help, and the bigger the cache the better, but once you are 
doing enough writes fast enough (and that doesn't take much even with 
a few GBs of cache) the recalculate-checksums-and-write-new-ones 
overhead will decrease the write speed of real data.  Bear in mind 
that the HD's _raw_ write speed hasn't been decreased.  Those HD's 
are pounding away as fast as they can for you.  Your _effective_ or 
_data level_ write speed is what decreases due to overhead.


Side Note: people often forget the other big reason to use RAID 10 
over RAID 5.  RAID 5 is always only 2 HD failures from data 
loss.  RAID 10 can lose up to 1/2 the HD's in the array w/o data loss 
unless you get unlucky and lose both members of a RAID 1 set.


This can be seen as an example of the classic space vs. time trade 
off in performance tuning.  You can use 2x the HDs you need and 
implement RAID 10 for best performance and reliability or you can 
dedicate less HD's to RAID and implement RAID 5 for less (write) 
performance and lower reliability.


TANSTAAFL.
Ron Peacetree



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?

2005-12-27 Thread Ron


At 08:35 AM 12/27/2005, Michael Stone wrote:

On Mon, Dec 26, 2005 at 10:11:00AM -0800, David Lang wrote:
what slows down raid 5 is that to modify a block you have to read 
blocks from all your drives to re-calculate the parity. this 
interleaving of reads and writes when all you are logicly doing is 
writes can really hurt. (this is why I asked the question that got 
us off on this tangent, when doing new writes to an array you don't 
have to read the blocks as they are blank, assuming your cacheing 
is enough so that you can write blocksize*n before the system 
starts actually writing the data)


Correct; there's no reason for the controller to read anything back 
if your write will fill a complete stripe. That's why I said that 
there isn't a "RAID 5 penalty" assuming you've got a reasonably fast 
controller and you're doing large sequential writes (or have enough 
cache that random writes can be batched as large sequential writes).


Sorry.  A decade+ RWE in production with RAID 5 using controllers as 
bad as Adaptec and as good as Mylex, Chaparral, LSI Logic (including 
their Engino stuff), and Xyratex under 5 different OS's (Sun, Linux, 
M$, DEC, HP) on each of Oracle, SQL Server, DB2, mySQL, and pg shows 
that RAID 5 writes are slower than RAID 5 reads


With the one notable exception of the Mylex controller that was so 
good IBM bought Mylex to put them out of business.


Enough IO load, random or sequential, will cause the effect no matter 
how much cache you have or how fast the controller is.


The even bigger problem that everyone is ignoring here is that large 
RAID 5's spend increasingly larger percentages of their time with 1 
failed HD in them.  The math of having that many HDs operating 
simultaneously 24x7 makes it inevitable.


This means you are operating in degraded mode an increasingly larger 
percentage of the time under exactly the circumstance you least want 
to be.  In addition, you are =one= HD failure from data loss on that 
array an increasingly larger percentage of the time under exactly the 
least circumstances you want to be.


RAID 5 is not a silver bullet.



 On Mon, Dec 26, 2005 at 06:04:40PM -0500, Alex Turner wrote:
Yes, but those blocks in RAID 10 are largely irrelevant as they are 
to independant disks.  In RAID 5 you have to write parity to an 
'active' drive that is part of the stripe.


Once again, this doesn't make any sense. Can you explain which parts of
a RAID 10 array are inactive?

I agree totally that the read+parity-calc+write in the worst case 
is totaly bad, which is why I alway recommend people should _never 
ever_ use RAID 5.   In this day and age of large capacity chassis, 
and large capacity SATA drives, RAID 5 is totally inapropriate IMHO 
for _any_ application least of all databases.
I vote with Michael here.  This is an extreme position to take that 
can't be followed under many circumstances ITRW.



So I've got a 14 drive chassis full of 300G SATA disks and need at 
least 3.5TB of data storage. In your mind the only possible solution 
is to buy another 14 drive chassis? Must be nice to never have a budget.


I think you mean an infinite budget.  That's even assuming it's 
possible to get the HD's you need.  I've had arrays that used all the 
space I could give them in 160 HD cabinets.  Two 160 HD cabinets was 
neither within the budget nor going to perform well.  I =had= to use 
RAID 5.  RAID 10 was just not usage efficient enough.



Must be a hard sell if you've bought decent enough hardware that 
your benchmarks can't demonstrate a difference between a RAID 5 and 
a RAID 10 configuration on that chassis except in degraded mode (and 
the customer doesn't want to pay double for degraded mode performance)


I have =never= had this situation.  RAID 10 latency is better than 
RAID 5 latency.  RAID 10 write speed under heavy enough load, of any 
type, is faster than RAID 5 write speed under the same 
circumstances.  RAID 10 robustness is better as well.


Problem is that sometimes budget limits or number of HDs needed 
limits mean you can't use RAID 10.



In reality I have yet to benchmark a system where RAID 5 on the 
same number of drives with 8 drives or less in a single array beat 
a RAID 10 with the same number of drives.


Well, those are frankly little arrays, probably on lousy controllers...
Nah.  Regardless of controller I can take any RAID 5 and any RAID 10 
built on the same HW under the same OS running the same DBMS and 
=guarantee= there is an IO load above which it can be shown that the 
RAID 10 will do writes faster than the RAID 5.  The only exception in 
my career thus far has been the aforementioned Mylex controller.


OTOH, sometimes you have no choice but to "take the hit" and use RAID 5.


cheers,
Ron



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?

2005-12-27 Thread Ron


At 02:05 PM 12/27/2005, Michael Stone wrote:

On Tue, Dec 27, 2005 at 11:50:16AM -0500, Ron wrote:
Sorry.  A decade+ RWE in production with RAID 5 using controllers 
as bad as Adaptec and as good as Mylex, Chaparral, LSI Logic 
(including their Engino stuff), and Xyratex under 5 different OS's 
(Sun, Linux, M$, DEC, HP) on each of Oracle, SQL Server, DB2, 
mySQL, and pg shows that RAID 5 writes are slower than RAID 5 reads


What does that have to do with anything? That wasn't the question...
Your quoted position is "there isn't a 'RAID 5 penalty' assuming 
you've got a reasonably fast controller and you're doing large 
sequential writes (or have enough cache that random writes can be 
batched as large sequential writes)."


My experience across a wide range of HW, OSs, DBMS, and applications 
says you are wrong.  Given enough IO, RAID 5 takes a bigger 
performance hit for writes than RAID 10 does.


Enough IO, sequential or otherwise, will result in a situation where 
a RAID 10 array using the same number of HDs (and therefore of ~1/2 
the usable capacity) will have better write performance than the 
equivalent RAID 5 built using the same number of HDs.

There is a 'RAID 5 write penalty'.

Said RAID 10 array will also be more robust than a RAID 5 built using 
the same number of HDs.


OTOH, that does not make RAID 5 "bad".  Nor are statements like 
"Never use RAID 5!" realistic or reasonable.


Also, performance is not the only or even most important reason for 
choosing RAID 10 or RAID 50 over RAID 5.  Robustness considerations 
can be more important than performance ones.


cheers,
Ron



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?

2005-12-27 Thread Ron


At 04:15 PM 12/27/2005, Michael Stone wrote:

I don't understand why you keep using the pejorative term "performance
hit". Try describing the "performance characteristics" instead.


pe·jor·a·tive( P )  Pronunciation Key  (p-jôr-tv, -jr-, pj-rtv, pj-)
adj.
Tending to make or become worse.
Disparaging; belittling.

RAID 5 write performance is significantly enough 
less than RAID 5 read performance as to be a 
matter of professional note and concern.  That's 
not "disparaging or belittling" nor is it 
"tending to make or become worse".  It's 
measurable fact that has an adverse impact on 
capacity planning, budgeting, HW deployment, etc.


If you consider calling a provable decrease in 
performance while doing a certain task that has 
such effects "a hit" or "bad" pejorative, you are 
using a definition for the word that is different than the standard one.




Also, claims about performance claims based on experience are fairly useless.
Either you have data to provide (in which case claiming vast experience
is unnecessary) or you don't.


My experience _is_ the data provided.  Isn't it 
convenient for you that I don't have the records 
for every job I've done in 20 years, nor do I 
necessarily have the right to release some 
specifics for some of what I do have.  I've said 
what I can as a service to the 
community.  Including to you.  Your reaction 
implies that I and others with perhaps equally or 
more valuable experience to share shouldn't bother.


"One of the major differences between Man and 
Beast is that Man learns from others experience."


It's also impressive that you evidently seem to 
be implying that you do such records for your own 
job experience _and_ that you have the legal 
right to publish them.  In which case, please 
feel free to impress me further by doing so.



Said RAID 10 array will also be more robust 
than a RAID 5 built using the same number of HDs.


And a RAID 6 will be more robust than either. Basing reliability on
"hopefully you wont have both disks in a mirror fail" is just silly.
Either you need double disk failure protection or you don't.
That statement is incorrect and ignores both 
probability and real world statistical failure patterns.


The odds of a RAID 10 array of n HDs suffering a 
failure that loses data are less than the odds of 
it happening in a RAID 6 array of n HDs.  You are 
correct that RAID 6 is more robust than RAID 5.


cheers,
Ron



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] How import big amounts of data?

2005-12-29 Thread Ron


At 04:48 AM 12/29/2005, Arnau wrote:

Hi all,

  Which is the best way to import data to tables? I have to import 
9 rows into a column and doing it as inserts takes ages. Would 
be faster with copy? is there any other alternative to insert/copy?

Compared to some imports, 90K rows is not that large.

Assuming you want the table(s) to be in some sorted order when you 
are done, the fastest way to import a large enough amount of data is:

-put the new data into a temp table (works best if temp table fits into RAM)
-merge the rows from the original table and the temp table into a new table
-create the indexes you want on the new table
-DROP the old table and its indexes
-rename the new table and its indexes to replace the old ones.

If you _don't_ care about having the table in some sorted order,
-put the new data into a new table
-COPY the old data to the new table
-create the indexes you want on the new table
-DROP the old table and its indexes
-rename the new table and its indexes to replace the old ones

Either of these procedures will also minimize your downtime while you 
are importing.


If one doesn't want to go to all of the trouble of either of the 
above, at least DROP your indexes, do your INSERTs in batches, and 
rebuild your indexes.

Doing 90K individual INSERTs should usually be avoided.

cheers,
Ron



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] Invulnerable VACUUM process thrashing everything

2005-12-29 Thread Ron

Ick.  Can you get users and foreign connections off that machine, 
lock them out for some period, and renice the VACUUM?


Shedding load and keeping it off while VACUUM runs high priority 
might allow it to finish in a reasonable amount of time.

Or
Shedding load and dropping the VACUUM priority might allow a kill 
signal to get through.


Hope this helps,
Ron


At 05:09 PM 12/29/2005, Jeffrey W. Baker wrote:

A few WEEKS ago, the autovacuum on my instance of pg 7.4 unilaterally
decided to VACUUM a table which has not been updated in over a year and
is more than one terabyte on the disk.  Because of the very high
transaction load on this database, this VACUUM has been ruining
performance for almost a month.  Unfortunately is seems invulnerable to
killing by signals:

# ps ax | grep VACUUM
15308 ?D588:00 postgres: postgres skunk [local] VACUUM
# kill -HUP 15308
# ps ax | grep VACUUM
15308 ?D588:00 postgres: postgres skunk [local] VACUUM
# kill -INT 15308
# ps ax | grep VACUUM
15308 ?D588:00 postgres: postgres skunk [local] VACUUM
# kill -PIPE 15308
# ps ax | grep VACUUM
15308 ?D588:00 postgres: postgres skunk [local] VACUUM

o/~ But the cat came back, the very next day ...

I assume that if I kill this with SIGKILL, that will bring down every
other postgres process, so that should be avoided.  But surely there is
a way to interrupt this.  If I had some reason to shut down the
instance, I'd be screwed, it seems.





---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] improving write performance for logging

2006-01-04 Thread Ron

2B is a lot of inserts.  If you had to guess, 
what do you think is the maximum number of inserts you could do in a day?


How large is each record being inserted?

How much can you put in a COPY and how many COPYs 
can you put into a transactions?


What values are you using for bgwriter* and checkpoint*?

What HW on you running on and what kind of performance do you typically get?

Inquiring minds definitely want to know ;-)
Ron


At 08:54 AM 1/4/2006, Ian Westmacott wrote:

We have a similar application thats doing upwards of 2B inserts
per day.  We have spent a lot of time optimizing this, and found the
following to be most beneficial:

1)  use COPY (BINARY if possible)
2)  don't use triggers or foreign keys
3)  put WAL and tables on different spindles (channels if possible)
4)  put as much as you can in each COPY, and put as many COPYs as
you can in a single transaction.
5)  watch out for XID wraparound
6)  tune checkpoint* and bgwriter* parameters for your I/O system

On Tue, 2006-01-03 at 16:44 -0700, Steve Eckmann wrote:
> I have questions about how to improve the 
write performance of PostgreSQL for logging 
data from a real-time simulation. We found that 
MySQL 4.1.3 could log about 1480 objects/second 
using MyISAM tables or about 1225 
objects/second using InnoDB tables, but 
PostgreSQL 8.0.3 could log only about 540 
objects/second. (test system: quad-Itanium2, 
8GB memory, SCSI RAID, GigE connection from 
simulation server, nothing running except 
system processes and database system under test)

>
> We also found that we could improve MySQL 
performance significantly using MySQL's 
"INSERT" command extension allowing multiple 
value-list tuples in a single command; the rate 
for MyISAM tables improved to about 2600 
objects/second. PostgreSQL doesn't support that 
language extension. Using the COPY command 
instead of INSERT might help, but since rows 
are being generated on the fly, I don't see how 
to use COPY without running a separate process 
that reads rows from the application and uses 
COPY to write to the database. The application 
currently has two processes: the simulation and 
a data collector that reads events from the sim 
(queued in shared memory) and writes them as 
rows to the database, buffering as needed to 
avoid lost data during periods of high 
activity. To use COPY I think we would have to 
split our data collector into two processes communicating via a pipe.

>
> Query performance is not an issue: we found 
that when suitable indexes are added PostgreSQL 
is fast enough on the kinds of queries our 
users make. The crux is writing rows to the 
database fast enough to keep up with the simulation.

>
> Are there general guidelines for tuning the 
PostgreSQL server for this kind of application? 
The suggestions I've found include disabling 
fsync (done), increasing the value of 
wal_buffers, and moving the WAL to a different 
disk, but these aren't likely to produce the 3x 
improvement that we need. On the client side 
I've found only two suggestions: disable 
autocommit and use COPY instead of INSERT. I 
think I've effectively disabled autocommit by 
batching up to several hundred INSERT commands 
in each PQexec() call, and it isnât clear 
that COPY is worth the effort in our application.

>
> Thanks.
>
>
> ---(end of broadcast)---
> TIP 2: Don't 'kill -9' the postmaster
--
Ian Westmacott <[EMAIL PROTECTED]>
Intellivid Corp.


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org





---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] help tuning queries on large database

2006-01-08 Thread Ron


I'll second all of Luke Lonergan's comments and add these.

You should be able to increase both "cold" and "warm" performance (as 
well as data integrity.  read below.) considerably.

Ron

At 05:59 PM 1/6/2006, peter royal wrote:

Howdy.

I'm running into scaling problems when testing with a 16gb (data 
+indexes) database.


I can run a query, and it returns in a few seconds. If I run it
again, it returns in a few milliseconds. I realize this is because
during subsequent runs, the necessary disk pages have been cached by
the OS.

I have experimented with having all 8 disks in a single RAID0 set, a
single RAID10 set, and currently 4 RAID0 sets of 2 disks each. There
hasn't been an appreciable difference in the overall performance of
my test suite (which randomly generates queries like the samples
below as well as a few other types. this problem manifests itself on
other queries in the test suite as well).

So, my question is, is there anything I can do to boost performance
with what I've got, or am I in a position where the only 'fix' is
more faster disks? I can't think of any schema/index changes that
would help, since everything looks pretty optimal from the 'explain
analyze' output. I'd like to get a 10x improvement when querying from
the 'cold' state.

Thanks for any assistance. The advice from reading this list to
getting to where I am now has been invaluable.
-peter


Configuration:

PostgreSQL 8.1.1

shared_buffers = 1  # (It was higher, 50k, but didn't help any,
so brought down to free ram for disk cache)
work_mem = 8196
random_page_cost = 3
effective_cache_size = 25


Hardware:

CentOS 4.2 (Linux 2.6.9-22.0.1.ELsmp)


Upgrade your kernel to at least  2.6.12
There's a known issue with earlier versions of the 2.6.x kernel and 
64b CPUs like the Opteron.  See kernel.org for details.



Areca ARC-1220 8-port PCI-E controller


Make sure you have 1GB or 2GB of cache.  Get the battery backup and 
set the cache for write back rather than write through.



8 x Hitachi Deskstar 7K80 (SATA2) (7200rpm)
2 x Opteron 242 @ 1.6ghz
3gb RAM (should be 4gb, but separate Linux issue preventing us from
getting it to see all of it)
Tyan Thunder K8WE
The K8WE has 8 DIMM slots.  That should be good for 16 or 32 GB of 
RAM (Depending on whether the mainboard recognizes 4GB DIMMs or 
not.  Ask Tyan about the latest K8WE firmare.).  If nothing else, 1GB 
DIMMs are now so cheap that you should have no problems having 8GB on the K8WE.


A 2.6.12 or later based Linux distro should have NO problems using 
more than 4GB or RAM.


Among the other tricks having lots of RAM allows:
If some of your tables are Read Only or VERY rarely written to, you 
can preload them at boot time and make them RAM resident using the 
/etc/tmpfs trick.


In addition there is at least one company making a cheap battery 
backed PCI-X card that can hold up to 4GB of RAM and pretend to be a 
small HD to the OS.  I don't remember any names at the moment, but 
there have been posts here and at storage.review.com on such products.




RAID Layout:

4 2-disk RAID0 sets created
You do know that a RAID 0 set provides _worse_ data protection than a 
single HD?  Don't use RAID 0 for any data you want kept reliably.


With 8 HDs, the best config is probably
1 2HD RAID 1 + 1 6HD RAID 10  or
2 4HD RAID 10's

It is certainly true that once you have done everything you can with 
RAM, the next set of HW optimizations is to add HDs.  The more the 
better up to a the limits of your available PCI-X bandwidth.


In short, a 2nd RAID fully populated controller is not unreasonable.



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] File Systems Compared

2006-12-06 Thread Ron


At 10:40 AM 12/6/2006, Brian Wipf wrote:

All tests are with bonnie++ 1.03a

Main components of system:
16 WD Raptor 150GB 1 RPM drives all in a RAID 10
ARECA 1280 PCI-Express RAID adapter with 1GB BB Cache (Thanks for the 
recommendation, Ron!)

32 GB RAM
Dual Intel 5160 Xeon Woodcrest 3.0 GHz processors
OS: SUSE Linux 10.1


xfs (with write cache disabled on disks):
/usr/local/sbin/bonnie++ -d bonnie/ -s 64368:8k
Version  1.03   --Sequential Output-- --Sequential Input-
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- -- 
Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec % 
CP  /sec %CP
hulk464368M 90621  99 283916  35 105871  11 88569  97 
433890  23 644.5   0

--Sequential Create-- Random
Create
-Create-- --Read--- -Delete-- -Create-- -- 
Read--- -Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec % 
CP  /sec %CP
 16 28435  95 + +++ 28895  82 28523  91 + 
++ + 24369  86
hulk4,64368M, 
90621,99,283916,35,105871,11,88569,97,433890,23,644.5,0,16,28435,95,++ 
+++,+++,28895,82,28523,91,+,+++,24369,86


xfs (with write cache enabled on disks):
/usr/local/sbin/bonnie++ -d bonnie -s 64368:8k
Version  1.03   --Sequential Output-- --Sequential Input-
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- -- 
Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec % 
CP  /sec %CP
hulk464368M 90861  99 348401  43 131887  14 89412  97 
432964  23 658.7   0

--Sequential Create-- Random
Create
-Create-- --Read--- -Delete-- -Create-- -- 
Read--- -Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec % 
CP  /sec %CP
 16 28871  90 + +++ 28923  91 30879  93 + 
++ + 28012  94
hulk4,64368M, 
90861,99,348401,43,131887,14,89412,97,432964,23,658.7,0,16,28871,90,++ 
+++,+++,28923,91,30879,93,+,+++,28012,94
Hmmm.   Something is not right.  With a 16 HD RAID 10 based on 10K 
rpm HDs, you should be seeing higher absolute performance numbers.


Find out what HW the Areca guys and Tweakers guys used to test the 1280s.
At LW2006, Areca was demonstrating all-in-cache reads and writes of 
~1600MBps and ~1300MBps respectively along with RAID 0 Sustained 
Rates of ~900MBps read, and ~850MBps write.


Luke, I know you've managed to get higher IO rates than this with 
this class of HW.  Is there a OS or SW config issue Brian should 
closely investigate?


Ron Peacetree


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [PERFORM] Areca 1260 Performance (was: File Systems

2006-12-06 Thread Ron

The 1100 series is PCI-X based.  The 1200 series is PCI-E x8 
based.  Apples and oranges.


I still think Luke Lonergan or Josh Berkus may have some interesting 
ideas regarding possible OS and SW optimizations.


WD1500ADFDs are each good for ~90MBps read and ~60MBps write ASTR.
That means your 16 HD RAID 10 should be sequentially transferring 
~720MBps read and ~480MBps write.
Clearly more HDs will be required to allow a ARC-12xx to attain its 
peak performance.


One thing that occurs to me with your present HW is that your CPU 
utilization numbers are relatively high.
Since 5160s are clocked about as high as is available, that leaves 
trying CPUs with more cores and trying more CPUs.


You've got basically got 4 HW threads at the moment.  If you can, 
evaluate CPUs and mainboards that allow for 8 or 16 HW threads.
Intel-wise, that's the new Kentfields.  AMD-wise, you have lot's of 
4S mainboard options, but the AMD 4C CPUs won't be available until 
sometime late in 2007.


I've got other ideas, but this list is not the appropriate venue for 
the level of detail required.


Ron Peacetree


At 05:30 PM 12/6/2006, Brian Wipf wrote:

On 6-Dec-06, at 2:47 PM, Brian Wipf wrote:


Hmmm.   Something is not right.  With a 16 HD RAID 10 based on 10K
rpm HDs, you should be seeing higher absolute performance numbers.

Find out what HW the Areca guys and Tweakers guys used to test the
1280s.
At LW2006, Areca was demonstrating all-in-cache reads and writes
of ~1600MBps and ~1300MBps respectively along with RAID 0
Sustained Rates of ~900MBps read, and ~850MBps write.

Luke, I know you've managed to get higher IO rates than this with
this class of HW.  Is there a OS or SW config issue Brian should
closely investigate?


I wrote 1280 by a mistake. It's actually a 1260. Sorry about that.
The IOP341 class of cards weren't available when we ordered the
parts for the box, so we had to go with the 1260. The box(es) we
build next month will either have the 1261ML or 1280 depending on
whether we go 16 or 24 disk.

I noticed Bucky got almost 800 random seeks per second on her 6
disk 1 RPM SAS drive Dell PowerEdge 2950. The random seek
performance of this box disappointed me the most. Even running 2
concurrent bonnies, the random seek performance only increased from
644 seeks/sec to 813 seeks/sec. Maybe there is some setting I'm
missing? This card looked pretty impressive on tweakers.net.


Areca has some performance numbers in a downloadable PDF for the
Areca ARC-1120, which is in the same class as the ARC-1260, except
with 8 ports. With all 8 drives in a RAID 0 the card gets the
following performance numbers:

Card single thread write20 thread write  single
thread read20 thread read
ARC-1120 321.26 MB/s404.76 MB/s  412.55 MB/ 
s   672.45 MB/s


My numbers for sequential i/o for the ARC-1260 in a 16 disk RAID 10
are slightly better than the ARC-1120 in an 8 disk RAID 0 for a
single thread. I guess this means my numbers are reasonable.



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Areca 1260 Performance

2006-12-06 Thread Ron


At 06:40 PM 12/6/2006, Brian Wipf wrote:

I appreciate your suggestions, Ron. And that helps answer my question
on processor selection for our next box; I wasn't sure if the lower
MHz speed of the Kentsfield compared to the Woodcrest but with double
the cores would be better for us overall or not.

Please do not misunderstand me.  I am not endorsing the use of Kentsfield.
I am recommending =evaluating= Kentsfield.

I am also recommending the evaluation of 2C 4S AMD solutions.

All this stuff is so leading edge that it is far from clear what the 
RW performance of DBMS based on these components will be without 
extensive testing of =your= app under =your= workload.


One thing that is clear from what you've posted thus far is that you 
are going to needmore HDs if you want to have any chance of fully 
utilizing your Areca HW.


Out of curiosity, where are you geographically?

Hoping I'm being helpful,
Ron



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] Areca 1260 Performance

2006-12-07 Thread Ron


At 03:37 AM 12/7/2006, Brian Wipf wrote:

On 6-Dec-06, at 5:26 PM, Ron wrote:


All this stuff is so leading edge that it is far from clear what
the RW performance of DBMS based on these components will be
without extensive testing of =your= app under =your= workload.

I want the best performance for the dollar, so I can't rule anything
out. Right now I'm leaning towards Kentsfield, but I will do some
more research before I make a decision. We probably won't wait much
past January though.
Kentsfield's outrageously high pricing and operating costs (power and 
cooling) are not likely to make it the cost/performance winner.


OTOH,
1= ATM it is the way to throw the most cache per socket at a DBMS 
within the Core2 CPU line (Tulsa has even more at 16MB per CPU).
2= SSSE3 and other Core2 optimizations have led to some impressive 
performance numbers- unless raw clock rate is the thing that can help 
you the most.


If what you need for highest performance is the absolute highest 
clock rate or most cache per core, then bench some Intel Tulsa's.


Apps with memory footprints too large for on die or in socket caches 
or that require extreme memory subsystem performance are still best 
served by AMD CPUs.


If you are getting the impression that it is presently complicated 
deciding which CPU is best for any specific pg app, then I am making 
the impression I intend to.




One thing that is clear from what you've posted thus far is that
you are going to needmore HDs if you want to have any chance of
fully utilizing your Areca HW.

Do you know off hand where I might find a chassis that can fit 24[+]
drives? The last chassis we ordered was through Supermicro, and the
largest they carry fits 16 drives.
www.pogolinux.com has 24 and 48 bay 3.5" HD chassis'; and a 64 bay 
2.5" chassis.  Tell them I sent you.


www.impediment.com are folks I trust regarding all things storage 
(and RAM).  Again, tell them I sent you.


www.aberdeeninc.com is also a vendor I've had luck with, but try Pogo 
and Impediment first.



Good luck and please post what happens,
Ron Peacetree
  



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] Areca 1260 Performance

2006-12-07 Thread Ron


At 11:02 AM 12/7/2006, Gene wrote:
I'm building a SuperServer 6035B server (16 scsi drives). My schema 
has basically two large tables (million+ per day) each which are 
partitioned daily, and queried independently of each other. Would 
you recommend a raid1 system partition and 14 drives in a raid 10 or 
should i create separate partitions/tablespaces for the two large 
tables and indexes?
Not an easy question to answer w/o knowing more about your actual 
queries and workload.


To keep the math simple, let's assume each SCSI HD has and ASTR of 
75MBps.  A 14 HD RAID 10 therefore has an ASTR of 7* 75= 525MBps.  If 
the rest of your system can handle this much or more bandwidth, then 
this is most probably the best config.


Dedicating spindles to specific tables is usually best done when 
there is HD bandwidth that can't be utilized if the HDs are in a 
larger set +and+  there is a significant hot spot that can use 
dedicated resources.


My first attempt would be to use other internal HDs for a RAID 1 
systems volume and use all 16 of your HBA HDs for a 16 HD RAID 10 array.

Then I'd bench the config to see if it had acceptable performance.

If yes, stop.  Else start considering the more complicated  alternatives.

Remember that adding HDs and RAM is far cheaper than even a few hours 
of skilled technical labor.


Ron Peacetree 



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [PERFORM] New to PostgreSQL, performance considerations

2006-12-11 Thread Ron

This definitely is the correct approach.

Actually, Daniel van Ham Colchete may not be as "all wet" as some 
around here think.  We've had previous data that shows that pg can 
become CPU bound (see previous posts by Josh Berkus and others 
regarding CPU overhead in what should be IO bound tasks).

In addition, we know that x86 compatible 64b implementations differ 
enough between AMD and Intel products that it sometimes shows on benches.

Evidence outside the DBMS arena supports the hypothesis that recent 
CPUs are needing more hand-holding and product specific compiling, 
not less, compared to their previous versions.

Side Note: I wonder what if anything pg could gain from using SWAR 
instructions (SSE*, MMX, etc)?

I'd say the fairest attitude is to do everything we can to support 
having the proper experiments done w/o presuming the results.

Ron Peacetree

At 10:47 AM 12/11/2006, Chris Browne wrote:

In order to prove what you want to prove, you need to run the
benchmarks all on Gentoo, where you run with 4 categorizations:

 1.  Where you run PostgreSQL and GLIBC without any processor-specific
 optimizations

 2.  Where you run PostgreSQL and GLIBC with all relevant
 processor-specific optimizations

 3.  Where you run PostgreSQL with, and GLIBC without
 processor-specific optimizations

 4.  Where you run PostgreSQL without, and GLIBC with processor-specific
 optimizations

That would allow one to clearly distinguish which optimizations are
particularly relevant.

> I'm not saying that Gentoo is faster than FC6. I just want to prove
> that if you compile your software to make better use of your
> processor, it will run faster.
>
> It might take a few days because I'm pretty busy right now at my
> job.

I expect that you'll discover, if you actually do these tests, that
this belief is fairly much nonsense.

- Modern CPUs do a huge amount of on-CPU self-tuning.

- CPU features that could have a material effect tend to be unusable
  when compiling general purpose libraries and applications.  GCC
  doesn't generate MMX-like instructions.

- Database application performance tends to be I/O driven.

- When database application performance *isn't* I/O driven, it is
  likely to be driven by cache management, which compiler options
  won't affect.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] New to PostgreSQL, performance considerations

2006-12-11 Thread Ron

Statements like these can not be reasonably interpreted in any manner 
_except_ that of presuming the results:


"I expect that you'll discover, if you actually do these tests, that 
this belief  (that using arch specific compiler options lead to 
better performing SW) is fairly much nonsense."


"...IMO a waste of time..."

etc

The correct objective response to claims w/o evidence is to request 
evidence, and to do everything we can to support it being properly 
gathered.  Not to try to discourage the claimant from even trying by 
ganging up on them with multiple instances of Argument From Authority 
or variations of Ad Hominem attacks.
(The validity of the claim has nothing to do with the skills or 
experience of the claimant or anyone else in the discussion.  Only on 
the evidence.)


 It is a tad unfair and prejudicial to call claims that CPU 
optimizations matter to the performance of DB product "extraordinary".
Evidence outside the DBMS field exists; and previous posts here show 
that pg can indeed become CPU-bound during what should be IO bound tasks.
At the moment, Daniel's claims are not well supported.  That is far 
different from being "extraordinary" given the current circumstantial 
evidence.


Let's also bear in mind that as a community project, we can use all 
the help we can get.  Driving potential resources away is in 
opposition to that goal.


[1] The evidence that arch specific flags matter to performance can 
be found as easily as recompiling your kernel or your 
compiler.   While it certainly could be argued how "general purpose" 
such SW is, the same could be said for just about any SW at some 
level of abstraction.


Ron Peacetree


At 12:31 PM 12/11/2006, Michael Stone wrote:

On Mon, Dec 11, 2006 at 12:15:51PM -0500, Ron wrote:
I'd say the fairest attitude is to do everything we can to support 
having the proper experiments done w/o presuming the results.


Who's presuming results?[1] It is fair to say that making 
extraordinary claims without any evidence should be discouraged. 
It's also fair to say that if there are specific things that need 
cpu-specific tuning they'll be fairly limited critical areas (e.g., 
locks) which would probably be better implemented with a hand-tuned 
code and runtime cpu detection than by magical mystical compiler invocations.


Mike Stone

[1] I will say that I have never seen a realistic benchmark of 
general code where the compiler flags made a statistically 
significant difference in the runtime. There are some particularly 
cpu-intensive codes, like some science simulations or encoding 
routines where they matter, but that's not the norm--and many of 
those algorithms already have hand-tuned versions which will 
outperform autogenerated code. You'd think that with all the talk 
that the users of certain OS's generate about CFLAG settings, 
there'd be some well-published numbers backing up the hype. At any 
rate if there were numbers to back the claim then I think they could 
certainly be considered without prejudice.



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] New to PostgreSQL, performance considerations

2006-12-11 Thread Ron


At 01:47 PM 12/11/2006, Michael Stone wrote:

On Mon, Dec 11, 2006 at 01:20:50PM -0500, Ron wrote:
(The validity of the claim has nothing to do with the skills or 
experience of the claimant or anyone else in the discussion.  Only 
on the evidence.)


Please go back and reread the original post. I don't think the 
response was unwarranted.


So he's evidently young and perhaps a trifle over-enthusiast.  We 
were once too. ;-)


We are not going to get valuable contributions nor help people become 
more valuable to the community by "flaming them into submission".


...and who knows, =properly= done experiment may provide both 
surprises and unexpected insights/benefits.


I agree completely with telling him he needs to get better evidence 
and even with helping him understand how he should go about getting it.


It should be noted that his opposition has not yet done these 
experiments either.  (Else they could just simply point to the 
results that refute Daniel's hypothesis.)


The reality is that a new CPU architecture and multiple new memory 
technologies are part of this discussion.  I certainly do not expect 
them to change the fundamental thinking regarding how to get best 
performance for a DBMS.  OTOH, there are multiple valid reasons to 
give such new stuff a thorough and rigorous experimental shake-down.


ATM, =both= sides of this debate are lacking evidence for their POV.

Let's support getting definitive evidence.  No matter who brings it 
to the table ;-)
Ron Peacetree 



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] New to PostgreSQL, performance considerations

2006-12-11 Thread Ron


At 02:28 PM 12/11/2006, Merlin Moncure wrote:


also, some people posting here, not necessarily me, are authority figures. :-)

merlin


Noam Chomsky was one of the most influential thinkers in Linguistics 
to yet have lived.  He was proven wrong a number of times.  Even 
within Linguistics.

There are plenty of other historical examples.

As others have said, opinion without evidence and logic is just that- opinion.
And even Expert Opinion has been known to be wrong.  Sometimes very much so.

Part of what makes an expert an expert is that they can back up their 
statements with evidence and logic that are compelling even to the 
non expert when asked to do so.


All I'm saying is let's all remember how "assume" is spelled and 
support the getting of some hard data.

Ron Peacetree



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] New to PostgreSQL, performance considerations

2006-12-12 Thread Ron

1= In all these results I'm seeing, no one has yet reported what 
their physical IO subsystem is... ...when we are benching a DB.


2= So far we've got ~ a factor of 4 performance difference between 
Michael Stone's 1S 1C Netburst era 2.5GHz P4 PC and Guido Neitzer's 
1S 2C MacBook Pro 2.33GHz  C2D.  If the physical IO subsystems are 
even close to equivalent across the systems benched so far, we've 
clearly established that pg performance is more sensitive to factors 
outside the physical IO subsystem than might usually be thought with 
regard to a DBMS.  (At least for this benchmark SW.)


3= Daniel van Ham Colchete is running Gentoo.  That means every SW 
component on his box has been compiled to be optimized for the HW it 
is running on.
There may be a combination of effects going on for him that others 
not running a system optimized from the ground up for its HW do not see.


4= If we are testing arch specific compiler options and only arch 
specific compiler options, we should remove the OS as a variable.
Since Daniel has presented evidence in support of his hypothesis, the 
first step should be to duplicate his environment as =exactly= as 
possible and see if someone can independently reproduce the results 
when the only significant difference is the human involved.  This 
will guard against procedural error in the experiment.


Possible Outcomes
A= Daniel made a procedural error.  We all learn what is and to avoid it.
B= The Gentoo results are confirmed but no other OS shows this 
effect.  Much digging ensues ;-)
C= Daniel's results are confirmed as platform independent once we 
take all factor into account properly

We all learn more re: how to best set up pg for highest performance.

Ron Peacetree


At 01:35 AM 12/12/2006, Greg Smith wrote:

On Mon, 11 Dec 2006, Michael Stone wrote:

Can anyone else reproduce these results? I'm on similar hardware 
(2.5GHz P4, 1.5G RAM)...


There are two likely candidates for why Daniel's P4 3.0GHz 
significantly outperforms your 2.5GHz system.


1) Most 2.5GHZ P4 processors use a 533MHz front-side bus (FSB); most 
3.0GHZ ones use an 800MHz bus.


2) A typical motherboard paired with a 2.5GHz era processor will 
have a single-channel memory interface; a typical 3.0GHZ era board 
supports dual-channel DDR.


These changes could easily explain the magnitude of difference in 
results you're seeing, expecially when combined with a 20% greater 
raw CPU clock.



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] New to PostgreSQL, performance considerations

2006-12-12 Thread Ron


At 10:47 AM 12/12/2006, Tom Lane wrote:


It's notoriously hard to get repeatable numbers out of pgbench :-(

That's not a good characteristic in bench marking SW...

Does the ODSL stuff  have an easier time getting reproducible results?



A couple of tips:
* don't put any faith in short runs.  I usually use -t 1000 
plus -c whatever.
* make sure you loaded the database (pgbench -i) with a 
scale factor (-s) at least equal to the maximum -c you want to test.

  Otherwise you're mostly measuring update contention.
* pay attention to when checkpoints occur.  You probably 
need to increase checkpoint_segments if you want pgbench not to be 
checkpoint-bound.
This all looks very useful.  Can you give some guidance as to what 
checkpoint_segments should be increased to?   Do the values you are 
running  pgbench with suggest what value checkpoint_segments should be?


Ron Peacetree 



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] New to PostgreSQL, performance considerations

2006-12-12 Thread Ron


At 01:35 PM 12/12/2006, Daniel van Ham Colchete wrote:

I just made another test with a second Gentoo machine:





The results showed no significant change. The conclusion of today's 
test would be that there are no improvement at PostgreSQL when using 
-march=prescott.


I only see 3 diferences between yesterday's server and today's: the 
kernel version (y: 2.6.18, t:2.6.17), the server uses an IDE 
harddrive (yesterday was SATA), and the gcc version (3.4.6 -> 4.1.1).


I don't know why yesterday we had improved and today we had not.
SATA HD's, particularly SATA II HD's and _especially_ 10Krpm 150GB 
SATA II Raptors are going to have far better performance than older IDE HDs.


Do some raw bonnie++ benches on the two systems.  If the numbers from 
bonnie++ are close to those obtained during the pgbench runs, then 
the HDs are limiting pgbench.


 Best would be to use the exact same HD IO subsystem on both boxes, 
but that may not be feasible.


In general, it would be helpful if the entire config, HW + OS + pg 
stuff, was documented when submitting benchmark results.
(For instance, it would not be outside the realm of plausibility for 
Guidos C2D laptop to be HD IO limited and for Micheal's 2.5 GHZ P4 PC 
to be CPU limited during pgbench runs.)


Ron Peacetree 



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] New to PostgreSQL, performance considerations

2006-12-13 Thread Ron


At 11:11 AM 12/13/2006, Cosimo Streppone wrote:


Interesting, eh?

Cosimo


What I find interesting is that so far Guido's C2D Mac laptop has 
gotten the highest values by far in this set of experiments, and no 
one else is even close.
The slowest results, Michael's, are on the system with what appears 
to be the slowest CPU of the bunch; and the ranking of the rest of 
the results seem to similarly depend on relative CPU 
performance.  This is not what one would naively expect when benching 
a IO intensive app like a DBMS.


Given that the typical laptop usually has 1 HD, and a relatively 
modest one at that (the fastest available are SATA 7200rpm or 
Seagate's perpendicular recording 5400rpm) in terms of performance, 
this feels very much like other factors are bottlenecking the 
experiments to the point where Daniel's results regarding compiler 
options are not actually being tested.


Anyone got a 2.33 GHz C2D box with a decent HD IO subsystem more 
representative of a typical DB server hooked up to it?


Again, the best way to confirm/deny Daniel's results is to duplicate 
the environment he obtained those results with as closely as possible 
(preferably exactly) and then have someone else try to duplicate his results.


Also, I think the warnings regarding proper configuration of pgbench 
and which version of pgbench to use are worthy of note.  Do we have 
guidance yet as to what checkpoint_segments should be set 
to?   Should we be considering using something other than pgbench for 
such experiments?


Ron Peacetree



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] New to PostgreSQL, performance considerations

2006-12-13 Thread Ron


At 01:49 PM 12/13/2006, Bucky Jordan wrote:

I've only seen pg_bench numbers > 2,000 tps on either really large 
hardware (none of the above mentioned comes close) or the results 
are in memory due to a small database size (aka measuring update contention).

Which makes a laptop achieving such numbers all the more interesting IMHO.


Just a guess, but these tests (compiler opts.) seem like they 
sometimes show a benefit where the database is mostly in RAM (which 
I'd guess many people have) since that would cause more work to be 
put on the CPU/Memory subsystems.
The cases where the working set, or the performance critical part of 
the working set, of the DB is RAM resident are very important ones ITRW.



Other people on the list hinted at this, but I share their 
hypothesis that once you get IO involved as a bottleneck (which is a 
more typical DB situation) you won't notice compiler options.
Certainly makes intuitive sense.  OTOH, this list has seen discussion 
of what should be IO bound operations being CPU bound.  Evidently due 
to the expense of processing pg datastructures.  Only objective 
benches are going to tell us where the various limitations on pg 
performance really are.



I've got a 2 socket x 2 core woodcrest poweredge 2950 with a BBC 6 
disk RAID I'll run some tests on as soon as I get a chance.


I'm also thinking for this test, there's no need to tweak the 
default config other than maybe checkpoint_segments, since I don't 
really want postgres using large amounts of RAM (all that does is 
require me to build a larger test DB).


Daniel's orginal system had 512MB RAM.  This suggests to me that 
tests involving 256MB of pg memory should be plenty big enough.




Thoughts?

Hope they are useful.

Ron Peacetree 



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] New to PostgreSQL, performance considerations

2006-12-13 Thread Ron

Benchmarks, like any other SW, need modernizing and updating from time to time.

Given the multi-core CPU approach to higher performance as the 
current fad in CPU architecture, we need a benchmark that is appropriate.

If SPEC feels it is appropriate to rev their benchmark suite 
regularly, we probably should as well.

Ron Peacetree

At 12:44 AM 12/14/2006, Tom Lane wrote:

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:
> On Wed, 2006-12-13 at 18:36 -0800, Josh Berkus wrote:
>> Mostly, though, pgbench just gives the I/O system a workout.  It's not a
>> really good general workload.

> It also will not utilize all cpus on a many cpu machine. We recently
> found that the only way to *really* test with pgbench was to actually
> run 4+ copies of pgbench at the same time.

The pgbench app itself becomes the bottleneck at high transaction
rates.  Awhile back I rewrote it to improve its ability to issue
commands concurrently, but then desisted from submitting the
changes --- if we change the app like that, future numbers would
be incomparable to past ones, which sort of defeats the purpose of a
benchmark no?

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] New to PostgreSQL, performance considerations

2006-12-14 Thread Ron


At 10:00 AM 12/14/2006, Greg Smith wrote:

On Wed, 13 Dec 2006, Ron wrote:

The slowest results, Michael's, are on the system with what appears 
to be the slowest CPU of the bunch; and the ranking of the rest of 
the results seem to similarly depend on relative CPU 
performance.  This is not what one would naively expect when 
benching a IO intensive app like a DBMS.


pgbench with 3000 total transactions and fsync off is barely doing 
I/O to disk; it's writing a bunch of data to the filesystem cache 
and ending the benchmark before the data even makes it to the hard 
drive.  This is why his results become completely different as soon 
as the number of transactions increases. With little or no actual 
disk writes, you should expect results to be ranked by CPU speed.
I of course agree with you in the general sense.  OTOH, I'm fairly 
sure the exact point where this cross-over occurs is dependent on the 
components and configuration of the system involved.


(Nor do I want to dismiss this scenario as irrelevant or 
unimportant.  There are plenty of RW situations where this takes 
place or where the primary goal of a tuning effort is to make it take 
place.  Multi-GB BB RAID caches anyone?)


In addition, let's keep in mind that we all know that overall system 
performance is limited by whatever component hits its limits 
first.  Local pg performance has been known to be limited by any of : 
CPUs, memory subsystems, or physical IO subsystems.  Intuitively, one 
tends to expect only the later to be a limiting factor in the vast 
majority of DBMS tasks.  pg has a history of regularly surprising 
such intuition in many cases.
IMO, this makes good bench marking tools and procedures more 
important  to have.


(If nothing else, knowing what component is likely to be the 
bottleneck in system "X" made of components "x1, x2, x3, " for 
task "Y" is valuable lore for the pg community to have as preexisting 
data when first asked any given question on this list! )


One plausible positive effect of tuning like that Daniel advocates is 
to "move" the level of system activity where the physical IO 
subsystem becomes the limiting factor on overall system performance.


We are not going to know definitively if such an effect exists, or to 
what degree, or how to achieve it, if we don't have appropriately 
rigorous and reproducible experiments (and documentation of them) in 
place to test for it.


 So it seem to make sense that the community should have a 
discussion about the proper bench marking of pg and to get some 
results based on said.


Ron Peacetree


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [PERFORM] New to PostgreSQL, performance considerations

2006-12-14 Thread Ron


Alexander,   Good stuff.

Can you do runs with just CFLAGS="-O3" and just CFLAGS="-msse2 
-mfpmath=sse -funroll-loops -m64 - march=opteron -pipe" as well ?


As it is, you've given a good lower and upper bound on your 
performance obtainable using compiler options, but you've given no 
data to show what effect arch specific compiler options have by themselves.


Also, what HDs are you using?  How many in what config?

Thanks in Advance,
Ron Peacetree

At 02:14 PM 12/14/2006, Alexander Staubo wrote:


My PostgreSQL config overrides, then, are:

shared_buffers = 1024MB
work_mem = 1MB
maintenance_work_mem = 16MB
fsync = off

Environment: Linux 2.6.15-23-amd64-generic on Ubuntu. Dual-core AMD
Opteron 280 with 4GB of RAM. LSI PCI-X Fusion-MPT SAS.

Running with: pgbench -S -v -n -t 5000 -c 5.

Results as a graph: http://purefiction.net/paste/pgbench.pdf

Stats for CFLAGS="-O0": 18440.181894 19207.882300 19894.432185
19635.625622 19876.858884 20032.597042 19683.597973 20370.19
19989.157881 20207.343510 19993.745956 20081.353580 20356.416424
20047.810017 20319.834190 19417.807528 19906.788454 20536.039929
19491.308046 20002.144230

Stats for CFLAGS="-O3 -msse2 -mfpmath=sse -funroll-loops -m64 - 
march=opteron -pipe": 23830.358351 26162.203569 25569.091264

26762.755665 26590.822550 26864.908197 26608.029665 26796.116921
26323.742015 26692.576261 26878.859132 26106.770425 26328.371664
26755.595130 25488.304946 26635.527959 26377.485023 24817.590708
26480.245737 26223.427801



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] New to PostgreSQL, performance considerations

2006-12-14 Thread Ron


At 05:39 PM 12/14/2006, Alexander Staubo wrote:

On Dec 14, 2006, at 20:28 , Ron wrote:

Can you do runs with just CFLAGS="-O3" and just CFLAGS="-msse2 - 
mfpmath=sse -funroll-loops -m64 - march=opteron -pipe" as well ?


All right. From my perspective, the effect of -O3 is significant, 
whereas architecture-related optimizations have no statistically 
significant effect.


Is this opinion?  Or have you rerun the tests using the flags I 
suggested?  If so, can you post the results?


If  "-O3 -msse2 - mfpmath=sse -funroll-loops -m64 - march=opteron 
-pipe" results in a 30-40% speed up over "-O0", and
" -msse2 - mfpmath=sse -funroll-loops -m64 - march=opteron -pipe" 
results in a 5-10% speedup, then ~ 1/8 - 1/3 of the total possible 
speedup is due to arch specific optimizations.


(testing "-O3" in isolation in addition tests for independence of 
factors as well as showing what "plain" "-O3" can accomplish.)


Some might argue that a 5-10% speedup which represents 1/8 - 1/3 of 
the total speedup is significant...


But enough speculating.  I look forward to seeing your data.

Ron Peacetree 



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] New to PostgreSQL, performance considerations

2006-12-14 Thread Ron


At 07:27 PM 12/14/2006, Alexander Staubo wrote:


Sorry, I neglected to include the pertinent graph:

  http://purefiction.net/paste/pgbench2.pdf
In fact, your graph suggests that using arch specific options in 
addition to -O3 actually =hurts= performance.


...that seems unexpected...
Ron Peacetree  



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] New to PostgreSQL, performance considerations

2006-12-15 Thread Ron


At 04:54 AM 12/15/2006, Alexander Staubo wrote:

On Dec 15, 2006, at 04:09 , Ron wrote:


At 07:27 PM 12/14/2006, Alexander Staubo wrote:


Sorry, I neglected to include the pertinent graph:

  http://purefiction.net/paste/pgbench2.pdf
In fact, your graph suggests that using arch specific options in 
addition to -O3 actually =hurts= performance.


According to the tech staff, this is a Sun X4100 with a two-drive 
RAID 1 volume. No idea about the make of the hard drives.


Alexander.

http://www.sun.com/servers/entry/x4100/features.xml

So we are dealing with a 1U 1-4S (which means 1-8C) AMD Kx box with 
up to 32GB of ECC RAM (DDR2 ?) and 2 Seagate 2.5" SAS HDs.


http://www.seagate.com/cda/products/discsales/index/1,,,00.html?interface=SAS

My bet is the X4100 contains one of the 3 models of Cheetah 
15K.4's.  A simple du, dkinfo, whatever, will tell you which.


I'm looking more closely into exactly what the various gcc -O 
optimizations do on Kx's as well.
64b vs 32b gets x86 compatible code access to ~ 2x as many registers; 
and MMX or SSE instructions get you access to not only more 
registers, but wider ones as well.


As one wit has noted, "all optimization is an exercise in caching." 
(Terje Mathisen- one of the better assembler coders on the planet.)


It seems unusual that code generation options which give access to 
more registers would ever result in slower code...

Ron Peacetree


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] New to PostgreSQL, performance considerations

2006-12-15 Thread Ron


At 09:23 AM 12/15/2006, Merlin Moncure wrote:

On 12/15/06, Ron <[EMAIL PROTECTED]> wrote:


It seems unusual that code generation options which give access to
more registers would ever result in slower code...


The slower is probably due to the unroll loops switch which can 
actually hurt code due to the larger footprint (less cache coherency).


I have seen that effect as well occasionally in the last few decades 
;-)  OTOH, suspicion is not _proof_; and I've seen other 
"optimizations" turn out to be "pessimizations" over the years as well.



The extra registers are not all that important because of pipelining 
and other hardware tricks.


No.  Whoever told you this or gave you such an impression was 
mistaken.  There are many instances of x86 compatible code that get 
30-40% speedups just because they get access to 16 rather than 8 GPRs 
when recompiled for x84-64.



Pretty much all the old assembly strategies such as forcing local 
variables to registers are basically obsolete...especially with 
regards to integer math.


Again, not true.  OTOH, humans are unlikely at this point to be able 
to duplicate the accuracy of the compiler's register coloring 
algorithms.  Especially on x86 compatibles.  (The flip side is that 
_expert_ humans can often put the quirky register set and instruction 
pipelines of x86 compatibles to more effective use for a specific 
chunk of code than even the best compilers can.)



As I said before, modern CPUs are essentially RISC engines with a 
CISC preprocessing engine laid in top.


I'm sure you meant modern =x86 compatible= CPUs are essentially RISC 
engines with a CISC engine on top.  Just as "all the world's not a 
VAX", "all CPUs are not x86 compatibles".  Forgetting this has 
occasionally cost folks I know...



Things are much more complicated than they were in the old days 
where you could count instructions for the assembly optimization process.


Those were the =very= old days in computer time...


I suspect that there is little or no differnece between the 
-march=686 and the various specifc archicectures.


There should be.  The FSF compiler folks (and the rest of the 
industry compiler folks for that matter) are far from stupid.  They 
are not just adding compiler switches because they randomly feel like it.


Evidence suggests that the most recent CPUs are in need of =more= 
arch specific TLC compared to their ancestors, and that this trend is 
not only going to continue, it is going to accelerate.



Did anybody think to look at the binaries and look for the amount of 
differences?  I bet you code compiled for march=opteron will just 
fine on a pentium 2 if compiled

for 32 bit.
Sucker bet given that the whole point of a 32b x86 compatible is to 
be able to run code on any I32 ISA. CPU.
OTOH, I bet that code optimized for best performance on a P2 is not 
getting best performance on a P4.  Or vice versa. ;-)


The big arch specific differences in Kx's are in 64b mode.  Not 32b


Ron Peacetree. 



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] New to PostgreSQL, performance considerations

2006-12-15 Thread Ron


At 09:50 AM 12/15/2006, Greg Smith wrote:

On Fri, 15 Dec 2006, Merlin Moncure wrote:

The slower is probably due to the unroll loops switch which can 
actually hurt code due to the larger footprint (less cache coherency).


The cache issues are so important with current processors that I'd 
suggest throwing -Os (optimize for size) into the mix people 
test.  That one may stack usefully with -O2, but probably not with 
-O3 (3 includes optimizations that increase code size).


-Os
Optimize for size. -Os enables all -O2 optimizations that do not 
typically increase code size. It also performs further optimizations 
designed to reduce code size.


-Os disables the following optimization flags:
-falign-functions -falign-jumps -falign-loops -falign-labels
-freorder-blocks -freorder-blocks-and-partition
-fprefetch-loop-arrays
-ftree-vect-loop-version

Hmmm.  That list of disabled flags bears thought.

-falign-functions -falign-jumps -falign-loops -falign-labels

1= Most RISC CPUs performance is very sensitive to misalignment 
issues.  Not recommended to turn these off.


-freorder-blocks
Reorder basic blocks in the compiled function in order to reduce 
number of taken branches and improve code locality.


Enabled at levels -O2, -O3.
-freorder-blocks-and-partition
In addition to reordering basic blocks in the compiled function, in 
order to reduce number of taken branches, partitions hot and cold 
basic blocks into separate sections of the assembly and .o files, to 
improve paging and cache locality performance.


This optimization is automatically turned off in the presence of 
exception handling, for link once sections, for functions with a 
user-defined section attribute and on any architecture that does not 
support named sections.


2= Most RISC CPUs are cranky about branchy code and (lack of) cache 
locality.  Wouldn't suggest punting these either.


-fprefetch-loop-arrays
If supported by the target machine, generate instructions to prefetch 
memory to improve the performance of loops that access large arrays.


This option may generate better or worse code; results are highly 
dependent on the structure of loops within the source code.


3= OTOH, This one looks worth experimenting with turning off.

-ftree-vect-loop-version
Perform loop versioning when doing loop vectorization on trees. When 
a loop appears to be vectorizable except that data alignment or data 
dependence cannot be determined at compile time then vectorized and 
non-vectorized versions of the loop are generated along with runtime 
checks for alignment or dependence to control which version is 
executed. This option is enabled by default except at level -Os where 
it is disabled.


4= ...and this one looks like a 50/50 shot.

Ron Peacetree




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] [PERFORM] EXPLAIN ANALYZE on 8.2

2006-12-15 Thread Ron

At 10:45 AM 12/15/2006, Tom Lane wrote:

Gregory Stark <[EMAIL PROTECTED]> writes:
> There are various attempts at providing better timing infrastructure at low
> overhead but I'm not sure what's out there currently. I expect to 
do this what
> we'll have to do is invent a pg_* abstraction that has various 
implementations

> on different architectures.

You've got to be kidding.  Surely it's glibc's responsibility, not ours,
to implement gettimeofday correctly for the hardware.

regards, tom lane

I agree with Tom on this.  Perhaps the best compromise is for the pg 
community to make thoughtful suggestions to the glibc community?

Ron Peacetree 

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] New to PostgreSQL, performance considerations

2006-12-15 Thread Ron


At 10:55 AM 12/15/2006, Merlin Moncure wrote:

On 12/15/06, Ron <[EMAIL PROTECTED]> wrote:


There are many instances of x86 compatible code that get
30-40% speedups just because they get access to 16 rather than 8 GPRs
when recompiled for x84-64.


...We benchmarked PostgreSQL internally here and found it to be 
fastest in 32 bit mode running on a 64 bit platform -- this was on a 
quad opteron 870 runnning our specific software stack, your results 
might be differnt of course.


On AMD Kx's, you probably will get best performance in 64b mode (so 
you get all those extra registers and other stuff) while using 32b 
pointers (to keep code size and memory footprint down).


On Intel C2's, things are more complicated since Intel's x86-64 
implementation and memory IO architecture are both different enough 
from AMD's to have caused some consternation on occasion when Intel's 
64b performance did not match AMD's.





The big arch specific differences in Kx's are in 64b mode.  Not 32b


I dont think so.  IMO all the processor specific instruction sets were
hacks of 32 bit mode to optimize specific tasks.  Except for certain
things these instructions are rarely, if ever used in 64 bit mode,
especially in integer math (i.e. database binaries).  Since Intel and
AMD64 64 bit are virtually indentical I submit that -march is not
really important anymore except for very, very specific (but
important) cases like spinlocks.


Take a good look at the processor specific manuals and the x86-64 
benches around the net.  The evidence outside the DBMS domain is 
pretty solidly in contrast to your statement and 
position.  Admittedly, DBMS are not web servers or FPS games or ... 
That's why we need to do our own rigorous study of the subject.



This thread is about how much architecture depenant binares can beat 
standard ones.  I say they don't very much at all, and with the 
specific exception of Daniel's

benchmarking the results posted to this list bear that out.
...and IMHO the issue is still very much undecided given that we 
don't have enough properly obtained and documented evidence.


ATM, the most we can say is that in a number of systems with modest 
physical IO subsystems that are not running Gentoo Linux we have not 
been able to duplicate the results.  (We've also gotten some 
interesting results like yours suggesting the arch specific 
optimizations are bad for pg performance in your environment.)


In the process questions have been asked and issues raised regarding 
both the tolls involved and the proper way to use them.


We really do need to have someone other than Daniel duplicate his 
Gentoo environment and independently try to duplicate his results.



...and let us bear in mind that this is not just intellectual 
curiosity.  The less pg is mysterious, the better the odds pg will be 
adopted in any specific case.

Ron Peacetree
  



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] New to PostgreSQL, performance considerations

2006-12-16 Thread Ron


At 07:06 PM 12/15/2006, Michael Stone wrote:

On Fri, Dec 15, 2006 at 12:24:46PM -0500, Ron wrote:
ATM, the most we can say is that in a number of systems with modest 
physical IO subsystems



So I reran it on a 3.2GHz xeon with 6G RAM off a ramdisk; I/O ain't 
the bottleneck on that one. Results didn't show didn't show any 
signficant gains regardless of compilation options (results hovered 
around 12k tps). If people want to continue this, I will point out 
that they should make sure they're linked against the optimized 
libpq rather than an existing one elsewhere in the library path. 
Beyond that, I'm done with this thread. Maybe there are some gains 
to be found somewhere, but the testing done thus far (while limited) 
is sufficient, IMO, to demonstrate that compiler options aren't 
going to provide a blow-your-socks-off dramatic performance improvement.
AFAICT, no one has stated there would be a "blow-your-socks-off 
dramatic performance improvement" for pg due to compilation 
options.  Just that there might be some, and there might be some that 
are arch specific.


So far these experiments have shown
= multiple instances of a ~30-35% performance improvement going from 
-O0 to --O3
= 1 instance of arch specific options hurting performance when 
combined with -O3
= 1 instance of arch specific options helping performance on an OS 
that only one person has tested (Gentoo Linux)
= that a 2.33 GHz C2D Mac laptop (under what OS?) with a typical 
laptop modest physical IO subystem can do ~2100tps
= that pg has a "speed limit" on a 3.2GHz Xeon (which kind?) with 6G 
RAM off a ramdisk (under what OS?) of ~12K tps
(I'd be curious to see what this limit is with better CPUs and memory 
subsystems)


Note that except for the first point, all the other results are 
singletons that as of yet have not been reproduced.


The most important "gain" IMO is Knowledge, and I'd say there is 
still more to learn and/or verify IMHO. YMMV.


Ron Peacetree


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] New to PostgreSQL, performance considerations

2006-12-18 Thread Ron


Sorry for the delay in responding.  I had familial obligations.

As a matter of fact, I am spending a decent amount of time on 
this.  I don't usually pore through documentation for compilers and 
OS's to the degree I've been since this thread started.  Nor do I 
usually try and get access to the HW I'm presently tracking down.


I'll post my thoughts re: detailed analysis of gcc/g++ compiler 
options later today or tomorrow as work schedule allows.


Why this is worth it:
1= Any gains from setup and configuration are the cheapest ones 
available once we codify how to obtain them.
2= any public database or knowledge about how to best setup, 
configure, and test pg is very good for the community.
3= developers need to know and agree on proper procedure and 
technique for generating results for discussion or we end up wasting 
a lot of time.
4= measuring and documenting pg performance means we know where best 
to allocate resources for improving pg.  Or where using pg is 
(in)appropriate compared to competitors.


Potential performance gains are not the only value of this thread.
Ron Peacetree


At 12:33 PM 12/16/2006, Michael Stone wrote:

On Sat, Dec 16, 2006 at 10:53:21AM -0500, Ron wrote:
The most important "gain" IMO is Knowledge, and I'd say there is 
still more to learn and/or verify IMHO. YMMV.


Well, I think there are other areas where I can spend my time where 
potential gains are more likely. YMMV (although, I note, you don't 
seem to be spending much of your own time testing this)



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] High update activity, PostgreSQL vs BigDBMS

2006-12-29 Thread Ron


At 12:46 AM 12/28/2006, Guy Rouillier wrote:
I don't want to violate any license agreement by discussing 
performance, so I'll refer to a large, commercial 
PostgreSQL-compatible DBMS only as BigDBMS here.


I'm trying to convince my employer to replace BigDBMS with 
PostgreSQL for at least some of our Java applications.  As a proof 
of concept, I started with a high-volume (but conceptually simple) 
network data collection application.  This application collects 
files of 5-minute usage statistics from our network devices, and 
stores a raw form of these stats into one table and a normalized 
form into a second table. We are currently storing about 12 million 
rows a day in the normalized table, and each month we start new 
tables.  For the normalized data, the app inserts rows initialized 
to zero for the entire current day first thing in the morning, then 
throughout the day as stats are received, executes updates against 
existing rows.  So the app has very high update activity.


In my test environment, I have a dual-x86 Linux platform running the 
application, and an old 4-CPU Sun Enterprise 4500 running BigDBMS 
and PostgreSQL 8.2.0 (only one at a time.)  The Sun box has 4 disk 
arrays attached, each with 12 SCSI hard disks (a D1000 and 3 A1000, 
for those familiar with these devices.)  The arrays are set up with 
RAID5.  So I'm working with a consistent hardware platform for this 
comparison.  I'm only processing a small subset of files (144.)


BigDBMS processed this set of data in 2 seconds, with all 
foreign keys in place.  With all foreign keys in place, PG took 
54000 seconds to complete the same job.  I've tried various 
approaches to autovacuum (none, 30-seconds) and it doesn't seem to 
make much difference.  What does seem to make a difference is 
eliminating all the foreign keys; in that configuration, PG takes 
about 3 seconds.  Better, but BigDBMS still has it beat significantly.


If you are using pg configured as default installed, you are not 
getting pg's best performance.  Ditto using data structures optimized 
for BigDBMS.


A= go through each query and see what work_mem needs to be for that 
query to be as RAM resident as possible.  If you have enough RAM, set 
work_mem for that query that large.  Remember that work_mem is =per 
query=, so queries running in parallel eat the sum of each of their work_mem's.


B= Make sure shared buffers is set reasonably.  A good rule of thumb 
for 8.x is that shared buffers should be at least ~1/4 your RAM.  If 
your E4500 is maxed with RAM, there's a good chance shared buffers 
should be considerably more than 1/4 of RAM.


C= What file system are you using?  Unlike BigDBMS, pg does not have 
its own native one, so you have to choose the one that best suits 
your needs.  For update heavy applications involving lots of small 
updates jfs and XFS should both be seriously considered.


D= Your table schema and physical table layout probably needs to 
change.  What BigDBMS likes here is most likely different from what pg likes.


E= pg does not actually update records in place.  It appends new 
records to the table and marks the old version invalid.  This means 
that things like pages size, RAID stripe size, etc etc may need to 
have different values than they do for BigDBMS.  Another consequence 
is that pg likes RAID 10 even more than most of its competitors.


F= This may seem obvious, but how many of the foreign keys and other 
overhead do you actually need?  Get rid of the unnecessary.


G= Bother the folks at Sun, like Josh Berkus, who know pq inside and 
out +and+ know your HW (or have access to those that do ;-) )inside 
and out.  I'll bet they'll have ideas I'm not thinking of.


H= Explain Analyze is your friend.  Slow queries may need better 
table statistics, or better SQL, or may be symptoms of issues "C" or 
"D" above or ...


I've got PG configured so that that the system database is on disk 
array 2, as are the transaction log files.  The default table space 
for the test database is disk array 3.  I've got all the reference 
tables (the tables to which the foreign keys in the stats tables 
refer) on this array.  I also store the stats tables on this 
array.  Finally, I put the indexes for the stats tables on disk 
array 4.  I don't use disk array 1 because I believe it is a software array.
I= With 4 arrays of 12 HDs each, you definitely have enough spindles 
to place pg_xlog somewhere separate from all the other pg tables.  In 
addition, you should analyze you table access patterns and then 
scatter them across your 4 arrays in such as way as to minimize head 
contention.



I'm out of ideas how to improve this picture any further.  I'd 
appreciate some suggestions.  Thanks.

Hope this helps,

Ron Peacetree 



---(end of broadcast)---
TIP 1: if posting/readi

Re: [PERFORM] slow result

2007-01-23 Thread Ron


At 07:34 AM 1/23/2007, Laurent Manchon wrote:

Hi,

I have a slow response of my PostgreSQL database 7.4 using this query below
on a table with 80 rows:


1= Upgrade to the latest stable version of pg.  That would be 
8.2.x   You are very much in the Dark Ages pg version wise.

pg 8.x has significant IO enhancements.  Especially compared to 7.4.


select count(*)from tbl;

PostgreSQL return result in 28 sec every time.
although MS-SQL return result in 0.02 sec every time.


2= pg actually counts how many rows there are in a table.  MS-SQL 
looks up a count value from a internal data table... which can be 
wrong in extraordinarily rare circumstances in a MVCC DBMS (which 
MS-SQL is !not!.  MS-SQL uses the older hierarchical locking strategy 
for data protection.)
Since pg actually scans the table for the count, pg's count will 
always be correct.  No matter what.


Since MS-SQL does not use MVCC, it does not have to worry about the 
corner MVCC cases that pg does.
OTOH, MVCC _greatly_ reduces the number of cases where one 
transaction can block another compared to the locking strategy used in MS-SQL.
This means in real day to day operation, pg is very likely to handle 
OLTP loads and heavy loads better than MS-SQL will.


In addition, MS-SQL is a traditional Codd & Date table oriented 
DBMS.  pg is an object oriented DBMS.


Two very different products with very different considerations and 
goals (and initially designed at very different times historically.)


Compare them under real loads using real queries if you are going to 
compare them.  Comparing pg and MS-SQL using "fluff" queries like 
count(*) is both misleading and a waste of effort.




My server is a DELL PowerEdge 2600 with bi-processor Xeon at 3.2 Ghz
with 3GBytes RAM


My PostgreSQL Conf is
*
log_connections = yes
syslog = 2
effective_cache_size = 5
sort_mem = 1
max_connections = 200
shared_buffers = 3000
vacuum_mem = 32000
wal_buffers = 8
max_fsm_pages = 2000
max_fsm_relations = 100

Can you tell me is there a way to enhence performance ?
There are extensive FAQs on what the above values should be for 
pg.  The lore is very different for pg 8.x vs pg 7.x



Thank you

You're welcome.

Ron Peacetree


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Tuning

2007-01-29 Thread Ron

At 06:24 PM 1/28/2007, Josh Berkus wrote:

John,

> -work_mem

Depends on the number of concurrent queries you expect to run and what size
sorts you expect them to do.
EXPLAIN ANALYZE is your friend.  It will tell you how much data each 
query is manipulating and therefore how much memory each query will chew.

The next step is to figure out how many of each query will be running 
concurrently.
Summing those will tell you the maximum work_mem each kind of query 
will be capable of using.

If you have a deep enough understanding of how your pg system is 
working, then you can set work_mem on a per query basis to get the 
most efficient use of the RAM in your system.

> -maintenance_work_mem - 50% of the largest table?

Actually, in current code I've found that anything over 256mb 
doesn't actually

get used.
Is this considered a bug?  When will this limit go away?  Does 
work_mem have a similar limit?

> -shared_buffers - max value 5

Actually, I need to update that.   On newer faster multi-core 
machines you may

want to allocate up to 1GB of shared buffers.

> -effective_cache_size - max 2/3 of available ram, ie 24GB on the
> hardware described above

Yes.
Why?  "max of 2/3 of available RAM" sounds a bit 
hand-wavy.  Especially with 32gb, 64GB, and 128GB systems available.

Is there are hidden effective or hard limit  here as well?

For a dedicated pg machine, I'd assume one would want to be very 
aggressive about configuring the kernel, minimizing superfluous 
services, and configuring memory use so that absolutely as much as 
possible is being used by pg and in the most intelligent way given 
one's specific pg usage scenario.

> -shmmax - how large dare I set this value on dedicated postgres servers?

Set it to 2GB and you'll be covered.

I thought  that on 32b systems the 2GB shmmax limit had been raised to 4GB?
and that there essentially is no limit to shmmax on 64b systems?

What are Oracle and EnterpriseDB recommending for shmmax these days?

My random thoughts,
Ron Peacetree 

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Querying distinct values from a large table

2007-01-31 Thread Ron

I strongly encourage anyone who is interested in the general external 
sorting problem peruse Jim Gray's site:

http://research.microsoft.com/barc/SortBenchmark/

Ron Peacetree

At 08:24 AM 1/31/2007, Gregory Stark wrote:

Tom Lane <[EMAIL PROTECTED]> writes:

> Alvaro Herrera <[EMAIL PROTECTED]> writes:
> > Gregory Stark wrote:
> >> (Incidentally I'm not sure where 2-5x comes from. It's 
entirely dependant on
> >> your data distribution. It's not hard to come up with 
distributions where it's

> >> 1000x as fast and others where there's no speed difference.)
>
> > So the figure is really "1-1000x"?  I bet this one is more impressive in
> > PHB terms.
>
> Luke has a bad habit of quoting numbers that are obviously derived from
> narrow benchmarking scenarios as Universal Truths, rather than providing
> the context they were derived in.  I wish he'd stop doing that...

In fairness I may have exaggerated a bit there. There is a limit to how much
of a speedup you can get in valid benchmarking situations. A single sequential
scan is always going to be necessary so you're only saving the cost of writing
out the temporary file and subsequent merge passes.

It's hard to generate lots of intermediate merge passes since there are only
O(log(n)) of them. So to get 1000x speedup on a large I/O bound sort you would
have to be sorting something on order of 2^1000 records which is ridiculous.
Realistically you should only be able to save 2-5 intermediate merge passes.

On the other there are some common situations where you could see atypical
increases. Consider joining a bunch of small tables to generate a large result
set. The small tables are probably all in memory and the result set may only
have a small number of distinct values. If you throw out the duplicates early
you save *all* the I/O. If you have to do a disk sort it could be many orders
slower.

This is actually not an uncommon coding idiom for MySQL programmers accustomed
to fast DISTINCT working around the lack of subqueries and poor performance of
IN and EXISTS. They often just join together all the tables in a big cross
join and then toss in a DISTINCT at the top to get rid of the duplicates.

--
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Vacuumdb - Max_FSM_Pages Problem.

2007-02-26 Thread Ron


At 10:53 AM 2/26/2007, Peter Childs wrote:

On 26/02/07, Pallav Kalva <[EMAIL PROTECTED]> wrote:

Hi,

 I am in the process of cleaning up one of our big table, this table
has 187 million records and we need to delete around 100 million of them.

 I am deleting around 4-5 million of them daily in order to catchup
with vacuum and also with the archive logs space. So far I have deleted
around 15million in past few days.

 max_fsm_pages value is set to 120. Vacuumdb runs once daily,
here is the output from last night's vacuum job


===
 INFO:  free space map: 999 relations, 798572 pages stored; 755424
total pages needed
 DETAIL:  Allocated FSM size: 1000 relations + 120 pages = 7096
kB shared memory.
 VACUUM



 From the output it says 755424 total pages needed , this number
keeps growing daily even after vacuums are done daily. This was around
350K pages before the delete process started.

 I am afraid that this number will reach the max_fsm_pages limit
soon and vacuums thereafter will never catch up .

 Can anyone please explain this behavior ? What should I do to catch
up with vacuumdb daily ?


Vacuum adds to free pages to the fsm so that they can be reused. If
you don't fill up those free pages the fsm will fill up. Once the fsm
is full no more pages can be added to the fsm. If you start writing to
the free pages via inserts when vacuum next runs more free pages will
be added that did not fit previously in the free space map due to it
being full.

If you are really deleting that many records you may be better coping
those you want to a new table and dropping the old one. To actually
recover space you need to either run vacuum full or cluster.

This ought to be in the manual somewhere as this question gets asked
about once a week.

Peter.

In fact ,
a= copying data to a new table and dropping the original table
rather than
b= updating the original table
is a "standard best DBA practice" regardless of DB product.

The only thing that changes from DB product to DB product is the 
exact point where the copy is large enough to make "copy, replace" 
better than "update in place".


Rule of Thumb:  No matter what DB product you are using, if it's more 
than 1/2 of any table or more than 1/4 of any table that does not fit 
into memory, it's usually better to copy replace rather then update in place.


...and I completely agree that we should document this sort of 
Industry Best Practice in a way that is easily usable by the pg community.


Cheers,
Ron  



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Opinions on Raid

2007-02-27 Thread Ron


At 08:12 AM 2/27/2007, Joe Uhl wrote:

We have been running Postgres on a 2U server with 2 disks configured in
raid 1 for the os and logs and 4 disks configured in raid 10 for the
data.  I have since been told raid 5 would have been a better option
given our usage of Dell equipment and the way they handle raid 10.  I
have just a few general questions about raid with respect to Postgres:

[1] What is the performance penalty of software raid over hardware raid?
 Is it truly significant?  We will be working with 100s of GB to 1-2 TB
of data eventually.
The real CPU overhead when using SW RAID is when using any form of SW 
RAID that does XOR operations as part of writes (RAID 5, 6, 50, ..., 
etc).  At that point, you are essentially hammering on the CPU just 
as hard as you would on a dedicated RAID controller... ...and the 
dedicated RAID controller probably has custom HW helping it do this 
sort of thing more efficiently.
That being said, SW RAID 5 in this sort of scenario can be reasonable 
if you =dedicate= a CPU core to it.  So in such a system, your "n" 
core box is essentially a "n-1" core box because you have to lock a 
core to doing nothing but RAID management.
Religious wars aside, this actually can work well.  You just have to 
understand and accept what needs to be done.


SW RAID 1, or 10, or etc should not impose a great deal of CPU 
overhead, and often can be =faster= than a dedicated RAID controller.


SW RAID 5 etc in usage scenarios involving far more reads than writes 
and light write loads can work quite well even if you don't dedicate 
a core to RAID management, but you must be careful about workloads 
that are, or that contain parts that are, examples of the first 
scenario I gave.  If you have any doubts about whether you are doing 
too many writes, dedicate a core to RAID stuff as in the first scenario.




[2] How do people on this list monitor their hardware raid?  Thus far we
have used Dell and the only way to easily monitor disk status is to use
their openmanage application.  Do other controllers offer easier means
of monitoring individual disks in a raid configuration?  It seems one
advantage software raid has is the ease of monitoring.
Many RAID controller manufacturers and storage product companies 
offer reasonable monitoring / management tools.


3ware AKA AMCC has a good reputation in this area for their cards.
So does Areca.
I personally do not like Adaptec's SW for this purpose, but YMMV.
LSI Logic has had both good and bad SW in this area over the years.

Dell, HP, IBM, etc's offerings in this area tend to be product line 
specific.  I'd insist on  some sort of  "try before you buy" if the 
ease of use / quality of the SW matters to your overall purchase decision.


Then there are the various CSSW and OSSW packages that contain this 
functionality or are dedicated to it.  Go find some reputable reviews.

(HEY LURKERS FROM Tweakers.net:  ^^^ THAT"S AN ARTICLE IDEA ;-) )

Cheers,
Ron 



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] increasing database connections

2007-03-01 Thread Ron

At 01:18 AM 3/1/2007, Joshua D. Drake wrote:

Jonah H. Harris wrote:
> On 3/1/07, Shiva Sarna <[EMAIL PROTECTED]> wrote:
>> I am sorry if it is a repeat question but I want to know if database
>> performance will decrease if I increase the max-connections to 2000. At
>> present it is 100.
>
> Most certainly.  Adding connections over 200 will degrade performance
> dramatically.  You should look into pgpool or connection pooling from
> the application.

huh? That is certainly not my experience. I have systems that show no
depreciable performance hit on even 1000+ connections. To be fair to the
discussion, these are on systems with 4+ cores. Usually 8+ and
significant ram 16/32 gig fo ram.

Sincerely,

Joshua D. Drake

Some caveats.

Keeping a DB connection around is relatively inexpensive.
OTOH, building and tearing down a DB connection =can be= expensive.
Expensive or not, connection build and tear down are pure overhead 
activities.  Any overhead you remove from the system is extra 
capacity that the system can use in actually answering DB queries 
(...at least until the physical IO system is running flat out...)

So having 1000+ DB connections open should not be a problem in and of 
itself (but you probably do not want 1000+ queries worth of 
simultaneous HD IO!...).

OTOH, you probably do !not! want to be constantly creating and 
destroying 1000+ DB connections.
Better to open 1000+ DB connections once at system start up time and 
use them as a connection pool.

The potential =really= big performance hit in having lots of 
connections around is in lots of connections doing simultaneous 
heavy, especially seek heavy, HD IO.

Once you have enough open connections that your physical IO subsystem 
tends to be maxed out performance wise on the typical workload being 
handled, it is counter productive to allow any more concurrent DB connections.

So the issue is not "how high a max-connections is too high?".   It's 
"how high a max connections is too high for =my= HW running =my= query mix?"

The traditional advice is to be conservative and start with a 
relatively small number of connections and increase that number only 
as long as doing so results in increased system performance on your 
job mix.  Once you hit the performance plateau, stop increasing 
max-connections and let connection caching and pooling handle things.
If that does not result in enough performance, it's time to initiate 
the traditional optimization hunt.

Also, note Josh's deployed HW for systems that can handle 1000+ 
connections.  ...and you can bet the IO subsystems on those boxes are 
similarly "beefy".  Don't expect miracles out of modest HW.
Ron 

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] strange performance regression between 7.4 and 8.1

2007-03-01 Thread Ron

At 07:36 PM 3/1/2007, Jeff Frost wrote:

On Thu, 1 Mar 2007, Alex Deucher wrote:

>> Postgresql might be choosing a bad plan because your 
effective_cache_size

>> is
>> way off (it's the default now right?).  Also, what was the 
block read/write

>
> yes it's set to the default.
>
>> speed of the SAN from your bonnie tests?  Probably want to tune
>> random_page_cost as well if it's also at the default.
>>
>
>   --Sequential Output-- --Sequential Input-
> --Random-
>   -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP 
K/sec %CP /sec

> %CP
> luna12-san   16000M 58896  91 62931   9 35870   5 54869  82 
145504 13 397.7

> 0
>
So, you're getting 62MB/s writes and 145MB/s reads.  Just FYI, that write
speed is about the same as my single SATA drive write speed on my 
workstation,

so not that great.  The read speed is decent, though and with that sort of
read performance, you might want to lower random_page_cost to something like
2.5 or 2 so the planner will tend to prefer index scans.

Right, but the old box was getting ~45MBps on both reads and writes,
so it's an improvement for me :)  Thanks for the advice, I'll let you
know how it goes.

Do you think that is because you have a different interface between 
you and the SAN?  ~45MBps is pretty slow - your average 7200RPM 
ATA133 drive can do that and costs quite a bit less than a SAN.

Is the SAN being shared between the database servers and other 
servers?  Maybe it was just random timing that gave you the poor 
write performance on the old server which might be also yielding 
occassional poor performance on the new one.

Remember that pg, even pg 8.2.3, has a known history of very poor 
insert speed (see comments on this point by Josh Berkus,  Luke Lonergan, etc)

For some reason, the code changes that have resulted in dramatic 
improvements in pg's read speed have not had nearly the same efficacy 
for writes.

Bottom line: pg presently has a fairly low and fairly harsh upper 
bound on write performance.   What exactly that bound is has been the 
subject of some discussion, but IIUC the fact of its existence is 
well established.

Various proposals for improving the situation exist, I've even made 
some of them, but AFAIK this is currently considered one of the 
"tough pg problems".

Cheers,
Ron Peacetree 

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] strange performance regression between 7.4 and 8.1

2007-03-02 Thread Ron


At 08:56 AM 3/2/2007, Carlos Moreno wrote:

Florian Weimer wrote:

* Alex Deucher:



I have noticed a strange performance regression and I'm at a loss as
to what's happening.  We have a fairly large database (~16 GB).



Sorry for asking, but is this a typo?  Do you mean 16 *TB* instead of
16 *GB*?

If it's really 16 GB, you should check if it's cheaper to buy more RAM
than to fiddle with the existing infrastructure.



This brings me to a related question:

Do I need to specifically configure something to take advantage of
such increase of RAM?

In particular, is the amount of things that postgres can do with RAM
limited by the amount of shared_buffers or some other parameter?
Should shared_buffers be a fixed fraction of the total amount of
physical RAM, or should it be the total amount minus half a gigabyte
or so?

As an example, if one upgrades a host from 1GB to 4GB, what would
be the right thing to do in the configuration, assuming 8.1 or 8.2?  (at
least what would be the critical aspects?)

Thanks,

Carlos


Unfortunately, pg does not (yet! ;-) ) treat all available RAM as a 
common pool and dynamically allocate it intelligently to each of the 
various memory data structures.


So if you increase your RAM,  you will have to manually change the 
entries in the pg config file to take advantage of it.

(and start pg after changing it for the new config values to take effect)

The pertinent values are all those listed under "Memory" in the 
annotated pg conf file:  shared_buffers, work_mem, maintenance_work_mem, etc.

http://www.powerpostgresql.com/Downloads/annotated_conf_80.html

Cheers,
Ron Peacetree


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] strange performance regression between 7.4 and 8.1

2007-03-02 Thread Ron

At 10:16 AM 3/2/2007, Alex Deucher wrote:

On 3/2/07, Florian Weimer <[EMAIL PROTECTED]> wrote:

* Alex Deucher:

> I have noticed a strange performance regression and I'm at a loss as
> to what's happening.  We have a fairly large database (~16 GB).

Sorry for asking, but is this a typo?  Do you mean 16 *TB* instead of
16 *GB*?

If it's really 16 GB, you should check if it's cheaper to buy more RAM
than to fiddle with the existing infrastructure.

Yes, 16 GB.  I'd rather not shell out for more ram, if I'm not even
sure that will help.  The new system should be faster, or at least as
fast, so I'd like to sort out what's going on before I buy more ram.

OK.  You
a= went from pg 7.4.x to 8.1.4 AND

b= you changed from 4 SPARC CPUs (how many cores?  If this is > 4...) 
to 2 2C Opterons AND

(SPEC and TPC bench differences between these CPUs?)

c= you went from a Sun box to a "white box" AND
(memory subsystem differences?  other differences?)

d=  you went from local HD IO to a SAN
(many differences hidden in that one line...  ...and is the physical 
layout of tables and things like pg_xlog sane on the SAN?)

...and you did this by just pulling over the old DB onto the new HW?

May I suggest that it is possible that your schema, queries, etc were 
all optimized for pg 7.x running on the old HW?
(explain analyze shows the old system taking ~1/10 the time per row 
as well as estimating the number of rows more accurately)

RAM is =cheap=.  Much cheaper than the cost of a detective hunt 
followed by rework to queries, schema, etc.
Fitting the entire DB into RAM is guaranteed to help unless this is 
an OLTP like application where HD IO is  required to be synchronous..
If you can fit the entire DB comfortably into RAM, do it and buy 
yourself the time to figure out the rest of the story w/o impacting 
on production performance.

Cheers,
Ron Peacetree 

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] strange performance regression between 7.4 and 8.1

2007-03-02 Thread Ron


At 11:03 AM 3/2/2007, Alex Deucher wrote:

On 3/2/07, Ron <[EMAIL PROTECTED]> wrote:


May I suggest that it is possible that your schema, queries, etc were
all optimized for pg 7.x running on the old HW?
(explain analyze shows the old system taking ~1/10 the time per row
as well as estimating the number of rows more accurately)

RAM is =cheap=.  Much cheaper than the cost of a detective hunt
followed by rework to queries, schema, etc.
Fitting the entire DB into RAM is guaranteed to help unless this is
an OLTP like application where HD IO is  required to be synchronous..
If you can fit the entire DB comfortably into RAM, do it and buy
yourself the time to figure out the rest of the story w/o impacting
on production performance.


Perhaps so.  I just don't want to spend $1000 on ram and have it only
marginally improve performance if at all.  The old DB works, so we can
keep using that until we sort this out.

Alex
1=  $1000 worth of RAM is very likely less than the $ worth of, say, 
10 hours of your time to your company.  Perhaps much less.
(Your =worth=, not your pay or even your fully loaded cost.  This 
number tends to be >= 4x what you are paid unless the organization 
you are working for is in imminent financial danger.)
You've already put more considerably more than 10 hours of your time 
into this...


2= If the DB goes from not fitting completely into RAM to being 
completely RAM resident, you are almost 100% guaranteed a big 
performance boost.
The exception is an OLTP like app where DB writes can't be done 
a-synchronously (doing financial transactions, real time control systems, etc).

Data mines should never have this issue.

3= Whether adding enough RAM to make the DB RAM resident (and 
re-configuring conf, etc, appropriately) solves the problem or not, 
you will have gotten a serious lead as to what's wrong.


...and I still think looking closely at the actual physical layout of 
the tables in the SAN is likely to be worth it.


Cheers,
Ron Peacetree
  



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] strange performance regression between 7.4 and 8.1

2007-03-02 Thread Ron


At 02:43 PM 3/2/2007, Alex Deucher wrote:

On 3/2/07, Ron <[EMAIL PROTECTED]> wrote:


...and I still think looking closely at the actual physical layout of
the tables in the SAN is likely to be worth it.


How would I go about doing that?

Alex


Hard for me to give specific advice when I don't know what SAN 
product we are talking about nor what kind of HDs are in it nor how 
those HDs are presently configured...


I quote you in an earlier post:
"The RAID groups on the SAN were set up for maximum capacity rather 
than for performance.  Using it for the databases just came up recently."


That implies to me that the SAN is more or less set up as a huge 105 
HD (assuming this number is correct?  We all know how "assume" is 
spelled...) JBOD or RAID 5 (or 6, or 5*, or 6*) set.


=IF= that is true, tables are not being given dedicated RAID 
groups.  That implies that traditional lore like having pg_xlog on 
dedicated spindles is being ignored.
Nor is the more general Best Practice of putting the most heavily 
used tables onto dedicated spindles being followed.


In addition, the most space efficient RAID levels: 5* or 6*, are not 
the best performing one (RAID 10 striping your mirrors)


In short, configuring a SAN for maximum capacity is exactly the wrong 
thing to do if one is planning to use it in the best way to support 
DB performance.


I assume (there's that word again...) that there is someone in your 
organization who understands how the SAN is configured and administered.

You need to talk to them about these issues.

Cheers,
Ron



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] strange performance regression between 7.4 and 8.1

2007-03-06 Thread Ron

At 10:25 AM 3/6/2007, Alex Deucher wrote:

On 3/5/07, Guido Neitzer <[EMAIL PROTECTED]> wrote:

On 05.03.2007, at 19:56, Alex Deucher wrote:

> Yes, I started setting that up this afternoon.  I'm going to test that
> tomorrow and post the results.

Good - that may or may not give some insight in the actual
bottleneck. You never know but it seems to be one of the easiest to
find out ...

Well, the SAN appears to be the limiting factor.  I set up the DB on
the local scsi discs (software RAID 1) and performance is excellent
(better than the old server).  Thanks for everyone's help.

Alex

What kind of SAN is it and how many + what kind of HDs are in it?
Assuming the answers are reasonable...

Profile the table IO pattern your workload generates and start 
allocating RAID sets to tables or groups of tables based on IO pattern.

For any table or group of tables that has a significant level of 
write IO, say >= ~25% of the IO mix, try RAID 5 or 6 first, but be 
prepared to go RAID 10 if performance is not acceptable.

Don't believe any of the standard "lore" regarding what tables to put 
where or what tables to give dedicated spindles to.

Profile, benchmark, and only then start allocating dedicated resources.
For instance, I've seen situations where putting pg_xlog on its own 
spindles was !not! the right thing to do.

Best Wishes,
Ron Peacetree

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

1 2 3 4 5 >

1 - 100 of 409 matches

Mail list logo