[PERFORM] Samsung 32GB SATA SSD tested

2008-07-22 Thread Jeffrey W. Baker
For background, please read the thread Fusion-io ioDrive, archived at

http://archives.postgresql.org/pgsql-performance/2008-07/msg00010.php

To recap, I tested an ioDrive versus a 6-disk RAID with pgbench on an
ordinary PC.  I now also have a 32GB Samsung SATA SSD, and I have tested
it in the same machine with the same software and configuration.  I
tested it connected to the NVIDIA CK804 SATA controller on the
motherboard, and as a pass-through disk on the Areca RAID controller,
with write-back caching enabled.

   Service Time Percentile, millis
   R/W TPS   R-O TPS  50th   80th   90th   95th
RAID  182   673 18 32 42 64
Fusion971  4792  8  9 10 11
SSD+NV442  4399 12 18 36 43
SSD+Areca 252  5937 12 15 17 21

As you can see, there are tradeoffs.  The motherboard's ports are
substantially faster on the TPC-B type of workload.  This little, cheap
SSD achieves almost half the performance of the ioDrive (i.e. similar
performance to a 50-disk SAS array.)  The RAID controller does a better
job on the read-only workload, surpassing the ioDrive by 20%.

Strangely the RAID controller behaves badly on the TPC-B workload.  It
is faster than disk, but not by a lot, and it's much slower than the
other flash configurations.  The read/write benchmark did not vary when
changing the number of clients between 1 and 8.  I suspect this is some
kind of problem with Areca's kernel driver or firmware.

On the bright side, the Samsung+Areca configuration offers excellent
service time distribution, comparable to that achieved by the ioDrive.
Using the motherboard's SATA ports gave service times comparable to the
disk RAID.

The performance is respectable for a $400 device.  You get about half
the tps and half the capacity of the ioDrive, but for one fifth the
price and in the much more convenient SATA form factor.

Your faithful investigator,
jwb


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Storing Digital Video

2006-02-04 Thread Jeffrey W. Baker
On Tue, 2006-01-31 at 16:32 -0800, Rodrigo Madera wrote:
 I am concerned with performance issues involving the storage of DV on
 a database.
 
 I though of some options, which would be the most advised for speed?
 
 1) Pack N frames inside a container and store the container to the db.
 2) Store each frame in a separate record in the table frames.
 3) (type something here)

How about some more color?  _Why_, for example, would you store video in
a relational database?

-jwb

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Huge Data sets, simple queries

2006-02-01 Thread Jeffrey W. Baker
On Tue, 2006-01-31 at 21:53 -0800, Luke Lonergan wrote:
 Jeffrey,
 
 On 1/31/06 8:09 PM, Jeffrey W. Baker [EMAIL PROTECTED] wrote:
  ... Prove it.
  I think I've proved my point.  Software RAID1 read balancing provides
  0%, 300%, 100%, and 100% speedup on 1, 2, 4, and 8 threads,
  respectively.  In the presence of random I/O, the results are even
  better.
  Anyone who thinks they have a single-threaded workload has not yet
  encountered the autovacuum daemon.
 
 Good data - interesting case.  I presume from your results that you had to
 make the I/Os non-overlapping (the skip option to dd) in order to get the
 concurrent access to work.  Why the particular choice of offset - 3.2GB in
 this case?

No particular reason.  8k x 10 is what the last guy used upthread.
 
 So - the bandwidth doubles in specific circumstances under concurrent
 workloads - not relevant to Huge Data sets, simple queries, but possibly
 helpful for certain kinds of OLTP applications.

Ah, but someday Pg will be able to concurrently read from two
datastreams to complete a single query.  And that day will be glorious
and fine, and you'll want as much disk concurrency as you can get your
hands on.

-jwb


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Index Usage using IN

2006-02-01 Thread Jeffrey W. Baker
On Thu, 2006-02-02 at 09:12 +1300, Ralph Mason wrote:
 Hi,
 
 I have 2 tables both have an index on ID (both ID columns are an oid).
 
 I want to find only only rows in one and not the other.
 
 Select ID from TableA where ID not IN ( Select ID from Table B)

Have you considered this:

SELECT ID from TableA EXCEPT Select ID from Table B

?

-jwb


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Index Usage using IN

2006-02-01 Thread Jeffrey W. Baker
On Wed, 2006-02-01 at 12:22 -0800, Jeffrey W. Baker wrote:
 On Thu, 2006-02-02 at 09:12 +1300, Ralph Mason wrote:
  Hi,
  
  I have 2 tables both have an index on ID (both ID columns are an oid).
  
  I want to find only only rows in one and not the other.
  
  Select ID from TableA where ID not IN ( Select ID from Table B)
 
 Have you considered this:
 
 SELECT ID from TableA EXCEPT Select ID from Table B

Alternately:

   SELECT a.ID 
 FROM TableA AS a 
LEFT JOIN TableB AS b 
   ON a.ID = b.ID 
WHERE b.ID IS NULL

-jwb

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Huge Data sets, simple queries

2006-01-29 Thread Jeffrey W. Baker
On Sun, 2006-01-29 at 13:44 -0500, Luke Lonergan wrote:
 Depesz,
 
  [mailto:[EMAIL PROTECTED] On Behalf Of 
  hubert depesz lubaczewski
  Sent: Sunday, January 29, 2006 3:25 AM
 
  hmm .. do i understand correctly that you're suggesting that 
  using raid 10 and/or hardware raid adapter might hurt disc 
  subsystem performance? could you elaborate on the reasons, 
  please? it's not that i'm against the idea - i'm just curious 
  as this is very against-common-sense. and i always found it 
  interesting when somebody states something that uncommon...

 Oh - and about RAID 10 - for large data work it's more often a waste of
 disk performance-wise compared to RAID 5 these days.  RAID5 will almost
 double the performance on a reasonable number of drives.

I think you might want to be more specific here.  I would agree with you
for data warehousing, decision support, data mining, and similar
read-mostly non-transactional loads.  For transactional loads RAID-5 is,
generally speaking, a disaster due to the read-before-write problem.

While we're on the topic, I just installed another one of those Areca
ARC-1130 controllers with 1GB cache.  It's ludicrously fast: 250MB/sec
burst writes, CPU-limited reads.  I can't recommend them highly enough.

-jwb

PS: Could you look into fixing your mailer?  Your messages sometimes
don't contain In-Reply-To headers, and therefore don't thread properly.


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Huge Data sets, simple queries

2006-01-28 Thread Jeffrey W. Baker
On Sat, 2006-01-28 at 10:55 -0500, Tom Lane wrote:
 
 Assuming that month means what it sounds like, the above would
 result
 in running twelve parallel sort/uniq operations, one for each month
 grouping, to eliminate duplicates before counting.  You've got sortmem
 set high enough to blow out RAM in that scenario ...

Hrmm, why is it that with a similar query I get a far simpler plan than
you describe, and relatively snappy runtime?

  select date
   , count(1) as nads
   , sum(case when premium then 1 else 0 end) as npremium
   , count(distinct(keyword)) as nwords
   , count(distinct(advertiser)) as nadvertisers 
from data 
group by date 
order by date asc

  QUERY PLAN
   
---
 GroupAggregate  (cost=0.00..14452743.09 rows=721 width=13)
   -  Index Scan using data_date_idx on data  (cost=0.00..9075144.27 
rows=430206752 width=13)
(2 rows)

=# show server_version;
 server_version 

 8.1.2
(1 row)

-jwb


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Huge Data sets, simple queries

2006-01-27 Thread Jeffrey W. Baker
On Fri, 2006-01-27 at 20:23 -0500, Mike Biamonte wrote:
 
 Does anyone have any experience with extremely large data sets?
 I'm mean hundreds of millions of rows.

Sure, I think more than a few of us do.  Just today I built a summary
table from a 25GB primary table with ~430 million rows.  This took about
45 minutes.

 The queries I need to run on my 200 million transactions are relatively
 simple:
 
select month, count(distinct(cardnum)) count(*), sum(amount) from
 transactions group by month;
 
 This query took 18 hours on PG 8.1 on a Dual Xeon, RHEL3, (2.4 Kernel) with
 RAID-10 (15K drives)
 and 12 GB Ram.  I was expecting it to take about 4 hours - based on some
 experience with a
 similar dataset on a different machine (RH9, PG7.3 Dual Xeon, 4GB RAM,
 Raid-5 10K drives)

Possibly the latter machine has a faster I/O subsystem.  How large is
the table on disk?

   This machine is COMPLETELY devoted to running these relatively simple
 queries one at a
 time. (No multi-user support needed!)I've been tooling with the various
 performance settings:
 effective_cache at 5GB, shared_buffers at 2 GB, workmem, sortmem at 1 GB
 each.
 ( Shared buffers puzzles me a it bit - my instinct says to set it as high as
 possible,
 but everything I read says that too high can hurt performance.)
 
Any ideas for performance tweaking in this kind of application would be
 greatly appreciated.
 We've got indexes on the fields being grouped, 
 and always vacuum analzye
 after building them.

Probably vacuum makes no difference.

It's difficult to just try various ideas because each attempt takes a
 full day to test.  Real
 experience is needed here!

Can you send us an EXPLAIN of the query?  I believe what you're seeing
here is probably:

Aggregate
+-Sort
  +-Sequential Scan

or perhaps:

Aggregate
+-Index Scan

I have a feeling that the latter will be much faster.  If your table has
been created over time, then it is probably naturally ordered by date,
and therefore also ordered by month.  You might expect a Sequential Scan
to be the fastest, but the Sort step will be a killer.  On the other
hand, if your table is badly disordered by date, the Index Scan could
also be very slow.

Anyway, send us the query plan and also perhaps a sample of vmstat
during the query.

For what it's worth, I have:

effective_cache_size| 70
cpu_tuple_cost  | 0.01
cpu_index_tuple_cost| 0.001
random_page_cost| 3
shared_buffers  | 5
temp_buffers| 1000
work_mem| 1048576 = for this query only

And here's a few lines from vmstat during the query:

procs ---memory-- ---swap-- -io --system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa
 2  1 76  43476  94916 765514800 78800 0 1662   788 68 12  0 20
 1  1 76  45060  91196 765808800 78028 0 1639   712 71 11  0 19
 2  0 76  44668  87624 766296000 7892452 1650   736 69 12  0 19
 2  0 76  45300  83672 766743200 8353616 1688   768 71 12  0 18
 1  1 76  45744  80652 767071200 84052 0 1691   796 70 12  0 17

That's about 80MB/sec sequential input, for comparison purposes.

-jwb


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[PERFORM] Process executing COPY opens and reads every table on the system

2005-12-29 Thread Jeffrey W. Baker
I have an instance of PG 7.4 where I would really like to execute some
schema changes, but every schema change is blocked waiting for a process
doing a COPY.  That query is:

COPY drill.trades (manager, sec_id, ticker, bridge_tkr, date, type,
short, quantity, price, prin, net_money, factor) TO stdout;

So it's only involved with a single table in a single schema.
Unfortunately, what this process is doing is opening and reading every
table in the database:

# strace -e open,close -p 29859
Process 29859 attached - interrupt to quit
open(/var/lib/postgres/data/base/7932340/2442094542, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/358185104.16, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2023517557.1, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2023517557.3, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2023517557.5, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2023517557.5, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2023517557.3, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2023517557.6, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2023517557.9, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2023517557.3, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/358185104.16, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/358185104.16, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/358185104.16, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2414561511, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2426495316, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2426495316, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2414561511, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2426495316, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2426495316, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2414561511, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/358185104.16, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2429205386, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2429205433, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/358185104.16, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2426495316, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2414561511, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/358185104.16, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/358185104.16, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2429205441, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/358185104.16, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2414561511, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2426495316, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2023517557.3, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2023517557.10, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/358185104.16, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/358185104.16, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2023517557.9, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/2298808676/2298808939.10, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/358185104.15, O_RDWR) = 49
close(49)   = 0
open(/var/lib/postgres/data/base/7932340/2414561511, O_RDWR) = 49
close(49)   = 0

[PERFORM] Invulnerable VACUUM process thrashing everything

2005-12-29 Thread Jeffrey W. Baker
A few WEEKS ago, the autovacuum on my instance of pg 7.4 unilaterally
decided to VACUUM a table which has not been updated in over a year and
is more than one terabyte on the disk.  Because of the very high
transaction load on this database, this VACUUM has been ruining
performance for almost a month.  Unfortunately is seems invulnerable to
killing by signals:

# ps ax | grep VACUUM
15308 ?D588:00 postgres: postgres skunk [local] VACUUM
# kill -HUP 15308
# ps ax | grep VACUUM
15308 ?D588:00 postgres: postgres skunk [local] VACUUM
# kill -INT 15308
# ps ax | grep VACUUM
15308 ?D588:00 postgres: postgres skunk [local] VACUUM
# kill -PIPE 15308
# ps ax | grep VACUUM
15308 ?D588:00 postgres: postgres skunk [local] VACUUM

o/~ But the cat came back, the very next day ...

I assume that if I kill this with SIGKILL, that will bring down every
other postgres process, so that should be avoided.  But surely there is
a way to interrupt this.  If I had some reason to shut down the
instance, I'd be screwed, it seems.

-jwb

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Invulnerable VACUUM process thrashing everything

2005-12-29 Thread Jeffrey W. Baker
On Thu, 2005-12-29 at 22:53 +, Russ Garrett wrote:
 In my experience a kill -9 has never resulted in any data loss in this 
 situation (it will cause postgres to detect that the process died, shut 
 down, then recover), and most of the time it only causes a 5-10sec 
 outage. I'd definitely hesitate to recommend it in a production context 
 though, especially since I think there are some known race-condition 
 bugs in 7.4.
 
 VACUUM *will* respond to a SIGTERM, but it doesn't check very often - 
 I've often had to wait hours for it to determine that it's been killed, 
 and my tables aren't anywhere near 1TB. Maybe this is a place where 
 things could be improved...

FWIW, I murdered this process with SIGKILL, and the recovery was very
short.


 Incidentally, I have to kill -9 some of our MySQL instances quite 
 regularly because they do odd things. Not something you want to be 
 doing, especially when MySQL takes 30mins to recover.

Agreed.  After mysql shutdown with MyISAM, all tables must be checked
and usually many need to be repaired.  This takes a reallly long
time.

-jwb

 Russ Garrett
 Last.fm Ltd.
 [EMAIL PROTECTED]
 
 Ron wrote:
 
  Ick.  Can you get users and foreign connections off that machine, lock 
  them out for some period, and renice the VACUUM?
 
  Shedding load and keeping it off while VACUUM runs high priority might 
  allow it to finish in a reasonable amount of time.
  Or
  Shedding load and dropping the VACUUM priority might allow a kill 
  signal to get through.
 
  Hope this helps,
  Ron
 
 
  At 05:09 PM 12/29/2005, Jeffrey W. Baker wrote:
 
  A few WEEKS ago, the autovacuum on my instance of pg 7.4 unilaterally
  decided to VACUUM a table which has not been updated in over a year and
  is more than one terabyte on the disk.  Because of the very high
  transaction load on this database, this VACUUM has been ruining
  performance for almost a month.  Unfortunately is seems invulnerable to
  killing by signals:
 
  # ps ax | grep VACUUM
  15308 ?D588:00 postgres: postgres skunk [local] VACUUM
  # kill -HUP 15308
  # ps ax | grep VACUUM
  15308 ?D588:00 postgres: postgres skunk [local] VACUUM
  # kill -INT 15308
  # ps ax | grep VACUUM
  15308 ?D588:00 postgres: postgres skunk [local] VACUUM
  # kill -PIPE 15308
  # ps ax | grep VACUUM
  15308 ?D588:00 postgres: postgres skunk [local] VACUUM
 
  o/~ But the cat came back, the very next day ...
 
  I assume that if I kill this with SIGKILL, that will bring down every
  other postgres process, so that should be avoided.  But surely there is
  a way to interrupt this.  If I had some reason to shut down the
  instance, I'd be screwed, it seems.
 
 
 
 
 
  ---(end of broadcast)---
  TIP 2: Don't 'kill -9' the postmaster
 
 

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] opinion on disk speed

2005-12-08 Thread Jeffrey W. Baker
On Thu, 2005-12-08 at 11:52 -0500, Vivek Khera wrote:
 I have a choice to make on a RAID enclosure:
 
 14x 36GB 15kRPM ultra 320 SCSI drives
 
 OR
 
 12x 72GB 10kRPM ultra 320 SCSI drives
 
 both would be configured into RAID 10 over two SCSI channels using a  
 megaraid 320-2x card.
 
 My goal is speed.  Either would provide more disk space than I would  
 need over the next two years.
 
 The database does a good number of write transactions, and a decent  
 number of sequential scans over the whole DB (about 60GB including  
 indexes) for large reports.

The STR of 15k is quite a bit higher than 10k.  I'd be inclined toward
the 15k if it doesn't impact the budget.

For the write transactions, the speed and size of the DIMM on that LSI
card will matter the most.  I believe the max memory on that adapter is
512MB.  These cost so little that it wouldn't make sense to go with
anything smaller.

When comparing the two disks, don't forget to check for supported SCSI
features.  In the past I've been surprised that some 10k disks don't
support packetization, QAS, and so forth.  All 15k disks seem to support
these.

Don't forget to post some benchmarks when your vendor delivers ;)

-jwb

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Jeffrey W. Baker
On Wed, 2005-10-05 at 12:14 -0400, Ron Peacetree wrote:
 I've now gotten verification from multiple working DBA's that DB2, Oracle, and
 SQL Server can achieve ~250MBps ASTR (with as much as ~500MBps ASTR in
 setups akin to Oracle RAC) when attached to a decent (not outrageous, but
 decent) HD subsystem...
 
 I've not yet had any RW DBA verify Jeff Baker's supposition that ~1GBps ASTR 
 is
 attainable.  Cache based bursts that high, yes.  ASTR, no.

I find your tone annoying.  That you do not have access to this level of
hardware proves nothing, other than pointing out that your repeated
emails on this list are based on supposition.

If you want 1GB/sec STR you need:

1) 1 or more Itanium CPUs
2) 24 or more disks
3) 2 or more SATA controllers
4) Linux

Have fun.

-jwb

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Ultra-cheap NVRAM device

2005-10-03 Thread Jeffrey W. Baker
On Mon, 2005-10-03 at 11:15 -0600, Dan Harris wrote:
 On Oct 3, 2005, at 5:02 AM, Steinar H. Gunderson wrote:
 
  I thought this might be interesting, not the least due to the  
  extremely low
  price ($150 + the price of regular DIMMs):
 
 
 
 
 This has been posted before, and the main reason nobody got very  
 excited is that:
 
 a) it only uses the PCI bus to provide power to the device, not for I/O
 b) It is limited to SATA bandwidth
 c) The benchmarks did not prove it to be noticeably faster than a  
 good single SATA drive
 
 A few of us were really excited at first too, until seeing the  
 benchmarks..

Also, no ECC support.  You'd be crazy to use it for anything.

-jwb

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-29 Thread Jeffrey W. Baker
On Thu, 2005-09-29 at 10:06 -0700, Luke Lonergan wrote:
 Josh,
 
 On 9/29/05 9:54 AM, Josh Berkus josh@agliodbs.com wrote:
 
  Following an index creation, we see that 95% of the time required is the
  external sort, which averages 2mb/s.  This is with seperate drives for
  the WAL, the pg_tmp, the table and the index.  I've confirmed that
  increasing work_mem beyond a small minimum (around 128mb) had no benefit
  on the overall index creation speed.
 
 Yp!  That about sums it up - regardless of taking 1 or 2 passes through
 the heap being sorted, 1.5 - 2 MB/s is the wrong number.

Yeah this is really bad ... approximately the speed of GNU sort.

Josh, do you happen to know how many passes are needed in the multiphase
merge on your 60GB table?

Looking through tuplesort.c, I have a couple of initial ideas.  Are we
allowed to fork here?  That would open up the possibility of using the
CPU and the I/O in parallel.  I see that tuplesort.c also suffers from
the kind of postgresql-wide disease of calling all the way up and down a
big stack of software for each tuple individually.  Perhaps it could be
changed to work on vectors.

I think the largest speedup will be to dump the multiphase merge and
merge all tapes in one pass, no matter how large M.  Currently M is
capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
the tape.  It could be done in a single pass heap merge with N*log(M)
comparisons, and, more importantly, far less input and output.

I would also recommend using an external processes to asynchronously
feed the tuples into the heap during the merge.

What's the timeframe for 8.2?

-jwb




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-27 Thread Jeffrey W. Baker
On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote:

 That Btree can be used to generate a physical reordering of the data
 in one pass, but that's the weakest use for it.  The more powerful
 uses involve allowing the Btree to persist and using it for more
 efficient re-searches or combining it with other such Btrees (either as
 a step in task distribution across multiple CPUs or as a more efficient
 way to do things like joins by manipulating these Btrees rather than
 the actual records.)

Maybe you could describe some concrete use cases.  I can see what you
are getting at, and I can imagine some advantageous uses, but I'd like
to know what you are thinking.

Specifically I'd like to see some cases where this would beat sequential
scan.  I'm thinking that in your example of a terabyte table with a
column having only two values, all the queries I can think of would be
better served with a sequential scan.

Perhaps I believe this because you can now buy as much sequential I/O as
you want.  Random I/O is the only real savings.

-jwb



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Battery Backed Cache for RAID

2005-09-14 Thread Jeffrey W. Baker
On Wed, 2005-09-14 at 11:25 -0700, Peter Darley wrote:
   I'm getting a new server for our database, and I have a quick question
 about RAID controllers with a battery backed cache.  I understand that the
 cache will allow the cache to be written out if the power fails to the box,
 which allows it to report a write as committed safely when it's not actually
 committed.

Actually the cache will just hold its contents while the power is out.
When the power is restored, the RAID controller will complete the writes
to disk.  If the battery does not last through the outage, the data is
lost.

   My question is, if the power goes off, and the drives stop, how does the
 battery backed cache save things out to the dead drives?  Is there another
 component that is implied that will provide power to the drives that I
 should be looking into as well?

A UPS would allow you to do an orderly shutdown and write contents to
disk during a power failure.  However a UPS can be an extra point of
failure.

-jwb

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Observation about db response time

2005-09-05 Thread Jeffrey W. Baker
On Tue, 2005-08-30 at 08:13 -0500, Frank Wiles wrote:
 On Tue, 30 Aug 2005 18:35:30 +0530
 Akshay Mathur [EMAIL PROTECTED] wrote:
 
  Hello Friends,
   
  We were having a database in pgsql7.4.2 The database was responding
  very slowly even after full vacuum analyze (select count(*) from
  some_table_having_18000_records was taking 18 Sec).
   
  We took a backup of that db and restored it back. Now the same db on
  same PC is responding fast (same query is taking 18 ms).
   
  But we can't do the same as a solution of slow response. Do anybody
  has faced similar problem? Is this due to any internal problem of
  pgsql? Is there any clue to fasten the database?
 
   This could be because you don't have max_fsm_pages and
   max_fsm_relations setup correctly or are not doing full vacuums 
   often enough. 
 
   If your database deletes a ton of data as a matter of course then
   sometimes a full vacuum will not clear up as much space as it could.
 
   Try increasing those configuration values and doing vacuums more
   often. 
 
   If you should also explore upgrading to the latest 8.0 as you will
   no doubt see noticeable speed improvements. 

This can also be caused by index bloat.  VACUUM does not clear out the
index.  You must use REINDEX for that.

-jwb

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Limit + group + join

2005-08-25 Thread Jeffrey W. Baker
On Thu, 2005-08-25 at 18:56 -0700, Jeffrey W. Baker wrote:
 On Fri, 2005-08-26 at 02:27 +0200, Tobias Brox wrote:
  Consider this setup - which is a gross simplification of parts of our
  production system ;-)
  
create table c (id integer primary key);
create table b (id integer primary key, c_id integer);
create index b_on_c on b(c_id)
  
insert into c (select ... lots of IDs ...);
insert into b (select id, id from c); /* keep it simple :-) */

  Now, I'm just interessted in some few rows.  
  
  All those gives good plans:
  
  explain select c.id from c order by c.id limit 1;
  explain select c.id from c group by c.id order by c.id limit 1;
  explain select c.id from c join b on c_id=c.id order by c.id limit 1;
  
  ... BUT ... combining join, group and limit makes havoc:
  
  explain select c.id from c join b on c_id=c.id  group by c.id order by c.id
  desc limit 5;
 
 Where's b in this join clause?  It looks like a cartesian product to me.

Nevermind.  I read c_id as c.id.

-jwb


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Read/Write block sizes

2005-08-24 Thread Jeffrey W. Baker
On Wed, 2005-08-24 at 01:56 -0400, Tom Lane wrote:
 Jeffrey W. Baker [EMAIL PROTECTED] writes:
  On Wed, 2005-08-24 at 17:20 +1200, Guy Thornley wrote:
  Dont forget that already in postgres, you have a process per connection, 
  and
  all the processes take care of their own I/O.
 
  That's the problem.  Instead you want 1 or 4 or 10 i/o slaves
  coordinating the I/O of all the backends optimally.  For instance, with
  synchronous scanning.
 
 And why exactly are we going to do a better job of I/O scheduling than
 the OS itself can do?
...
 There are some things we could do to reduce the impedance between us and
 the OS --- for instance, the upthread criticism that a seqscan asks the
 OS for only 8K at a time is fair enough.  But that doesn't translate
 to a conclusion that we should schedule the I/O instead of the OS.

Synchronous scanning is a fairly huge and obvious win.  If you have two
processes 180 degrees out-of-phase in a linear read, neither process is
going to get anywhere near the throughput they would get from a single
scan.

I think you're being deliberately obtuse with regards to file I/O and
the operating system.  The OS isn't magical.  It has to strike a balance
between a reasonable read latency and a reasonable throughput.  As far
as the kernel is concerned, a busy postgresql server is
indistinguishable from 100 unrelated activities.  All backends will be
served equally, even if in this case equally means quite badly all
around.

An I/O slave process could be a big win in Postgres for many kinds of
reads.  Instead of opening and reading files the backends would connect
to the I/O slave and request the file be read.  If a scan of that file
were already underway, the new backends would be attached.  Otherwise a
new scan would commence.  In either case, the slave process can issue
(sometimes non-dependant) reads well ahead of the needs of the backend.
You may think the OS can do this for you but it can't.  On postgres
knows that it needs the whole file from beginning to end.  The OS can
only guess.

Ask me sometime about my replacement for GNU sort.  It uses the same
sorting algorithm, but it's an order of magnitude faster due to better
I/O strategy.  Someday, in my infinite spare time, I hope to demonstrate
that kind of improvement with a patch to pg.

-jwb


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Read/Write block sizes (Was: Caching by Postgres)

2005-08-23 Thread Jeffrey W. Baker
On Tue, 2005-08-23 at 19:12 -0400, Michael Stone wrote:
 On Tue, Aug 23, 2005 at 05:29:01PM -0400, Jignesh Shah wrote:
 Actually some of that readaheads,etc  the OS does  already if it does
 some sort of throttling/clubbing of reads/writes.
 
 Note that I specified the fully cached case--even with the workload in
 RAM the system still has to process a heck of a lot of read calls.
 
 * Introduce a multiblock or extent tunable variable where you can
 define a multiple of 8K (or BlockSize tuneable) to read a bigger chunk
 and store it in the bufferpool.. (Maybe writes too) (Most devices now
 support upto 1MB chunks for reads and writes)
 
 Yeah. The problem with relying on OS readahead is that the OS doesn't
 know whether you're doing a sequential scan or an index scan; if you
 have the OS agressively readahead you'll kill your seek performance.
 OTOH, if you don't do readaheads you'll kill your sequential scan
 performance. At the app level you know which makes sense for each
 operation.

This is why we have MADVISE_RANDOM and MADVISE_SEQUENTIAL.

-jwb

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Read/Write block sizes

2005-08-23 Thread Jeffrey W. Baker
On Tue, 2005-08-23 at 19:31 -0700, Josh Berkus wrote:
 Steve,
 
  I would assume that dbt2 with STP helps minimize the amount of hours
  someone has to invest to determine performance gains with configurable
  options?
 
 Actually, these I/O operation issues show up mainly with DW workloads, so the 
 STP isn't much use there.   If I can ever get some of these machines back 
 from the build people, I'd like to start testing some stuff.
 
 One issue with testing this is that currently PostgreSQL doesn't support 
 block 
 sizes above 128K.  We've already done testing on that (well, Mark has) and 
 the performance gains aren't even worth the hassle of remembering you're on a 
 different block size (like, +4%).
 
 What the Sun people have done with other DB systems is show that substantial 
 performance gains are possible on large databases (100G) using block sizes 
 of 1MB.   I believe that's possible (and that it probably makes more of a 
 difference on Solaris than on BSD) but we can't test it without some hackery 
 first.

To get decent I/O you need 1MB fundamental units all the way down the
stack.  You need a filesystem that can take a 1MB write well, and you
need an I/O scheduler that will keep it together, and you need a storage
controller that can eat a 1MB request at once.  Ideally you'd like an
architecture with a 1MB page (Itanium has this, and AMD64 Linux will
soon have this.)  The Lustre people have done some work in this area,
opening up the datapaths in the kernel so they can keep the hardware
really working.  They even modified the QLogic SCSI/FC driver so it
supports such large transfers.  Their work has shown that you can get
significant perf boost on Linux just by thinking in terms of larger
transfers.

Unfortunately I'm really afraid that this conversation is about trees
when the forest is the problem.  PostgreSQL doesn't even have an async
reader, which is the sort of thing that could double or triple its
performance.  You're talking about block sizes and such, but the kinds
of improvements you can get there are in the tens of percents at most.

-jwb


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Read/Write block sizes

2005-08-23 Thread Jeffrey W. Baker
On Wed, 2005-08-24 at 17:20 +1200, Guy Thornley wrote:
 As for the async IO, sure you might think 'oh async IO would be so cool!!'
 and I did, once, too. But then I sat down and _thought_ about it, and
 decided well, no, actually, theres _very_ few areas it could actually help,
 and in most cases it just make it easier to drive your box into lseek()
 induced IO collapse.
 
 Dont forget that already in postgres, you have a process per connection, and
 all the processes take care of their own I/O.

That's the problem.  Instead you want 1 or 4 or 10 i/o slaves
coordinating the I/O of all the backends optimally.  For instance, with
synchronous scanning.

-jwb


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Query plan looks OK, but slow I/O - settings advice?

2005-08-19 Thread Jeffrey W. Baker
On Tue, 2005-08-16 at 10:46 -0700, Roger Hand wrote:
 The disks are ext3 with journalling type of ordered, but this was later 
 changed to writeback with no apparent change in speed.
 
 They're on a Dell poweredge 6650 with LSI raid card, setup as follows:
 4 disks raid 10 for indexes (145GB) - sdc1
 6 disks raid 10 for data (220GB) - sdd1
 2 mirrored disks for logs - sdb1
 
 stripe size is 32k
 cache policy: cached io (am told the controller has bbu)
 write policy: write-back
 read policy: readahead

I assume you are using Linux 2.6.  Have you considered booting your
machine with elevator=deadline?  You can also change this at runtime
using sysfs.

These read speeds are not too impressive.  Perhaps this is a slow
controller.  Alternately you might need bigger CPUs.

There's a lot of possibilities, obviously :)  I'd start with the
elevator, since that's easily tested.

-jwb


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Need for speed

2005-08-17 Thread Jeffrey W. Baker
On Wed, 2005-08-17 at 11:15 +0200, Ulrich Wisser wrote:
 Hello,
 
 thanks for all your suggestions.
 
 I can see that the Linux system is 90% waiting for disc io. At that time 
 all my queries are *very* slow. My scsi raid controller and disc are 
 already the fastest available.

What RAID controller?  Initially you said you have only 2 disks, and
since you have your xlog on a separate spindle, I assume you have 1 disk
for the xlog and 1 for the data.  Even so, if you have a RAID, I'm going
to further assume you are using RAID 1, since no sane person would use
RAID 0.  In those cases you are getting the performance of a single
disk, which is never going to be very impressive.  You need a RAID.

Please be more precise when describing your system to this list.

-jwb


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Need for speed

2005-08-16 Thread Jeffrey W. Baker
On Tue, 2005-08-16 at 17:39 +0200, Ulrich Wisser wrote:
 Hello,
 
 one of our services is click counting for on line advertising. We do 
 this by importing Apache log files every five minutes. This results in a 
 lot of insert and delete statements. At the same time our customers 
 shall be able to do on line reporting.
 
 We have a box with
 Linux Fedora Core 3, Postgres 7.4.2
 Intel(R) Pentium(R) 4 CPU 2.40GHz

This is not a good CPU for this workload.  Try an Opteron or Xeon.  Also
of major importance is the amount of memory.  If possible, you would
like to have memory larger than the size of your database.

 2 scsi 76GB disks (15.000RPM, 2ms)

If you decide your application is I/O bound, here's an obvious place for
improvement.  More disks == faster.

 I did put pg_xlog on another file system on other discs.

Did that have a beneficial effect?

 Still when several users are on line the reporting gets very slow. 
 Queries can take more then 2 min.

Is this all the time or only during the insert?

 I need some ideas how to improve performance in some orders of 
 magnitude. I already thought of a box with the whole database on a ram 
 disc. So really any idea is welcome.

You don't need a RAM disk, just a lot of RAM.  Your operating system
will cache disk contents in memory if possible.  You have a very small
configuration, so more CPU, more memory, and especially more disks will
probably all yield improvements.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [SPAM?] Re: [PERFORM] PG8 Tuning

2005-08-11 Thread Jeffrey W. Baker
On Fri, 2005-08-12 at 08:47 +, Steve Poe wrote:
 Paul,
 
 Before I say anything else, one online document which may be of
 assistance to you is:
 http://www.powerpostgresql.com/PerfList/
 
 Some thoughts I have:
 
 3) You're shared RAM setting seems overkill to me. Part of the challenge
 is you're going from 1000 to 262K with no assessment in between. Each
 situation can be different, but try in the range of 10 - 50K.
 
 4) pg_xlog: If you're pg_xlog is on a spindle is *only* for pg_xlog
 you're better off.

Like Mr. Stone said earlier, this is pure dogma.  In my experience,
xlogs on the same volume with data is much faster if both are on
battery-backed write-back RAID controller memory.  Moving from this
situation to xlogs on a single normal disk is going to be much slower in
most cases.

-jwb

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Performance problems on 4/8way Opteron (dualcore) HP

2005-07-29 Thread Jeffrey W. Baker
On Fri, 2005-07-29 at 10:46 -0700, Josh Berkus wrote:
 Dirk,
 
  does anybody have expierence with this machine (4x 875 dual core Opteron
  CPUs)?

I'm using dual 275s without problems.

 Nope.   I suspect that you may be the first person to report in on 
 dual-cores.  There may be special compile issues with dual-cores that 
 we've not yet encountered.

Doubtful.  However you could see improvements using recent Linux kernel
code.  There have been some patches for optimizing scheduling and memory
allocations.

However, if you are running this machine in 32-bit mode, why did you
bother paying $14,000 for your CPUs?  You will get FAR better
performance in 64-bit mode.  64-bit mode will give you 30-50% better
performance on PostgreSQL loads, in my experience.  Also, if I remember
correctly, the 32-bit x86 kernel doesn't understand Opteron NUMA
topology, so you may be seeing poor memory allocation decisions.

-jwb

  We run RHEL 3.0, 32bit and under high load it is a drag. We 
  mostly run memory demanding queries. Context switches are pretty much
  around 20.000 on the average, no cs spikes when we run many processes in
  parallel. Actually we only see two processes in running state! When
  there are only a few processes running context switches go much higher.
  At the moment we are much slower that with a 4way XEON box (DL580).
 
 Um, that was a bit incoherent.  Are you seeing a CS storm or aren't you?
 

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Cheap RAM disk?

2005-07-26 Thread Jeffrey W. Baker
On Tue, 2005-07-26 at 11:34 -0500, John A Meinel wrote:
 I saw a review of a relatively inexpensive RAM disk over at 
 anandtech.com, the Gigabyte i-RAM
 http://www.anandtech.com/storage/showdoc.aspx?i=2480
 
 Basically, it is a PCI card, which takes standard DDR RAM, and has a 
 SATA port on it, so that to the system, it looks like a normal SATA drive.
 
 The card costs about $100-150, and you fill it with your own ram, so for 
 a 4GB (max size) disk, it costs around $500. Looking for solid state 
 storage devices, the cheapest I found was around $5k for 2GB.
 
 Gigabyte claims that the battery backup can last up to 16h, which seems 
 decent, if not really long (the $5k solution has a built-in harddrive so 
 that if the power goes out, it uses the battery power to copy the 
 ramdisk onto the harddrive for more permanent storage).
 
 Anyway, would something like this be reasonable as a drive for storing 
 pg_xlog? With 4GB you could have as many as 256 checkpoint segments.

I haven't tried this product, but the microbenchmarks seem truly slow.
I think you would get a similar benefit by simply sticking a 1GB or 2GB
DIMM -- battery-backed, of course -- in your RAID controller.

-jwb

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Huge performance problem between 7.4.1 and 8.0.3 -

2005-07-17 Thread Jeffrey W. Baker
On Sun, 2005-07-17 at 21:34 -0600, Robert Creager wrote:
 Sigh...
 
 I recently upgraded from 7.4.1 to 8.0.3.  The application did not change.  I'm
 now running both database concurrently (on different ports, same machine) just
 so I could verify the problem really exists.
 
 The application is a custom test application for testing mechanical systems. 
 The runs in question (4 at a time) each generate 16 queries at a time of which
 the results are sent to the mechanical system which processes the request, 
 which
 processes them anywhere from 10 to 120 seconds.  The system is capable of
 completing between 4 and 8 jobs at once.  So, once the system is running, at
 most there will be 8 queries per run simultaneously.
 
 The entire database fits into RAM (2Gb), as evidenced by no disk activity and
 relatively small database size.  pg_xlog is on different disks from the db.
 
 The problem is that on version 8.0.3, once I get 3 or more concurrent runs
 going, the query times start tanking (20 seconds).  On 7.4.1, the 
 applications
 hum along with queries typically below .2 seconds on over 5 concurrent runs. 
 Needless to say, 7.4.1 behaves as expected...  The only change between runs is
 the port connecting to.  Bot DB's are up at the same time.
 
 For 8.03, pg_autovacuum is running.  On 7.4.1, I set up a cron job to vacuum
 analyze every 5 minutes.
 
 The system is Mandrake Linux running 2.4.22 kernel with dual Intel Xenon CPU
 with HT enabled.  On an 803 run, the context switching is up around 60k.  On
 7.4.1, it maxes around 23k and averages  1k.

Did you build 8.0.3 yourself, or install it from packages?  I've seen in
the past where pg would build with the wrong kind of mutexes on some
machines, and that would send the CS through the roof.  If you did build
it yourself, check your ./configure logs.  If not, try strace.

-jwb

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PERFORM] more filesystem benchmarks

2005-07-16 Thread Jeffrey W. Baker
In our last installment, we saw that JFS provides higher pgbench
performance than either XFS or ext3.  Using a direct-I/O patch stolen
from 8.1, JFS achieved 105 tps with 100 clients.

To refresh, the machine in question has 5 7200RPM SATA disks, an Areca
RAID controller with 128MB cache, and 1GB of main memory.  pgbench is
being run with a scale factor of 1000 and 10 total transactions.

At the suggestion of Andreas Dilger of clusterfs, I tried modulating the
size of the ext3 journal, and the mount options (data=journal,
writeback, and ordered).  I turns out that you can achieve a substantial
improvement (almost 50%) by simply mounting the ext3 volume with
data=writeback instead of data=ordered (the default).  Changing the
journal size did not seem to make a difference, except that 256MB is for
some reason pathological (9% slower than the best time).  128MB, the
default for a large volume, gave the same performance as 400MB (the max)
or 32MB.

In the end, the ext3 volume mounted with -o noatime,data=writeback
yielded 88 tps with 100 clients.  This is about 16% off the performance
of JFS with default options.

Andreas pointed me to experimental patches to ext3's block allocation
code and writeback strategy.  I will test these, but I expect the
database community, which seems so attached to its data, will be very
interested in code that has not yet entered mainstream use.

Another frequent suggestion is to put the xlog on a separate device.  I
tried this, and, for a given number of disks, it appears to be
counter-productive.  A RAID5 of 5 disks holding both logs and data is
about 15% faster than a RAID5 of 3 disks with the data, and a mirror of
two disks holding the xlog.

Here are the pgbench results for each permutation of ext3:

Journal Size | Journal Mode | 1 Client | 10 Clients | 100 Clients
--
32 ordered28 51   57
32 writeback  34 70   88
64 ordered29 52   61
64 writeback  32 69   87
128ordered32 54   62
128writeback  34 70   88
256ordered28 51   60
256writeback  29 64   79
400ordered26 49   59
400writeback  32 70   87

-jwb

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Really bad diskio

2005-07-15 Thread Jeffrey W. Baker
On Fri, 2005-07-15 at 15:04 -0600, Ron Wills wrote:
 At Fri, 15 Jul 2005 13:45:07 -0700,
 Joshua D. Drake wrote:
  
  Ron Wills wrote:
   Hello all
   
 I'm running a postgres 7.4.5, on a dual 2.4Ghz Athlon, 1Gig RAM and
   an 3Ware SATA raid. 
  
  2 drives?
  4 drives?
  8 drives?
 
   3 drives raid 5. I don't believe it's the raid. I've tested this by
 moving the database to the mirrors software raid where the root is
 found and onto the the SATA raid. Neither relieved the IO problems.

Hard or soft RAID?  Which controller?  Many of the 3Ware controllers
(85xx and 95xx) have extremely bad RAID 5 performance.

Did you take any pgbench or other benchmark figures before you started
using the DB?

-jwb

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Really bad diskio

2005-07-15 Thread Jeffrey W. Baker
On Fri, 2005-07-15 at 15:29 -0600, Ron Wills wrote:
 Here's a bit of a dump of the system that should be useful.
 
 Processors x2:
 
 vendor_id   : AuthenticAMD
 cpu family  : 6
 model   : 8
 model name  : AMD Athlon(tm) MP 2400+
 stepping: 1
 cpu MHz : 2000.474
 cache size  : 256 KB
 
 MemTotal:   903804 kB
 
 Mandrake 10.0 Linux kernel 2.6.3-19mdk
 
 The raid controller, which is using the hardware raid configuration:
 
 3ware 9000 Storage Controller device driver for Linux v2.26.02.001.
 scsi0 : 3ware 9000 Storage Controller
 3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xe802, IRQ: 17.
 3w-9xxx: scsi0: Firmware FE9X 2.02.00.011, BIOS BE9X 2.02.01.037, Ports: 4.
   Vendor: 3ware Model: Logical Disk 00   Rev: 1.00
   Type:   Direct-Access  ANSI SCSI revision: 00
 SCSI device sda: 624955392 512-byte hdwr sectors (319977 MB)
 SCSI device sda: drive cache: write back, no read (daft)
 
 This is also on a 3.6 reiser filesystem.
 
 Here's the iostat for 10mins every 10secs. I've removed the stats from
 the idle drives to reduce the size of this email.
 
 Linux 2.6.3-19mdksmp (photo_server)   07/15/2005
 
 avg-cpu:  %user   %nice%sys %iowait   %idle
2.851.532.15   39.52   53.95
 
 Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
 sda  82.49  4501.73   188.38 1818836580   76110154
 
 avg-cpu:  %user   %nice%sys %iowait   %idle
0.300.001.00   96.302.40
 
 Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
 sda  87.80  6159.20   340.00  61592   3400

These I/O numbers are not so horrible, really.  100% iowait is not
necessarily a symptom of misconfiguration.  It just means you are disk
limited.  With a database 20 times larger than main memory, this is no
surprise.

If I had to speculate about the best way to improve your performance, I
would say:

1a) Get a better RAID controller.  The 3ware hardware RAID5 is very bad.
1b) Get more disks.
2) Get a (much) newer kernel.
3) Try XFS or JFS.  Reiser3 has never looked good in my pgbench runs

By the way, are you experiencing bad application performance, or are you
just unhappy with the iostat figures?

Regards,
jwb


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PERFORM] JFS fastest filesystem for PostgreSQL?

2005-07-14 Thread Jeffrey W. Baker
[reposted due to delivery error -jwb]

I just took delivery of a new system, and used the opportunity to
benchmark postgresql 8.0 performance on various filesystems.  The system
in question runs Linux 2.6.12, has one CPU and 1GB of system memory, and
5 7200RPM SATA disks attached to an Areca hardware RAID controller
having 128MB of cache.  The caches are all write-back.

I ran pgbench with a scale factor of 1000 and a total of 100,000
transactions per run.  I varied the number of clients between 10 and
100.  It appears from my test JFS is much faster than both ext3 and XFS
for this workload.  JFS and XFS were made with the mkfs defaults.  ext3
was made with -T largefile4 and -E stride=32.  The deadline scheduler
was used for all runs (anticipatory scheduler is much worse).

Here's the result, in transactions per second.

  ext3  jfs  xfs
-
 10 Clients 55   81   68
100 Clients 61  100   64


-jwb

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] JFS fastest filesystem for PostgreSQL?

2005-07-14 Thread Jeffrey W. Baker
On Thu, 2005-07-14 at 10:03 +0200, Dawid Kuroczko wrote:
 On 7/14/05, Jeffrey W. Baker [EMAIL PROTECTED] wrote:
  [reposted due to delivery error -jwb]
  
  I just took delivery of a new system, and used the opportunity to
  benchmark postgresql 8.0 performance on various filesystems.  The system
  in question runs Linux 2.6.12, has one CPU and 1GB of system memory, and
  5 7200RPM SATA disks attached to an Areca hardware RAID controller
  having 128MB of cache.  The caches are all write-back.
  
  I ran pgbench with a scale factor of 1000 and a total of 100,000
  transactions per run.  I varied the number of clients between 10 and
  100.  It appears from my test JFS is much faster than both ext3 and XFS
  for this workload.  JFS and XFS were made with the mkfs defaults.  ext3
  was made with -T largefile4 and -E stride=32.  The deadline scheduler
  was used for all runs (anticipatory scheduler is much worse).
  
  Here's the result, in transactions per second.
  
ext3  jfs  xfs
  -
   10 Clients 55   81   68
  100 Clients 61  100   64
  
 
 If you still have a chance, could you do tests with other journaling
 options for ext3 (journal=writeback, journal=data)?  And could you
 give figures about performace of other IO elevators?  I mean, you
 wrote that anticipatory is much wore -- how much worse? :)  Could
 you give numbers for deadline,anticipatory,cfq elevators? :)
 
 And, additionally would it be possible to give numbers for bonnie++
 results?  To see how does pgbench to bonnie++ relate?

Phew, that's a lot of permutations.  At 20-30 minutes per run, I'm
thinking 5-8 hours or so.  Still, for you dear readers, I'll somehow
accomplish this tedious feat.

As for Bonnie, JFS is a good 60-80% faster than ext3.  See my message to
ext3-users yesterday.

Using bonnie++ with a 10GB fileset, in MB/s:

 ext3jfsxfs
Read 112 188141
Write 97 157167
Rewrite   51  71 60

-jwb

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[PERFORM] JFS fastest filesystem for PostgreSQL?

2005-07-14 Thread Jeffrey W. Baker
I just took delivery of a new system, and used the opportunity to
benchmark postgresql 8.0 performance on various filesystems.  The system
in question runs Linux 2.6.12, has one CPU and 1GB of system memory, and
5 7200RPM SATA disks attached to an Areca hardware RAID controller
having 128MB of cache.  The caches are all write-back.

I ran pgbench with a scale factor of 1000 and a total of 100,000
transactions per run.  I varied the number of clients between 10 and
100.  It appears from my test JFS is much faster than both ext3 and XFS
for this workload.  JFS and XFS were made with the mkfs defaults.  ext3
was made with -T largefile4 and -E stride=32.  The deadline scheduler
was used for all runs (anticipatory scheduler is much worse).

Here's the result, in transactions per second.

  ext3  jfs  xfs
-
 10 Clients 55   81   68
100 Clients 61  100   64


-jwb

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org