Re: [PERFORM] Hardware/OS recommendations for large databases (5TB)

2005-11-15 Thread Claus Guttesen
 Does anyone have recommendations for hardware and/or OS to work with around
 5TB datasets?

Hardware-wise I'd say dual core opterons. One dual-core-opteron
performs better than two single-core at the same speed. Tyan makes
some boards that have four sockets, thereby giving you 8 cpu's (if you
need that many). Sun and HP also makes nice hardware although the Tyan
board is more competetive priced.

OS wise I would choose the FreeBSD amd64 port but partititions larger
than 2 TB needs some special care, using gpt rather than disklabel
etc., tools like fsck may not be able to completely check partitions
larger than 2 TB. Linux or Solaris with either LVM or Veritas FS
sounds like candidates.

 I have been working with datasets no bigger than around 30GB, and that (I'm
 afraid to admit) has been in MSSQL.

Well, our data are just below 30 GB so I can't help you there :-)

regards
Claus

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Help speeding up delete

2005-11-15 Thread Magnus Hagander
 Because I think we need to.  The above would only delete rows 
 that have name = 'obsid' and value = 'oid080505'.  We need to 
 delete all rows that have the same ids as those rows.  
 However, from what you note, I bet we could do:
 
DELETE FROM tmp_table2 WHERE id IN
   (SELECT id FROM temp_table2 WHERE name = 'obsid' and 
 value= 'oid080505');
 
 However, even that seems to have a much higher cost than I'd expect:
 
lab.devel.configdb=# explain delete from tmp_table2 where id in
 (select id from tmp_table2 where name='obsid' and 
 value = 'oid080505');
NOTICE:  QUERY PLAN:
 
Seq Scan on tmp_table2  (cost=0.00..65705177237.26 
 rows=769844 width=6)
  SubPlan
-  Materialize  (cost=42674.32..42674.32 rows=38 width=50)
  -  Seq Scan on tmp_table2  (cost=0.00..42674.32 
 rows=38 width=50)
 
EXPLAIN
 
 And, sure enough, is taking an extrordinarily long time to 
 run (more than 10 minutes so far, compared to  10seconds for 
 the select).  Is this really typical of deletes?  It appears 
 (to me) to be the Seq Scan on tmp_table2 that is the killer 
 here.  If we put an index on, would it help?  (The user 
 claims she tried that and it's EXPLAIN cost went even higher, 
 but I haven't checked that...)


Earlier pg versions have always been bad at dealing with IN subqueries.
Try rewriting it as (with fixing any broken syntax, I'm not actually
testing this :P)

DELETE FROM tmp_table2 WHERE EXISTS 
 (SELECT * FROM tmp_table2 t2 WHERE t2.id=tmp_table2.id AND
t2.name='obsid' AND t2.value='oid080505')


I assume you do have an index on tmp_table2.id :-) And that it's
non-unique? (If it was unique, the previous simplification of the query
really should've worked..)

Do you also have an index on name,value or something like that, so you
get an index scan from it?

//Magnus

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-15 Thread Luke Lonergan
Adam,

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of 
 Claus Guttesen
 Sent: Tuesday, November 15, 2005 12:29 AM
 To: Adam Weisberg
 Cc: pgsql-performance@postgresql.org
 Subject: Re: [PERFORM] Hardware/OS recommendations for large 
 databases ( 5TB)
 
  Does anyone have recommendations for hardware and/or OS to 
 work with 
  around 5TB datasets?
 
 Hardware-wise I'd say dual core opterons. One 
 dual-core-opteron performs better than two single-core at the 
 same speed. Tyan makes some boards that have four sockets, 
 thereby giving you 8 cpu's (if you need that many). Sun and 
 HP also makes nice hardware although the Tyan board is more 
 competetive priced.
 
 OS wise I would choose the FreeBSD amd64 port but 
 partititions larger than 2 TB needs some special care, using 
 gpt rather than disklabel etc., tools like fsck may not be 
 able to completely check partitions larger than 2 TB. Linux 
 or Solaris with either LVM or Veritas FS sounds like candidates.

I agree - you can get a very good one from www.acmemicro.com or
www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA
RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
on a Tyan 2882 motherboard.  We get about 400MB/s sustained disk read
performance on these (with tuning) on Linux using the xfs filesystem,
which is one of the most critical factors for large databases.  

Note that you want to have your DBMS use all of the CPU and disk channel
bandwidth you have on each query, which takes a parallel database like
Bizgres MPP to achieve.

Regards,

- Luke


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Hardware/OS recommendations for large databases (5TB)

2005-11-15 Thread Merlin Moncure
 Hardware-wise I'd say dual core opterons. One dual-core-opteron
 performs better than two single-core at the same speed. Tyan makes
 some boards that have four sockets, thereby giving you 8 cpu's (if you
 need that many). Sun and HP also makes nice hardware although the Tyan
 board is more competetive priced.

just FYI: tyan makes a 8 socket motherboard (up to 16 cores!):
http://www.swt.com/vx50.html

It can be loaded with up to 128 gb memory if all the sockets are filled
:).

Merlin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-15 Thread Dave Cramer
Luke,Have you tried the areca cards, they are slightly faster yet.DaveOn 15-Nov-05, at 7:09 AM, Luke Lonergan wrote: I agree - you can get a very good one from www.acmemicro.com or www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM on a Tyan 2882 motherboard.  We get about 400MB/s sustained disk read performance on these (with tuning) on Linux using the xfs filesystem, which is one of the most critical factors for large databases.    Note that you want to have your DBMS use all of the CPU and disk channel bandwidth you have on each query, which takes a parallel database like Bizgres MPP to achieve.  Regards, 

Re: [PERFORM] Help speeding up delete

2005-11-15 Thread Steve Wampler
Magnus Hagander wrote:
Because I think we need to.  The above would only delete rows 
that have name = 'obsid' and value = 'oid080505'.  We need to 
delete all rows that have the same ids as those rows.  
However, from what you note, I bet we could do:

   DELETE FROM tmp_table2 WHERE id IN
  (SELECT id FROM temp_table2 WHERE name = 'obsid' and 
value= 'oid080505');

However, even that seems to have a much higher cost than I'd expect:

   lab.devel.configdb=# explain delete from tmp_table2 where id in
(select id from tmp_table2 where name='obsid' and 
value = 'oid080505');
   NOTICE:  QUERY PLAN:

   Seq Scan on tmp_table2  (cost=0.00..65705177237.26 
rows=769844 width=6)
 SubPlan
   -  Materialize  (cost=42674.32..42674.32 rows=38 width=50)
 -  Seq Scan on tmp_table2  (cost=0.00..42674.32 
rows=38 width=50)

   EXPLAIN
...
 
 Earlier pg versions have always been bad at dealing with IN subqueries.
 Try rewriting it as (with fixing any broken syntax, I'm not actually
 testing this :P)
 
 DELETE FROM tmp_table2 WHERE EXISTS 
  (SELECT * FROM tmp_table2 t2 WHERE t2.id=tmp_table2.id AND
 t2.name='obsid' AND t2.value='oid080505')

Thanks - that looks *significantly* better:

   lab.devel.configdb=# explain delete from tmp_table2 where exists
   (select 1 from tmp_table2 t2 where
   t2.id=tmp_table2.id and
   t2.name='obsid' and t2.value='oid080505');
   NOTICE:  QUERY PLAN:

   Seq Scan on tmp_table2  (cost=0.00..9297614.80 rows=769844 width=6)
 SubPlan
   -  Index Scan using inv_index_2 on tmp_table2 t2  (cost=0.00..6.02 
rows=1 width=0)

   EXPLAIN

(This is after putting an index on the (id,name,value) tuple.)  That outer seq 
scan
is still annoying, but maybe this will be fast enough.

I've passed this on, along with the (strong) recommendation that they
upgrade PG.

Thanks!!

-- 
Steve Wampler -- [EMAIL PROTECTED]
The gods that smiled on your birth are now laughing out loud.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-15 Thread Luke Lonergan



Dave,

  
  
  From: [EMAIL PROTECTED] 
  [mailto:[EMAIL PROTECTED] On Behalf Of Dave 
  CramerSent: Tuesday, November 15, 2005 6:15 AMTo: Luke 
  LonerganCc: Adam Weisberg; 
  pgsql-performance@postgresql.orgSubject: Re: [PERFORM] Hardware/OS 
  recommendations for large databases (
  Luke,
  
  Have you tried the areca cards, they are slightly faster yet.
No, I've been curious since I read an earlier posting here. I've 
had a lot more experience with the 3Ware cards, mostly good, and they've been 
doing a lot ofvolume with Rackable/Yahoowhich gives 
me some more confidence.

The new 3Ware 9550SX cards use a PowerPC for checksumming, so their write 
performance is now up to par with the best cards I believe. We find that 
you still need to set Linux readahead to at least 8MB (blockdev --setra) to get 
maximum read performance on them, is that your experience with the Arecas? 
We get about 260MB/s read on 8 drives in RAID5 without the readahead tuning and 
about 400MB/s with it.

- Luke


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-15 Thread Luke Lonergan
Merlin, 

 just FYI: tyan makes a 8 socket motherboard (up to 16 cores!):
 http://www.swt.com/vx50.html
 
 It can be loaded with up to 128 gb memory if all the sockets 
 are filled :).

Cool!

Just remember that you can't get more than 1 CPU working on a query at a
time without a parallel database.

- Luke


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-15 Thread Luke Lonergan
Merlin,

  just FYI: tyan makes a 8 socket motherboard (up to 16 cores!):
  http://www.swt.com/vx50.html
  
  It can be loaded with up to 128 gb memory if all the sockets are 
  filled :).

Another thought - I priced out a maxed out machine with 16 cores and
128GB of RAM and 1.5TB of usable disk - $71,000.

You could instead buy 8 machines that total 16 cores, 128GB RAM and 28TB
of disk for $48,000, and it would be 16 times faster in scan rate, which
is the most important factor for large databases.  The size would be 16
rack units instead of 5, and you'd have to add a GigE switch for $1500.

Scan rate for above SMP: 200MB/s

Scan rate for above cluster: 3,200Mb/s

You could even go dual core and double the memory on the cluster and
you'd about match the price of the god box.

- Luke


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-15 Thread Michael Stone

On Tue, Nov 15, 2005 at 09:33:25AM -0500, Luke Lonergan wrote:

write performance is now up to par with the best cards I believe.  We
find that you still need to set Linux readahead to at least 8MB
(blockdev --setra) to get maximum read performance on them, is that your


What on earth does that do to your seek performance?

Mike Stone

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Hardware/OS recommendations for large databases ( 5TB)

2005-11-15 Thread Merlin Moncure
 Merlin,
 
   just FYI: tyan makes a 8 socket motherboard (up to 16 cores!):
   http://www.swt.com/vx50.html
  
   It can be loaded with up to 128 gb memory if all the sockets are
   filled :).
 
 Another thought - I priced out a maxed out machine with 16 cores and
 128GB of RAM and 1.5TB of usable disk - $71,000.
 
 You could instead buy 8 machines that total 16 cores, 128GB RAM and
28TB
 of disk for $48,000, and it would be 16 times faster in scan rate,
which
 is the most important factor for large databases.  The size would be
16
 rack units instead of 5, and you'd have to add a GigE switch for
$1500.
 
It's hard to say what would be better.  My gut says the 5u box would be
a lot better at handling high cpu/high concurrency problems...like your
typical business erp backend.  This is pure speculation of course...I'll
defer to the experts here.

Merlin


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Hardware/OS recommendations for large databases ( 5TB)

2005-11-15 Thread Adam Weisberg
Luke,

-Original Message-
From: Luke Lonergan [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 15, 2005 7:10 AM
To: Adam Weisberg
Cc: pgsql-performance@postgresql.org
Subject: RE: [PERFORM] Hardware/OS recommendations for large databases (
5TB)

Adam,

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Claus 
 Guttesen
 Sent: Tuesday, November 15, 2005 12:29 AM
 To: Adam Weisberg
 Cc: pgsql-performance@postgresql.org
 Subject: Re: [PERFORM] Hardware/OS recommendations for large databases

 ( 5TB)
 
  Does anyone have recommendations for hardware and/or OS to
 work with
  around 5TB datasets?
 
 Hardware-wise I'd say dual core opterons. One dual-core-opteron 
 performs better than two single-core at the same speed. Tyan makes 
 some boards that have four sockets, thereby giving you 8 cpu's (if you

 need that many). Sun and HP also makes nice hardware although the Tyan

 board is more competetive priced.
 
 OS wise I would choose the FreeBSD amd64 port but partititions larger 
 than 2 TB needs some special care, using gpt rather than disklabel 
 etc., tools like fsck may not be able to completely check partitions 
 larger than 2 TB. Linux or Solaris with either LVM or Veritas FS 
 sounds like candidates.

I agree - you can get a very good one from www.acmemicro.com or
www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA
RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
on a Tyan 2882 motherboard.  We get about 400MB/s sustained disk read
performance on these (with tuning) on Linux using the xfs filesystem,
which is one of the most critical factors for large databases.  

Note that you want to have your DBMS use all of the CPU and disk channel
bandwidth you have on each query, which takes a parallel database like
Bizgres MPP to achieve.

Regards,

- Luke


The What's New FAQ for PostgreSQL 8.1 says the buffer manager for 8.1
has been enhanced to scale almost linearly with the number of
processors, leading to significant performance gains on 8-way, 16-way,
dual-core, and multi-core CPU servers.

Why not just use it as-is?

Cheers,

Adam

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-15 Thread Luke Lonergan
Because only 1 cpu is used on each query.
- Luke
--
Sent from my BlackBerry Wireless Device


-Original Message-
From: Adam Weisberg [EMAIL PROTECTED]
To: Luke Lonergan [EMAIL PROTECTED]
CC: pgsql-performance@postgresql.org pgsql-performance@postgresql.org
Sent: Tue Nov 15 10:40:53 2005
Subject: RE: [PERFORM] Hardware/OS recommendations for large databases ( 5TB)

Luke,

-Original Message-
From: Luke Lonergan [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 15, 2005 7:10 AM
To: Adam Weisberg
Cc: pgsql-performance@postgresql.org
Subject: RE: [PERFORM] Hardware/OS recommendations for large databases (
5TB)

Adam,

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Claus 
 Guttesen
 Sent: Tuesday, November 15, 2005 12:29 AM
 To: Adam Weisberg
 Cc: pgsql-performance@postgresql.org
 Subject: Re: [PERFORM] Hardware/OS recommendations for large databases

 ( 5TB)
 
  Does anyone have recommendations for hardware and/or OS to
 work with
  around 5TB datasets?
 
 Hardware-wise I'd say dual core opterons. One dual-core-opteron 
 performs better than two single-core at the same speed. Tyan makes 
 some boards that have four sockets, thereby giving you 8 cpu's (if you

 need that many). Sun and HP also makes nice hardware although the Tyan

 board is more competetive priced.
 
 OS wise I would choose the FreeBSD amd64 port but partititions larger 
 than 2 TB needs some special care, using gpt rather than disklabel 
 etc., tools like fsck may not be able to completely check partitions 
 larger than 2 TB. Linux or Solaris with either LVM or Veritas FS 
 sounds like candidates.

I agree - you can get a very good one from www.acmemicro.com or
www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA
RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
on a Tyan 2882 motherboard.  We get about 400MB/s sustained disk read
performance on these (with tuning) on Linux using the xfs filesystem,
which is one of the most critical factors for large databases.  

Note that you want to have your DBMS use all of the CPU and disk channel
bandwidth you have on each query, which takes a parallel database like
Bizgres MPP to achieve.

Regards,

- Luke


The What's New FAQ for PostgreSQL 8.1 says the buffer manager for 8.1
has been enhanced to scale almost linearly with the number of
processors, leading to significant performance gains on 8-way, 16-way,
dual-core, and multi-core CPU servers.

Why not just use it as-is?

Cheers,

Adam


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Performance PG 8.0 on dual opteron / 4GB / 3ware

2005-11-15 Thread Joost Kraaijeveld
Hi Dave,

On Mon, 2005-11-14 at 18:51 -0500, Dave Cramer wrote:
 Joost,
 
 I've got experience with these controllers and which version do you  
 have. I'd expect to see higher than 50MB/s although I've never tried  
 RAID 5
 
 I routinely see closer to 100MB/s with RAID 1+0 on their 9000 series
OK, than there must be hope.

 I would also suggest that shared buffers should be higher than 7500,  
 closer to 3, and effective cache should be up around 200k
In my current 8.1 situation I use shared_buffers = 4, 
effective_cache_size = 131072 .

 work_mem is awfully high, remember that this will be given to each  
 and every connection and can be more than 1x this number per  
 connection depending on the number of sorts
 done in the query.
I use such a high number because I am the only user querying and my
queries do sorted joins etc. 


 fsync=false ? I'm not even sure why we have this option, but I'd  
 never set it to false.
I want as much speed as possible for a database conversion that MUST be
handled in 1 weekend (it lasts now, with the current speed almost 7
centuries. I may be off a millenium). If it fails because of hardware
problem (the only reason we want and need fsync?) we will try next
weekend until it finally goes right. 

What I can see is that only the *write* performance of *long updates*
(and not inserts) are slow and they get slower in time: the first few
thousand go relatively fast, after that PostgreSQL crawls to a halt
(other benchmarks like bonnie++ or just dd'ing a big file don't have
this behavior).

I did notice that changing the I/O scheduler's nr_request from the
default 128 to 1024 or even 4096 made a remarkable performance
improvement. I suspect that experimenting with other I/O schedululers
could improve performance. But it is hard to find any useful
documentation about I/O schedulers.



-- 
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: [EMAIL PROTECTED]
web: www.askesis.nl 



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Performance PG 8.0 on dual opteron / 4GB / 3ware

2005-11-15 Thread Luke Lonergan
Title: Re: [PERFORM] Performance PG 8.0 on dual opteron / 4GB / 3ware



Joost,

On 11/15/05 8:35 AM, Joost Kraaijeveld [EMAIL PROTECTED] wrote:

thousand go relatively fast, after that PostgreSQL crawls to a halt
(other benchmarks like bonnie++ or just dd'ing a big file don't have
this behavior).

With RAID5, it could matter a lot what block size you run your dd bigfile test with. You should run dd if=/dev/zero of=bigfile bs=8k count=50 for a 2GB main memory machine, multiply the count by (your mem/2GB).

It is very important with the 3Ware cards to match the driver to the firmware revision.

I did notice that changing the I/O scheduler's nr_request from the
default 128 to 1024 or even 4096 made a remarkable performance
improvement. I suspect that experimenting with other I/O schedululers
could improve performance. But it is hard to find any useful
documentation about I/O schedulers.

You could try deadline, theres no harm, but Ive found that when you reach the point of experimenting with schedulers, you are probably not addressing the real problem.

On a 3Ware 9500 with HW RAID5 and 4 or more disks I think you should get 100MB/s write rate, which is double what Postgres can use. We find that Postgres, even with fsync=false, will only run at a net COPY speed of about 8-12 MB/s, where 12 is the Bizgres number. 8.1 might do 10. But to get the 10 or 12, the WAL writing and other writing is about 4-5X more than the net write speed, or the speed at which the input file is parsed and read into the database.

So, if you can get your dd bigfile test to write data at 50MB/s+ with a blocksize of 8KB, you should be doing well enough.

Incidentally, we also find that using the XFS filesystem and setting the readahead to 8MB or more is extremely beneficial for performance with the 3Ware cards (and with others, but especially for the older 3Ware cards).

Regards,

- Luke





Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-15 Thread Luke Lonergan
Title: Re: [PERFORM] Hardware/OS recommendations for large databases (



Mike, 

On 11/15/05 6:55 AM, Michael Stone [EMAIL PROTECTED] wrote:

On Tue, Nov 15, 2005 at 09:33:25AM -0500, Luke Lonergan wrote:
write performance is now up to par with the best cards I believe. We
find that you still need to set Linux readahead to at least 8MB
(blockdev --setra) to get maximum read performance on them, is that your

What on earth does that do to your seek performance?

Were in decision support, as is our poster here, so seek isnt the issue, its sustained sequential transfer rate that we need. At 8MB, Id not expect too much damage though  the default is 1.5MB.

- Luke








Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-15 Thread Luke Lonergan
Title: Re: [PERFORM] Hardware/OS recommendations for large databases ( 5TB)



Merlin,

On 11/15/05 7:20 AM, Merlin Moncure [EMAIL PROTECTED] wrote:

It's hard to say what would be better. My gut says the 5u box would be
a lot better at handling high cpu/high concurrency problems...like your
typical business erp backend. This is pure speculation of course...I'll
defer to the experts here.

With Oracle RAC, which is optimized for OLTP and uses a shared memory caching model, maybe or maybe not. Id put my money on the SMP in that case as you suggest, but what happens when the OS dies?

For data warehousing, OLAP and decision support applications, RAC and other shared memory/disk architectures dont do you any good and the SMP machine is better by a bit.

However, if you have an MPP database, where disk and memory are not shared, then the SMP machine is tens or hundreds of times slower than the cluster of the same price.

- Luke 






Re: [PERFORM] Hardware/OS recommendations for large databases ( 5TB)

2005-11-15 Thread William Yu

Merlin Moncure wrote:

You could instead buy 8 machines that total 16 cores, 128GB RAM and
 
It's hard to say what would be better.  My gut says the 5u box would be

a lot better at handling high cpu/high concurrency problems...like your
typical business erp backend.  This is pure speculation of course...I'll
defer to the experts here.


In this specific case (data warehouse app), multiple machines is the 
better bet. Load data on 1 machine, copy to other servers and then use a 
middleman to spread out SQL statements to each machine.


I was going to suggest pgpool as the middleman but I believe it's 
limited to 2 machines max at this time. I suppose you could daisy chain 
pgpools running on every machine.


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Performance PG 8.0 on dual opteron / 4GB / 3ware

2005-11-15 Thread Joost Kraaijeveld
Hi Luke,

On Tue, 2005-11-15 at 10:42 -0800, Luke Lonergan wrote:
 With RAID5, it could matter a lot what block size you run your “dd
 bigfile” test with.  You should run “dd if=/dev/zero of=bigfile bs=8k
 count=50” for a 2GB main memory machine, multiply the count by
 (your mem/2GB).
If I understand correctly (I have 4GB ram):

[EMAIL PROTECTED]:~/tmp$ dd if=/dev/zero of=bigfile bs=8k count=100
100+0 records in
100+0 records out
819200 bytes transferred in 304.085269 seconds (26939812 bytes/sec)

Which looks suspicious: 26308 MB/sec???

 It is very important with the 3Ware cards to match the driver to the
 firmware revision.
OK, I am running 1 driver behind the firmware.
   
 I did notice that changing the I/O scheduler's nr_request from
 the
 default 128 to 1024 or even 4096 made a remarkable performance
 improvement. I suspect that experimenting with other I/O
 schedululers
 could improve performance. But it is hard to find any useful
 documentation about I/O schedulers.
 
 You could try deadline, there’s no harm, but I’ve found that when you
 reach the point of experimenting with schedulers, you are probably not
 addressing the real problem.
It depends. I/O Schedulers (I assume) have a purpose: some schedulers
should be more appropriate for some circumstances. And maybe my specific
circumstances (converting a database with *many updates*) is a specific
circumstance. I really don't know

 On a 3Ware 9500 with HW RAID5 and 4 or more disks I think you should
 get 100MB/s write rate, which is double what Postgres can use.  We
 find that Postgres, even with fsync=false, will only run at a net COPY
 speed of about 8-12 MB/s, where 12 is the Bizgres number.  8.1 might
 do 10.  But to get the 10 or 12, the WAL writing and other writing is
 about 4-5X more than the net write speed, or the speed at which the
 input file is parsed and read into the database.
As I have an (almost) seperate WAL disk: iostat does not show any
significant writing on the WAL disk

 So, if you can get your “dd bigfile” test to write data at 50MB/s+
 with a blocksize of 8KB, you should be doing well enough.
See above.

 Incidentally, we also find that using the XFS filesystem and setting
 the readahead to 8MB or more is extremely beneficial for performance
 with the 3Ware cards (and with others, but especially for the older
 3Ware cards).
I don't have problems with my read performance but *only* with my
*update* performance (and not even insert performance). But than again I
am not the only one with these problems:

http://www.issociate.de/board/goto/894541/3ware_+_RAID5_
+_xfs_performance.html#msg_894541
http://lkml.org/lkml/2005/4/20/110
http://seclists.org/lists/linux-kernel/2005/Oct/1171.html

I am happy to share the tables against which I am running my checks

-- 
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: [EMAIL PROTECTED]
web: www.askesis.nl 



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[PERFORM] Too Many OR's?

2005-11-15 Thread Bill McGonigle
I have a query that's making the planner do the wrong thing (for my 
definition of wrong) and I'm looking for advice on what to tune to make 
it do what I want.


The query consists or SELECT'ing a few fields from a table for a large 
number of rows.  The table has about seventy thousand rows and the user 
is selecting some subset of them.  I first do a SELECT...WHERE to 
determine the unique identifiers I want (works fine) and then I do a 
SELECT WHERE IN giving the list of id's I need additional data on 
(which I see from EXPLAIN just gets translated into a very long list of 
OR's).


Everything works perfectly until I get to 65301 rows.  At 65300 rows, 
it does an index scan and takes 2197.193 ms.  At 65301 rows it switches 
to a sequential scan and takes 778951.556 ms.  Values known not to 
affect this are: work_mem, effective_cache_size.  Setting 
random_page_cost from 4 to 1 helps (79543.214 ms) but I'm not really 
sure what '1' means, except it's relative.  Of course, setting 
'enable_seqscan false' helps immensely (2337.289 ms) but that's as 
inelegant of a solution as I've found - if there were other databases 
on this install that wouldn't be the right approach.


Now I can break this down into multiple SELECT's in code, capping each 
query at 65300 rows, and that's a usable workaround, but academically 
I'd like to know how to convince the planner to do it my way.  It's 
making a bad guess about something but I'm not sure what.  I didn't see 
any hard-coded limits grepping through the source (though it is close 
to the 16-bit unsigned boundry - probably coincidental) so if anyone 
has ideas or pointers to how I might figure out what's going wrong that 
would be helpful.


Thanks,
-Bill

-
Bill McGonigle, Owner   Work: 603.448.4440
BFC Computing, LLC  Home: 603.448.1668
[EMAIL PROTECTED]   Mobile: 603.252.2606
http://www.bfccomputing.com/Pager: 603.442.1833
Jabber: [EMAIL PROTECTED]  Text: [EMAIL PROTECTED]
Blog: http://blog.bfccomputing.com/


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Performance PG 8.0 on dual opteron / 4GB / 3ware

2005-11-15 Thread Steve Wampler
Joost Kraaijeveld wrote:
 If I understand correctly (I have 4GB ram):
 
 [EMAIL PROTECTED]:~/tmp$ dd if=/dev/zero of=bigfile bs=8k count=100
 100+0 records in
 100+0 records out
 819200 bytes transferred in 304.085269 seconds (26939812 bytes/sec)
 
 Which looks suspicious: 26308 MB/sec???

Eh?  That looks more like ~25.7 MB/sec, assuming 1MB = 1024*1024 bytes.

-- 
Steve Wampler -- [EMAIL PROTECTED]
The gods that smiled on your birth are now laughing out loud.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Performance PG 8.0 on dual opteron / 4GB / 3ware

2005-11-15 Thread Joost Kraaijeveld
On Tue, 2005-11-15 at 12:41 -0700, Steve Wampler wrote:
 Joost Kraaijeveld wrote:
  If I understand correctly (I have 4GB ram):
  
  [EMAIL PROTECTED]:~/tmp$ dd if=/dev/zero of=bigfile bs=8k count=100
  100+0 records in
  100+0 records out
  819200 bytes transferred in 304.085269 seconds (26939812 bytes/sec)
  
  Which looks suspicious: 26308 MB/sec???
 
 Eh?  That looks more like ~25.7 MB/sec, assuming 1MB = 1024*1024 bytes.
Oooops. This calculation error is not typical for my testing (I think ;-)).

-- 
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: [EMAIL PROTECTED]
web: www.askesis.nl 



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Too Many OR's?

2005-11-15 Thread Scott Marlowe
On Tue, 2005-11-15 at 13:12, Bill McGonigle wrote:
 I have a query that's making the planner do the wrong thing (for my 
 definition of wrong) and I'm looking for advice on what to tune to make 
 it do what I want.
 
 The query consists or SELECT'ing a few fields from a table for a large 
 number of rows.  The table has about seventy thousand rows and the user 
 is selecting some subset of them.  I first do a SELECT...WHERE to 
 determine the unique identifiers I want (works fine) and then I do a 
 SELECT WHERE IN giving the list of id's I need additional data on 
 (which I see from EXPLAIN just gets translated into a very long list of 
 OR's).
 
 Everything works perfectly until I get to 65301 rows.  At 65300 rows, 
 it does an index scan and takes 2197.193 ms.  At 65301 rows it switches 
 to a sequential scan and takes 778951.556 ms.  Values known not to 
 affect this are: work_mem, effective_cache_size.  Setting 
 random_page_cost from 4 to 1 helps (79543.214 ms) but I'm not really 
 sure what '1' means, except it's relative.  Of course, setting 
 'enable_seqscan false' helps immensely (2337.289 ms) but that's as 
 inelegant of a solution as I've found - if there were other databases 
 on this install that wouldn't be the right approach.
 
 Now I can break this down into multiple SELECT's in code, capping each 
 query at 65300 rows, and that's a usable workaround, but academically 
 I'd like to know how to convince the planner to do it my way.  It's 
 making a bad guess about something but I'm not sure what.  I didn't see 
 any hard-coded limits grepping through the source (though it is close 
 to the 16-bit unsigned boundry - probably coincidental) so if anyone 
 has ideas or pointers to how I might figure out what's going wrong that 
 would be helpful.

OK, there IS a point at which switching to a sequential scan will be
fast.  I.e. when you're getting everything in the table.  But the
database is picking a number where to switch that is too low.

First, we need to know if the statistics are giving the query planner a
good enough idea of how many rows it's really gonna get versus how many
it expects.

Do an explain your query here and see how many it thinks it's gonna
get.  Since you've actually run it, you know how many it really is going
to get, so there's no need for an explain analyze your query here just
yet.

Now, as long as the approximation is pretty close, fine.  But if it's
off by factors, then we need to increase the statistics target on that
column, with:

ALTER TABLE name ALTER columnname SET STATISTICS xxx

where xxx is the new number.  The default is set in your postgresql.conf
file, and is usually pretty low, say 10.  You can go up to 1000, but
that makes query planning take longer.  Try some incremental increase to
say 20 or 40 or even 100, and run analyze on that table then do an
explain on it again until the estimate is close.

Once the estimate is close, you use change random_page_cost to get the
query planner to switch at the right time.  Change the number of in()
numbers and play with random_page_cost and see where that sweet spot
is.  note that what seems right on a single table for a single user may
not be best as you increase load or access other tables.

random_page_cost represents the increase in a random access versus a
sequential access.  As long as your data fit into ram, the difference is
pretty much none (i.e. random_page_cost=1) so don't set it too low, or
accessing REALLY large data sets could become REALLY slow, as it uses
indexes when it should have been sequentially scanning.

Also, check what you've got effective_cache set to.  This tells
postgresql how much memory your kernel is using for cache, and so lets
it know about how likely it is that your current data set under your
query is to be in there.

Also, read this:

http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Hardware/OS recommendations for large databases ( 5TB)

2005-11-15 Thread James Mello
Unless there was a way to guarantee consistency, it would be hard at
best to make this work. Convergence on large data sets across boxes is
non-trivial, and diffing databases is difficult at best. Unless there
was some form of automated way to ensure consistency, going 8 ways into
separate boxes is *very* hard. I do suppose that if you have fancy
storage (EMC, Hitachi) you could do BCV or Shadow copies. But in terms
of commodity stuff, I'd have to agree with Merlin. 

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of William Yu
Sent: Tuesday, November 15, 2005 10:57 AM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Hardware/OS recommendations for large databases (
5TB)

Merlin Moncure wrote:
You could instead buy 8 machines that total 16 cores, 128GB RAM and
  
 It's hard to say what would be better.  My gut says the 5u box would 
 be a lot better at handling high cpu/high concurrency problems...like 
 your typical business erp backend.  This is pure speculation of 
 course...I'll defer to the experts here.

In this specific case (data warehouse app), multiple machines is the
better bet. Load data on 1 machine, copy to other servers and then use a
middleman to spread out SQL statements to each machine.

I was going to suggest pgpool as the middleman but I believe it's
limited to 2 machines max at this time. I suppose you could daisy chain
pgpools running on every machine.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-15 Thread Alex Turner
On 11/15/05, Luke Lonergan [EMAIL PROTECTED] wrote:
 Adam,

  -Original Message-
  From: [EMAIL PROTECTED]
  [mailto:[EMAIL PROTECTED] On Behalf Of
  Claus Guttesen
  Sent: Tuesday, November 15, 2005 12:29 AM
  To: Adam Weisberg
  Cc: pgsql-performance@postgresql.org
  Subject: Re: [PERFORM] Hardware/OS recommendations for large
  databases ( 5TB)
 
   Does anyone have recommendations for hardware and/or OS to
  work with
   around 5TB datasets?
 
  Hardware-wise I'd say dual core opterons. One
  dual-core-opteron performs better than two single-core at the
  same speed. Tyan makes some boards that have four sockets,
  thereby giving you 8 cpu's (if you need that many). Sun and
  HP also makes nice hardware although the Tyan board is more
  competetive priced.
 
  OS wise I would choose the FreeBSD amd64 port but
  partititions larger than 2 TB needs some special care, using
  gpt rather than disklabel etc., tools like fsck may not be
  able to completely check partitions larger than 2 TB. Linux
  or Solaris with either LVM or Veritas FS sounds like candidates.

 I agree - you can get a very good one from www.acmemicro.com or
 www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA
 RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
 on a Tyan 2882 motherboard.  We get about 400MB/s sustained disk read
 performance on these (with tuning) on Linux using the xfs filesystem,
 which is one of the most critical factors for large databases.


Spend a fortune on dual core CPUs and then buy crappy disks...  I bet
for most applications this system will be IO bound, and you will see a
nice lot of drive failures in the first year of operation with
consumer grade drives.

Spend your money on better Disks, and don't bother with Dual Core IMHO
unless you can prove the need for it.

Alex

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-15 Thread Alex Turner
Not at random access in RAID 10 they aren't, and anyone with their
head screwed on right is using RAID 10.  The 9500S will still beat the
Areca cards at RAID 10 database access patern.

Alex.

On 11/15/05, Dave Cramer [EMAIL PROTECTED] wrote:
 Luke,

 Have you tried the areca cards, they are slightly faster yet.

 Dave

 On 15-Nov-05, at 7:09 AM, Luke Lonergan wrote:





 I agree - you can get a very good one from www.acmemicro.com or

 www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA

 RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM

 on a Tyan 2882 motherboard.  We get about 400MB/s sustained disk read

 performance on these (with tuning) on Linux using the xfs filesystem,

 which is one of the most critical factors for large databases.




 Note that you want to have your DBMS use all of the CPU and disk channel

 bandwidth you have on each query, which takes a parallel database like

 Bizgres MPP to achieve.




 Regards,


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Performance PG 8.0 on dual opteron / 4GB / 3ware

2005-11-15 Thread Luke Lonergan
Title: Re: [PERFORM] Performance PG 8.0 on dual opteron / 4GB / 3ware



Joost,

On 11/15/05 11:51 AM, Joost Kraaijeveld [EMAIL PROTECTED] wrote:

On Tue, 2005-11-15 at 12:41 -0700, Steve Wampler wrote:
 Joost Kraaijeveld wrote:
  If I understand correctly (I have 4GB ram):
 
  [EMAIL PROTECTED]:~/tmp$ dd if=/dev/zero of=bigfile bs=8k count=100
  100+0 records in
  100+0 records out
  819200 bytes transferred in 304.085269 seconds (26939812 bytes/sec)
 
  Which looks suspicious: 26308 MB/sec???

 Eh? That looks more like ~25.7 MB/sec, assuming 1MB = 1024*1024 bytes.
Oooops. This calculation error is not typical for my testing (I think ;-)).

Summarizing the two facts of note: the write result is 1/4 of what you should be getting, and you are running 1 driver behind the firmware.

You might update your driver, rerun the test, and if you still have the slow result, verify that your filesystem isnt fragmented (multiple undisciplined apps on the same filesystem will do that).

WAL on a separate disk, on a separate controller? What is the write performance there?

Regards,

- Luke





Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-15 Thread Luke Lonergan
Title: Re: [PERFORM] Hardware/OS recommendations for large databases ( 5TB)



James,


On 11/15/05 11:07 AM, James Mello [EMAIL PROTECTED] wrote:

Unless there was a way to guarantee consistency, it would be hard at
best to make this work. Convergence on large data sets across boxes is
non-trivial, and diffing databases is difficult at best. Unless there
was some form of automated way to ensure consistency, going 8 ways into
separate boxes is *very* hard. I do suppose that if you have fancy
storage (EMC, Hitachi) you could do BCV or Shadow copies. But in terms
of commodity stuff, I'd have to agree with Merlin.

Its a matter of good software that handles the distribution / parallel query optimization / distributed transactions and management features. Combine that with a gigabit ethernet switch and it works  we routinely get 50x speedup over SMP on OLAP / Decision Support workloads.

Regards,

- Luke 





Re: [PERFORM] Performance PG 8.0 on dual opteron / 4GB / 3ware

2005-11-15 Thread Joost Kraaijeveld
Hi Luke,

On Tue, 2005-11-15 at 22:07 -0800, Luke Lonergan wrote:

 You might update your driver, 
I will do that (but I admit that I am not looking forward to it. When I
was young and did not make money with my computer, I liked challenges
like compiling kernels and not being able to boot the computer. Not any
more :-)).


 
 WAL on a separate disk, on a separate controller?  What is the write
 performance there?
WAL is on a separate disk and a separate controller, write performance:

[EMAIL PROTECTED]:/tmp$ dd if=/dev/zero of=bigfile bs=8k count=100
100+0 records in
100+0 records out
819200 bytes transferred in 166.499230 seconds (49201429 bytes/sec)

The quest continues...


-- 
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: [EMAIL PROTECTED]
web: www.askesis.nl 



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly