[PERFORM] Open request for benchmarking input

2005-11-26 Thread Jeff Frost
Did you folks see this article on Slashdot with a fellow requesting input on 
what sort of benchmarks to run to get a good Postgresql vs Mysql dataset? 
Perhaps this would be a good opportunity for us to get some good benchmarking 
done.  Here's the article link and top text:


http://ask.slashdot.org/article.pl?sid=05/11/26/0317213

 David Lang asks: "With the release of MySQL 5.0, PostgreSQL 8.1, and the flap 
over Oracle purchasing InnoDB, the age old question of performance is coming 
up again. I've got some boxes that were purchased for a data warehouse project 
that isn't going to be installed for a month or two, and could probably 
squeeze some time in to do some benchmarks on the machines. However, the 
question is: what should be done that's reasonably fair to both MySQL and 
PostgreSQL? We all know that careful selection of the benchmark can seriously 
skew the results, and I want to avoid that (in fact I would consider it close 
to ideal if the results came out that each database won in some tests). I 
would also not like to spend time generating the benchmarks only to have the 
losing side accuse me of being unfair. So, for both MySQL and PostgreSQL 
advocates, what would you like to see in a series of benchmarks?"


 "The hardware I have available is as follows:

* 2x dual Opteron 8G ram, 2x144G 15Krpm SCSI
* 2x dual Opteron 8G ram, 2x72G 15Krpm SCSI
* 1x dual Opteron 16G ram, 2x36G 15Krpm SCSI 16x400G 7200rpm SATA

I would prefer to use Debian Sarge as the base install of the systems (with 
custom built kernels), but would compile the databases from source rather then 
using binary packages.


For my own interests, I would like to at least cover the following bases: 32 
bit vs 64 bit vs 64 bit kernel + 32 bit user-space; data warehouse type tests 
(data >> memory); and web prefs test (active data RAM)


What specific benchmarks should be run, and what other things should be 
tested? Where should I go for assistance on tuning each database, evaluating 
the benchmark results, and re-tuning them?"


---
Jeff Frost, Owner 
<[EMAIL PROTECTED]>

Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Open request for benchmarking input

2005-11-26 Thread Qingqing Zhou

"Jeff Frost" <[EMAIL PROTECTED]> wrote
>
> Did you folks see this article on Slashdot with a fellow requesting input 
> on what sort of benchmarks to run to get a good Postgresql vs Mysql 
> dataset? Perhaps this would be a good opportunity for us to get some good 
> benchmarking done.
>  "The hardware I have available is as follows:
>
> * 2x dual Opteron 8G ram, 2x144G 15Krpm SCSI
> * 2x dual Opteron 8G ram, 2x72G 15Krpm SCSI
> * 1x dual Opteron 16G ram, 2x36G 15Krpm SCSI 16x400G 7200rpm SATA
>

I see this as a good chance to evaluate and boost PostgreSQL performance in 
general.

My two concerns:
(1) How long will David Lang spend on the benchmarking? We need *continous* 
feedback after each tuning. This is important and Mark Wong has done great 
job on this.
(2) The hardware configuration may not reflect all potentials of PostgreSQL. 
For example, so far, PostgreSQL does not pay much attention in reducing I/O 
cost, so a stronger RAID definitely will benefit PostgreSQL performance.

>
> For my own interests, I would like to at least cover the following bases: 
> 32 bit vs 64 bit vs 64 bit kernel + 32 bit user-space; data warehouse type 
> tests (data >> memory); and web prefs test (active data RAM)
>

Don't forget TPCC (data > memory, with intensive updates). So the benchmarks 
in my mind include TPCC, TPCH and TPCW.

Regards,
Qingqing 



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Open request for benchmarking input

2005-11-26 Thread Luke Lonergan
Jeff, Qingqing,

On 11/26/05 10:57 AM, "Qingqing Zhou" <[EMAIL PROTECTED]> wrote:

> 
> "Jeff Frost" <[EMAIL PROTECTED]> wrote
>> 
>> Did you folks see this article on Slashdot with a fellow requesting input
>> on what sort of benchmarks to run to get a good Postgresql vs Mysql
>> dataset? Perhaps this would be a good opportunity for us to get some good
>> benchmarking done.
>>  "The hardware I have available is as follows:
>> 
>> * 2x dual Opteron 8G ram, 2x144G 15Krpm SCSI
>> * 2x dual Opteron 8G ram, 2x72G 15Krpm SCSI
>> * 1x dual Opteron 16G ram, 2x36G 15Krpm SCSI 16x400G 7200rpm SATA
>> 

I suggest specifying a set of basic system / HW benchmarks to baseline the
hardware before each benchmark is run.  This has proven to be a major issue
with most performance tests.  My pick for I/O is bonnie++.

Your equipment allows you the opportunity to benchmark all 5 machines
running together as a cluster - this is important to measure maturity of
solutions for high performance warehousing.  Greenplum can provide you a
license for Bizgres MPP for this purpose.

> (2) The hardware configuration may not reflect all potentials of PostgreSQL.
> For example, so far, PostgreSQL does not pay much attention in reducing I/O
> cost, so a stronger RAID definitely will benefit PostgreSQL performance.

The 16x SATA drives should be great, provided you have a high performance
RAID adapter configured properly.  You should be able to get 800MB/s of
sequential scan performance by using a card like the 3Ware 9550SX.  I've
also heard that the Areca cards are good (how good?).  Configuration of the
I/O must be validated though - I've seen as low as 25MB/s from a
misconfigured system.

>> For my own interests, I would like to at least cover the following bases:
>> 32 bit vs 64 bit vs 64 bit kernel + 32 bit user-space; data warehouse type
>> tests (data >> memory); and web prefs test (active data RAM)
>> 
> 
> Don't forget TPCC (data > memory, with intensive updates). So the benchmarks
> in my mind include TPCC, TPCH and TPCW.

I agree with Qingqing, though I think the OSTG DBT-3 (very similar to TPC-H)
is sufficient for data warehousing.

This is a fairly ambitious project - one problem I see is that MySQL may not
run all of these benchmarks, particularly the DBT-3.  Also - would the rules
allow for mixing / matching pluggable features of the DBMS?  Innodb versus
MyISAM?

- Luke   



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Open request for benchmarking input

2005-11-26 Thread Ron

At 03:15 PM 11/26/2005, Luke Lonergan wrote:


I suggest specifying a set of basic system / HW benchmarks to baseline the
hardware before each benchmark is run.  This has proven to be a major issue
with most performance tests.  My pick for I/O is bonnie++.

Your equipment allows you the opportunity to benchmark all 5 machines
running together as a cluster - this is important to measure maturity of
solutions for high performance warehousing.  Greenplum can provide you a
license for Bizgres MPP for this purpose.
...and detailed config / tuning specs as well for it or everyone is 
probably wasting their time.  For instance, it seems fairly clear 
that the default 8KB table size and default read ahead size are both 
pessimal, at least for non OLTP-like apps.  In addition, there's been 
a reasonable amount of evidence that xfs should be the file system of 
choice for pg.


Things like optimal RAID strip size, how to allocate tables to 
various IO HW, and what levels of RAID to use for each RAID set also 
have to be defined.




The 16x SATA drives should be great, provided you have a high performance
RAID adapter configured properly.  You should be able to get 800MB/s of
sequential scan performance by using a card like the 3Ware 9550SX.  I've
also heard that the Areca cards are good (how good?).  Configuration of the
I/O must be validated though - I've seen as low as 25MB/s from a
misconfigured system.
The Areca cards, particularly with 1-2GB of buffer cache, are the 
current commodity RAID controller performance leader.  Better 
performance can be gotten out of HW from vendors like Xyratex, but it 
will cost much more.



Ron



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Open request for benchmarking input

2005-11-26 Thread Andreas Pflug

Qingqing Zhou wrote:

"Jeff Frost" <[EMAIL PROTECTED]> wrote



"The hardware I have available is as follows:

   * 2x dual Opteron 8G ram, 2x144G 15Krpm SCSI
   * 2x dual Opteron 8G ram, 2x72G 15Krpm SCSI
   * 1x dual Opteron 16G ram, 2x36G 15Krpm SCSI 16x400G 7200rpm SATA

(2) The hardware configuration may not reflect all potentials of PostgreSQL. 


These boxes don't look like being designed for a DB server. The first 
are very CPU bound, and the third may be a good choice for very large 
amounts of streamed data, but not optimal for TP random access.


Hopefully, when publicly visible benchmarks are performed, machines are 
used that comply with common engineering knowledge, ignoring those guys 
who still believe that sequential performance is the most important 
issue on disk subsystems for DBMS.


Regards,
Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Open request for benchmarking input

2005-11-26 Thread David Lang

Ok, I've subscribed (hopefully list volume won't kill me :-)

I'm covering several things in this message since I didn't receive the 
prior messages in the thread


first off these benchamrks are not being sponsered by my employer, they 
need the machines burned in and so I'm going to use them for the tests 
while burning them in. I can spend a little official time on this, 
justifying it as learning the proper config/tuneing settings for our 
project, but not too much. and I'm deliberatly not useing my work e-mail 
and am not mentioning the company name, so please respect this and keep 
the two seperate (some of you have dealt with us formally, others will 
over the next few months)


this means no remote access for people (but I am willing to run tests and 
send configs around). in fact the machines will not have Internet access 
for the duration of the tests.


it also means I'm doing almost all the configuration work for this on my 
own time (nights and weekends). the machines will not be moved to 
production for a couple of months. this should mean that we can go back 
and forth with questions and answers (albeit somewhat slowly, with me 
checking in every night) while whatever tests are done happen during the 
day. once we get past the system tuneing and start doing different tests 
it would probably be helpful if people can send me scripts to run that I 
can just let loose.


I don't have any money to pay for benchmark suites, so if things like the 
TPC benchmarks cost money to do I won't be able to do them


to clarify the hardware

I have 5 machines total to work with, this includes client machines to 
make the queries (I may be able to get hold of 2-3 more, but they are 
similar configs)


none of these have dual-core processors on them, the CPU's are 246 or 252 
Opterons (I'll have to double check which is in which machine, I think the 
large disk machine has 246's and the others 252's)


I have access to a gig-E switch that's on a fairly idle network to use to 
connect these machines


the large-disk machine has 3ware 9500 series 8-port SATA controllers in 
them with battery backup. in our official dealings with Greenplum we 
attempted to do a set of benchmarks on that machine, but had horrible 
timing with me being too busy when they worked with us on this and we 
never did figure out the best setting to use for this machine.


Part of the reason I posted this to /. rather then just contacting you and 
MySQL folks directly is that I would like to see a reasonable set of 
benchmarks agreed to and have people with different hardware then I have 
run the same sets of tests. I know the tuneing will be different for 
different hardware, but if we can have a bunch of people run similar tests 
we should learn a lot.


David Lang

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Open request for benchmarking input

2005-11-26 Thread David Lang

by the way, this is the discussion that promped me to start this project
http://lwn.net/Articles/161323/

David Lang

---(end of broadcast)---
TIP 6: explain analyze is your friend


[PERFORM] Open request for benchmarking input

2005-11-26 Thread David Lang
These boxes don't look like being designed for a DB server. The first are 
very CPU bound, and the third may be a good choice for very large amounts 
of streamed data, but not optimal for TP random access.


I don't know what you mean when you say that the first ones are CPU bound, 
they have far more CPU then they do disk I/O


however I will agree that they were not designed to be DB servers, they 
weren't. they happen to be the machines that I have available.


they only have a pair of disks each, which would not be reasonable for 
most production DB uses, and they have far more CPU then is normally 
reccomended. So I'll have to run raid 0 instead of 0+1 (or not use raid) 
which would be unacceptable in a production environment, but can still 
give some useful info.


the 5th box _was_ purchased to be a DB server, but one to store and 
analyse large amounts of log data, so large amounts of data storage were 
more important then raw DB performance (although we did max out the RAM at 
16G to try and make up for it). it was a deliberate price/performance 
tradeoff. this machine ran ~$20k, but a similar capacity with SCSI drives 
would have been FAR more expensive (IIRC a multiple of 4x or more more 
expensive).


Hopefully, when publicly visible benchmarks are performed, machines are 
used that comply with common engineering knowledge, ignoring those guys 
who still believe that sequential performance is the most important issue 
on disk subsystems for DBMS.


are you saying that I shouldn't do any benchmarks becouse the machines 
aren't what you would consider good enough?


if so I disagree with you and think that benchmarks should be done on even 
worse machines, but should also be done on better machines. (are you 
volunteering to provide time on better machines for benchmarks?)


not everyone will buy a lot of high-end hardware before they start useing 
a database. in fact most companies will start with a database on lower end 
hardware and then as their requirements grow they will move to better 
hardware. I'm willing to bet that what I have available is better then the 
starting point for most places.


Postgres needs to work on the low end stuff as well as the high end stuff 
or people will write their app to work with things that DO run on low end 
hardware and they spend much more money then is needed to scale the 
hardware up rather then re-writing their app.


Part of the reason that I made the post on /. to start this was the hope 
that a reasonable set of benchmarks could be hammered out and then more 
people then just me could run them to get a wider range of results.


David Lang

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-26 Thread David Lang

Another thought - I priced out a maxed out machine with 16 cores and
128GB of RAM and 1.5TB of usable disk - $71,000.

You could instead buy 8 machines that total 16 cores, 128GB RAM and 28TB
of disk for $48,000, and it would be 16 times faster in scan rate, which
is the most important factor for large databases.  The size would be 16
rack units instead of 5, and you'd have to add a GigE switch for $1500.

Scan rate for above SMP: 200MB/s

Scan rate for above cluster: 3,200Mb/s

You could even go dual core and double the memory on the cluster and
you'd about match the price of the "god box".

- Luke


Luke, I assume you are talking about useing the Greenplum MPP for this 
(otherwise I don't know how you are combining all the different systems).


If you are, then you are overlooking one very significant factor, the cost 
of the MPP software, at $10/cpu the cluster has an extra $160K in software 
costs, which is double the hardware costs.


if money is no object then go for it, but if it is then you comparison 
would be (ignoring software maintinance costs) the 16 core 128G ram system 
vs ~3xsmall systems totaling 6 cores and 48G ram.


yes if scan speed is the bottleneck you still win with the small systems, 
but for most other uses the large system would win easily. and in any case 
it's not the open and shut case that you keep presenting it as.


David Lang

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-26 Thread Luke Lonergan
For data warehousing its pretty well open and shut.  To use all cpus and io 
channels on each query you will need mpp.

Has anyone done the math.on the original post?  5TB takes how long to scan 
once?  If you want to wait less than a couple of days just for a seq scan, 
you'd better be in the multi-gb per second range.

- Luke
--
Sent from my BlackBerry Wireless Device


-Original Message-
From: [EMAIL PROTECTED] <[EMAIL PROTECTED]>
To: pgsql-performance@postgresql.org 
Sent: Sat Nov 26 13:51:18 2005
Subject: Re: [PERFORM] Hardware/OS recommendations for large databases (

>Another thought - I priced out a maxed out machine with 16 cores and
>128GB of RAM and 1.5TB of usable disk - $71,000.
>
>You could instead buy 8 machines that total 16 cores, 128GB RAM and 28TB
>of disk for $48,000, and it would be 16 times faster in scan rate, which
>is the most important factor for large databases.  The size would be 16
>rack units instead of 5, and you'd have to add a GigE switch for $1500.
>
>Scan rate for above SMP: 200MB/s
>
>Scan rate for above cluster: 3,200Mb/s
>
>You could even go dual core and double the memory on the cluster and
>you'd about match the price of the "god box".
>
>- Luke

Luke, I assume you are talking about useing the Greenplum MPP for this 
(otherwise I don't know how you are combining all the different systems).

If you are, then you are overlooking one very significant factor, the cost 
of the MPP software, at $10/cpu the cluster has an extra $160K in software 
costs, which is double the hardware costs.

if money is no object then go for it, but if it is then you comparison 
would be (ignoring software maintinance costs) the 16 core 128G ram system 
vs ~3xsmall systems totaling 6 cores and 48G ram.

yes if scan speed is the bottleneck you still win with the small systems, 
but for most other uses the large system would win easily. and in any case 
it's not the open and shut case that you keep presenting it as.

David Lang

---(end of broadcast)---
TIP 6: explain analyze is your friend


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-26 Thread David Lang

On Sun, 27 Nov 2005, Luke Lonergan wrote:

For data warehousing its pretty well open and shut.  To use all cpus and 
io channels on each query you will need mpp.


Has anyone done the math.on the original post?  5TB takes how long to 
scan once?  If you want to wait less than a couple of days just for a 
seq scan, you'd better be in the multi-gb per second range.


if you truely need to scan the entire database then you are right, however 
indexes should be able to cut the amount you need to scan drasticly.


David Lang

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly