[PERFORM] Databases optimization

2011-09-11 Thread Hany ABOU-GHOURY
Hi,


I have a database cluster running PostgreSQL 8.2

and I have **new Linux virtualized database environment running PostgreSQL
9.0


My question is how to ensure that database schemas are always performing and
scalable and databases optimized and entirely migrated

Thanks in advance!

Hany


[PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Igor Chudov
I have been a MySQL user for years, including owning a few
multi-gigabyte databases for my websites, and using it to host
algebra.com (about 12 GB database).

I have had my ups and downs with MySQL. The ups were ease of use and
decent performance for small databases such as algebra.com. The downs
were things like twenty hour REPAIR TABLE operations on a 35 GB
table, etc.

Right now I have a personal (one user) project to create a 5-10
Terabyte data warehouse. The largest table will consume the most space
and will take, perhaps, 200,000,000 rows.

I want to use it to obtain valuable business intelligence and to make
money.

I expect it to grow, never shrink, and to be accessed via batch
queries. I do not care for batch queries to be super fast, for example
an hour per query would be just fine.

However, while an hour is fine, two weeks per query is NOT fine.

I have a server with about 18 TB of storage and 48 GB of RAM, and 12
CPU cores.

My initial plan was to use MySQL, InnoDB, and deal with problems as
they arise. Perhaps, say, I would implement my own joining
procedures.

After reading some disparaging stuff about InnoDB performance on large
datasets, however, I am getting cold feet. I have a general feeling
that, perhaps, I will not be able to succeed with MySQL, or, perhaps,
with either MySQL and Postgres.

I do not know much about Postgres, but I am very eager to learn and
see if I can use it for my purposes more effectively than MySQL.

I cannot shell out $47,000 per CPU for Oracle for this project.

To be more specific, the batch queries that I would do, I hope,
would either use small JOINS of a small dataset to a large dataset, or
just SELECTS from one big table.

So... Can Postgres support a 5-10 TB database with the use pattern
stated above?

Thanks!

i


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Scott Marlowe
On Sun, Sep 11, 2011 at 6:35 AM, Igor Chudov ichu...@gmail.com wrote:
 I have a server with about 18 TB of storage and 48 GB of RAM, and 12
 CPU cores.

1 or 2 fast cores is plenty for what you're doing.  But the drive
array and how it's configured etc are very important.  There's a huge
difference between 10 2TB 7200RPM SATA drives in a software RAID-5 and
36 500G 15kRPM SAS drives in a RAID-10 (SW or HW would both be ok for
data warehouse.)

 I do not know much about Postgres, but I am very eager to learn and
 see if I can use it for my purposes more effectively than MySQL.
 I cannot shell out $47,000 per CPU for Oracle for this project.
 To be more specific, the batch queries that I would do, I hope,

Hopefully if needs be you can spend some small percentage of that for
a fast IO subsystem is needed.

 would either use small JOINS of a small dataset to a large dataset, or
 just SELECTS from one big table.
 So... Can Postgres support a 5-10 TB database with the use pattern
 stated above?

I use it on a ~3TB DB and it works well enough.  Fast IO is the key
here.  Lots of drives in RAID-10 or HW RAID-6 if you don't do a lot of
random writing.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread pasman pasmański
For 10 TB table and 3hours, disks should have a transfer about 1GB/s (seqscan).

2011/9/11, Scott Marlowe scott.marl...@gmail.com:
 On Sun, Sep 11, 2011 at 6:35 AM, Igor Chudov ichu...@gmail.com wrote:
 I have a server with about 18 TB of storage and 48 GB of RAM, and 12
 CPU cores.

 1 or 2 fast cores is plenty for what you're doing.  But the drive
 array and how it's configured etc are very important.  There's a huge
 difference between 10 2TB 7200RPM SATA drives in a software RAID-5 and
 36 500G 15kRPM SAS drives in a RAID-10 (SW or HW would both be ok for
 data warehouse.)

 I do not know much about Postgres, but I am very eager to learn and
 see if I can use it for my purposes more effectively than MySQL.
 I cannot shell out $47,000 per CPU for Oracle for this project.
 To be more specific, the batch queries that I would do, I hope,

 Hopefully if needs be you can spend some small percentage of that for
 a fast IO subsystem is needed.

 would either use small JOINS of a small dataset to a large dataset, or
 just SELECTS from one big table.
 So... Can Postgres support a 5-10 TB database with the use pattern
 stated above?

 I use it on a ~3TB DB and it works well enough.  Fast IO is the key
 here.  Lots of drives in RAID-10 or HW RAID-6 if you don't do a lot of
 random writing.

 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance



-- 

pasman

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Igor Chudov
On Sun, Sep 11, 2011 at 7:52 AM, Scott Marlowe scott.marl...@gmail.comwrote:

 On Sun, Sep 11, 2011 at 6:35 AM, Igor Chudov ichu...@gmail.com wrote:
  I have a server with about 18 TB of storage and 48 GB of RAM, and 12
  CPU cores.

 1 or 2 fast cores is plenty for what you're doing.


I need those cores to perform other tasks, like image manipulation with
imagemagick, XML forming and parsing etc.


  But the drive
 array and how it's configured etc are very important.  There's a huge
 difference between 10 2TB 7200RPM SATA drives in a software RAID-5 and
 36 500G 15kRPM SAS drives in a RAID-10 (SW or HW would both be ok for
 data warehouse.)


Well, right now, my server has twelve 7,200 RPM 2TB hard drives in a RAID-6
configuration.

They are managed by a 3WARE 9750 RAID CARD.

I would say that I am not very concerned with linear relationship of read
speed to disk speed. If that stuff is somewhat slow, it is OK with me.

What I want to avoid is severe degradation of performance due to size (time
complexity greater than O(1)), disastrous REPAIR TABLE operations etc.


 I do not know much about Postgres, but I am very eager to learn and
  see if I can use it for my purposes more effectively than MySQL.
  I cannot shell out $47,000 per CPU for Oracle for this project.
  To be more specific, the batch queries that I would do, I hope,

 Hopefully if needs be you can spend some small percentage of that for
 a fast IO subsystem is needed.



I am actually open for suggestions here.


   would either use small JOINS of a small dataset to a large dataset, or
  just SELECTS from one big table.
  So... Can Postgres support a 5-10 TB database with the use pattern
  stated above?

 I use it on a ~3TB DB and it works well enough.  Fast IO is the key
 here.  Lots of drives in RAID-10 or HW RAID-6 if you don't do a lot of
 random writing.


I do not plan to do a lot of random writing. My current design is that my
perl scripts write to a temporary table every week, and then I do INSERT..ON
DUPLICATE KEY UPDATE.

By the way, does that INSERT UPDATE functionality or something like this
exist in Postgres?

i


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Igor Chudov
2011/9/11 pasman pasmański pasma...@gmail.com

 For 10 TB table and 3hours, disks should have a transfer about 1GB/s
 (seqscan).



I have 6 Gb/s disk drives, so it should be not too far, maybe 5 hours for a
seqscan.

i


 2011/9/11, Scott Marlowe scott.marl...@gmail.com:
  On Sun, Sep 11, 2011 at 6:35 AM, Igor Chudov ichu...@gmail.com wrote:
  I have a server with about 18 TB of storage and 48 GB of RAM, and 12
  CPU cores.
 
  1 or 2 fast cores is plenty for what you're doing.  But the drive
  array and how it's configured etc are very important.  There's a huge
  difference between 10 2TB 7200RPM SATA drives in a software RAID-5 and
  36 500G 15kRPM SAS drives in a RAID-10 (SW or HW would both be ok for
  data warehouse.)
 
  I do not know much about Postgres, but I am very eager to learn and
  see if I can use it for my purposes more effectively than MySQL.
  I cannot shell out $47,000 per CPU for Oracle for this project.
  To be more specific, the batch queries that I would do, I hope,
 
  Hopefully if needs be you can spend some small percentage of that for
  a fast IO subsystem is needed.
 
  would either use small JOINS of a small dataset to a large dataset, or
  just SELECTS from one big table.
  So... Can Postgres support a 5-10 TB database with the use pattern
  stated above?
 
  I use it on a ~3TB DB and it works well enough.  Fast IO is the key
  here.  Lots of drives in RAID-10 or HW RAID-6 if you don't do a lot of
  random writing.
 
  --
  Sent via pgsql-performance mailing list (
 pgsql-performance@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-performance
 


 --
 
 pasman

 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance



Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Claudio Freire
On Sun, Sep 11, 2011 at 3:59 PM, Igor Chudov ichu...@gmail.com wrote:
 Well, right now, my server has twelve 7,200 RPM 2TB hard drives in a RAID-6
 configuration.
 They are managed by a 3WARE 9750 RAID CARD.

 I would say that I am not very concerned with linear relationship of read
 speed to disk speed. If that stuff is somewhat slow, it is OK with me.

With Raid 6 you'll have abysmal performance on write operations.
In data warehousing, there's lots of writes to temporary files, for
sorting and stuff like that.

You should either migrate to raid 10, or set up a separate array for
temporary files, perhaps raid 0.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Andy Colson

On 09/11/2011 07:35 AM, Igor Chudov wrote:

I have been a MySQL user for years, including owning a few
multi-gigabyte databases for my websites, and using it to host
algebra.com http://algebra.com (about 12 GB database).

I have had my ups and downs with MySQL. The ups were ease of use and
decent performance for small databases such as algebra.com 
http://algebra.com. The downs
were things like twenty hour REPAIR TABLE operations on a 35 GB
table, etc.

Right now I have a personal (one user) project to create a 5-10
Terabyte data warehouse. The largest table will consume the most space
and will take, perhaps, 200,000,000 rows.

I want to use it to obtain valuable business intelligence and to make
money.

I expect it to grow, never shrink, and to be accessed via batch
queries. I do not care for batch queries to be super fast, for example
an hour per query would be just fine.

However, while an hour is fine, two weeks per query is NOT fine.

I have a server with about 18 TB of storage and 48 GB of RAM, and 12
CPU cores.

My initial plan was to use MySQL, InnoDB, and deal with problems as
they arise. Perhaps, say, I would implement my own joining
procedures.

After reading some disparaging stuff about InnoDB performance on large
datasets, however, I am getting cold feet. I have a general feeling
that, perhaps, I will not be able to succeed with MySQL, or, perhaps,
with either MySQL and Postgres.

I do not know much about Postgres, but I am very eager to learn and
see if I can use it for my purposes more effectively than MySQL.

I cannot shell out $47,000 per CPU for Oracle for this project.

To be more specific, the batch queries that I would do, I hope,
would either use small JOINS of a small dataset to a large dataset, or
just SELECTS from one big table.

So... Can Postgres support a 5-10 TB database with the use pattern
stated above?

Thanks!

i



That is a scale or two larger than I have experience with.  I converted my 
website database from mysql to PG, and it has several db's between 1 and 10 
gig.  There are parts of the website that were faster with mysql, and there are 
parts faster with PG.  One spot, because PG has superior join support on select 
statements, I was able to change the code to generate a single more complicated 
sql statement vs. mysql that had to fire off several simpler statements.  Its a 
search screen where you can type in 15'ish different options.  I was able to 
generate a single sql statement which joins 8 some odd tables and plenty of 
where statements.  PG runs it in the blink of an eye.  Its astonishing compared 
to the pain of mysql.  If you ever have to write your own join, or your own 
lookup function, that's a failure of your database.

One spot that was slower was a batch insert of data.  Its not so much slower 
that it was a problem.  I use COPY on PG vs prepared insert's on mysql.  It was 
pretty close, but mysql still won.

Seeing as you can setup and test both databases, have you considered a trial 
run?

Things to watch for:


I think the same amount of data will use more disk space in PG than in mysql.

Importing data into PG should use COPY and multiple connections at the same 
time.

PG will only use multi-core if you use multiple connections. (each connecion 
uses one core).

Huge result sets (like a select statement that returns 1,000,000 rows) will be 
slow.

PG is a much fuller database than mysql, and as such you can influence its join 
types, and function calls. (table scan vs index, immutable function vs stable, 
perl function vs sql).  So if at first it appears slow, you have a million 
options.  I think the only option you have in mysql is to pull the data back 
and code it yourself.

Upgrading to major versions of PG may or may not be painful.  (mysql sometimes 
works seamlessly between versions, it appears brilliant.  But I have had 
problems with an update, and when it goes bad, you dont have a lot of options). 
 In the past PG's only method of upgrade was a full backup of old, restore in 
new.  Things have gotten better, there is new pg_upgrade support (still kinda 
new though), and there is some 3rd party replication support where you 
replicate your 9.0 database to a new 9.1 database, and at some point you 
promote the new 9.1 database as the new master.  Or something like that.  I've 
only read posts about it, never done it.  But with that much data, you'll need 
an upgrade plan.

All in all, if I can summarize my personal view: mysql is fast at the expense 
of safety and usability.   (mysql still cannot do update's with subselects).  
PG is safe and usable at the expense of speed, and you wont be disappointed by 
the speed.

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Igor Chudov
On Sun, Sep 11, 2011 at 9:16 AM, Claudio Freire klaussfre...@gmail.comwrote:

 On Sun, Sep 11, 2011 at 3:59 PM, Igor Chudov ichu...@gmail.com wrote:
  Well, right now, my server has twelve 7,200 RPM 2TB hard drives in a
 RAID-6
  configuration.
  They are managed by a 3WARE 9750 RAID CARD.
 
  I would say that I am not very concerned with linear relationship of read
  speed to disk speed. If that stuff is somewhat slow, it is OK with me.

 With Raid 6 you'll have abysmal performance on write operations.
 In data warehousing, there's lots of writes to temporary files, for
 sorting and stuff like that.

 You should either migrate to raid 10, or set up a separate array for
 temporary files, perhaps raid 0.


Thanks. I will rebuild the RAID array early next week and I will see if I
have a Raid 10 option with that card.

Quantitatively, what would you say is the write speed difference between
RAID 10 and RAID 6?


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Andy Colson

On 09/11/2011 08:59 AM, Igor Chudov wrote:



I do not plan to do a lot of random writing. My current design is that my perl 
scripts write to a temporary table every week, and then I do INSERT..ON 
DUPLICATE KEY UPDATE.

By the way, does that INSERT UPDATE functionality or something like this exist 
in Postgres?

i


You have two options:

1) write a function like:
create function doinsert(_id integer, _value text) returns void as $$
begin
  update thetable set value = _value where id = _id;
  if not found then
 insert into thetable(id, value) values (_id, _value);
   end if
end;
$$ language plpgsql;

2) use two sql statements:
-- update the existing
update realTable set value = (select value from tmp where tmp.id = realTable.id)
where exists (select value from tmp where tmp.id = realTable.id);

-- insert the missing
insert into realTable(id, value)
select id, value from tmp where not exists(select 1 from realTable where tmp.id 
= realTable.id);


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Claudio Freire
On Sun, Sep 11, 2011 at 4:16 PM, Andy Colson a...@squeakycode.net wrote:
 Upgrading to major versions of PG may or may not be painful.  (mysql
 sometimes works seamlessly between versions, it appears brilliant.  But I
 have had problems with an update, and when it goes bad, you dont have a lot
 of options).  In the past PG's only method of upgrade was a full backup of
 old, restore in new.  Things have gotten better, there is new pg_upgrade
 support (still kinda new though), and there is some 3rd party replication
 support where you replicate your 9.0 database to a new 9.1 database, and at
 some point you promote the new 9.1 database as the new master.  Or something
 like that.  I've only read posts about it, never done it.  But with that
 much data, you'll need an upgrade plan.

I have used slony to do database migration. It is a pain to set up,
but it saves you hours of downtime.
Basically, you replicate your 9.0 database into a 9.1 slave while the
9.0 is still hot and working, so you only have a very small downtime.
It's an option, but it's a lot of work to set up, only warranted if
you really cannot afford the downtime.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Andy Colson

On 09/11/2011 09:21 AM, Igor Chudov wrote:



On Sun, Sep 11, 2011 at 9:16 AM, Claudio Freire klaussfre...@gmail.com 
mailto:klaussfre...@gmail.com wrote:

On Sun, Sep 11, 2011 at 3:59 PM, Igor Chudov ichu...@gmail.com 
mailto:ichu...@gmail.com wrote:
  Well, right now, my server has twelve 7,200 RPM 2TB hard drives in a 
RAID-6
  configuration.
  They are managed by a 3WARE 9750 RAID CARD.
 
  I would say that I am not very concerned with linear relationship of read
  speed to disk speed. If that stuff is somewhat slow, it is OK with me.

With Raid 6 you'll have abysmal performance on write operations.
In data warehousing, there's lots of writes to temporary files, for
sorting and stuff like that.

You should either migrate to raid 10, or set up a separate array for
temporary files, perhaps raid 0.


Thanks. I will rebuild the RAID array early next week and I will see if I have 
a Raid 10 option with that card.

Quantitatively, what would you say is the write speed difference between RAID 
10 and RAID 6?



Note that using RAID 10, while faster, cuts your usable space in half. 12 2TB 
drives in raid 10 == 6 drives * 2TB == 12 TB total space.  That's not big 
enough, is it?

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Claudio Freire
On Sun, Sep 11, 2011 at 4:21 PM, Igor Chudov ichu...@gmail.com wrote:
 Quantitatively, what would you say is the write speed difference between
 RAID 10 and RAID 6?

https://support.nstein.com/blog/archives/73

There you can see a comparison with 4 drives, and raid 10 is twice as fast.
Since raid 5/6 doesn't scale write performance at all (it performs as
a single drive), it's quite expected. 12 drives would probably be
around 6 times as fast as raid 6.

You definitely should do some benchmarks to confirm, though.

And Andy is right, you'll have a lot less space. If raid 10 doesn't
give you enough room, just leave two spare drives for a raid 0
temporary partition. That will be at least twice as fast as doing
temporary tables on the raid 6.

You'll obviously have to get creative, tons of options.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Marti Raudsepp
On Sun, Sep 11, 2011 at 17:23, Andy Colson a...@squeakycode.net wrote:
 On 09/11/2011 08:59 AM, Igor Chudov wrote:
 By the way, does that INSERT UPDATE functionality or something like this 
 exist in Postgres?
 You have two options:
 1) write a function like:
 create function doinsert(_id integer, _value text) returns void as
 2) use two sql statements:

Unfortunately both of these options have caveats. Depending on your
I/O speed, you might need to use multiple loader threads to saturate
the write bandwidth.

However, neither option is safe from race conditions. If you need to
load data from multiple threads at the same time, they won't see each
other's inserts (until commit) and thus cause unique violations. If
you could somehow partition their operation by some key, so threads
are guaranteed not to conflict each other, then that would be perfect.
The 2nd option given by Andy is probably faster.

You *could* code a race-condition-safe function, but that would be a
no-go on a data warehouse, since each call needs a separate
subtransaction which involves allocating a transaction ID.



Which brings me to another important point: don't do lots of small
write transactions, SAVEPOINTs or PL/pgSQL subtransactions. Besides
being inefficient, they introduce a big maintenance burden. In
PostgreSQL's MVCC, each tuple contains a reference to the 32-bit
transaction ID that inserted it (xmin). After hitting the maximum
32-bit value transaction ID, the number wraps around. To prevent old
rows from appearing as new, a vacuum freeze process will run after
passing autovacuum_freeze_max_age transactions (200 million by
default) to update all old rows in your database. Using fewer
transaction IDs means it runs less often.

On small databases, this is usually not important. But on a 10TB data
warehouse, rewriting a large part of your database totally kills
performance for any other processes.
This is detailed in the documentation:
http://www.postgresql.org/docs/9.1/static/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND

Regards,
Marti

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Ogden

On Sep 11, 2011, at 9:21 AM, Igor Chudov wrote:

 
 
 On Sun, Sep 11, 2011 at 9:16 AM, Claudio Freire klaussfre...@gmail.com 
 wrote:
 On Sun, Sep 11, 2011 at 3:59 PM, Igor Chudov ichu...@gmail.com wrote:
  Well, right now, my server has twelve 7,200 RPM 2TB hard drives in a RAID-6
  configuration.
  They are managed by a 3WARE 9750 RAID CARD.
 
  I would say that I am not very concerned with linear relationship of read
  speed to disk speed. If that stuff is somewhat slow, it is OK with me.
 
 With Raid 6 you'll have abysmal performance on write operations.
 In data warehousing, there's lots of writes to temporary files, for
 sorting and stuff like that.
 
 You should either migrate to raid 10, or set up a separate array for
 temporary files, perhaps raid 0.
 
 Thanks. I will rebuild the RAID array early next week and I will see if I 
 have a Raid 10 option with that card.
 
 Quantitatively, what would you say is the write speed difference between RAID 
 10 and RAID 6?
 

As someone who migrated a RAID 5 installation to RAID 10, I am getting far 
better read and write performance on heavy calculation queries. Writing on the 
RAID 5 really made things crawl. For lots of writing, I think RAID 10 is the 
best. It should also be noted that I changed my filesystem from ext3 to XFS - 
this is something you can look into as well. 

Ogden



Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread J Sisson
On Sun, Sep 11, 2011 at 1:36 PM, Ogden li...@darkstatic.com wrote:

 As someone who migrated a RAID 5 installation to RAID 10, I am getting far
 better read and write performance on heavy calculation queries. Writing on
 the RAID 5 really made things crawl. For lots of writing, I think RAID 10 is
 the best. It should also be noted that I changed my filesystem from ext3 to
 XFS - this is something you can look into as well.

 Ogden

 RAID 10 on XFS here, too, both in OLTP and Data-warehousing scenarios.  Our
largest OLTP is ~375 GB, and PostgreSQL performs admirably (we converted
from MSSQL to PostgreSQL, and we've had more issues with network bottlenecks
since converting (where MSSQL was always the bottleneck before)).  Now that
we have fiber interconnects between our two main datacenters, I'm actually
having to work again haha.

But yeah, we tried quite a few file systems, and XFS **for our workloads**
performed better than everything else we tested, and RAID 10 is a given if
you do any significant writing.


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread J Sisson
Sorry, meant to send this to the list.

For really big data-warehousing, this document really helped us:

http://pgexperts.com/document.html?id=49


[PERFORM] PostgreSQL performance tweaking on new hardware

2011-09-11 Thread Ogden
I want to thank members on this list which helped me benchmark and conclude 
that RAID 10 on a XFS filesystem was the way to go over what we had prior. 
PostgreSQL we have been using with Perl for the last 8 years and it has been 
nothing but outstanding for us. Things have definitely worked out much better 
and the writes are much much faster. 

Since I want the maximum performance from our new servers, I want to make sure 
the configuration is what is recommended. Things are running fine and queries 
that would take seconds prior now only take one or two. I have read a lot of 
guides on tweaking PostgreSQL as well as a book, however, I would like someone 
to just review the settings I have and let me know if it's too crazy. It's for 
a considerably heavy write database with a lot of calculation queries 
(percentages, averages, sums, etc). 

This is my setup:

2 x Intel E5645 (12 Core CPU total)
64 GB Ram
RAID 10 (/var/lib/pgsql lives on it's own RAID controller) on XFS
PostgreSQL 9.0.4 on Debian Squeeze 
Database size about 200Gb. 

And in postgresql.conf:

max_connections = 200   
shared_buffers = 8GB
temp_buffers = 128MB 
work_mem = 40MB
maintenance_work_mem = 1GB

wal_buffers = 16MB 

effective_cache_size = 48GB

seq_page_cost = 1.0
random_page_cost = 1.1
cpu_tuple_cost = 0.1
cpu_index_tuple_cost = 0.05
cpu_operator_cost = 0.01
default_statistics_target = 1000

With these settings, output from free -m (Megabytes):

 total   used   free sharedbuffers cached
Mem: 64550  56605   7945  0  0  55907
-/+ buffers/cache:697  63852
Swap: 7628  6   7622

top shows:
Swap:  7812088k total, 6788k used,  7805300k free, 57343264k cached


Any suggestions would be awesome. 

Thank you

Ogden
-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Scott Marlowe
2011/9/11 pasman pasmański pasma...@gmail.com:
 For 10 TB table and 3hours, disks should have a transfer about 1GB/s 
 (seqscan).

Random data point.  Our primary servers were built for OLTP with 48
cores and 32 15kSAS drives.  We started out on Arecas but the
Supermicro 1Us we were using didn't provide enough cooling and the
Arecas were burning out after 2 to 4 months, so on those machines, we
pulled the Arecas and replaced them with simple LSI SAS non-RAID
cards.  Both were RAID-10, the latter with linux software RAID.

With the Arecas the OLTP performance is outstanding, garnering us
~8500tps at 40 to 50 threads.  However, sequentual performance was
just so so at around read / write speeds of 500/350MB/s.  The SW
RAID-10 can read AND write at right around 1GB/s.  what it lacks in
transactional throughput it more than makes up for in sequential read
/ write performance.

Another data point.  We had a big Oracle installation at my last job,
and OLAP queries were killing it midday, so I built a simple
replication system to grab rows from the big iron Oracle SUN box and
shove into a single core P IV 2.xGHz machine with 4 120G SATA drives
in SW RAID-10.

That machine handily beat the big iron Oracle machine at OLAP queries,
running in 20 minutes what was taking well over an hour for the big
Oracle machine to do, even during its (Oracle machine) off peak load
times.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PostgreSQL performance tweaking on new hardware

2011-09-11 Thread Scott Marlowe
On Sun, Sep 11, 2011 at 1:50 PM, Ogden li...@darkstatic.com wrote:
 I want to thank members on this list which helped me benchmark and conclude 
 that RAID 10 on a XFS filesystem was the way to go over what we had prior. 
 PostgreSQL we have been using with Perl for the last 8 years and it has been 
 nothing but outstanding for us. Things have definitely worked out much better 
 and the writes are much much faster.

 Since I want the maximum performance from our new servers, I want to make 
 sure the configuration is what is recommended. Things are running fine and 
 queries that would take seconds prior now only take one or two. I have read a 
 lot of guides on tweaking PostgreSQL as well as a book, however, I would like 
 someone to just review the settings I have and let me know if it's too crazy. 
 It's for a considerably heavy write database with a lot of calculation 
 queries (percentages, averages, sums, etc).

 This is my setup:

 2 x Intel E5645 (12 Core CPU total)
 64 GB Ram
 RAID 10 (/var/lib/pgsql lives on it's own RAID controller) on XFS
 PostgreSQL 9.0.4 on Debian Squeeze
 Database size about 200Gb.

 And in postgresql.conf:

 max_connections = 200
 shared_buffers = 8GB
 temp_buffers = 128MB
 work_mem = 40MB
 maintenance_work_mem = 1GB

 wal_buffers = 16MB

 effective_cache_size = 48GB

 seq_page_cost = 1.0
 random_page_cost = 1.1
 cpu_tuple_cost = 0.1
 cpu_index_tuple_cost = 0.05
 cpu_operator_cost = 0.01
 default_statistics_target = 1000

 With these settings, output from free -m (Megabytes):

             total       used       free     shared    buffers     cached
 Mem:         64550      56605       7945          0          0      55907
 -/+ buffers/cache:        697      63852
 Swap:         7628          6       7622

 top shows:
 Swap:  7812088k total,     6788k used,  7805300k free, 57343264k cached


 Any suggestions would be awesome.

Well, what's your workload like?  If you'd like to smooth out lots of
heavy writing, then look at cranking up checkpoint_segments, increase
checkpoint timeout to 2h, and play with checkpoint completion target.
If you write a lot of the same rows over and over, then keep it down
in the 0.5 range.  If you tend to write all unique rows, then closer
to 1.0 is better.  We run at 0.8.  As you increase checkpoint
completion target, you'll increase the amount of writes that have to
happen twice to the storage array, so unless you're 100% sure you
don't write to the same blocks / tuples a lot, keep it below 1.0.

Also if you're NOT using a battery backed caching RAID controller look
into upgrading to one.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Databases optimization

2011-09-11 Thread Maciek Sakrejda
I doubt you'll get much useful feedback because your question is too
broad for a mailing list answer. If you're looking for basic
performance guidelines, I recommend Greg Smith's PostgreSQL 9.0 High
Performance [1] (disclaimer: I used to work with Greg and got a free
copy), although I don't think he spent much time on running
virtualized (which certainly could affect things). Then if you have
*specific* hardware or query questions, this list is a great resource.

[1]: http://www.2ndquadrant.com/books/postgresql-9-0-high-performance/
---
Maciek Sakrejda | System Architect | Truviso

1065 E. Hillsdale Blvd., Suite 215
Foster City, CA 94404
(650) 242-3500 Main
www.truviso.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] RAID Controller (HP P400) beat by SW-RAID?

2011-09-11 Thread Anthony Presley
We've currently got PG 8.4.4 running on a whitebox hardware set up, with (2)
5410 Xeon's, and 16GB of RAM.  It's also got (4) 7200RPM SATA drives, using
the onboard IDE controller and ext3.

A few weeks back, we purchased two refurb'd HP DL360's G5's, and were hoping
to set them up with PG 9.0.2, running replicated.  These machines have (2)
5410 Xeon's, 36GB of RAM, (6) 10k SAS drives, and are using the HP SA P400i
with 512MB of BBWC.  PG is running on an ext4 (noatime) partition, and they
drives configured as RAID 1+0 (seems with this controller, I cannot do
JBOD).  I've spent a few hours going back and forth benchmarking the new
systems, and have set up the DWC, and the accelerator cache using hpacucli.
 I've tried accelerator caches of 25/75, 50/50, and 75/25.

To start with, I've set the relevant parameters in postgresql.conf the
same on the new config as the old:

  max_connections = 150
  shared_buffers = 6400MB (have tried as high as 20GB)
  work_mem = 20MB (have tried as high as 100MB)
  effective_io_concurrency = 6
  fsync = on
  synchronous_commit = off
  wal_buffers = 16MB
  checkpoint_segments = 30  (have tried 200 when I was loading the db)
  random_page_cost = 2.5
  effective_cache_size = 10240MB  (have tried as high as 16GB)

First thing I noticed is that it takes the same amount of time to load the
db (about 40 minutes) on the new hardware as the old hardware.  I was really
hoping with the faster, additional drives and a hardware RAID controller,
that this would be faster.  The database is only about 9GB with pg_dump
(about 28GB with indexes).

Using pgfouine I've identified about 10 problematic SELECT queries that
take anywhere from .1 seconds to 30 seconds on the old hardware.  Running
these same queries on the new hardware is giving me results in the .2 to 66
seconds.  IE, it's twice as slow.

I've tried increasing the shared_buffers, and some other parameters
(work_mem), but haven't yet seen the new hardware perform even at the same
speed as the old hardware.

I was really hoping that with hardware RAID that something would be faster
(loading times, queries, etc...).  What am I doing wrong?

About the only thing left that I know to try is to drop the RAID1+0 and go
to RAID0 in hardware, and do RAID1 in software.  Any other thoughts?

Thanks!


--
Anthony


Re: [PERFORM] Databases optimization

2011-09-11 Thread Hany ABOU-GHOURY
Thanks Maciek.

I really do not know where to start or how to explain my question
I am newbie to Postgres. I will try to get more information from the
development team and SA's

Cheers
Hany


On Mon, Sep 12, 2011 at 10:22 AM, Maciek Sakrejda msakre...@truviso.comwrote:

 I doubt you'll get much useful feedback because your question is too
 broad for a mailing list answer. If you're looking for basic
 performance guidelines, I recommend Greg Smith's PostgreSQL 9.0 High
 Performance [1] (disclaimer: I used to work with Greg and got a free
 copy), although I don't think he spent much time on running
 virtualized (which certainly could affect things). Then if you have
 *specific* hardware or query questions, this list is a great resource.

 [1]: http://www.2ndquadrant.com/books/postgresql-9-0-high-performance/
 ---
 Maciek Sakrejda | System Architect | Truviso

 1065 E. Hillsdale Blvd., Suite 215
 Foster City, CA 94404
 (650) 242-3500 Main
 www.truviso.com



Re: [PERFORM] RAID Controller (HP P400) beat by SW-RAID?

2011-09-11 Thread Stephen Frost
* Anthony Presley (anth...@resolution.com) wrote:
 I was really hoping that with hardware RAID that something would be faster
 (loading times, queries, etc...).  What am I doing wrong?

ext3 and ext4 do NOT perform identically out of the box..  You might be
running into the write barriers problem here with ext4 forcing the RAID
controllers to push commits all the way to the hard drive before
returning (thus making the BBWC next to useless).

You might try w/ ext3 on the new system instead.

Also, the p800's are definitely better than the p400's, but I don't know
that it's the controller that's really the issue here..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Stephen Frost
* Igor Chudov (ichu...@gmail.com) wrote:
 Right now I have a personal (one user) project to create a 5-10
 Terabyte data warehouse. The largest table will consume the most space
 and will take, perhaps, 200,000,000 rows.

I run data-warehouse databases on that order (current largest single
instance is ~4TB running under 9.0.4).  If the largest table is only
200M rows, PG should handle that quite well.  Our data is partitioned by
month and each month is about 200M records and simple queries can run in
15-20 minutes (with a single thread), with complex windowing queries
(split up and run in parallel) finishing in a couple of hours. 

 However, while an hour is fine, two weeks per query is NOT fine.

What's really, really, really useful are two things: EXPLAIN, and this
mailing list. :)  Seriously, run EXPLAIN on your queries before you run
them and see if how the query is going to be executed makes sense.
Here's a real easy hint: if it says External Sort and has big numbers,
come talk to us here- that's about one of the worst things you can
possibly do.  Of course, PG's going to avoid doing that, but you may
have written a query (unintentionally) which forces PG to do a sort, or
something else.

 I have a server with about 18 TB of storage and 48 GB of RAM, and 12
 CPU cores.

If you partition up your data and don't mind things running in different
transactions, you can definitely get a speed boost with PG by running
things in parallel.  PG will handle that very well, in fact, if two
queries are running against the same table, PG will actually combine
them and only actually read the data from disk once.

 I cannot shell out $47,000 per CPU for Oracle for this project.

The above data warehouse was migrated from an Oracle-based system. :)

 To be more specific, the batch queries that I would do, I hope,
 would either use small JOINS of a small dataset to a large dataset, or
 just SELECTS from one big table.

Make sure that you set your 'work_mem' correctly- PG will use that to
figure out if it can hash the small table (you want that to happen,
trust me..).  If you do end up having sorts, it'll also use the work_mem
value to figure out how much memory to use for sorting.

 So... Can Postgres support a 5-10 TB database with the use pattern
 stated above?

Yes, certainly.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Igor Chudov
On Sun, Sep 11, 2011 at 6:01 PM, Stephen Frost sfr...@snowman.net wrote:

 * Igor Chudov (ichu...@gmail.com) wrote:
  Right now I have a personal (one user) project to create a 5-10
  Terabyte data warehouse. The largest table will consume the most space
  and will take, perhaps, 200,000,000 rows.

 I run data-warehouse databases on that order (current largest single
 instance is ~4TB running under 9.0.4).  If the largest table is only
 200M rows, PG should handle that quite well.  Our data is partitioned by
 month and each month is about 200M records and simple queries can run in
 15-20 minutes (with a single thread), with complex windowing queries
 (split up and run in parallel) finishing in a couple of hours.



Which brings up a question.

Can I partition data by month (or quarter), without that month being part of
PRIMARY KEY?

If this question sounds weird, I am asking because MySQL enforces this,
which does not fit my data.

If I can keep my primary key to be the ID that I want (which comes with
data), but still partition it by month, I will be EXTREMELY happy.

 However, while an hour is fine, two weeks per query is NOT fine.

 What's really, really, really useful are two things: EXPLAIN, and this
 mailing list. :)  Seriously, run EXPLAIN on your queries before you run
 them and see if how the query is going to be executed makes sense.
 Here's a real easy hint: if it says External Sort and has big numbers,
 come talk to us here- that's about one of the worst things you can
 possibly do.  Of course, PG's going to avoid doing that, but you may
 have written a query (unintentionally) which forces PG to do a sort, or
 something else.


Very good, thanks


  I have a server with about 18 TB of storage and 48 GB of RAM, and 12
  CPU cores.

 If you partition up your data and don't mind things running in different
 transactions, you can definitely get a speed boost with PG by running
 things in parallel.  PG will handle that very well, in fact, if two
 queries are running against the same table, PG will actually combine
 them and only actually read the data from disk once.

  I cannot shell out $47,000 per CPU for Oracle for this project.

 The above data warehouse was migrated from an Oracle-based system. :)


I am wondering, why?


  To be more specific, the batch queries that I would do, I hope,
  would either use small JOINS of a small dataset to a large dataset, or
  just SELECTS from one big table.

 Make sure that you set your 'work_mem' correctly- PG will use that to
 figure out if it can hash the small table (you want that to happen,
 trust me..).  If you do end up having sorts, it'll also use the work_mem
 value to figure out how much memory to use for sorting.


I could, say, set work_mem to 30 GB? (64 bit linux)


   So... Can Postgres support a 5-10 TB database with the use pattern
  stated above?

 Yes, certainly.


that's great to know.

i


Thanks,

Stephen

 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.10 (GNU/Linux)

 iEYEARECAAYFAk5tPc8ACgkQrzgMPqB3kigtSgCffwEmi3AD6Ryff7qZyQYieyKQ
 jhoAoJDFC1snQmwCIBUjwlC6WVRyAOkn
 =LPtP
 -END PGP SIGNATURE-




Re: [PERFORM] RAID Controller (HP P400) beat by SW-RAID?

2011-09-11 Thread Tomas Vondra
Dne 12.9.2011 00:44, Anthony Presley napsal(a):
 We've currently got PG 8.4.4 running on a whitebox hardware set up,
 with (2) 5410 Xeon's, and 16GB of RAM.  It's also got (4) 7200RPM
 SATA drives, using the onboard IDE controller and ext3.
 
 A few weeks back, we purchased two refurb'd HP DL360's G5's, and
 were hoping to set them up with PG 9.0.2, running replicated.  These
 machines have (2) 5410 Xeon's, 36GB of RAM, (6) 10k SAS drives, and
 are using the HP SA P400i with 512MB of BBWC.  PG is running on an
 ext4 (noatime) partition, and they drives configured as RAID 1+0
 (seems with this controller, I cannot do JBOD).  I've spent a few
 hours going back and forth benchmarking the new systems, and have set
 up the DWC, and the accelerator cache using hpacucli.  I've tried
 accelerator caches of 25/75, 50/50, and 75/25.

Whas is an 'accelerator cache'? Is that the cache on the controller?
Then give 100% to the write cache - the read cache does not need to be
protected by the battery, the page cache at the OS level can do the same
service.

Provide more details about the ext3/ext4 - there are various data modes
(writeback, ordered, journal), various other settings (barriers, stripe
size, ...) that matter.

According to the benchmark I've done a few days back, the performance
difference between ext3 and ext4 is rather small, when comparing equally
configured file systems (i.e. data=journal vs. data=journal) etc.

With read-only workload (e.g. just SELECT statements), the config does
not matter (e.g. journal is just as fast as writeback).

See for example these comparisons

   read-only workload: http://bit.ly/q04Tpg
   read-write workload: http://bit.ly/qKgWgn

The ext4 is usually a bit faster than equally configured ext3, but the
difference should not be 100%.

 To start with, I've set the relevant parameters in postgresql.conf
 the same on the new config as the old:
 
 max_connections = 150 shared_buffers = 6400MB (have tried as high as
 20GB) work_mem = 20MB (have tried as high as 100MB) 
 effective_io_concurrency = 6 fsync = on synchronous_commit = off 
 wal_buffers = 16MB checkpoint_segments = 30  (have tried 200 when I
 was loading the db) random_page_cost = 2.5 effective_cache_size =
 10240MB  (have tried as high as 16GB)
 
 First thing I noticed is that it takes the same amount of time to
 load the db (about 40 minutes) on the new hardware as the old
 hardware.  I was really hoping with the faster, additional drives and
 a hardware RAID controller, that this would be faster.  The database
 is only about 9GB with pg_dump (about 28GB with indexes).
 
 Using pgfouine I've identified about 10 problematic SELECT queries 
 that take anywhere from .1 seconds to 30 seconds on the old
 hardware. Running these same queries on the new hardware is giving me
 results in the .2 to 66 seconds.  IE, it's twice as slow.
 
 I've tried increasing the shared_buffers, and some other parameters 
 (work_mem), but haven't yet seen the new hardware perform even at
 the same speed as the old hardware.

In that case some of the assumptions is wrong. For example the new RAID
is slow for some reason. Bad stripe size, slow controller, ...

Do the basic hw benchmarking, i.e. use bonnie++ to benchmark the disk,
etc. Only if this provides expected results (i.e. the new hw performs
better) it makes sense to mess with the database.

Tomas

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Claudio Freire
On Mon, Sep 12, 2011 at 1:16 AM, Igor Chudov ichu...@gmail.com wrote:
 I could, say, set work_mem to 30 GB? (64 bit linux)

I don't think you'd want that. Remember, work_mem is the amount of
memory *per sort*.
Queries can request several times that much memory, once per sort they
need to perform.

You can set it really high, but not 60% of your RAM - that wouldn't be wise.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] RAID Controller (HP P400) beat by SW-RAID?

2011-09-11 Thread Alan Hodgson
On September 11, 2011 03:44:34 PM Anthony Presley wrote:
 First thing I noticed is that it takes the same amount of time to load the
 db (about 40 minutes) on the new hardware as the old hardware.  I was
 really hoping with the faster, additional drives and a hardware RAID
 controller, that this would be faster.  The database is only about 9GB
 with pg_dump (about 28GB with indexes).

Loading the DB is going to be CPU-bound (on a single) core, unless your disks 
really suck, which they don't. Most of the time will be spent building 
indexes.

I don't know offhand why the queries are slower, though, unless you're not 
getting as much cached before testing as on the older box.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] RAID Controller (HP P400) beat by SW-RAID?

2011-09-11 Thread Scott Marlowe
On Sun, Sep 11, 2011 at 4:44 PM, Anthony Presley anth...@resolution.com wrote:
 We've currently got PG 8.4.4 running on a whitebox hardware set up, with (2)
 5410 Xeon's, and 16GB of RAM.  It's also got (4) 7200RPM SATA drives, using
 the onboard IDE controller and ext3.
 A few weeks back, we purchased two refurb'd HP DL360's G5's, and were hoping
 to set them up with PG 9.0.2, running replicated.  These machines have (2)
 5410 Xeon's, 36GB of RAM, (6) 10k SAS drives, and are using the HP SA P400i
 with 512MB of BBWC.  PG is running on an ext4 (noatime) partition, and they

Two issues here.  One is that the onboard controller and disks on the
old machine might not be obeying fsync properly, giving a speed boost
at the expense of crash safeness.  Two is that the P400 has gotten
pretty horrible performance reviews on this list in the past.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Stephen Frost
* Igor Chudov (ichu...@gmail.com) wrote:
 Can I partition data by month (or quarter), without that month being part of
 PRIMARY KEY?

The way partitioning works in PG is by using CHECK constraints.  Not
sure if you're familiar with those (not sure if MySQL has them), so
here's a quick example:

Create a parent table.  Then create two tables which inherit from that
parent table (this is more of an implementation detail than anything
else, the parent table is always empty, it's just there to be the
single, combined, table that you run your select queries against).  On
each of the two 'child' tables, create a CHECK constraint.  On table1,
you do: 
  alter table table1 add check (date  '2000-01-01');
On table2, you do:
  alter table table2 add check (date = '2000-01-01');

Once those are done, you can query against the 'parent' table with
something like:
select * from parent where date = '2010-01-01';

And PG will realize it only has to look at table2 to get the results for
that query.  This means the partitioning can be more-or-less any check
constraint that will be satisfied by the data in the table (and PG will
check/enforce this) and that PG can figure out will eliminate a partition
from possibly having the data that matches the request.

Technically, this means that you could have all kinds of different ways
your data is split across the partitions, but remember that all the
constraints have to actually be TRUE. :)  Typically, people do split
based on the PK, but it's not required (realize that PG doesn't support
cross-table PKs, so if you don't have CHECK constraints which make sure
that the tables don't cover the same PK value, you could end up with
duplicate values across the tables...).

 If this question sounds weird, I am asking because MySQL enforces this,
 which does not fit my data.

That part is a little strange..

 If I can keep my primary key to be the ID that I want (which comes with
 data), but still partition it by month, I will be EXTREMELY happy.

As I said above, the actual PK is going to be independent and in the
base/child tables.  That said, yes, you could have the PK in each table
be whatever you want and you use month to partition the 'main' table.
You then have to come up with some other way to make sure your PK is
enforced, however, or figure out a way to deal with things if it's not.
Based on what you've been describing, I'm afraid you'd have to actually
search all the partitions for a given ID on an update, to figure out if
you're doing an UPDATE or an INSERT...  Unless, of course, the month is
included in the PK somewhere, or is in the incoming data and you can be
100% confident that the incoming data is never wrong.. :)

 I am wondering, why?

Cost, and we had a real hard time (this was a while ago..) getting
Oracle to run decently on Linux, and the Sun gear was just too damn
expensive.  Also, ease of maintenance- it takes a LOT less effort to
keep a PG database set up and running smoothly than an Oracle one, imv.

 I could, say, set work_mem to 30 GB? (64 bit linux)

You can, but you have to be careful with it, because PG will think it
can use 30GB for EACH sort in a given query, and in EACH hash in a given
query.  What I would recommend is setting the default to something like
256MB and then looking at specific queries and bumping it up for those
queries when it's clear that it'll help the query and won't cause the
system to go into swap.  Note that you can set work_mem for a given
session after you connect to the database, just do:

set work_mem = '1GB';

in your session before running other queries.  Doing that won't impact
other sessions.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Stephen Frost
* Claudio Freire (klaussfre...@gmail.com) wrote:
 I don't think you'd want that. Remember, work_mem is the amount of
 memory *per sort*.
 Queries can request several times that much memory, once per sort they
 need to perform.
 
 You can set it really high, but not 60% of your RAM - that wouldn't be wise.

Oh, I dunno..  It's only used by the planner, so sometimes you have to
bump it up, especially when PG thinks the number of rows returned from
something will be a lot more than it really will be. :)

/me has certain queries where it's been set to 100GB... ;)

I agree that it shouldn't be the default, however.  That's asking for
trouble.  Do it for the specific queries that need it.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PERFORM] Databases optimization

2011-09-11 Thread J Sisson
On Sun, Sep 11, 2011 at 5:22 PM, Maciek Sakrejda msakre...@truviso.comwrote:

 performance guidelines, I recommend Greg Smith's PostgreSQL 9.0 High
 Performance [1] (disclaimer: I used to work with Greg and got a free
 copy)

 I'll second that.  PostgreSQL 9.0 High Performance is an excellent
resource
(I recommend it even for non-PostgreSQL admins because it goes so in-depth
on Linux tuning) so whether you get it for free or not, it's worth the time
it takes
to read and absorb the info.

I've never run PostgreSQL virtualized, but I can say that if it's anything
like
running SQL Server virtualized, it's not a terribly good idea.


Re: [PERFORM] Postgres for a data warehouse, 5-10 TB

2011-09-11 Thread Ondrej Ivanič
Hi,

On 12 September 2011 12:28, Stephen Frost sfr...@snowman.net wrote:
 Once those are done, you can query against the 'parent' table with
 something like:
 select * from parent where date = '2010-01-01';

 And PG will realize it only has to look at table2 to get the results for
 that query.  This means the partitioning can be more-or-less any check
 constraint that will be satisfied by the data in the table (and PG will
 check/enforce this) and that PG can figure out will eliminate a partition
 from possibly having the data that matches the request.

Theory is nice but there are few gotchas (in 8.4) :

- planner can use constant expressions only. You will get scans across
all partitions when you use function (like now(), immutable function
with constant arguments), sub query (like part_col = (select x from
...) .. ) or anything which can't be evaluated to constat during query
planning.

- partitions constraints are not pushed to joins (assuming tables
partitioned by primary key):
select ... from X left join Y on X.primary_key = Y.primary_key where
part_col = ... and X.primary_key = .,, and X.primary_key  ...
must be rewritten like
select ... from X
left join Y on X.primary_key = Y.primary_key and X.primary_key = .,,
and Y.primary_key  ...
where X.primary_key = .,, and X.primary_key  ...
in order to avoid scan entire Y table (not only relevant partitions)

- ORDER BY / LIMIT X issue fixed in 9.1 (Allow inheritance table scans
to return meaningfully-sorted results.

Moreover all queries should have 'WHERE' on column which is used for
partitioning otherwise partitioning is not very useful (yes, it could
simplify data management -- drop partition vs delete from X where
part_col between A and B)

-- 
Ondrej Ivanic
(ondrej.iva...@gmail.com)

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] RAID Controller (HP P400) beat by SW-RAID?

2011-09-11 Thread mark


From: pgsql-performance-ow...@postgresql.org
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Anthony Presley
Sent: Sunday, September 11, 2011 4:45 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] RAID Controller (HP P400) beat by SW-RAID?

We've currently got PG 8.4.4 running on a whitebox hardware set up, with
(2) 5410 Xeon's, and 16GB of RAM.  It's also got (4) 7200RPM SATA drives,
using the onboard IDE controller and ext3. 

A few weeks back, we purchased two refurb'd HP DL360's G5's, and were
hoping to set them up with PG 9.0.2, running replicated.  These machines
have (2) 5410 Xeon's, 36GB of RAM, (6) 10k SAS drives, and are using the HP
SA P400i with 512MB of BBWC.  PG is running on an ext4 (noatime) partition,
and they drives configured as RAID 1+0 (seems with this controller, I cannot
do JBOD).  I've spent a few hours going back and forth benchmarking the new
systems, and have set up the DWC, and the accelerator cache using hpacucli.
 I've tried accelerator caches of 25/75, 50/50, and 75/25.



I would start of by recommending a more current version of 9.0...like 9.0.4
since you are building a new box. The rumor mill says 9.0.5 and 9.1.0 might
be out soon (days?). but that is just rumor mill. Don't bank on it. 


What kernel are you on ?

Long time HP user here, for better and worse... so here are a few other
little things I recommend. 

Check the bios power management. Make sure it is set where you want it.
(IIRC the G5s have this, I know G6s and G7s do). This can help with nasty
latency problems if the box has been idle for a while then needs to start
doing work.

The p400i is not a great card, compared to more modern one, but you should
be able to beat the old setup with what you have. Faster clocked cpu's more
spindles, faster RPM spindles. 

Assuming the battery is working, with XFS or ext4 you can use nobarrier
mount option and you should see some improvement. 


Make sure the raid card's firmware is current. I can't stress this enough.
HP fixed a nasty bug with Raid 1+0 a few months ago where you could eat your
data... They also seem to be fixing a lot of other bugs along the way as
well. So do yourself a big favor and make sure that firmware is current. It
might just head off headache down the road.

Also make sure you have a 8.10.? (IIRC the version number right) or better
version of hpacucli... there have been some fixes to that utility as well.
IIRC most of the fixes in this have been around recognizing newere cards
(812s and 410s) but some interface bugs have been fixed as well.   You may
need new packages for HP health. (I don't recall the official name, but new
versions if hpacucli might not play well with old versions of hp health. 

Its HP so they have a new version about every month for firmware and their
cli utility... that’s HP for us. 

Anyways that is my fast input.

Best of luck,


-Mark


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance