Re: [PERFORM] pgsql_tmp( Temporary tablespace)

2012-11-28 Thread suhas.basavaraj12
Hi,


Can i delete the content of this folder. I have observed couple of times ,
this folder got cleaned automatically.
Which backend process deletes  the data from this folder .Any Idea?

Rgrds
Suhas






--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/pgsql-tmp-Temporary-tablespace-tp5733858p5733863.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] pgsql_tmp( Temporary tablespace)

2012-11-28 Thread Albe Laurenz
suhas.basavaraj12 wrote:
 Can i delete the content of this folder. I have observed couple of
times ,
 this folder got cleaned automatically.

These files are in use and you should not delete them.
If you need them to go right now, cancel the queries that
create temporary files.

If there are any leftover when PostgreSQL is shut down
(don't know if that can happen), it should be safe to
delete them.

 Which backend process deletes  the data from this folder .Any Idea?

I'm not sure, but probably the one that created them.

Yours,
Laurenz Albe


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Database design - best practice

2012-11-28 Thread Niels Kristian Schjødt
Hi,

I'm on the hunt for some solid knowledge on a theoretical level about the 
performance of postgresql. My question is regarding best practices, and how 
architectural decisions might influence the performance. First a little 
background:

The setup:
I have a database which holds informations on used cars. The database has 
mainly 3 tables of interest for this case:
A cars table, an adverts table and a sellers table. One car has many adverts 
and one seller has many adverts. One advert belongs to one car and one seller.
The database is powering a website for searching used cars. When searching for 
used cars, the cars table is mainly used, and a lot of the columns should be 
directly available for searching e.g. color, milage, price, 
has_automatic_transmission etc.

So my main concern is actually about the cars table, since this one currently 
has a lot of columns (151 - I expect thats quite a lot?), and a lot of data (4 
mil. rows, and growing). Now you might start by thinking, this could sound like 
a regular need for some normalization, but wait a second and let me explain :-)
The columns in this table is for the most very short stings, integers, decimals 
or booleans. So take for an example has_automatic_transmission (boolean) I 
can't see why it would make sense to put that into a separate table and join in 
the values. Or the milage or the price as another example. The cars table used 
for search is indexed quite a lot.

The questions:
Having the above setup in mind, what impact on performance, in terms of read 
performance and write performance, does it have, whether I do the following:
1) In general would the read and/or the write on the database be 
faster, if I serialized some of the not searched columns in the table into a 
single text columns instead of let's say 20 booleans?
2) Lets say I'm updating a timestamp in a single one of the 151 columns 
in the cars table. The update statement is using the id to find the car. Would 
the write performance of that UPDATE be affected, if the table had fewer 
columns?
3) When adding a new column to the table i know that it becomes slower 
the more rows is in the table, but what about the width of the table does 
that affect the performance when adding new columns?
4) In general what performance downsides do you get when adding a lot 
of columns to one table instead of having them in separate tables?
5) Is it significantly faster to select * from a table with 20 columns, 
than selecting the same 20 in a table with 150 columns?

Hope there is some good answers out there :-)

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Optimize update query

2012-11-28 Thread Niels Kristian Schjødt
Hi, i have these update queries, that run very often, and takes too long time, 
in order for us to reach the throughput we are aiming at. However, the update 
query is very simple, and I can't figure out any way to improve the situation. 
The query looks like this:

UPDATE adverts SET last_observed_at = '2012-11-28 00:02:30.265154', 
data_source_id ='83d024a57bc2958940f3ca281bddcbf4' WHEREadverts.id IN ( 
1602382, 4916432, 3221246, 4741057, 3853335, 571429, 3222740, 571736, 3544903, 
325378,5774338, 5921451, 4295768, 3223170, 5687001, 4741966, 325519, 580867, 
325721, 4412200, 4139598, 325567, 1616653,1616664, 6202007, 3223748, 325613, 
3223764, 325615, 4296536, 3854595, 4971428, 3224146, 5150522, 4412617, 
5073048,325747, 325771, 1622154, 5794384, 5736581, 1623767, 5686945, 3224627, 
5073009, 3224747, 3224749, 325809, 5687051,3224811, 5687052, 4917824, 5073013, 
3224816, 3224834, 4297331, 1623907, 325864, 1623947, 6169706, 325869, 
325877,3225074, 3225112, 325893, 325912, 3225151, 3225184, 3225175, 1624659, 
325901, 4033926, 325904, 325911, 4412835,1624737, 5073004, 5921434, 325915, 
3225285, 3225452, 4917672, 1624984, 3225472, 325940, 5380611, 325957, 
5073258,3225500, 1625002, 5923489, 4413009, 325952, 3961122, 363 ) ;

An explain outputs me the following:

Update on adverts  (cost=0.12..734.27 rows=95 width=168)
  -  Index Scan using adverts_pkey on adverts  (cost=0.12..734.27 rows=95 
width=168)
Index Cond: (id = ANY 
('{1602382,4916432,3221246,4741057,3853335,571429,3222740,571736,3544903,325378,5774338,5921451,4295768,3223170,5687001,4741966,325519,580867,325721,4412200,4139598,325567,1616653,1616664,6202007,3223748,325613,3223764,325615,4296536,3854595,4971428,3224146,5150522,4412617,5073048,325747,325771,1622154,5794384,5736581,1623767,5686945,3224627,5073009,3224747,3224749,325809,5687051,3224811,5687052,4917824,5073013,3224816,3224834,4297331,1623907,325864,1623947,6169706,325869,325877,3225074,3225112,325893,325912,3225151,3225184,3225175,1624659,325901,4033926,325904,325911,4412835,1624737,5073004,5921434,325915,3225285,3225452,4917672,1624984,3225472,325940,5380611,325957,5073258,3225500,1625002,5923489,4413009,325952,3961122,363}'::integer[]))

So as you can see, it's already pretty optimized, it's just not enough :-) So 
what can I do? the two columns last_observed_at and data_source_id has an 
index, and it is needed elsewhere, so I can't delete those.

PS. I'm on postgres 9.2 on a server with 32gb ram, 8 cores and two 3T disks in 
a software raid 1 setup.

Is the only way out of this really a SSD disk?

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Database design - best practice

2012-11-28 Thread Willem Leenen

Niels,

  I can't see why it would make sense to put that into a separate table and 
join in the values  
You don't normalize for performance. People DEnormalize for performance.


Questions: (AFAIK)

1) This is a way to disaster. Get yourself a book on RDBMS from for example 
Celko. Do NOT go against the flow of the RDBMS rules, as here in rule #1 atomic 
values of a column. 
 
2) This is not the big fish you are after. First benchmark your setup 
and compare the results with your desired performance level. First quantify 
your problem, if there is any, before using tricks.

3) A row will need more memory when it is wider, this may be amplified during 
hash joins. 

4)  People DEnormalize for performance. 

5)  Is it significantly faster to select * from a table with 20 columns, than 
selecting the same 20 in a table with 150 columns? 

I know the answer, but i encourage you to simply test this. I have seen lot's 
of urban legends about performance ( including the dropping of the referential 
integrity be cause that would make a difference ). 
Of course , when it's a full table scan, and it are ALL disk reads, (or ALL 
memory reads_) you can simply calculate it too. But just get into the habit of  
testing for learning.


My advice:
- know what performance you need.
- test if you have this, varying tablecontent and systemload
- do not tamper with the RDBMS rules, this will haunt you.
- if you have the latest postgres version, you can use covering indexes: tables 
aren't accessed at all, bypassing most of your questions. Check with peers if 
you've got the indexes right.

Regards,
Willem



 From: nielskrist...@autouncle.com
 Subject: [PERFORM] Database design - best practice
 Date: Wed, 28 Nov 2012 13:41:14 +0100
 To: pgsql-performance@postgresql.org
 
 Hi,
 
 I'm on the hunt for some solid knowledge on a theoretical level about the 
 performance of postgresql. My question is regarding best practices, and how 
 architectural decisions might influence the performance. First a little 
 background:
 
 The setup:
 I have a database which holds informations on used cars. The database has 
 mainly 3 tables of interest for this case:
 A cars table, an adverts table and a sellers table. One car has many adverts 
 and one seller has many adverts. One advert belongs to one car and one seller.
 The database is powering a website for searching used cars. When searching 
 for used cars, the cars table is mainly used, and a lot of the columns should 
 be directly available for searching e.g. color, milage, price, 
 has_automatic_transmission etc.
 
 So my main concern is actually about the cars table, since this one currently 
 has a lot of columns (151 - I expect thats quite a lot?), and a lot of data 
 (4 mil. rows, and growing). Now you might start by thinking, this could sound 
 like a regular need for some normalization, but wait a second and let me 
 explain :-)
 The columns in this table is for the most very short stings, integers, 
 decimals or booleans. So take for an example has_automatic_transmission 
 (boolean) I can't see why it would make sense to put that into a separate 
 table and join in the values. Or the milage or the price as another example. 
 The cars table used for search is indexed quite a lot.
 
 The questions:
 Having the above setup in mind, what impact on performance, in terms of read 
 performance and write performance, does it have, whether I do the following:
   1) In general would the read and/or the write on the database be 
 faster, if I serialized some of the not searched columns in the table into a 
 single text columns instead of let's say 20 booleans?
   2) Lets say I'm updating a timestamp in a single one of the 151 columns 
 in the cars table. The update statement is using the id to find the car. 
 Would the write performance of that UPDATE be affected, if the table had 
 fewer columns?
   3) When adding a new column to the table i know that it becomes slower 
 the more rows is in the table, but what about the width of the table does 
 that affect the performance when adding new columns?
   4) In general what performance downsides do you get when adding a lot 
 of columns to one table instead of having them in separate tables?
   5) Is it significantly faster to select * from a table with 20 columns, 
 than selecting the same 20 in a table with 150 columns?
 
 Hope there is some good answers out there :-)
 
 -- 
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance
  

Re: [PERFORM] Database design - best practice

2012-11-28 Thread Niels Kristian Schjødt
Thanks for the advice.

Currently I see a lot of I/O related to update/inserts, so I'm trying to track 
down these guys at first. In relation to question 2, I read somewhere in the 
documentation that because of MVCC, the whole row has to be rewritten even 
though I just update one single column in that row. Hence if the table is wider 
(has more columns), the update will be slower. Does this match your 
understanding?

Den 28/11/2012 kl. 14.10 skrev Willem Leenen willem_lee...@hotmail.com:

 Niels,
 
  I can't see why it would make sense to put that into a separate table and 
 join in the values  
 You don't normalize for performance. People DEnormalize for performance.
 
 
 Questions: (AFAIK)
 
 1) This is a way to disaster. Get yourself a book on RDBMS from for example 
 Celko. Do NOT go against the flow of the RDBMS rules, as here in rule #1 
 atomic values of a column. 
 
 2) This is not the big fish you are after. First benchmark your setup and 
 compare the results with your desired performance level. First quantify your 
 problem, if there is any, before using tricks.
 
 3) A row will need more memory when it is wider, this may be amplified during 
 hash joins. 
 
 4) People DEnormalize for performance. 
 
 5)  Is it significantly faster to select * from a table with 20 columns, 
 than selecting the same 20 in a table with 150 columns? 
 
 I know the answer, but i encourage you to simply test this. I have seen lot's 
 of urban legends about performance ( including the dropping of the 
 referential integrity be cause that would make a difference ). 
 Of course , when it's a full table scan, and it are ALL disk reads, (or ALL 
 memory reads_) you can simply calculate it too. But just get into the habit 
 of  testing for learning.
 
 
 My advice:
 - know what performance you need.
 - test if you have this, varying tablecontent and systemload
 - do not tamper with the RDBMS rules, this will haunt you.
 - if you have the latest postgres version, you can use covering indexes: 
 tables aren't accessed at all, bypassing most of your questions. Check with 
 peers if you've got the indexes right.
 
 Regards,
 Willem
 
 
 
  From: nielskrist...@autouncle.com
  Subject: [PERFORM] Database design - best practice
  Date: Wed, 28 Nov 2012 13:41:14 +0100
  To: pgsql-performance@postgresql.org
  
  Hi,
  
  I'm on the hunt for some solid knowledge on a theoretical level about the 
  performance of postgresql. My question is regarding best practices, and how 
  architectural decisions might influence the performance. First a little 
  background:
  
  The setup:
  I have a database which holds informations on used cars. The database has 
  mainly 3 tables of interest for this case:
  A cars table, an adverts table and a sellers table. One car has many 
  adverts and one seller has many adverts. One advert belongs to one car and 
  one seller.
  The database is powering a website for searching used cars. When searching 
  for used cars, the cars table is mainly used, and a lot of the columns 
  should be directly available for searching e.g. color, milage, price, 
  has_automatic_transmission etc.
  
  So my main concern is actually about the cars table, since this one 
  currently has a lot of columns (151 - I expect thats quite a lot?), and a 
  lot of data (4 mil. rows, and growing). Now you might start by thinking, 
  this could sound like a regular need for some normalization, but wait a 
  second and let me explain :-)
  The columns in this table is for the most very short stings, integers, 
  decimals or booleans. So take for an example has_automatic_transmission 
  (boolean) I can't see why it would make sense to put that into a separate 
  table and join in the values. Or the milage or the price as another 
  example. The cars table used for search is indexed quite a lot.
  
  The questions:
  Having the above setup in mind, what impact on performance, in terms of 
  read performance and write performance, does it have, whether I do the 
  following:
  1) In general would the read and/or the write on the database be faster, if 
  I serialized some of the not searched columns in the table into a single 
  text columns instead of let's say 20 booleans?
  2) Lets say I'm updating a timestamp in a single one of the 151 columns in 
  the cars table. The update statement is using the id to find the car. Would 
  the write performance of that UPDATE be affected, if the table had fewer 
  columns?
  3) When adding a new column to the table i know that it becomes slower the 
  more rows is in the table, but what about the width of the table does 
  that affect the performance when adding new columns?
  4) In general what performance downsides do you get when adding a lot of 
  columns to one table instead of having them in separate tables?
  5) Is it significantly faster to select * from a table with 20 columns, 
  than selecting the same 20 in a table with 150 columns?
  
  Hope there is some good answers out 

Re: [PERFORM] Database design - best practice

2012-11-28 Thread Willem Leenen

 In relation to question 2, I read somewhere in the documentation that 
because of MVCC, the whole row has to be rewritten even though I just 
update one single column in that row. Hence if the table is wider (has 
more columns), the update will be slower. Does this match your 
understanding?

No. I don't count the number of rows, but number of blocks (pages) that are 
modified, which are 8K each. 

My advice would be to first establish a solutiondirection via diagnosing the 
problem. My experience is that most solutions are not obscure at all.



Subject: Re: [PERFORM] Database design - best practice
From: nielskrist...@autouncle.com
Date: Wed, 28 Nov 2012 14:20:27 +0100
CC: pgsql-performance@postgresql.org
To: willem_lee...@hotmail.com

Thanks for the advice.
Currently I see a lot of I/O related to update/inserts, so I'm trying to track 
down these guys at first. In relation to question 2, I read somewhere in the 
documentation that because of MVCC, the whole row has to be rewritten even 
though I just update one single column in that row. Hence if the table is wider 
(has more columns), the update will be slower. Does this match your 
understanding?
Den 28/11/2012 kl. 14.10 skrev Willem Leenen willem_lee...@hotmail.com:Niels,

 I can't see why it would make sense to put that into a separate table and 
join in the values  
You don't normalize for performance. People DEnormalize for performance.


Questions: (AFAIK)

1) This is a way to disaster. Get yourself a book on RDBMS from for example 
Celko. Do NOT go against the flow of the RDBMS rules, as here in rule #1 atomic 
values of a column. 

2) This is not the big fish you are after. First benchmark your setup and 
compare the results with your desired performance level. First quantify your 
problem, if there is any, before using tricks.

3) A row will need more memory when it is wider, this may be amplified during 
hash joins. 

4) People DEnormalize for performance. 

5)  Is it significantly faster to select * from a table with 20 columns, than 
selecting the same 20 in a table with 150 columns? 

I know the answer, but i encourage you to simply test this. I have seen lot's 
of urban legends about performance ( including the dropping of the referential 
integrity be cause that would make a difference ). 
Of course , when it's a full table scan, and it are ALL disk reads, (or ALL 
memory reads_) you can simply calculate it too. But just get into the habit of  
testing for learning.


My advice:
- know what performance you need.
- test if you have this, varying tablecontent and systemload
- do not tamper with the RDBMS rules, this will haunt you.
- if you have the latest postgres version, you can use covering indexes: tables 
aren't accessed at all, bypassing most of your questions. Check with peers if 
you've got the indexes right.

Regards,
Willem



 From: nielskrist...@autouncle.com
 Subject: [PERFORM] Database design - best practice
 Date: Wed, 28 Nov 2012 13:41:14 +0100
 To: pgsql-performance@postgresql.org
 
 Hi,
 
 I'm on the hunt for some solid knowledge on a theoretical level about the 
 performance of postgresql. My question is regarding best practices, and how 
 architectural decisions might influence the performance. First a little 
 background:
 
 The setup:
 I have a database which holds informations on used cars. The database has 
 mainly 3 tables of interest for this case:
 A cars table, an adverts table and a sellers table. One car has many adverts 
 and one seller has many adverts. One advert belongs to one car and one seller.
 The database is powering a website for searching used cars. When searching 
 for used cars, the cars table is mainly used, and a lot of the columns should 
 be directly available for searching e.g. color, milage, price, 
 has_automatic_transmission etc.
 
 So my main concern is actually about the cars table, since this one currently 
 has a lot of columns (151 - I expect thats quite a lot?), and a lot of data 
 (4 mil. rows, and growing). Now you might start by thinking, this could sound 
 like a regular need for some normalization, but wait a second and let me 
 explain :-)
 The columns in this table is for the most very short stings, integers, 
 decimals or booleans. So take for an example has_automatic_transmission 
 (boolean) I can't see why it would make sense to put that into a separate 
 table and join in the values. Or the milage or the price as another example. 
 The cars table used for search is indexed quite a lot.
 
 The questions:
 Having the above setup in mind, what impact on performance, in terms of read 
 performance and write performance, does it have, whether I do the following:
 1) In general would the read and/or the write on the database be faster, if I 
 serialized some of the not searched columns in the table into a single text 
 columns instead of let's say 20 booleans?
 2) Lets say I'm updating a timestamp in a single one of the 151 columns in 
 the cars table. The update 

Re: [PERFORM] Database design - best practice

2012-11-28 Thread Bèrto ëd Sèra
Hi Kristian,

  I can't see why it would make sense to put that into a separate table and
 join in the values 
 You don't normalize for performance. People DEnormalize for performance.

Yes. In short, you seem more of a developer than a RDBMS guy. This is
not a personal fault, but it's a *very* dangerous state to be in and
you should address the problem asap. Erase from your head all you
could possibly know in terms of putting it into a file and read very
basic texts about normal forms. Like this:

http://en.wikipedia.org/wiki/Database_normalization

As already said by Willem, learn to test your stuff. There is a
\timing command in psql, use it.

For example (addressing your other post), you want to check how long it takes to
UPDATE adverts
SET
  last_observed_at = '2012-11-28 00:02:30.265154',
  data_source_id ='83d024a57bc2958940f3ca281bddcbf4'
WHERE
  adverts.id IN ( 1602382, 4916432, .. 363 ) ;

as opposed to

UPDATE adverts
SET
  last_observed_at = '2012-11-28 00:02:30.265154',
  data_source_id ='83d024a57bc2958940f3ca281bddcbf4'
WHERE
  adverts.id = 1602382 OR
  adverts.id = 4916432 OR
   ..
  adverts.id = 363;

My 5 pence
Bèrto
-- 
==
If Pac-Man had affected us as kids, we'd all be running around in a
darkened room munching pills and listening to repetitive music.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize update query

2012-11-28 Thread Shaun Thomas

On 11/28/2012 06:57 AM, Niels Kristian Schjødt wrote:

Before I go crazy, here... you really need to tell us what not enough
means. You didn't provide an explain analyze, so we don't know what your
actual performance is. But I have my suspicions.


So as you can see, it's already pretty optimized, it's just not
enough :-) So what can I do? the two columns last_observed_at and
data_source_id has an index, and it is needed elsewhere, so I can't
delete those.


Ok, so part of your problem is that you're tying an advertising system
directly to the database for direct updates. That's a big no-no. Any
time you got a huge influx of views, there would be a logjam. You need
to decouple this so you can use a second tool to load the database in
larger batches. You'll get much higher throughput this way.

If you absolutely must use this approach, you're going to have to beef
up your hardware.


PS. I'm on postgres 9.2 on a server with 32gb ram, 8 cores and two 3T
disks in a software raid 1 setup.


This is not sufficient for a high-bandwidth stream of updates. Not even
close. Even if those 3T disks are 7200 RPM, and even in RAID-1, you're
going to have major problems with concurrent reads and writes. You need
to do several things:

1. Move your transaction logs (pg_xlog) to another pair of disks
entirely. Do not put these on the same disks as your data if you need
high write throughput.
2. Get a better disk architecture. You need 10k, or 15k RPM disks.
Starting with 6 or more of them in a RAID-10 would be a good beginning.

You never told us your postgresql.conf settings, so I'm just going with
very generic advice. Essentially, you're expecting too much for too
little. That machine would have been low-spec three years ago, and
unsuited to database use simply due to the 2-disk RAID.


Is the only way out of this really a SSD disk?


No. There are many, many steps you can and should take before going this
route. You need to know the problem you're solving before making
potentially expensive hardware decisions.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize update query

2012-11-28 Thread Marcin Mirosław
W dniu 28.11.2012 15:07, Shaun Thomas pisze:
 On 11/28/2012 06:57 AM, Niels Kristian Schjødt wrote:
 
 Before I go crazy, here... you really need to tell us what not enough
 means. You didn't provide an explain analyze, so we don't know what your
 actual performance is. But I have my suspicions.
 
 So as you can see, it's already pretty optimized, it's just not
 enough :-) So what can I do? the two columns last_observed_at and
 data_source_id has an index, and it is needed elsewhere, so I can't
 delete those.
 
 Ok, so part of your problem is that you're tying an advertising system
 directly to the database for direct updates. That's a big no-no. Any
 time you got a huge influx of views, there would be a logjam. You need
 to decouple this so you can use a second tool to load the database in
 larger batches. You'll get much higher throughput this way.

+1, sql databases has limited number of inserts/updates per second. Even
with highend hardware you won't have more than XXX operations per
second. As Thomas said, you should feed something like nosql database
from www server and use other tool to do aggregation and batch inserts
to postgresql. It will scale much better.

Marcin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Database design - best practice

2012-11-28 Thread Kevin Grittner
Niels Kristian Schjødt wrote:

 So my main concern is actually about the cars table, since this
 one currently has a lot of columns (151 - I expect thats quite a
 lot?),

That's pretty wide, but not outrageous.

 and a lot of data (4 mil. rows, and growing).

That's not a big deal. It's not unusual to have hundreds of
millions of rows in a PostgreSQL table. Properly indexed, that
should perform fine on queries. Sometimes partitioning rows into
sub-tables helps, but you didn't really mention anything which
suggests that would be helpful for you.

 Now you might start by thinking, this could sound like a regular
 need for some normalization

On the contrary, what you describe sounds well normalized. Breaking
off attributes of a car into separate tables would not advance
that.

 The columns in this table is for the most very short stings,
 integers, decimals or booleans. So take for an example
 has_automatic_transmission (boolean) I can't see why it would
 make sense to put that into a separate table and join in the
 values. Or the milage or the price as another example. The cars
 table used for search is indexed quite a lot.

On the face of it, it sounds like you should have some one-column
indexes on the columns most useful for selection (based on
frequency of use and how selective a selection on the column tends
to be).

You might benefit from a technique called vertical partitioning
-- where you split off less frequently referenced column and/or
columns which are updated more often into sibling tables, with
the same primary key as the car table. That can sometimes buy some
performance at the expense of programming complexity and more
difficulty maintaining data integrity. I wouldn't go there without
evidence that your performance is not adequate without it.

-Kevin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Database design - best practice

2012-11-28 Thread Vitalii Tymchyshyn
Let me be devil advocate here :)
First of all, even if you read any basics about normalization, don't take
it to your heart :) Think.
Know that each normalization/denormalization step has it's cons and pros.
E.g. in NoSQL world they often don't normalize much.
What's interesting with PosgreSQL is that it is suited quite good for
NoSQL-like scenarios.
First of all, each unfilled (null) data column takes 1 bit only. This, BTW,
leads to interesting consequence that performance-wise it can be better to
have null/true boolean than false/true. Especially if you've got a lot of
false.
So, PostgreSQL should be good with 10th, possible 100th of data column with
most columns empty. Record of 151 null columns would take header +
roundup(151/8 ) = 19 bytes. Not much. NoSQLs usually put column names into
records and this costs more.
Any null columns at the end of record take no space at all (so, you can
think on reordering your columns to put the least used to the record end).
Adding column with null as default is cheap operation that do not require
table scan.
You can have partial indexes to speed things up, like create index on car
(car_id) where (has_automatic_transmission);

At the other side, when you normalize you need to join. Instead of select *
from car where has_automatic_transmission (that will use index above), you
will have to select * from car where id in (select id from
car_with_automatic_transmission). The plan is much more complex here. It
will be slower.

The main normalization plus for you is that you work with record as a
whole, so if there is a lot of information in there that is rarely used,
you will pay for it's access every time, both on selects and updates.

So, as conclusion, I agree with others, that you should check. But
remember, joining two tables with millions of records os never cheap :)

Best regards, Vitalii Tymchyshyn


2012/11/28 Niels Kristian Schjødt nielskrist...@autouncle.com

 Hi,

 I'm on the hunt for some solid knowledge on a theoretical level about the
 performance of postgresql. My question is regarding best practices, and how
 architectural decisions might influence the performance. First a little
 background:

 The setup:
 I have a database which holds informations on used cars. The database has
 mainly 3 tables of interest for this case:
 A cars table, an adverts table and a sellers table. One car has many
 adverts and one seller has many adverts. One advert belongs to one car and
 one seller.
 The database is powering a website for searching used cars. When searching
 for used cars, the cars table is mainly used, and a lot of the columns
 should be directly available for searching e.g. color, milage, price,
 has_automatic_transmission etc.

 So my main concern is actually about the cars table, since this one
 currently has a lot of columns (151 - I expect thats quite a lot?), and a
 lot of data (4 mil. rows, and growing). Now you might start by thinking,
 this could sound like a regular need for some normalization, but wait a
 second and let me explain :-)
 The columns in this table is for the most very short stings, integers,
 decimals or booleans. So take for an example has_automatic_transmission
 (boolean) I can't see why it would make sense to put that into a separate
 table and join in the values. Or the milage or the price as another
 example. The cars table used for search is indexed quite a lot.

 The questions:
 Having the above setup in mind, what impact on performance, in terms of
 read performance and write performance, does it have, whether I do the
 following:
 1) In general would the read and/or the write on the database be
 faster, if I serialized some of the not searched columns in the table into
 a single text columns instead of let's say 20 booleans?
 2) Lets say I'm updating a timestamp in a single one of the 151
 columns in the cars table. The update statement is using the id to find the
 car. Would the write performance of that UPDATE be affected, if the table
 had fewer columns?
 3) When adding a new column to the table i know that it becomes
 slower the more rows is in the table, but what about the width of the
 table does that affect the performance when adding new columns?
 4) In general what performance downsides do you get when adding a
 lot of columns to one table instead of having them in separate tables?
 5) Is it significantly faster to select * from a table with 20
 columns, than selecting the same 20 in a table with 150 columns?

 Hope there is some good answers out there :-)

 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance




-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] Optimize update query

2012-11-28 Thread Willem Leenen


I assume that SQL databases ( Banks? Telecom?) can handle an used car shop. No 
need for an unstructured data tool.



 +1, sql databases has limited number of inserts/updates per second. Even
 with highend hardware you won't have more than XXX operations per
 second. As Thomas said, you should feed something like nosql database
 from www server and use other tool to do aggregation and batch inserts
 to postgresql. It will scale much better.
 
 Marcin

  

Re: [PERFORM] Savepoints in transactions for speed?

2012-11-28 Thread Mike Blackwell
On Tue, Nov 27, 2012 at 7:16 PM, Claudio Freire klaussfre...@gmail.comwrote:

 Updates, are faster if batched, if your business logic allows it,
 because it creates less bloat and creates more opportunities for with
 HOT updates. I don't think it applies to inserts, though, and I
 haven't heard it either.


Ah.  That must have been what I'd half-remembered.  Thanks for the
clarification.

Mike


Re: [PERFORM] Savepoints in transactions for speed?

2012-11-28 Thread Willem Leenen


Commitmarks are written to disk after each transaction. So transactionsize has 
impact on performance. 

Date: Wed, 28 Nov 2012 09:18:20 -0600
Subject: Re: [PERFORM] Savepoints in transactions for speed?
From: mike.blackw...@rrd.com
To: klaussfre...@gmail.com
CC: pgsql-performance@postgresql.org



On Tue, Nov 27, 2012 at 7:16 PM, Claudio Freire klaussfre...@gmail.com wrote:

Updates, are faster if batched, if your business logic allows it,

because it creates less bloat and creates more opportunities for with

HOT updates. I don't think it applies to inserts, though, and I

haven't heard it either.

Ah.  That must have been what I'd half-remembered.  Thanks for the 
clarification.
Mike  

Re: [PERFORM] Optimize update query

2012-11-28 Thread Niels Kristian Schjødt
Okay guys,

Thanks for all the great help and advice already! Let me just clear some 
things, to make my question a little easier to answer :-)
Now my site is a search engine for used cars - not just a car shop with a few 
hundred cars.
The update query you look at, is an update that is executed once a day in 
chunks for all active adverts, so we know they are still for sale (one car can 
be advertised at several places hence several adverts). So it's not a 
constant stream but it has a fairly high volume especially at night time 
though.

A compressed version of my .conf looks like this (note: there is some tweaks at 
the end of the file)
  data_directory = '/var/lib/postgresql/9.2/main'
  hba_file = '/etc/postgresql/9.2/main/pg_hba.conf' 
  ident_file = '/etc/postgresql/9.2/main/pg_ident.conf'
  external_pid_file = '/var/run/postgresql/9.2-main.pid' 
  listen_addresses = '192.168.0.2, localhost'
  port = 5432
  max_connections = 1000 
  unix_socket_directory = '/var/run/postgresql'
  wal_level = hot_standby
  synchronous_commit = off
  archive_mode = onarchive_command = 'rsync -a %p 
postgres@192.168.0.4:/var/lib/postgresql/9.2/wals/%f /dev/null' 
  max_wal_senders = 1 
  wal_keep_segments = 32
  logging_collector = on 
  log_min_messages = debug1 
  log_min_error_statement = debug1
  log_min_duration_statement = 0
  log_checkpoints = on
  log_connections = on
  log_disconnections = onlog_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d '
  log_lock_waits = on log_temp_files = 0
  datestyle = 'iso, mdy' 
  lc_messages = 'C'
  lc_monetary = 'en_US.UTF-8'
  lc_numeric = 'en_US.UTF-8' 
  lc_time = 'en_US.UTF-8' 
  default_text_search_config = 'pg_catalog.english' 
  default_statistics_target = 100
  maintenance_work_mem = 1GB
  checkpoint_completion_target = 0.7
  effective_cache_size = 22GB
  work_mem = 160MB
  wal_buffers = 4MB
  checkpoint_segments = 16
  shared_buffers = 7680MB

# All the log stuff is mainly temporary requirement for pgBadger
# The database has been tuned with pgtuner

You might be familiar with new relic, and I use that for quite a lot of 
monitoring. So, this is what I see at night time (a lot of I/O). So I went to 
play around with pgBadger to get some insights at database level.
iframe src=https://rpm.newrelic.com/public/charts/h2dtedghfsv; width=500 
height=300 scrolling=no frameborder=no/iframe

This shows me, that the by far most time-consuming queries are updates (in 
general). On avg. a query like the one I showed you, take 1,3 sec (but often it 
takes several minutes - which makes me wonder). So correct me if I'm wrong 
here: my theory is, that I have too many too slow update queries, that then 
often end up in a situation, where they wait for each other to finish, hence 
the sometimes VERY long execution times. So my basic idea here is, that if I 
could reduce the cost of the updates, then I could get a hight throughput 
overall.

Here is a sample of the pgBadger analysis:

Queries that took up the most time (N) ^
RankTotal duration  Times executed  Av. duration (s)Query
1   1d15h28m38.71s  
948,711
0.15s   
COMMIT;

2   1d2h17m55.43s   
401,002
0.24s   
INSERT INTO car_images ( car_id, created_at, image, updated_at ) 
VALUES ( '', '', '', '' ) returning id;

3   23h18m33.68s
195,093
0.43s   
SELECT DISTINCT cars.id FROM cars LEFT OUTER JOIN adverts ON 
adverts.car_id = cars.id LEFT OUTERJOIN sellers ON sellers.id = 
adverts.seller_id WHERE cars.sales_state = '' AND cars.year = 0 
ANDcars.engine_size = 0.0 AND ( ( cars.id IS NOT NULL AND 
cars.brand = '' AND cars.model_name = ''AND cars.fuel = '' AND 
cars.km = 0 AND cars.price = 0 AND sellers.kind = '' ) ) LIMIT 0;

4   22h45m26.52s
3,374,133
0.02s   
SELECT adverts.* FROM adverts WHERE ( source_name = '' AND md5 ( url ) = 
md5 ( '' ) ) LIMIT 0;

5   10h31m37.18s
29,671
1.28s   
UPDATE adverts SET last_observed_at = '', data_source_id = '' WHERE 
adverts.id IN ( ... ) ;

6   7h18m40.65s 
396,393
0.07s   
UPDATE cars SET updated_at = '' WHERE cars.id = 0;

7   7h6m7.87s   
241,294
0.11s   
UPDATE cars SET images_count = COALESCE ( images_count, 0 ) + 0 WHERE 
cars.id = 0;

8   6h56m11.78s 
84,571
0.30s   
INSERT INTO failed_adverts ( active_record_object_class, 
advert_candidate, created_at, exception_class,exception_message, 
from_rescraper, last_retried_at, retry_count, source_name, 
stack_trace,updated_at, url ) VALUES ( NULL, '', '', '', '', NULL, NULL, 
'', '', '', '', '' ) returning id;

9   5h47m25.45s 
188,402
0.11s   
INSERT INTO adverts ( availability_state, car_id, created_at, 
data_source_id, deactivated_at,first_extraction, last_observed_at, 
price, seller_id, source_id, source_name, updated_at, url )VALUES ( 
'', '', '', '', NULL, '', '', '', '', '', '', '', '' ) returning id;

10  3h4m26.86s  
166,235
0.07s   
UPDATE adverts SET deactivated_at = '', availability_state = '', 
updated_at = '' WHERE adverts.id = 0;

(Yes I'm already 

Re: [PERFORM] Optimize update query

2012-11-28 Thread Bèrto ëd Sèra
max_connections = 1000 looks bad... why not a pooler in place?
Cheers
Bèrto

On 28 November 2012 16:19, Niels Kristian Schjødt
nielskrist...@autouncle.com wrote:
 max_connections = 1000



-- 
==
If Pac-Man had affected us as kids, we'd all be running around in a
darkened room munching pills and listening to repetitive music.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize update query

2012-11-28 Thread Shaun Thomas

On 11/28/2012 10:19 AM, Niels Kristian Schjødt wrote:


https://rpm.newrelic.com/public/charts/h2dtedghfsv


Doesn't this answer your question?

That iowait is crushing your server into the ground. It's no surprise
updates are taking several seconds. That update you sent us *should*
execute on the order of only a few milliseconds.

So I'll reiterate that you *must* move your pg_xlog location elsewhere.
You've got row lookup bandwidth conflicting with writes. There are a
couple other changes you should probably make to your config:


checkpoint_segments = 16


This is not enough for the workload you describe. Every time the
database checkpoints, all of those changes in pg_xlog are applied to the
backend data files. You should set these values:

checkpoint_segments = 100
checkpoint_timeout = 10m
checkpoint_completion_target = 0.9

This will reduce your overall write workload, and make it less active.
Too many checkpoints massively reduce write throughput. With the
settings you have, it's probably checkpointing constantly while your
load runs. Start with this, but experiment with increasing
checkpoint_segments further.

If you check your logs now, you probably see a ton of checkpoint
starting: xlog in there. That's very bad. It should say checkpoint
starting: time meaning it's keeping up with your writes naturally.


work_mem = 160MB


This is probably way too high. work_mem is used every sort operation in
a query. So each connection could have several of these allocated, thus
starting your system of memory which will reduce that available for page
cache. Change it to 8mb, and increase it in small increments if necessary.


So correct me if I'm wrong here: my theory is, that I have too many
too slow update queries, that then often end up in a situation, where
they wait for each other to finish, hence the sometimes VERY long
execution times.


Sometimes this is the case, but for you, you're running into IO
contention, not lock contention. Your 3TB RAID-1 is simply insufficient
for this workload.

If you check your logs after making the changes I've suggested, take a
look at your checkpoint sync times. That will tell you how long it took
the kernel to physically commit those blocks to disk and get a
confirmation back from the controller. If those take longer than a
second or two, you're probably running into controller buffer overflows.
You have a large amount of RAM, so you should also make these two kernel
changes to sysctl.conf:

vm.dirty_ratio = 10
vm.dirty_writeback_ratio = 1

Then run this:

sysctl -p

This will help prevent large IO write spikes caused when the kernel
decides to write out dirty memory. That can make checkpoints take
minutes to commit in some cases, which basically stops all write traffic
to your database entirely.

That should get you going, anyway. You still need more/better disks so
you can move your pg_xlog directory. With your write load, that will
make a huge difference.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize update query

2012-11-28 Thread Shaun Thomas

On 11/28/2012 11:44 AM, Niels Kristian Schjødt wrote:


Thanks a lot - on the server I already have one additional SSD 250gb
disk, that I don't use for anything at the moment.


God. An SSD would actually be better for your data, as it follows
more random access patterns, and xlogs are more sequential. But it's
better than nothing.

And yes, you'd be better off with a RAID-1 of two of these SSDs, because
the xlogs are critical to database health. You have your archived copy
due to the rsync, which helps. But if you had a crash, there could
potentially be a need to replay unarchived transaction logs, and you'd
end up with some data loss.


BTW. as you might have seen from the .conf I have a second slave
server with the exact same setup, which currently runs as a hot
streaming replication slave. I might ask a stupid question here, but
this does not affect the performance of the master does it?


Only if you're using synchronous replication. From what I saw in the
config, that isn't the case.


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Database design - best practice

2012-11-28 Thread Jeff Janes
On Wed, Nov 28, 2012 at 4:41 AM, Niels Kristian Schjødt
nielskrist...@autouncle.com wrote:

 So my main concern is actually about the cars table, since this one currently 
 has a lot of columns (151 - I expect thats quite a lot?), and a lot of data 
 (4 mil. rows, and growing). Now you might start by thinking, this could sound 
 like a regular need for some normalization, but wait a second and let me 
 explain :-)

If you have 151 single-valued pieces of information, than that is what
you have.  You can't tell if something is normalized or not by
counting the columns.

 The columns in this table is for the most very short stings, integers, 
 decimals or booleans. So take for an example has_automatic_transmission 
 (boolean) I can't see why it would make sense to put that into a separate 
 table and join in the values.

I can't see why that would make sense, either.  Nor do I think that
doing so would increase the level of normalization.  What rule of
normalization would be served by creating gratuitous joins?

 Or the milage or the price as another example. The cars table used for search 
 is indexed quite a lot.

How useful are the indices?

 The questions:
 Having the above setup in mind, what impact on performance, in terms of read 
 performance and write performance, does it have, whether I do the following:
 1) In general would the read and/or the write on the database be 
 faster, if I serialized some of the not searched columns in the table into a 
 single text columns instead of let's say 20 booleans?

Probably not. And could make it much worse, depending on how you
serialize it.  For example, if you use hstore or json, now the column
names for each of the 20 booleans are repeated in every row, rather
than being metadata stored only once.  But try it and see.

 2) Lets say I'm updating a timestamp in a single one of the 151 
 columns in the cars table. The update statement is using the id to find the 
 car. Would the write performance of that UPDATE be affected, if the table had 
 fewer columns?

Yes, but probably not by much.  The biggest effect will be on whether
the timestamp column is indexed.  If it is, then updating it means
that all other indexes on the table will also need to be updated.  If
it is not indexed, then the update can be a HOT update.


 3) When adding a new column to the table i know that it becomes 
 slower the more rows is in the table, but what about the width of the table 
 does that affect the performance when adding new columns?

Adding a new column to a table is pretty much instantaneous if the
default value is NULL.

 4) In general what performance downsides do you get when adding a lot 
 of columns to one table instead of having them in separate tables?

This question cannot be answered in general.  If every time you use
the main table you have to join it to the separate table, then
performance will be bad.  If you almost never have to join to the
separate table, then performance will be better.

 5) Is it significantly faster to select * from a table with 20 
 columns, than selecting the same 20 in a table with 150 columns?

If the extra 130 columns are mostly null, the difference will be very
small.  Or, if the where clause is such that you only do a single-row
lookup on a primary key column, for example, the difference will also
be small.

Cheers,

Jeff


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] NEED REPLICATION SOLUTION -POSTGRES 9.1

2012-11-28 Thread suhas.basavaraj12
Hi, 

We are planning to migrate  our production  databases to different
servers.We have around  8 servers with 8 different clusters.We are planning
t shuffle databases and make them as 7 cluster and migrate to new remote
servers . 
We cannot use streaming replication as we are migrating different databases
from different clusters to one single cluster . This  will be resulting in
huge downtime  as data is huge . 
Need expert advice on this scenario.Can we reduce downtime in any way ..?? 

Rgrds 
Suhas



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/NEED-REPLICATION-SOLUTION-POSTGRES-9-1-tp5733939.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] NEED REPLICATION SOLUTION -POSTGRES 9.1

2012-11-28 Thread Claudio Freire
On Wed, Nov 28, 2012 at 3:12 PM, suhas.basavaraj12 suha...@verse.in wrote:
 We are planning to migrate  our production  databases to different
 servers.We have around  8 servers with 8 different clusters.We are planning
 t shuffle databases and make them as 7 cluster and migrate to new remote
 servers .
 We cannot use streaming replication as we are migrating different databases
 from different clusters to one single cluster . This  will be resulting in
 huge downtime  as data is huge .
 Need expert advice on this scenario.Can we reduce downtime in any way ..??

One time we had to do something like this (on a smaller scale), we used slony.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] NEED REPLICATION SOLUTION -POSTGRES 9.1

2012-11-28 Thread Rick Otten
I recommend SymmetricDS - http://www.symmetricds.org


-Original Message-
From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of suhas.basavaraj12
Sent: Wednesday, November 28, 2012 1:12 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] NEED REPLICATION SOLUTION -POSTGRES 9.1

Hi, 

We are planning to migrate  our production  databases to different servers.We 
have around  8 servers with 8 different clusters.We are planning t shuffle 
databases and make them as 7 cluster and migrate to new remote servers . 
We cannot use streaming replication as we are migrating different databases 
from different clusters to one single cluster . This  will be resulting in huge 
downtime  as data is huge . 
Need expert advice on this scenario.Can we reduce downtime in any way ..?? 

Rgrds
Suhas



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/NEED-REPLICATION-SOLUTION-POSTGRES-9-1-tp5733939.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Comparative tps question

2012-11-28 Thread John Lister
Hi, I've just been benchmarking a new box I've got and running pgbench 
yields what I thought was a slow tps count. It is dificult to find 
comparisons online of other benchmark results, I'd like to see if I have 
the box set up reasonably well.


I know oracle, et al prohibit benchmark results, but was surprised that 
there doesn't seem to be any postgresql ones out there..


Anyway, the machine is a Dell R720 with the data on a raid 10 using 8x 
intel 320 SSDs and a mirrored pair of 15k SAS HDDs configured for the 
pg_xlog, both on a dell H710 raid controller, in addition it has 64Gb of 
1600Mhz memory and 2x E5-2650 processors (with HT=32 cores). The arrays 
are all setup with XFS on and tweaked as I could. The drives are 160Gb 
and overprovisioned by another 15%.


I'm running postgresql 9.1 on ubuntu 12.04

bonnie++ (using defaults) shows about 600MB/s sequential read/write IO 
on the main data array, this doesn't seem too bad although the specs 
show over 200MB/s should be achievable per drive.


pgbench (using a scaling factor of 100 with 100 clients and 25 threads) 
gives an average of about 7200tps.


Does this look acceptable? Instinctively it feels on the low side, 
although I noted that a couple of blogs show 
(http://www.fuzzy.cz/en/articles/ssd-benchmark-results-read-write-pgbench/ 
and 
http://it-blog.5amsolutions.com/2010/08/performance-of-postgresql-ssd-vs.html) 
show around 1500tps for a single ssd, so maybe this is what is expected.


The interesting param differences from the postgresql conf are:
share_buffers=6Gb
work_mem=64Mb
max_stack_depth=4Mb
random_page_cost=1.1
cpu_tuple_cost=0.1
cpu_index_tuple_cost=0.05
cpu_operator_cost=0.025
effective_cache_size=40Gb

I'd be happy to provide any other configs, etc assuming the tps values 
are way off the expected.


Thanks

John

ps. the number of safe ssds available in the uk seems to be rather 
limited, hence the intel 320s which I probably aren't as fast as modern 
drives.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Comparative tps question

2012-11-28 Thread Merlin Moncure
On Wed, Nov 28, 2012 at 12:37 PM, John Lister john.lis...@kickstone.com wrote:
 Hi, I've just been benchmarking a new box I've got and running pgbench
 yields what I thought was a slow tps count. It is dificult to find
 comparisons online of other benchmark results, I'd like to see if I have the
 box set up reasonably well.

 I know oracle, et al prohibit benchmark results, but was surprised that
 there doesn't seem to be any postgresql ones out there..

 Anyway, the machine is a Dell R720 with the data on a raid 10 using 8x intel
 320 SSDs and a mirrored pair of 15k SAS HDDs configured for the pg_xlog,
 both on a dell H710 raid controller, in addition it has 64Gb of 1600Mhz
 memory and 2x E5-2650 processors (with HT=32 cores). The arrays are all
 setup with XFS on and tweaked as I could. The drives are 160Gb and
 overprovisioned by another 15%.

 I'm running postgresql 9.1 on ubuntu 12.04

 bonnie++ (using defaults) shows about 600MB/s sequential read/write IO on
 the main data array, this doesn't seem too bad although the specs show over
 200MB/s should be achievable per drive.

Probably this limitation is coming from sata bus. It shouldn't be a
problem in practice.   Can you report bonnie++ seek performance?
Another possibility is the raid controller is introducing overhead
here.

 pgbench (using a scaling factor of 100 with 100 clients and 25 threads)
 gives an average of about 7200tps.

 Does this look acceptable? Instinctively it feels on the low side, although
 I noted that a couple of blogs show
 (http://www.fuzzy.cz/en/articles/ssd-benchmark-results-read-write-pgbench/
 and
 http://it-blog.5amsolutions.com/2010/08/performance-of-postgresql-ssd-vs.html)
 show around 1500tps for a single ssd, so maybe this is what is expected.

 The interesting param differences from the postgresql conf are:
 share_buffers=6Gb
 work_mem=64Mb
 max_stack_depth=4Mb
 random_page_cost=1.1
 cpu_tuple_cost=0.1
 cpu_index_tuple_cost=0.05
 cpu_operator_cost=0.025
 effective_cache_size=40Gb

*) none of the above settings will influence storage bound pgbench
results.  Influential settings are fsync, synchronous_commit,
wal_sync_method, wal_level, full_page_writes, wal_buffers,
wal_writer_delay, and commit_delay.  These settings are basically
managing various tradeoffs, espeically in the sense of safety vs
performance.

 I'd be happy to provide any other configs, etc assuming the tps values are
 way off the expected.

*) Very first thing we need to check is if we are storage bound (check
i/o wait) and if so where the bind up is.  Could be on the wal or heap
volume.  Another possibility is that we're lock bound which is a
completely different issue to deal with.

so we want to see top, iostat, vmstat, etc while test is happening.

*) another interesting test to run is large scaling factor (ideally,
at least 2x ram) read only test via pgbench -S.

merlin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Savepoints in transactions for speed?

2012-11-28 Thread Jeff Davis
On Tue, 2012-11-27 at 22:16 -0300, Claudio Freire wrote:
 Updates, are faster if batched, if your business logic allows it,
 because it creates less bloat and creates more opportunities for with
 HOT updates. I don't think it applies to inserts, though, and I
 haven't heard it either.

Huge updates (e.g. UPDATE with no WHERE clause) are less likely to
benefit from HOT. HOT has two main optimizations:

1. Remove dead tuples faster without waiting for VACUUM -- this only
works if the transaction that updated/deleted the tuple actually
finished (otherwise the tuple can't be removed yet), so it only benefits
the *next* update to come along. But if it's one big update, then VACUUM
is probably just as good at cleaning up the space.

2. Doesn't make new index entries for the new tuple; reuses the old
index entries -- this only works if the update is on the same page, but
large updates tend to fill pages up (because of the buildup of dead
tuples) and force new to go to new pages.

HOT is essentially designed for lots of small updates, which didn't
perform well before PG 8.3.

Batching of inserts/updates/deletes has a big benefit over separate
transactions, but only up to a point, after which it levels off. I'm not
sure exactly when that point is, but after that, the downsides of
keeping a transaction open (like inability to remove the previous
version of an updated tuple) take over.

Regards,
Jeff Davis



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Savepoints in transactions for speed?

2012-11-28 Thread Jeff Davis
On Tue, 2012-11-27 at 16:04 -0600, Mike Blackwell wrote:
 I need to delete about 1.5 million records from a table and reload it
 in one transaction.  The usual advice when loading with inserts seems
 to be group them into transactions of around 1k records.  Committing
 at that point would leave the table in an inconsistent state.  Would
 issuing a savepoint every 1k or so records negate whatever downside
 there is to keeping a transaction open for all 1.5 million records, or
 just add more overhead?

A large transaction isn't really a big problem for postgres, and 1.5M
records should be processed quickly anyway.

The main problem with a long-running delete or update transaction is
that the dead tuples (deleted tuples or the old version of an updated
tuple) can't be removed until the transaction finishes. That can cause
temporary bloat, but 1.5M records shouldn't be noticeable. 

Adding subtransactions into the mix won't help, but probably won't hurt,
either. The transaction will still run just as long, and you still can't
delete the tuples ahead of time (unless you abort a subtransaction). If
you *do* use subtransactions, make sure to release them as quickly as
you create them (don't just use ROLLBACK TO, that still leaves the
savepoint there); having 1500 open subtransactions might cause
performance problems elsewhere.

Regards,
Jeff Davis



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize update query

2012-11-28 Thread Niels Kristian Schjødt
Hi, I have started to implement your suggestions . I have a small error so far 
though. The vm.dirty_writeback_ratio = 1 command rerurns: 
  error: vm.dirty_writeback_ratio is an unknown key
I'm on ubuntu 12.04


Den 28/11/2012 kl. 17.54 skrev Shaun Thomas stho...@optionshouse.com:

 On 11/28/2012 10:19 AM, Niels Kristian Schjødt wrote:
 
 https://rpm.newrelic.com/public/charts/h2dtedghfsv
 
 Doesn't this answer your question?
 
 That iowait is crushing your server into the ground. It's no surprise updates 
 are taking several seconds. That update you sent us *should* execute on the 
 order of only a few milliseconds.
 
 So I'll reiterate that you *must* move your pg_xlog location elsewhere. 
 You've got row lookup bandwidth conflicting with writes. There are a couple 
 other changes you should probably make to your config:
 
 checkpoint_segments = 16
 
 This is not enough for the workload you describe. Every time the database 
 checkpoints, all of those changes in pg_xlog are applied to the backend data 
 files. You should set these values:
 
 checkpoint_segments = 100
 checkpoint_timeout = 10m
 checkpoint_completion_target = 0.9
 
 This will reduce your overall write workload, and make it less active. Too 
 many checkpoints massively reduce write throughput. With the settings you 
 have, it's probably checkpointing constantly while your load runs. Start with 
 this, but experiment with increasing checkpoint_segments further.
 
 If you check your logs now, you probably see a ton of checkpoint starting: 
 xlog in there. That's very bad. It should say checkpoint starting: time 
 meaning it's keeping up with your writes naturally.
 
 work_mem = 160MB
 
 This is probably way too high. work_mem is used every sort operation in a 
 query. So each connection could have several of these allocated, thus 
 starting your system of memory which will reduce that available for page 
 cache. Change it to 8mb, and increase it in small increments if necessary.
 
 So correct me if I'm wrong here: my theory is, that I have too many
 too slow update queries, that then often end up in a situation, where
 they wait for each other to finish, hence the sometimes VERY long
 execution times.
 
 Sometimes this is the case, but for you, you're running into IO contention, 
 not lock contention. Your 3TB RAID-1 is simply insufficient for this workload.
 
 If you check your logs after making the changes I've suggested, take a look 
 at your checkpoint sync times. That will tell you how long it took the kernel 
 to physically commit those blocks to disk and get a confirmation back from 
 the controller. If those take longer than a second or two, you're probably 
 running into controller buffer overflows. You have a large amount of RAM, so 
 you should also make these two kernel changes to sysctl.conf:
 
 vm.dirty_ratio = 10
 vm.dirty_writeback_ratio = 1
 
 Then run this:
 
 sysctl -p
 
 This will help prevent large IO write spikes caused when the kernel decides 
 to write out dirty memory. That can make checkpoints take minutes to commit 
 in some cases, which basically stops all write traffic to your database 
 entirely.
 
 That should get you going, anyway. You still need more/better disks so you 
 can move your pg_xlog directory. With your write load, that will make a huge 
 difference.
 
 -- 
 Shaun Thomas
 OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
 312-444-8534
 stho...@optionshouse.com
 
 __
 
 See http://www.peak6.com/email_disclaimer/ for terms and conditions related 
 to this email



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Savepoints in transactions for speed?

2012-11-28 Thread Claudio Freire
On Wed, Nov 28, 2012 at 8:28 PM, Jeff Davis pg...@j-davis.com wrote:

 The main problem with a long-running delete or update transaction is
 that the dead tuples (deleted tuples or the old version of an updated
 tuple) can't be removed until the transaction finishes. That can cause
 temporary bloat, but 1.5M records shouldn't be noticeable.

Not really that fast if you have indices (and who doesn't have a PK or two).

I've never been able to update (update) 2M rows in one transaction in
reasonable times (read: less than several hours) without dropping
indices. Doing it in batches is way faster if you can't drop the
indices, and if you can leverage HOT updates.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimize update query

2012-11-28 Thread Mark Kirkwood

In later kernels these have been renamed:


Welcome to Ubuntu 12.04.1 LTS (GNU/Linux 3.2.0-32-generic x86_64)

$ sysctl -a|grep dirty
vm.dirty_background_ratio = 5
vm.dirty_background_bytes = 0
vm.dirty_ratio = 10
vm.dirty_bytes = 0
vm.dirty_writeback_centisecs = 500
vm.dirty_expire_centisecs = 3000

You the option of specifying either a ratio, or - more usefully for 
machines with a lot of ram - bytes.


Regards

Mark

P.s: People on this list usually prefer it if you *bottom* post (i.e 
reply underneath the original).



On 29/11/12 16:32, Niels Kristian Schjødt wrote:

Hi, I have started to implement your suggestions . I have a small error so far though. 
The vm.dirty_writeback_ratio = 1 command rerurns:
   error: vm.dirty_writeback_ratio is an unknown key
I'm on ubuntu 12.04






--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance