Re: [PERFORM] pgsql_tmp( Temporary tablespace)
Hi, Can i delete the content of this folder. I have observed couple of times , this folder got cleaned automatically. Which backend process deletes the data from this folder .Any Idea? Rgrds Suhas -- View this message in context: http://postgresql.1045698.n5.nabble.com/pgsql-tmp-Temporary-tablespace-tp5733858p5733863.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] pgsql_tmp( Temporary tablespace)
suhas.basavaraj12 wrote: Can i delete the content of this folder. I have observed couple of times , this folder got cleaned automatically. These files are in use and you should not delete them. If you need them to go right now, cancel the queries that create temporary files. If there are any leftover when PostgreSQL is shut down (don't know if that can happen), it should be safe to delete them. Which backend process deletes the data from this folder .Any Idea? I'm not sure, but probably the one that created them. Yours, Laurenz Albe -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Database design - best practice
Hi, I'm on the hunt for some solid knowledge on a theoretical level about the performance of postgresql. My question is regarding best practices, and how architectural decisions might influence the performance. First a little background: The setup: I have a database which holds informations on used cars. The database has mainly 3 tables of interest for this case: A cars table, an adverts table and a sellers table. One car has many adverts and one seller has many adverts. One advert belongs to one car and one seller. The database is powering a website for searching used cars. When searching for used cars, the cars table is mainly used, and a lot of the columns should be directly available for searching e.g. color, milage, price, has_automatic_transmission etc. So my main concern is actually about the cars table, since this one currently has a lot of columns (151 - I expect thats quite a lot?), and a lot of data (4 mil. rows, and growing). Now you might start by thinking, this could sound like a regular need for some normalization, but wait a second and let me explain :-) The columns in this table is for the most very short stings, integers, decimals or booleans. So take for an example has_automatic_transmission (boolean) I can't see why it would make sense to put that into a separate table and join in the values. Or the milage or the price as another example. The cars table used for search is indexed quite a lot. The questions: Having the above setup in mind, what impact on performance, in terms of read performance and write performance, does it have, whether I do the following: 1) In general would the read and/or the write on the database be faster, if I serialized some of the not searched columns in the table into a single text columns instead of let's say 20 booleans? 2) Lets say I'm updating a timestamp in a single one of the 151 columns in the cars table. The update statement is using the id to find the car. Would the write performance of that UPDATE be affected, if the table had fewer columns? 3) When adding a new column to the table i know that it becomes slower the more rows is in the table, but what about the width of the table does that affect the performance when adding new columns? 4) In general what performance downsides do you get when adding a lot of columns to one table instead of having them in separate tables? 5) Is it significantly faster to select * from a table with 20 columns, than selecting the same 20 in a table with 150 columns? Hope there is some good answers out there :-) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Optimize update query
Hi, i have these update queries, that run very often, and takes too long time, in order for us to reach the throughput we are aiming at. However, the update query is very simple, and I can't figure out any way to improve the situation. The query looks like this: UPDATE adverts SET last_observed_at = '2012-11-28 00:02:30.265154', data_source_id ='83d024a57bc2958940f3ca281bddcbf4' WHEREadverts.id IN ( 1602382, 4916432, 3221246, 4741057, 3853335, 571429, 3222740, 571736, 3544903, 325378,5774338, 5921451, 4295768, 3223170, 5687001, 4741966, 325519, 580867, 325721, 4412200, 4139598, 325567, 1616653,1616664, 6202007, 3223748, 325613, 3223764, 325615, 4296536, 3854595, 4971428, 3224146, 5150522, 4412617, 5073048,325747, 325771, 1622154, 5794384, 5736581, 1623767, 5686945, 3224627, 5073009, 3224747, 3224749, 325809, 5687051,3224811, 5687052, 4917824, 5073013, 3224816, 3224834, 4297331, 1623907, 325864, 1623947, 6169706, 325869, 325877,3225074, 3225112, 325893, 325912, 3225151, 3225184, 3225175, 1624659, 325901, 4033926, 325904, 325911, 4412835,1624737, 5073004, 5921434, 325915, 3225285, 3225452, 4917672, 1624984, 3225472, 325940, 5380611, 325957, 5073258,3225500, 1625002, 5923489, 4413009, 325952, 3961122, 363 ) ; An explain outputs me the following: Update on adverts (cost=0.12..734.27 rows=95 width=168) - Index Scan using adverts_pkey on adverts (cost=0.12..734.27 rows=95 width=168) Index Cond: (id = ANY ('{1602382,4916432,3221246,4741057,3853335,571429,3222740,571736,3544903,325378,5774338,5921451,4295768,3223170,5687001,4741966,325519,580867,325721,4412200,4139598,325567,1616653,1616664,6202007,3223748,325613,3223764,325615,4296536,3854595,4971428,3224146,5150522,4412617,5073048,325747,325771,1622154,5794384,5736581,1623767,5686945,3224627,5073009,3224747,3224749,325809,5687051,3224811,5687052,4917824,5073013,3224816,3224834,4297331,1623907,325864,1623947,6169706,325869,325877,3225074,3225112,325893,325912,3225151,3225184,3225175,1624659,325901,4033926,325904,325911,4412835,1624737,5073004,5921434,325915,3225285,3225452,4917672,1624984,3225472,325940,5380611,325957,5073258,3225500,1625002,5923489,4413009,325952,3961122,363}'::integer[])) So as you can see, it's already pretty optimized, it's just not enough :-) So what can I do? the two columns last_observed_at and data_source_id has an index, and it is needed elsewhere, so I can't delete those. PS. I'm on postgres 9.2 on a server with 32gb ram, 8 cores and two 3T disks in a software raid 1 setup. Is the only way out of this really a SSD disk? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Database design - best practice
Niels, I can't see why it would make sense to put that into a separate table and join in the values You don't normalize for performance. People DEnormalize for performance. Questions: (AFAIK) 1) This is a way to disaster. Get yourself a book on RDBMS from for example Celko. Do NOT go against the flow of the RDBMS rules, as here in rule #1 atomic values of a column. 2) This is not the big fish you are after. First benchmark your setup and compare the results with your desired performance level. First quantify your problem, if there is any, before using tricks. 3) A row will need more memory when it is wider, this may be amplified during hash joins. 4) People DEnormalize for performance. 5) Is it significantly faster to select * from a table with 20 columns, than selecting the same 20 in a table with 150 columns? I know the answer, but i encourage you to simply test this. I have seen lot's of urban legends about performance ( including the dropping of the referential integrity be cause that would make a difference ). Of course , when it's a full table scan, and it are ALL disk reads, (or ALL memory reads_) you can simply calculate it too. But just get into the habit of testing for learning. My advice: - know what performance you need. - test if you have this, varying tablecontent and systemload - do not tamper with the RDBMS rules, this will haunt you. - if you have the latest postgres version, you can use covering indexes: tables aren't accessed at all, bypassing most of your questions. Check with peers if you've got the indexes right. Regards, Willem From: nielskrist...@autouncle.com Subject: [PERFORM] Database design - best practice Date: Wed, 28 Nov 2012 13:41:14 +0100 To: pgsql-performance@postgresql.org Hi, I'm on the hunt for some solid knowledge on a theoretical level about the performance of postgresql. My question is regarding best practices, and how architectural decisions might influence the performance. First a little background: The setup: I have a database which holds informations on used cars. The database has mainly 3 tables of interest for this case: A cars table, an adverts table and a sellers table. One car has many adverts and one seller has many adverts. One advert belongs to one car and one seller. The database is powering a website for searching used cars. When searching for used cars, the cars table is mainly used, and a lot of the columns should be directly available for searching e.g. color, milage, price, has_automatic_transmission etc. So my main concern is actually about the cars table, since this one currently has a lot of columns (151 - I expect thats quite a lot?), and a lot of data (4 mil. rows, and growing). Now you might start by thinking, this could sound like a regular need for some normalization, but wait a second and let me explain :-) The columns in this table is for the most very short stings, integers, decimals or booleans. So take for an example has_automatic_transmission (boolean) I can't see why it would make sense to put that into a separate table and join in the values. Or the milage or the price as another example. The cars table used for search is indexed quite a lot. The questions: Having the above setup in mind, what impact on performance, in terms of read performance and write performance, does it have, whether I do the following: 1) In general would the read and/or the write on the database be faster, if I serialized some of the not searched columns in the table into a single text columns instead of let's say 20 booleans? 2) Lets say I'm updating a timestamp in a single one of the 151 columns in the cars table. The update statement is using the id to find the car. Would the write performance of that UPDATE be affected, if the table had fewer columns? 3) When adding a new column to the table i know that it becomes slower the more rows is in the table, but what about the width of the table does that affect the performance when adding new columns? 4) In general what performance downsides do you get when adding a lot of columns to one table instead of having them in separate tables? 5) Is it significantly faster to select * from a table with 20 columns, than selecting the same 20 in a table with 150 columns? Hope there is some good answers out there :-) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Database design - best practice
Thanks for the advice. Currently I see a lot of I/O related to update/inserts, so I'm trying to track down these guys at first. In relation to question 2, I read somewhere in the documentation that because of MVCC, the whole row has to be rewritten even though I just update one single column in that row. Hence if the table is wider (has more columns), the update will be slower. Does this match your understanding? Den 28/11/2012 kl. 14.10 skrev Willem Leenen willem_lee...@hotmail.com: Niels, I can't see why it would make sense to put that into a separate table and join in the values You don't normalize for performance. People DEnormalize for performance. Questions: (AFAIK) 1) This is a way to disaster. Get yourself a book on RDBMS from for example Celko. Do NOT go against the flow of the RDBMS rules, as here in rule #1 atomic values of a column. 2) This is not the big fish you are after. First benchmark your setup and compare the results with your desired performance level. First quantify your problem, if there is any, before using tricks. 3) A row will need more memory when it is wider, this may be amplified during hash joins. 4) People DEnormalize for performance. 5) Is it significantly faster to select * from a table with 20 columns, than selecting the same 20 in a table with 150 columns? I know the answer, but i encourage you to simply test this. I have seen lot's of urban legends about performance ( including the dropping of the referential integrity be cause that would make a difference ). Of course , when it's a full table scan, and it are ALL disk reads, (or ALL memory reads_) you can simply calculate it too. But just get into the habit of testing for learning. My advice: - know what performance you need. - test if you have this, varying tablecontent and systemload - do not tamper with the RDBMS rules, this will haunt you. - if you have the latest postgres version, you can use covering indexes: tables aren't accessed at all, bypassing most of your questions. Check with peers if you've got the indexes right. Regards, Willem From: nielskrist...@autouncle.com Subject: [PERFORM] Database design - best practice Date: Wed, 28 Nov 2012 13:41:14 +0100 To: pgsql-performance@postgresql.org Hi, I'm on the hunt for some solid knowledge on a theoretical level about the performance of postgresql. My question is regarding best practices, and how architectural decisions might influence the performance. First a little background: The setup: I have a database which holds informations on used cars. The database has mainly 3 tables of interest for this case: A cars table, an adverts table and a sellers table. One car has many adverts and one seller has many adverts. One advert belongs to one car and one seller. The database is powering a website for searching used cars. When searching for used cars, the cars table is mainly used, and a lot of the columns should be directly available for searching e.g. color, milage, price, has_automatic_transmission etc. So my main concern is actually about the cars table, since this one currently has a lot of columns (151 - I expect thats quite a lot?), and a lot of data (4 mil. rows, and growing). Now you might start by thinking, this could sound like a regular need for some normalization, but wait a second and let me explain :-) The columns in this table is for the most very short stings, integers, decimals or booleans. So take for an example has_automatic_transmission (boolean) I can't see why it would make sense to put that into a separate table and join in the values. Or the milage or the price as another example. The cars table used for search is indexed quite a lot. The questions: Having the above setup in mind, what impact on performance, in terms of read performance and write performance, does it have, whether I do the following: 1) In general would the read and/or the write on the database be faster, if I serialized some of the not searched columns in the table into a single text columns instead of let's say 20 booleans? 2) Lets say I'm updating a timestamp in a single one of the 151 columns in the cars table. The update statement is using the id to find the car. Would the write performance of that UPDATE be affected, if the table had fewer columns? 3) When adding a new column to the table i know that it becomes slower the more rows is in the table, but what about the width of the table does that affect the performance when adding new columns? 4) In general what performance downsides do you get when adding a lot of columns to one table instead of having them in separate tables? 5) Is it significantly faster to select * from a table with 20 columns, than selecting the same 20 in a table with 150 columns? Hope there is some good answers out
Re: [PERFORM] Database design - best practice
In relation to question 2, I read somewhere in the documentation that because of MVCC, the whole row has to be rewritten even though I just update one single column in that row. Hence if the table is wider (has more columns), the update will be slower. Does this match your understanding? No. I don't count the number of rows, but number of blocks (pages) that are modified, which are 8K each. My advice would be to first establish a solutiondirection via diagnosing the problem. My experience is that most solutions are not obscure at all. Subject: Re: [PERFORM] Database design - best practice From: nielskrist...@autouncle.com Date: Wed, 28 Nov 2012 14:20:27 +0100 CC: pgsql-performance@postgresql.org To: willem_lee...@hotmail.com Thanks for the advice. Currently I see a lot of I/O related to update/inserts, so I'm trying to track down these guys at first. In relation to question 2, I read somewhere in the documentation that because of MVCC, the whole row has to be rewritten even though I just update one single column in that row. Hence if the table is wider (has more columns), the update will be slower. Does this match your understanding? Den 28/11/2012 kl. 14.10 skrev Willem Leenen willem_lee...@hotmail.com:Niels, I can't see why it would make sense to put that into a separate table and join in the values You don't normalize for performance. People DEnormalize for performance. Questions: (AFAIK) 1) This is a way to disaster. Get yourself a book on RDBMS from for example Celko. Do NOT go against the flow of the RDBMS rules, as here in rule #1 atomic values of a column. 2) This is not the big fish you are after. First benchmark your setup and compare the results with your desired performance level. First quantify your problem, if there is any, before using tricks. 3) A row will need more memory when it is wider, this may be amplified during hash joins. 4) People DEnormalize for performance. 5) Is it significantly faster to select * from a table with 20 columns, than selecting the same 20 in a table with 150 columns? I know the answer, but i encourage you to simply test this. I have seen lot's of urban legends about performance ( including the dropping of the referential integrity be cause that would make a difference ). Of course , when it's a full table scan, and it are ALL disk reads, (or ALL memory reads_) you can simply calculate it too. But just get into the habit of testing for learning. My advice: - know what performance you need. - test if you have this, varying tablecontent and systemload - do not tamper with the RDBMS rules, this will haunt you. - if you have the latest postgres version, you can use covering indexes: tables aren't accessed at all, bypassing most of your questions. Check with peers if you've got the indexes right. Regards, Willem From: nielskrist...@autouncle.com Subject: [PERFORM] Database design - best practice Date: Wed, 28 Nov 2012 13:41:14 +0100 To: pgsql-performance@postgresql.org Hi, I'm on the hunt for some solid knowledge on a theoretical level about the performance of postgresql. My question is regarding best practices, and how architectural decisions might influence the performance. First a little background: The setup: I have a database which holds informations on used cars. The database has mainly 3 tables of interest for this case: A cars table, an adverts table and a sellers table. One car has many adverts and one seller has many adverts. One advert belongs to one car and one seller. The database is powering a website for searching used cars. When searching for used cars, the cars table is mainly used, and a lot of the columns should be directly available for searching e.g. color, milage, price, has_automatic_transmission etc. So my main concern is actually about the cars table, since this one currently has a lot of columns (151 - I expect thats quite a lot?), and a lot of data (4 mil. rows, and growing). Now you might start by thinking, this could sound like a regular need for some normalization, but wait a second and let me explain :-) The columns in this table is for the most very short stings, integers, decimals or booleans. So take for an example has_automatic_transmission (boolean) I can't see why it would make sense to put that into a separate table and join in the values. Or the milage or the price as another example. The cars table used for search is indexed quite a lot. The questions: Having the above setup in mind, what impact on performance, in terms of read performance and write performance, does it have, whether I do the following: 1) In general would the read and/or the write on the database be faster, if I serialized some of the not searched columns in the table into a single text columns instead of let's say 20 booleans? 2) Lets say I'm updating a timestamp in a single one of the 151 columns in the cars table. The update
Re: [PERFORM] Database design - best practice
Hi Kristian, I can't see why it would make sense to put that into a separate table and join in the values You don't normalize for performance. People DEnormalize for performance. Yes. In short, you seem more of a developer than a RDBMS guy. This is not a personal fault, but it's a *very* dangerous state to be in and you should address the problem asap. Erase from your head all you could possibly know in terms of putting it into a file and read very basic texts about normal forms. Like this: http://en.wikipedia.org/wiki/Database_normalization As already said by Willem, learn to test your stuff. There is a \timing command in psql, use it. For example (addressing your other post), you want to check how long it takes to UPDATE adverts SET last_observed_at = '2012-11-28 00:02:30.265154', data_source_id ='83d024a57bc2958940f3ca281bddcbf4' WHERE adverts.id IN ( 1602382, 4916432, .. 363 ) ; as opposed to UPDATE adverts SET last_observed_at = '2012-11-28 00:02:30.265154', data_source_id ='83d024a57bc2958940f3ca281bddcbf4' WHERE adverts.id = 1602382 OR adverts.id = 4916432 OR .. adverts.id = 363; My 5 pence Bèrto -- == If Pac-Man had affected us as kids, we'd all be running around in a darkened room munching pills and listening to repetitive music. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimize update query
On 11/28/2012 06:57 AM, Niels Kristian Schjødt wrote: Before I go crazy, here... you really need to tell us what not enough means. You didn't provide an explain analyze, so we don't know what your actual performance is. But I have my suspicions. So as you can see, it's already pretty optimized, it's just not enough :-) So what can I do? the two columns last_observed_at and data_source_id has an index, and it is needed elsewhere, so I can't delete those. Ok, so part of your problem is that you're tying an advertising system directly to the database for direct updates. That's a big no-no. Any time you got a huge influx of views, there would be a logjam. You need to decouple this so you can use a second tool to load the database in larger batches. You'll get much higher throughput this way. If you absolutely must use this approach, you're going to have to beef up your hardware. PS. I'm on postgres 9.2 on a server with 32gb ram, 8 cores and two 3T disks in a software raid 1 setup. This is not sufficient for a high-bandwidth stream of updates. Not even close. Even if those 3T disks are 7200 RPM, and even in RAID-1, you're going to have major problems with concurrent reads and writes. You need to do several things: 1. Move your transaction logs (pg_xlog) to another pair of disks entirely. Do not put these on the same disks as your data if you need high write throughput. 2. Get a better disk architecture. You need 10k, or 15k RPM disks. Starting with 6 or more of them in a RAID-10 would be a good beginning. You never told us your postgresql.conf settings, so I'm just going with very generic advice. Essentially, you're expecting too much for too little. That machine would have been low-spec three years ago, and unsuited to database use simply due to the 2-disk RAID. Is the only way out of this really a SSD disk? No. There are many, many steps you can and should take before going this route. You need to know the problem you're solving before making potentially expensive hardware decisions. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimize update query
W dniu 28.11.2012 15:07, Shaun Thomas pisze: On 11/28/2012 06:57 AM, Niels Kristian Schjødt wrote: Before I go crazy, here... you really need to tell us what not enough means. You didn't provide an explain analyze, so we don't know what your actual performance is. But I have my suspicions. So as you can see, it's already pretty optimized, it's just not enough :-) So what can I do? the two columns last_observed_at and data_source_id has an index, and it is needed elsewhere, so I can't delete those. Ok, so part of your problem is that you're tying an advertising system directly to the database for direct updates. That's a big no-no. Any time you got a huge influx of views, there would be a logjam. You need to decouple this so you can use a second tool to load the database in larger batches. You'll get much higher throughput this way. +1, sql databases has limited number of inserts/updates per second. Even with highend hardware you won't have more than XXX operations per second. As Thomas said, you should feed something like nosql database from www server and use other tool to do aggregation and batch inserts to postgresql. It will scale much better. Marcin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Database design - best practice
Niels Kristian Schjødt wrote: So my main concern is actually about the cars table, since this one currently has a lot of columns (151 - I expect thats quite a lot?), That's pretty wide, but not outrageous. and a lot of data (4 mil. rows, and growing). That's not a big deal. It's not unusual to have hundreds of millions of rows in a PostgreSQL table. Properly indexed, that should perform fine on queries. Sometimes partitioning rows into sub-tables helps, but you didn't really mention anything which suggests that would be helpful for you. Now you might start by thinking, this could sound like a regular need for some normalization On the contrary, what you describe sounds well normalized. Breaking off attributes of a car into separate tables would not advance that. The columns in this table is for the most very short stings, integers, decimals or booleans. So take for an example has_automatic_transmission (boolean) I can't see why it would make sense to put that into a separate table and join in the values. Or the milage or the price as another example. The cars table used for search is indexed quite a lot. On the face of it, it sounds like you should have some one-column indexes on the columns most useful for selection (based on frequency of use and how selective a selection on the column tends to be). You might benefit from a technique called vertical partitioning -- where you split off less frequently referenced column and/or columns which are updated more often into sibling tables, with the same primary key as the car table. That can sometimes buy some performance at the expense of programming complexity and more difficulty maintaining data integrity. I wouldn't go there without evidence that your performance is not adequate without it. -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Database design - best practice
Let me be devil advocate here :) First of all, even if you read any basics about normalization, don't take it to your heart :) Think. Know that each normalization/denormalization step has it's cons and pros. E.g. in NoSQL world they often don't normalize much. What's interesting with PosgreSQL is that it is suited quite good for NoSQL-like scenarios. First of all, each unfilled (null) data column takes 1 bit only. This, BTW, leads to interesting consequence that performance-wise it can be better to have null/true boolean than false/true. Especially if you've got a lot of false. So, PostgreSQL should be good with 10th, possible 100th of data column with most columns empty. Record of 151 null columns would take header + roundup(151/8 ) = 19 bytes. Not much. NoSQLs usually put column names into records and this costs more. Any null columns at the end of record take no space at all (so, you can think on reordering your columns to put the least used to the record end). Adding column with null as default is cheap operation that do not require table scan. You can have partial indexes to speed things up, like create index on car (car_id) where (has_automatic_transmission); At the other side, when you normalize you need to join. Instead of select * from car where has_automatic_transmission (that will use index above), you will have to select * from car where id in (select id from car_with_automatic_transmission). The plan is much more complex here. It will be slower. The main normalization plus for you is that you work with record as a whole, so if there is a lot of information in there that is rarely used, you will pay for it's access every time, both on selects and updates. So, as conclusion, I agree with others, that you should check. But remember, joining two tables with millions of records os never cheap :) Best regards, Vitalii Tymchyshyn 2012/11/28 Niels Kristian Schjødt nielskrist...@autouncle.com Hi, I'm on the hunt for some solid knowledge on a theoretical level about the performance of postgresql. My question is regarding best practices, and how architectural decisions might influence the performance. First a little background: The setup: I have a database which holds informations on used cars. The database has mainly 3 tables of interest for this case: A cars table, an adverts table and a sellers table. One car has many adverts and one seller has many adverts. One advert belongs to one car and one seller. The database is powering a website for searching used cars. When searching for used cars, the cars table is mainly used, and a lot of the columns should be directly available for searching e.g. color, milage, price, has_automatic_transmission etc. So my main concern is actually about the cars table, since this one currently has a lot of columns (151 - I expect thats quite a lot?), and a lot of data (4 mil. rows, and growing). Now you might start by thinking, this could sound like a regular need for some normalization, but wait a second and let me explain :-) The columns in this table is for the most very short stings, integers, decimals or booleans. So take for an example has_automatic_transmission (boolean) I can't see why it would make sense to put that into a separate table and join in the values. Or the milage or the price as another example. The cars table used for search is indexed quite a lot. The questions: Having the above setup in mind, what impact on performance, in terms of read performance and write performance, does it have, whether I do the following: 1) In general would the read and/or the write on the database be faster, if I serialized some of the not searched columns in the table into a single text columns instead of let's say 20 booleans? 2) Lets say I'm updating a timestamp in a single one of the 151 columns in the cars table. The update statement is using the id to find the car. Would the write performance of that UPDATE be affected, if the table had fewer columns? 3) When adding a new column to the table i know that it becomes slower the more rows is in the table, but what about the width of the table does that affect the performance when adding new columns? 4) In general what performance downsides do you get when adding a lot of columns to one table instead of having them in separate tables? 5) Is it significantly faster to select * from a table with 20 columns, than selecting the same 20 in a table with 150 columns? Hope there is some good answers out there :-) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Optimize update query
I assume that SQL databases ( Banks? Telecom?) can handle an used car shop. No need for an unstructured data tool. +1, sql databases has limited number of inserts/updates per second. Even with highend hardware you won't have more than XXX operations per second. As Thomas said, you should feed something like nosql database from www server and use other tool to do aggregation and batch inserts to postgresql. It will scale much better. Marcin
Re: [PERFORM] Savepoints in transactions for speed?
On Tue, Nov 27, 2012 at 7:16 PM, Claudio Freire klaussfre...@gmail.comwrote: Updates, are faster if batched, if your business logic allows it, because it creates less bloat and creates more opportunities for with HOT updates. I don't think it applies to inserts, though, and I haven't heard it either. Ah. That must have been what I'd half-remembered. Thanks for the clarification. Mike
Re: [PERFORM] Savepoints in transactions for speed?
Commitmarks are written to disk after each transaction. So transactionsize has impact on performance. Date: Wed, 28 Nov 2012 09:18:20 -0600 Subject: Re: [PERFORM] Savepoints in transactions for speed? From: mike.blackw...@rrd.com To: klaussfre...@gmail.com CC: pgsql-performance@postgresql.org On Tue, Nov 27, 2012 at 7:16 PM, Claudio Freire klaussfre...@gmail.com wrote: Updates, are faster if batched, if your business logic allows it, because it creates less bloat and creates more opportunities for with HOT updates. I don't think it applies to inserts, though, and I haven't heard it either. Ah. That must have been what I'd half-remembered. Thanks for the clarification. Mike
Re: [PERFORM] Optimize update query
Okay guys, Thanks for all the great help and advice already! Let me just clear some things, to make my question a little easier to answer :-) Now my site is a search engine for used cars - not just a car shop with a few hundred cars. The update query you look at, is an update that is executed once a day in chunks for all active adverts, so we know they are still for sale (one car can be advertised at several places hence several adverts). So it's not a constant stream but it has a fairly high volume especially at night time though. A compressed version of my .conf looks like this (note: there is some tweaks at the end of the file) data_directory = '/var/lib/postgresql/9.2/main' hba_file = '/etc/postgresql/9.2/main/pg_hba.conf' ident_file = '/etc/postgresql/9.2/main/pg_ident.conf' external_pid_file = '/var/run/postgresql/9.2-main.pid' listen_addresses = '192.168.0.2, localhost' port = 5432 max_connections = 1000 unix_socket_directory = '/var/run/postgresql' wal_level = hot_standby synchronous_commit = off archive_mode = onarchive_command = 'rsync -a %p postgres@192.168.0.4:/var/lib/postgresql/9.2/wals/%f /dev/null' max_wal_senders = 1 wal_keep_segments = 32 logging_collector = on log_min_messages = debug1 log_min_error_statement = debug1 log_min_duration_statement = 0 log_checkpoints = on log_connections = on log_disconnections = onlog_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d ' log_lock_waits = on log_temp_files = 0 datestyle = 'iso, mdy' lc_messages = 'C' lc_monetary = 'en_US.UTF-8' lc_numeric = 'en_US.UTF-8' lc_time = 'en_US.UTF-8' default_text_search_config = 'pg_catalog.english' default_statistics_target = 100 maintenance_work_mem = 1GB checkpoint_completion_target = 0.7 effective_cache_size = 22GB work_mem = 160MB wal_buffers = 4MB checkpoint_segments = 16 shared_buffers = 7680MB # All the log stuff is mainly temporary requirement for pgBadger # The database has been tuned with pgtuner You might be familiar with new relic, and I use that for quite a lot of monitoring. So, this is what I see at night time (a lot of I/O). So I went to play around with pgBadger to get some insights at database level. iframe src=https://rpm.newrelic.com/public/charts/h2dtedghfsv; width=500 height=300 scrolling=no frameborder=no/iframe This shows me, that the by far most time-consuming queries are updates (in general). On avg. a query like the one I showed you, take 1,3 sec (but often it takes several minutes - which makes me wonder). So correct me if I'm wrong here: my theory is, that I have too many too slow update queries, that then often end up in a situation, where they wait for each other to finish, hence the sometimes VERY long execution times. So my basic idea here is, that if I could reduce the cost of the updates, then I could get a hight throughput overall. Here is a sample of the pgBadger analysis: Queries that took up the most time (N) ^ RankTotal duration Times executed Av. duration (s)Query 1 1d15h28m38.71s 948,711 0.15s COMMIT; 2 1d2h17m55.43s 401,002 0.24s INSERT INTO car_images ( car_id, created_at, image, updated_at ) VALUES ( '', '', '', '' ) returning id; 3 23h18m33.68s 195,093 0.43s SELECT DISTINCT cars.id FROM cars LEFT OUTER JOIN adverts ON adverts.car_id = cars.id LEFT OUTERJOIN sellers ON sellers.id = adverts.seller_id WHERE cars.sales_state = '' AND cars.year = 0 ANDcars.engine_size = 0.0 AND ( ( cars.id IS NOT NULL AND cars.brand = '' AND cars.model_name = ''AND cars.fuel = '' AND cars.km = 0 AND cars.price = 0 AND sellers.kind = '' ) ) LIMIT 0; 4 22h45m26.52s 3,374,133 0.02s SELECT adverts.* FROM adverts WHERE ( source_name = '' AND md5 ( url ) = md5 ( '' ) ) LIMIT 0; 5 10h31m37.18s 29,671 1.28s UPDATE adverts SET last_observed_at = '', data_source_id = '' WHERE adverts.id IN ( ... ) ; 6 7h18m40.65s 396,393 0.07s UPDATE cars SET updated_at = '' WHERE cars.id = 0; 7 7h6m7.87s 241,294 0.11s UPDATE cars SET images_count = COALESCE ( images_count, 0 ) + 0 WHERE cars.id = 0; 8 6h56m11.78s 84,571 0.30s INSERT INTO failed_adverts ( active_record_object_class, advert_candidate, created_at, exception_class,exception_message, from_rescraper, last_retried_at, retry_count, source_name, stack_trace,updated_at, url ) VALUES ( NULL, '', '', '', '', NULL, NULL, '', '', '', '', '' ) returning id; 9 5h47m25.45s 188,402 0.11s INSERT INTO adverts ( availability_state, car_id, created_at, data_source_id, deactivated_at,first_extraction, last_observed_at, price, seller_id, source_id, source_name, updated_at, url )VALUES ( '', '', '', '', NULL, '', '', '', '', '', '', '', '' ) returning id; 10 3h4m26.86s 166,235 0.07s UPDATE adverts SET deactivated_at = '', availability_state = '', updated_at = '' WHERE adverts.id = 0; (Yes I'm already
Re: [PERFORM] Optimize update query
max_connections = 1000 looks bad... why not a pooler in place? Cheers Bèrto On 28 November 2012 16:19, Niels Kristian Schjødt nielskrist...@autouncle.com wrote: max_connections = 1000 -- == If Pac-Man had affected us as kids, we'd all be running around in a darkened room munching pills and listening to repetitive music. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimize update query
On 11/28/2012 10:19 AM, Niels Kristian Schjødt wrote: https://rpm.newrelic.com/public/charts/h2dtedghfsv Doesn't this answer your question? That iowait is crushing your server into the ground. It's no surprise updates are taking several seconds. That update you sent us *should* execute on the order of only a few milliseconds. So I'll reiterate that you *must* move your pg_xlog location elsewhere. You've got row lookup bandwidth conflicting with writes. There are a couple other changes you should probably make to your config: checkpoint_segments = 16 This is not enough for the workload you describe. Every time the database checkpoints, all of those changes in pg_xlog are applied to the backend data files. You should set these values: checkpoint_segments = 100 checkpoint_timeout = 10m checkpoint_completion_target = 0.9 This will reduce your overall write workload, and make it less active. Too many checkpoints massively reduce write throughput. With the settings you have, it's probably checkpointing constantly while your load runs. Start with this, but experiment with increasing checkpoint_segments further. If you check your logs now, you probably see a ton of checkpoint starting: xlog in there. That's very bad. It should say checkpoint starting: time meaning it's keeping up with your writes naturally. work_mem = 160MB This is probably way too high. work_mem is used every sort operation in a query. So each connection could have several of these allocated, thus starting your system of memory which will reduce that available for page cache. Change it to 8mb, and increase it in small increments if necessary. So correct me if I'm wrong here: my theory is, that I have too many too slow update queries, that then often end up in a situation, where they wait for each other to finish, hence the sometimes VERY long execution times. Sometimes this is the case, but for you, you're running into IO contention, not lock contention. Your 3TB RAID-1 is simply insufficient for this workload. If you check your logs after making the changes I've suggested, take a look at your checkpoint sync times. That will tell you how long it took the kernel to physically commit those blocks to disk and get a confirmation back from the controller. If those take longer than a second or two, you're probably running into controller buffer overflows. You have a large amount of RAM, so you should also make these two kernel changes to sysctl.conf: vm.dirty_ratio = 10 vm.dirty_writeback_ratio = 1 Then run this: sysctl -p This will help prevent large IO write spikes caused when the kernel decides to write out dirty memory. That can make checkpoints take minutes to commit in some cases, which basically stops all write traffic to your database entirely. That should get you going, anyway. You still need more/better disks so you can move your pg_xlog directory. With your write load, that will make a huge difference. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimize update query
On 11/28/2012 11:44 AM, Niels Kristian Schjødt wrote: Thanks a lot - on the server I already have one additional SSD 250gb disk, that I don't use for anything at the moment. God. An SSD would actually be better for your data, as it follows more random access patterns, and xlogs are more sequential. But it's better than nothing. And yes, you'd be better off with a RAID-1 of two of these SSDs, because the xlogs are critical to database health. You have your archived copy due to the rsync, which helps. But if you had a crash, there could potentially be a need to replay unarchived transaction logs, and you'd end up with some data loss. BTW. as you might have seen from the .conf I have a second slave server with the exact same setup, which currently runs as a hot streaming replication slave. I might ask a stupid question here, but this does not affect the performance of the master does it? Only if you're using synchronous replication. From what I saw in the config, that isn't the case. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Database design - best practice
On Wed, Nov 28, 2012 at 4:41 AM, Niels Kristian Schjødt nielskrist...@autouncle.com wrote: So my main concern is actually about the cars table, since this one currently has a lot of columns (151 - I expect thats quite a lot?), and a lot of data (4 mil. rows, and growing). Now you might start by thinking, this could sound like a regular need for some normalization, but wait a second and let me explain :-) If you have 151 single-valued pieces of information, than that is what you have. You can't tell if something is normalized or not by counting the columns. The columns in this table is for the most very short stings, integers, decimals or booleans. So take for an example has_automatic_transmission (boolean) I can't see why it would make sense to put that into a separate table and join in the values. I can't see why that would make sense, either. Nor do I think that doing so would increase the level of normalization. What rule of normalization would be served by creating gratuitous joins? Or the milage or the price as another example. The cars table used for search is indexed quite a lot. How useful are the indices? The questions: Having the above setup in mind, what impact on performance, in terms of read performance and write performance, does it have, whether I do the following: 1) In general would the read and/or the write on the database be faster, if I serialized some of the not searched columns in the table into a single text columns instead of let's say 20 booleans? Probably not. And could make it much worse, depending on how you serialize it. For example, if you use hstore or json, now the column names for each of the 20 booleans are repeated in every row, rather than being metadata stored only once. But try it and see. 2) Lets say I'm updating a timestamp in a single one of the 151 columns in the cars table. The update statement is using the id to find the car. Would the write performance of that UPDATE be affected, if the table had fewer columns? Yes, but probably not by much. The biggest effect will be on whether the timestamp column is indexed. If it is, then updating it means that all other indexes on the table will also need to be updated. If it is not indexed, then the update can be a HOT update. 3) When adding a new column to the table i know that it becomes slower the more rows is in the table, but what about the width of the table does that affect the performance when adding new columns? Adding a new column to a table is pretty much instantaneous if the default value is NULL. 4) In general what performance downsides do you get when adding a lot of columns to one table instead of having them in separate tables? This question cannot be answered in general. If every time you use the main table you have to join it to the separate table, then performance will be bad. If you almost never have to join to the separate table, then performance will be better. 5) Is it significantly faster to select * from a table with 20 columns, than selecting the same 20 in a table with 150 columns? If the extra 130 columns are mostly null, the difference will be very small. Or, if the where clause is such that you only do a single-row lookup on a primary key column, for example, the difference will also be small. Cheers, Jeff -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] NEED REPLICATION SOLUTION -POSTGRES 9.1
Hi, We are planning to migrate our production databases to different servers.We have around 8 servers with 8 different clusters.We are planning t shuffle databases and make them as 7 cluster and migrate to new remote servers . We cannot use streaming replication as we are migrating different databases from different clusters to one single cluster . This will be resulting in huge downtime as data is huge . Need expert advice on this scenario.Can we reduce downtime in any way ..?? Rgrds Suhas -- View this message in context: http://postgresql.1045698.n5.nabble.com/NEED-REPLICATION-SOLUTION-POSTGRES-9-1-tp5733939.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] NEED REPLICATION SOLUTION -POSTGRES 9.1
On Wed, Nov 28, 2012 at 3:12 PM, suhas.basavaraj12 suha...@verse.in wrote: We are planning to migrate our production databases to different servers.We have around 8 servers with 8 different clusters.We are planning t shuffle databases and make them as 7 cluster and migrate to new remote servers . We cannot use streaming replication as we are migrating different databases from different clusters to one single cluster . This will be resulting in huge downtime as data is huge . Need expert advice on this scenario.Can we reduce downtime in any way ..?? One time we had to do something like this (on a smaller scale), we used slony. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] NEED REPLICATION SOLUTION -POSTGRES 9.1
I recommend SymmetricDS - http://www.symmetricds.org -Original Message- From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of suhas.basavaraj12 Sent: Wednesday, November 28, 2012 1:12 PM To: pgsql-performance@postgresql.org Subject: [PERFORM] NEED REPLICATION SOLUTION -POSTGRES 9.1 Hi, We are planning to migrate our production databases to different servers.We have around 8 servers with 8 different clusters.We are planning t shuffle databases and make them as 7 cluster and migrate to new remote servers . We cannot use streaming replication as we are migrating different databases from different clusters to one single cluster . This will be resulting in huge downtime as data is huge . Need expert advice on this scenario.Can we reduce downtime in any way ..?? Rgrds Suhas -- View this message in context: http://postgresql.1045698.n5.nabble.com/NEED-REPLICATION-SOLUTION-POSTGRES-9-1-tp5733939.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Comparative tps question
Hi, I've just been benchmarking a new box I've got and running pgbench yields what I thought was a slow tps count. It is dificult to find comparisons online of other benchmark results, I'd like to see if I have the box set up reasonably well. I know oracle, et al prohibit benchmark results, but was surprised that there doesn't seem to be any postgresql ones out there.. Anyway, the machine is a Dell R720 with the data on a raid 10 using 8x intel 320 SSDs and a mirrored pair of 15k SAS HDDs configured for the pg_xlog, both on a dell H710 raid controller, in addition it has 64Gb of 1600Mhz memory and 2x E5-2650 processors (with HT=32 cores). The arrays are all setup with XFS on and tweaked as I could. The drives are 160Gb and overprovisioned by another 15%. I'm running postgresql 9.1 on ubuntu 12.04 bonnie++ (using defaults) shows about 600MB/s sequential read/write IO on the main data array, this doesn't seem too bad although the specs show over 200MB/s should be achievable per drive. pgbench (using a scaling factor of 100 with 100 clients and 25 threads) gives an average of about 7200tps. Does this look acceptable? Instinctively it feels on the low side, although I noted that a couple of blogs show (http://www.fuzzy.cz/en/articles/ssd-benchmark-results-read-write-pgbench/ and http://it-blog.5amsolutions.com/2010/08/performance-of-postgresql-ssd-vs.html) show around 1500tps for a single ssd, so maybe this is what is expected. The interesting param differences from the postgresql conf are: share_buffers=6Gb work_mem=64Mb max_stack_depth=4Mb random_page_cost=1.1 cpu_tuple_cost=0.1 cpu_index_tuple_cost=0.05 cpu_operator_cost=0.025 effective_cache_size=40Gb I'd be happy to provide any other configs, etc assuming the tps values are way off the expected. Thanks John ps. the number of safe ssds available in the uk seems to be rather limited, hence the intel 320s which I probably aren't as fast as modern drives. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Comparative tps question
On Wed, Nov 28, 2012 at 12:37 PM, John Lister john.lis...@kickstone.com wrote: Hi, I've just been benchmarking a new box I've got and running pgbench yields what I thought was a slow tps count. It is dificult to find comparisons online of other benchmark results, I'd like to see if I have the box set up reasonably well. I know oracle, et al prohibit benchmark results, but was surprised that there doesn't seem to be any postgresql ones out there.. Anyway, the machine is a Dell R720 with the data on a raid 10 using 8x intel 320 SSDs and a mirrored pair of 15k SAS HDDs configured for the pg_xlog, both on a dell H710 raid controller, in addition it has 64Gb of 1600Mhz memory and 2x E5-2650 processors (with HT=32 cores). The arrays are all setup with XFS on and tweaked as I could. The drives are 160Gb and overprovisioned by another 15%. I'm running postgresql 9.1 on ubuntu 12.04 bonnie++ (using defaults) shows about 600MB/s sequential read/write IO on the main data array, this doesn't seem too bad although the specs show over 200MB/s should be achievable per drive. Probably this limitation is coming from sata bus. It shouldn't be a problem in practice. Can you report bonnie++ seek performance? Another possibility is the raid controller is introducing overhead here. pgbench (using a scaling factor of 100 with 100 clients and 25 threads) gives an average of about 7200tps. Does this look acceptable? Instinctively it feels on the low side, although I noted that a couple of blogs show (http://www.fuzzy.cz/en/articles/ssd-benchmark-results-read-write-pgbench/ and http://it-blog.5amsolutions.com/2010/08/performance-of-postgresql-ssd-vs.html) show around 1500tps for a single ssd, so maybe this is what is expected. The interesting param differences from the postgresql conf are: share_buffers=6Gb work_mem=64Mb max_stack_depth=4Mb random_page_cost=1.1 cpu_tuple_cost=0.1 cpu_index_tuple_cost=0.05 cpu_operator_cost=0.025 effective_cache_size=40Gb *) none of the above settings will influence storage bound pgbench results. Influential settings are fsync, synchronous_commit, wal_sync_method, wal_level, full_page_writes, wal_buffers, wal_writer_delay, and commit_delay. These settings are basically managing various tradeoffs, espeically in the sense of safety vs performance. I'd be happy to provide any other configs, etc assuming the tps values are way off the expected. *) Very first thing we need to check is if we are storage bound (check i/o wait) and if so where the bind up is. Could be on the wal or heap volume. Another possibility is that we're lock bound which is a completely different issue to deal with. so we want to see top, iostat, vmstat, etc while test is happening. *) another interesting test to run is large scaling factor (ideally, at least 2x ram) read only test via pgbench -S. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Savepoints in transactions for speed?
On Tue, 2012-11-27 at 22:16 -0300, Claudio Freire wrote: Updates, are faster if batched, if your business logic allows it, because it creates less bloat and creates more opportunities for with HOT updates. I don't think it applies to inserts, though, and I haven't heard it either. Huge updates (e.g. UPDATE with no WHERE clause) are less likely to benefit from HOT. HOT has two main optimizations: 1. Remove dead tuples faster without waiting for VACUUM -- this only works if the transaction that updated/deleted the tuple actually finished (otherwise the tuple can't be removed yet), so it only benefits the *next* update to come along. But if it's one big update, then VACUUM is probably just as good at cleaning up the space. 2. Doesn't make new index entries for the new tuple; reuses the old index entries -- this only works if the update is on the same page, but large updates tend to fill pages up (because of the buildup of dead tuples) and force new to go to new pages. HOT is essentially designed for lots of small updates, which didn't perform well before PG 8.3. Batching of inserts/updates/deletes has a big benefit over separate transactions, but only up to a point, after which it levels off. I'm not sure exactly when that point is, but after that, the downsides of keeping a transaction open (like inability to remove the previous version of an updated tuple) take over. Regards, Jeff Davis -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Savepoints in transactions for speed?
On Tue, 2012-11-27 at 16:04 -0600, Mike Blackwell wrote: I need to delete about 1.5 million records from a table and reload it in one transaction. The usual advice when loading with inserts seems to be group them into transactions of around 1k records. Committing at that point would leave the table in an inconsistent state. Would issuing a savepoint every 1k or so records negate whatever downside there is to keeping a transaction open for all 1.5 million records, or just add more overhead? A large transaction isn't really a big problem for postgres, and 1.5M records should be processed quickly anyway. The main problem with a long-running delete or update transaction is that the dead tuples (deleted tuples or the old version of an updated tuple) can't be removed until the transaction finishes. That can cause temporary bloat, but 1.5M records shouldn't be noticeable. Adding subtransactions into the mix won't help, but probably won't hurt, either. The transaction will still run just as long, and you still can't delete the tuples ahead of time (unless you abort a subtransaction). If you *do* use subtransactions, make sure to release them as quickly as you create them (don't just use ROLLBACK TO, that still leaves the savepoint there); having 1500 open subtransactions might cause performance problems elsewhere. Regards, Jeff Davis -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimize update query
Hi, I have started to implement your suggestions . I have a small error so far though. The vm.dirty_writeback_ratio = 1 command rerurns: error: vm.dirty_writeback_ratio is an unknown key I'm on ubuntu 12.04 Den 28/11/2012 kl. 17.54 skrev Shaun Thomas stho...@optionshouse.com: On 11/28/2012 10:19 AM, Niels Kristian Schjødt wrote: https://rpm.newrelic.com/public/charts/h2dtedghfsv Doesn't this answer your question? That iowait is crushing your server into the ground. It's no surprise updates are taking several seconds. That update you sent us *should* execute on the order of only a few milliseconds. So I'll reiterate that you *must* move your pg_xlog location elsewhere. You've got row lookup bandwidth conflicting with writes. There are a couple other changes you should probably make to your config: checkpoint_segments = 16 This is not enough for the workload you describe. Every time the database checkpoints, all of those changes in pg_xlog are applied to the backend data files. You should set these values: checkpoint_segments = 100 checkpoint_timeout = 10m checkpoint_completion_target = 0.9 This will reduce your overall write workload, and make it less active. Too many checkpoints massively reduce write throughput. With the settings you have, it's probably checkpointing constantly while your load runs. Start with this, but experiment with increasing checkpoint_segments further. If you check your logs now, you probably see a ton of checkpoint starting: xlog in there. That's very bad. It should say checkpoint starting: time meaning it's keeping up with your writes naturally. work_mem = 160MB This is probably way too high. work_mem is used every sort operation in a query. So each connection could have several of these allocated, thus starting your system of memory which will reduce that available for page cache. Change it to 8mb, and increase it in small increments if necessary. So correct me if I'm wrong here: my theory is, that I have too many too slow update queries, that then often end up in a situation, where they wait for each other to finish, hence the sometimes VERY long execution times. Sometimes this is the case, but for you, you're running into IO contention, not lock contention. Your 3TB RAID-1 is simply insufficient for this workload. If you check your logs after making the changes I've suggested, take a look at your checkpoint sync times. That will tell you how long it took the kernel to physically commit those blocks to disk and get a confirmation back from the controller. If those take longer than a second or two, you're probably running into controller buffer overflows. You have a large amount of RAM, so you should also make these two kernel changes to sysctl.conf: vm.dirty_ratio = 10 vm.dirty_writeback_ratio = 1 Then run this: sysctl -p This will help prevent large IO write spikes caused when the kernel decides to write out dirty memory. That can make checkpoints take minutes to commit in some cases, which basically stops all write traffic to your database entirely. That should get you going, anyway. You still need more/better disks so you can move your pg_xlog directory. With your write load, that will make a huge difference. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Savepoints in transactions for speed?
On Wed, Nov 28, 2012 at 8:28 PM, Jeff Davis pg...@j-davis.com wrote: The main problem with a long-running delete or update transaction is that the dead tuples (deleted tuples or the old version of an updated tuple) can't be removed until the transaction finishes. That can cause temporary bloat, but 1.5M records shouldn't be noticeable. Not really that fast if you have indices (and who doesn't have a PK or two). I've never been able to update (update) 2M rows in one transaction in reasonable times (read: less than several hours) without dropping indices. Doing it in batches is way faster if you can't drop the indices, and if you can leverage HOT updates. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimize update query
In later kernels these have been renamed: Welcome to Ubuntu 12.04.1 LTS (GNU/Linux 3.2.0-32-generic x86_64) $ sysctl -a|grep dirty vm.dirty_background_ratio = 5 vm.dirty_background_bytes = 0 vm.dirty_ratio = 10 vm.dirty_bytes = 0 vm.dirty_writeback_centisecs = 500 vm.dirty_expire_centisecs = 3000 You the option of specifying either a ratio, or - more usefully for machines with a lot of ram - bytes. Regards Mark P.s: People on this list usually prefer it if you *bottom* post (i.e reply underneath the original). On 29/11/12 16:32, Niels Kristian Schjødt wrote: Hi, I have started to implement your suggestions . I have a small error so far though. The vm.dirty_writeback_ratio = 1 command rerurns: error: vm.dirty_writeback_ratio is an unknown key I'm on ubuntu 12.04 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance