Re: [PERFORM] Index scan is not working, why??
please provide non-default config options on this host plus the same from a host which is using an index scan, please. Also, postgresql version, OS, and all of the other stuff that is asked for in this document: http://wiki.postgresql.org/wiki/SlowQueryQuestions. It is impossible to say why the query planner might be choosing a particular plan without any insight whatsoever as to how the server is configured. On Wed, Oct 20, 2010 at 10:25 PM, AI Rumman rumman...@gmail.com wrote: I don't know why seq scan is running on the following query where the same query is giving index scan on other servers: explain analyze select * from act where act.acttype in ( 'Meeting','Call','Task'); QUERY PLAN Seq Scan on act (cost=0.00..13386.78 rows=259671 width=142) (actual time=0.013..484.572 rows=263639 loops=1) Filter: (((acttype)::text = 'Meeting'::text) OR ((acttype)::text = 'Call'::text) OR ((acttype)::text = 'Task'::text)) Total runtime: 732.956 ms (3 rows) The above query is giving index scan on other servers and even if I rewrite the query as follows I got index scan: explain analyze select * from act where act.acttype = 'Meeting' or act.acttype = 'Call'; QUERY PLAN -- Bitmap Heap Scan on act (cost=17.98..1083.80 rows=2277 width=142) (actual time=1.901..9.722 rows=4808 loops=1) Recheck Cond: (((acttype)::text = 'Meeting'::text) OR ((acttype)::text = 'Call'::text)) - BitmapOr (cost=17.98..17.98 rows=2281 width=0) (actual time=1.262..1.262 rows=0 loops=1) - Bitmap Index Scan on act_acttype_idx (cost=0.00..8.99 rows=1141 width=0) (actual time=0.790..0.790 rows=3181 loops=1) Index Cond: ((acttype)::text = 'Meeting'::text) - Bitmap Index Scan on act_acttype_idx (cost=0.00..8.99 rows=1141 width=0) (actual time=0.469..0.469 rows=1630 loops=1) Index Cond: ((acttype)::text = 'Call'::text) Total runtime: 14.227 ms (8 rows) \d act Table public.act Column | Type | Modifiers --++--- actid | integer | not null default 0 subject | character varying(250) | not null semodule | character varying(20) | acttype | character varying(200) | not null date_start | date | not null due_date | date | time_start | character varying(50) | time_end | character varying(50) | sendnotification | character varying(3) | not null default '0'::character varying duration_hours | character varying(2) | duration_minutes | character varying(200) | status | character varying(200) | eventstatus | character varying(200) | priority | character varying(200) | location | character varying(150) | notime | character varying(3) | not null default '0'::character varying visibility | character varying(50) | not null default 'all'::character varying recurringtype | character varying(200) | end_date | date | end_time | character varying(50) | Indexes: act_pkey PRIMARY KEY, btree (actid) act_acttype_idx btree (acttype) act_date_start_idx btree (date_start) act_due_date_idx btree (due_date) act_eventstatus_idx btree (eventstatus) act_status_idx btree (status) act_subject_idx btree (subject) act_time_start_idx btree (time_start) Any idea please.
Re: [PERFORM] Index scan is not working, why??
On Thu, Oct 21, 2010 at 1:51 AM, Samuel Gendler sgend...@ideasculptor.com wrote: please provide non-default config options on this host plus the same from a host which is using an index scan, please. Also, postgresql version, OS, and all of the other stuff that is asked for in this document: http://wiki.postgresql.org/wiki/SlowQueryQuestions. It is impossible to say why the query planner might be choosing a particular plan without any insight whatsoever as to how the server is configured. I know it's mentioned in that wiki doc, but the ddl for the table and its indexes, or the output of \d tablename is quite useful and should be included as well. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Periodically slow inserts
Hello, We are using PostgreSQL for storing data and full-text search indexes for the webiste of a daily newspaper. We are very happy overall with the results, but we have one weird behaviour that we would like to solve. The problem is when we index objects into the full-text search part of the database (which a DELETE and then an INSERT into a specific table), the INSERT sometimes take a long time (from 10s to 20s), but the same insert (and many other similar ones) are fast (below 0.2s). This slowness comes regularly, about every 200 objects indexed, regardless of the frequency of the inserts. If I reindex one object every 5 seconds for one hour, or one object every second for 10 minutes, I've the same kind of results : around 0.5% of the time, indexing took more than 10s. The positive point is that this slowness doesn't block the rest of queries to the database, but it's still painful to have to wait (even if only once in a while) for 10s or 20s when the end-user hits the save button. This slowness is associated with very high IO load on the operating system. I tried playing with checkpoint parameters (making them more frequent or less frequent, but I didn't notice any siginificant difference). Do you have any hint on how to smooth the process, so we don't have this regular huge slowdown ? If you want more details about the setup : - server is a Xen virtual machine with 8Gb of memory, disks being 15000 rpm SAS disks on RAID 1, and CPU being one core of a Nehalem processor (but CPU load is low anyway). - the database schema is like : CREATE TABLE sesql_index ( classname varchar(255), id integer, created_at timestamp, modified_at timestamp, created_by integer, modified_by integer, workflow_state integer, site_id integer, title_text text, title_tsv tsvector, subtitle_text text, subtitle_tsv tsvector, fulltext_text text, fulltext_tsv tsvector, authors integer[], folders integer[], [...] indexed_at timestamp DEFAULT NOW(), PRIMARY KEY (classname, id) ); CREATE TABLE sesql_author (CHECK (classname = 'Author'), PRIMARY KEY (classname, id)) INHERITS (sesql_index); CREATE TABLE sesql_program (CHECK (classname = 'Program'), PRIMARY KEY (classname, id)) INHERITS (sesql_index); CREATE TABLE sesql_default (CHECK (classname = 'Slideshow' OR classname = 'Book' OR classname = 'Article' OR classname = 'Publication' OR classname = 'Forum'), PRIMARY KEY (classname, id)) INHERITS (sesql_index); (with a few other similar tables for different objects). Inserts/deletes are done directly into the child tables, searches are done either on the master table (sesql_index) or on the child tables depending of the use case (but search works fine anyway). In addition to that we have several indexes, created on each child tables : CREATE INDEX sesql_default_classname_index ON sesql_default (classname); CREATE INDEX sesql_default_id_index ON sesql_default (id); CREATE INDEX sesql_default_created_at_index ON sesql_default (created_at); CREATE INDEX sesql_default_modified_at_index ON sesql_default (modified_at); CREATE INDEX sesql_default_created_by_index ON sesql_default (created_by); CREATE INDEX sesql_default_modified_by_index ON sesql_default (modified_by); CREATE INDEX sesql_default_workflow_state_index ON sesql_default (workflow_state); CREATE INDEX sesql_default_site_id_index ON sesql_default (site_id); CREATE INDEX sesql_default_publication_date_index ON sesql_default (publication_date); CREATE INDEX sesql_default_authors_index ON sesql_default USING GIN (authors); CREATE INDEX sesql_default_folders_index ON sesql_default USING GIN (folders); And the heavy ones, for each fulltext field, we have two columns, the text and the tsv, with an index on the tsv, and the tsv itself is updated via a trigger : CREATE INDEX sesql_default_fulltext_index ON sesql_default USING GIN (fulltext_tsv); CREATE TRIGGER sesql_default_fulltext_update BEFORE INSERT OR UPDATE ON sesql_default FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(fulltext_tsv, 'public.lem_french', fulltext_text); Thanks a lot for reading me until here ;) Regards, -- Gaël Le Mignot - g...@pilotsystems.net Pilot Systems - 9, rue Desargues - 75011 Paris Tel : +33 1 44 53 05 55 - www.pilotsystems.net Gérez vos contacts et vos newsletters : www.cockpit-mailing.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Periodically slow inserts
Hi, There are a lot of details missing about your system: http://wiki.postgresql.org/wiki/SlowQueryQuestions Cheers, Ken On Thu, Oct 21, 2010 at 02:25:44PM +0200, Gael Le Mignot wrote: Hello, We are using PostgreSQL for storing data and full-text search indexes for the webiste of a daily newspaper. We are very happy overall with the results, but we have one weird behaviour that we would like to solve. ... -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Periodically slow inserts
We are using PostgreSQL for storing data and full-text search indexes for the webiste of a daily newspaper. We are very happy overall with the results, but we have one weird behaviour that we would like to solve. I think there's a lot of missing info worth knowing: 1) checkpoints logs? Enable them, maybe the slowness happens at checkpoints: log_checkpoints=true 2) How many rows does each table contain? 3) HW: how many discs you have, and which controller you're using (and: does it use a BBU?) The more you tell the list, the better help you'll get... -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Index scan is not working, why??
AI Rumman wrote: I don't know why seq scan is running on the following query where the same query is giving index scan on other servers: explain analyze select * from act where act.acttype in ( 'Meeting','Call','Task'); QUERY PLAN Seq Scan on act (cost=0.00..13386.78 rows=259671 width=142) (actual time=0.013..484.572 rows=263639 loops=1) Filter: (((acttype)::text = 'Meeting'::text) OR ((acttype)::text = 'Call'::text) OR ((acttype)::text = 'Task'::text)) Total runtime: 732.956 ms (3 rows) Al, what percentage of the rows fits the above criteria? How big are your histograms? -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 www.vmsinfo.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] New wiki page on write reliability
Now that some of my recent writing has gone from NDA protected to public sample, I've added a new page to the PostgreSQL wiki that provides a good starting set of resources to learn about an ever popular topic here, how write cache problems can lead to database corruption: http://wiki.postgresql.org/wiki/Reliable_Writes Bruce also has a presentation he's been working on that adds pictures showing the flow of data through the various cache levels, to help people visualize the whole thing, that should get added into there once he's finished tweaking it. I'd like to get some feedback from the members of this list about what's still missing after this expanded data dump. Ultimately I'd like to get this page to be an authoritative enough resource that the Reliability section of the official documentation could point back to this as a recommendation for additional information. So much of this material requires singling out specific vendors and staying up to date with hardware changes, both things that the official docs are not a good place for. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, PostgreSQL 9.0 High Performance: http://www.2ndquadrant.com/books -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New wiki page on write reliability
On 10-10-21 10:08 AM, Greg Smith wrote: Now that some of my recent writing has gone from NDA protected to public sample, I've added a new page to the PostgreSQL wiki that provides a good starting set of resources to learn about an ever popular topic here, how write cache problems can lead to database corruption: http://wiki.postgresql.org/wiki/Reliable_Writes Bruce also has a presentation he's been working on that adds pictures showing the flow of data through the various cache levels, to help people visualize the whole thing, that should get added into there once he's finished tweaking it. I'd like to get some feedback from the members of this list about what's still missing after this expanded data dump. Ultimately I'd like to get this page to be an authoritative enough resource that the Reliability section of the official documentation could point back to this as a recommendation for additional information. So much of this material requires singling out specific vendors and staying up to date with hardware changes, both things that the official docs are not a good place for. Looks like a good start. I think a warning turning fsync off, the dangers of async_commit, and the potential problems with disabling full_page_writes might be worth mentioning on this page, unless you want to leave that buried in the attached references. -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Periodically slow inserts
Gael Le Mignot g...@pilotsystems.net writes: The problem is when we index objects into the full-text search part of the database (which a DELETE and then an INSERT into a specific table), the INSERT sometimes take a long time (from 10s to 20s), but the same insert (and many other similar ones) are fast (below 0.2s). This slowness comes regularly, about every 200 objects indexed, regardless of the frequency of the inserts. Hm. You didn't say which PG version you're using, but if it's = 8.4, I think this may be caused by GIN's habit of queuing index insertions until it's accumulated a reasonable-size batch: http://www.postgresql.org/docs/9.0/static/gin-implementation.html#GIN-FAST-UPDATE While you can turn that off, I think that doing so will reduce the index's search efficiency over time. It might be better to schedule regular vacuums on the table so that the work is done by vacuum rather than foreground queries. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Periodically slow inserts
does it use a BBU?) Sorry, this was supposed to read do you have cache on the controller, of course a battery can't change the performance... but you got it anyway... -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Periodically slow inserts
2010-10-21 16:39:15 CEST LOG: checkpoint complete: wrote 365 buffers (11.9%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=0.403 s, sync=21.312 s, total=21.829 s I'm no expert, but isn't 21s to sync 365 buffers a big amount of time? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New wiki page on write reliability
Brad Nicholson wrote: I think a warning turning fsync off, the dangers of async_commit, and the potential problems with disabling full_page_writes might be worth mentioning on this page, unless you want to leave that buried in the attached references. Good idea to highlight that. What I just did here was point out which of the references covered that specific topic, which is as good as I could do for now. It's hard for me to justify spending time writing more about those when they are covered in the attached references, and I personally can't do it because of my publishing agreement. The fact that the information about this topic is what ended up being released as the sample material from my book is not coincidence--I wanted to be able to share what I'd done here as a free community resources because this topic is so important to me. But I can't go much further than what I've already put up there myself. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Periodically slow inserts
2010/10/21 Leonardo Francalanci m_li...@yahoo.it: 2010-10-21 16:39:15 CEST LOG: checkpoint complete: wrote 365 buffers (11.9%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=0.403 s, sync=21.312 s, total=21.829 s I'm no expert, but isn't 21s to sync 365 buffers a big amount of time? It is. I suggest to look at /proc/meminfo about dirty buffers and the results of 'iostat -x 2' runing for some moment -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Periodically slow inserts
Gael Le Mignot wrote: The delay is the delay of the sync part of the checkpoints : 2010-10-21 16:39:15 CEST LOG: checkpoint complete: wrote 365 buffers (11.9%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=0.403 s, sync=21.312 s, total=21.829 s Maybe there is something I misunderstood, but aren't the checkpoints supposed to run smoothly over the checkpoint_completion_target interval ? Well, first off you have to get the checkpoints spaced out in time enough for that to work. Both checkpoint_segments and possibly checkpoint_timeout may also need to be increased in order for the checkpoint write spreading code to work. When I start seeing long sync times, I'll usually shoot for 64 segments and 10 minutes for the timeout to give that smoothing work some room to do what it's supposed to. However, only the main checkpoint writes are spread over time. The hope is that by the time the sync phase starts, the operating system will have already written most of them out. Sometimes, particularly in servers with lots of RAM for caching writes, this doesn't happen. In that case, you can have gigabytes of data queued up at the beginning of the sync phase--which is not spread out at all. We are currently working on a spread sync feature for PostgreSQL that makes this problem better on platforms/filesystems it's possible to improve behavior on (you can't do anything about this issue on ext3 for example). I'll be talking about that development at the PgWest conference in a two weeks: https://www.postgresqlconference.org/content/righting-your-writes and hope to submit a patch with a first version of this feature to the November development CommitFest, in hopes of it making it into version 9.1. If you can't come up with any solution here and need help with your current version sooner than that, we've already backported this improvement all the way to V8.3; drop me an off-list note if you want to discuss consulting services in this area we have available. If you're lucky, just adjusting the segment and timeout values may be enough for you. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, PostgreSQL 9.0 High Performance http://www.2ndquadrant.com/books -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Periodically slow inserts
Hello Tom! Thu, 21 Oct 2010 10:55:48 -0400, you wrote: Gael Le Mignot g...@pilotsystems.net writes: The problem is when we index objects into the full-text search part of the database (which a DELETE and then an INSERT into a specific table), the INSERT sometimes take a long time (from 10s to 20s), but the same insert (and many other similar ones) are fast (below 0.2s). This slowness comes regularly, about every 200 objects indexed, regardless of the frequency of the inserts. Hm. You didn't say which PG version you're using, but if it's = 8.4, I think this may be caused by GIN's habit of queuing index insertions until it's accumulated a reasonable-size batch: http://www.postgresql.org/docs/9.0/static/gin-implementation.html#GIN-FAST-UPDATE While you can turn that off, I think that doing so will reduce the index's search efficiency over time. It might be better to schedule regular vacuums on the table so that the work is done by vacuum rather than foreground queries. Thanks for your feedback. It seems to be related, at least, if I increase the work_mem variable, the slowness because bigger (up to 1 minute for a work_mem of 8mb) but much less frequent (around 0.05% instead of 0.5% of the requests for 8mb instead of 1mb). So a big work_mem and a regular vacuum would do the tick, I think. Does auto_vacuum triggers the gin index vacuuming too, or does it require a manual vacuum ? Regards, -- Gaël Le Mignot - g...@pilotsystems.net Pilot Systems - 9, rue Desargues - 75011 Paris Tel : +33 1 44 53 05 55 - www.pilotsystems.net Gérez vos contacts et vos newsletters : www.cockpit-mailing.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New wiki page on write reliability
Greg Smith wrote: Now that some of my recent writing has gone from NDA protected to public sample, I've added a new page to the PostgreSQL wiki that provides a good starting set of resources to learn about an ever popular topic here, how write cache problems can lead to database corruption: http://wiki.postgresql.org/wiki/Reliable_Writes Bruce also has a presentation he's been working on that adds pictures showing the flow of data through the various cache levels, to help people visualize the whole thing, that should get added into there once he's finished tweaking it. My presentation is done and is now on the wiki too: http://momjian.us/main/writings/pgsql/hw_selection.pdf -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Periodically slow inserts
Gael Le Mignot wrote: Hello, We are using PostgreSQL for storing data and full-text search indexes for the webiste of a daily newspaper. We are very happy overall with the results, but we have one weird behaviour that we would like to solve. The problem is when we index objects into the full-text search part of the database (which a DELETE and then an INSERT into a specific table), the INSERT sometimes take a long time (from 10s to 20s), but the same insert (and many other similar ones) are fast (below 0.2s). Have you tried with strace, just to see where the time is spent? -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 http://www.vmsinfo.com The Leader in Integrated Media Intelligence Solutions -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On 10/20/2010 09:45 PM, Scott Marlowe wrote: On Wed, Oct 20, 2010 at 8:25 PM, Joshua D. Drakej...@commandprompt.com wrote: On Wed, 2010-10-20 at 22:13 -0400, Bruce Momjian wrote: Ben Chobot wrote: On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote: I'm weighing options for a new server. In addition to PostgreSQL, this machine will handle some modest Samba and Rsync load. I will have enough RAM so the virtually all disk-read activity will be cached. The average PostgreSQL read activity will be modest - a mix of single-record and fairly large (reporting) result-sets. Writes will be modest as well but will come in brief (1-5 second) bursts of individual inserts. The rate of insert requests will hit 100-200/second for those brief bursts. So... Am I likely to be better off putting $$$ toward battery-backup on the RAID or toward adding a second RAID-set and splitting off the WAL traffic? Or something else? A BBU is, what, $100 or so? Adding one seems a no-brainer to me. Dedicated WAL spindles are nice and all, but they're still spinning media. Raid card cache is wy faster, and while it's best at bursty writes, it sounds like bursty writes are precisely what you have. Totally agree! BBU first, more spindles second. Agreed. note that while you can get incredible burst performance from a battery backed cache, due to both caching and writing out of order, once the throughput begins to saturate at the speed of the disk array, the bbu cache is now only re-ordering really, as it will eventually fill up faster than the disks can take the writes, and you'll settle in at some percentage of your max tps you get for a short benchmark run. It's vitally important that once you put a BBU cache in place, you run a very long running transactional test (pgbench is a simple one to start with) that floods the io subsystem so you see what you're average throughput is with the WAL and data store getting flooded. I know on my system pgbench runs of a few minutes can be 3 or 4 times faster than runs that last for the better part of an hour. Thanks for all the replies. This is what I suspected but since I can't just buy one of everything to try, I wanted a sanity-check before spending the $$$. I am not too worried about saturating the controller cache as the current much lower spec machine can handle the sustained load just fine and the bursts are typically only 1-3 seconds long spaced a minute or more apart. Cheers, Steve -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Periodically slow inserts
Gael Le Mignot g...@pilotsystems.net writes: Thu, 21 Oct 2010 10:55:48 -0400, you wrote: I think this may be caused by GIN's habit of queuing index insertions until it's accumulated a reasonable-size batch: http://www.postgresql.org/docs/9.0/static/gin-implementation.html#GIN-FAST-UPDATE So a big work_mem and a regular vacuum would do the tick, I think. Does auto_vacuum triggers the gin index vacuuming too, or does it require a manual vacuum ? Autovacuum will handle it too. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Scott Marlowe wrote: On Wed, Oct 20, 2010 at 8:25 PM, Joshua D. Drake j...@commandprompt.com wrote: On Wed, 2010-10-20 at 22:13 -0400, Bruce Momjian wrote: Ben Chobot wrote: On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote: I'm weighing options for a new server. In addition to PostgreSQL, this machine will handle some modest Samba and Rsync load. I will have enough RAM so the virtually all disk-read activity will be cached. The average PostgreSQL read activity will be modest - a mix of single-record and fairly large (reporting) result-sets. Writes will be modest as well but will come in brief (1-5 second) bursts of individual inserts. The rate of insert requests will hit 100-200/second for those brief bursts. So... Am I likely to be better off putting $$$ toward battery-backup on the RAID or toward adding a second RAID-set and splitting off the WAL traffic? Or something else? A BBU is, what, $100 or so? Adding one seems a no-brainer to me. Dedicated WAL spindles are nice and all, but they're still spinning media. Raid card cache is wy faster, and while it's best at bursty writes, it sounds like bursty writes are precisely what you have. Totally agree! BBU first, more spindles second. Agreed. note that while you can get incredible burst performance from a battery backed cache, due to both caching and writing out of order, once the throughput begins to saturate at the speed of the disk array, the bbu cache is now only re-ordering really, as it will eventually fill up faster than the disks can take the writes, and you'll settle in at some percentage of your max tps you get for a short benchmark run. It's vitally important that once you put a BBU cache in place, you run a very long running transactional test (pgbench is a simple one to start with) that floods the io subsystem so you see what you're average throughput is with the WAL and data store getting flooded. I know on my system pgbench runs of a few minutes can be 3 or 4 times faster than runs that last for the better part of an hour. With a BBU you can turn off full_page_writes, which should decrease the WAL traffic. However, I don't see this mentioned in our documentation. Should I add it? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Bruce Momjian br...@momjian.us wrote: With a BBU you can turn off full_page_writes My understanding is that that is not without risk. What happens if the WAL is written, there is a commit, but the data page has not yet been written to the controller? Don't we still have a torn page? -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Kevin Grittner wrote: Bruce Momjian br...@momjian.us wrote: With a BBU you can turn off full_page_writes My understanding is that that is not without risk. What happens if the WAL is written, there is a commit, but the data page has not yet been written to the controller? Don't we still have a torn page? I don't see how full_page_writes affect non-written pages to the controller. full_page_writes is designed to guard against a partial write to a device. I don't think the raid cache can be partially written to, and the cache will not be cleared until the drive has fully writen the data to disk. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Bruce Momjian wrote: With a BBU you can turn off full_page_writes, which should decrease the WAL traffic. However, I don't see this mentioned in our documentation. Should I add it? What I would like to do is beef up the documentation with some concrete examples of how to figure out if your cache and associated write path are working reliably or not. It should be possible to include does this handle full page writes correctly? in that test suite. Until we have something like that, I'm concerned that bugs in filesystem or controller handling may make full_page_writes unsafe even with a BBU, and we'd have no way for people to tell if that's true or not. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Bruce Momjian br...@momjian.us wrote: full_page_writes is designed to guard against a partial write to a device. I don't think the raid cache can be partially written to So you're confident that an 8kB write to the controller will not be done as a series of smaller atomic writes by the OS file system? -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Kevin Grittner wrote: Bruce Momjian br...@momjian.us wrote: full_page_writes is designed to guard against a partial write to a device. I don't think the raid cache can be partially written to So you're confident that an 8kB write to the controller will not be done as a series of smaller atomic writes by the OS file system? Sure, that happens. But if the BBU has gotten an fsync call after the 8K write, it shouldn't return success until after all 8K are in its cache. That's why full_page_writes should be safe on a system with BBU as Bruce is suggesting. But I'd like to see some independent proof of that fact, that includes some targeted tests users can run, before we start recommending that practice. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow count(*) again...
On 2010-10-21 06:47, Scott Carey wrote: On a wimpy disk, I/O bound for sure. But my disks go 1000MB/sec. No query can go fast enough for them. The best I've gotten is 800MB/sec, on a wide row (average 800 bytes). Most tables go 300MB/sec or so. And with 72GB of RAM, many scans are in-memory anyway. Is it cpu or io bound while doing it? Can you scan it faster using time cat relation-oid.* /dev/null A single SSD with supercapacitor will go about 500MB/sec by itself next spring. I will easily be able to build a system with 2GB/sec I/O for under $10k. What filesystem are you using? Readahead? Can you try to check the filesystemfragmentation of the table using filefrag? -- Jesper
Re: [PERFORM] BBU Cache vs. spindles
Greg Smith g...@2ndquadrant.com wrote: Kevin Grittner wrote: So you're confident that an 8kB write to the controller will not be done as a series of smaller atomic writes by the OS file system? Sure, that happens. But if the BBU has gotten an fsync call after the 8K write, it shouldn't return success until after all 8K are in its cache. I'm not concerned about an fsync after the controller has it; I'm concerned about a system crash in the middle of writing an 8K page to the controller. Other than the expected *size* of the window of time during which you're vulnerable, what does a BBU caching controller buy you in this regard? Can't the OS rearrange the writes of disk sectors after the 8K page is written to the OS cache so that the window might occasionally be rather large? -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] How does PG know if data is in memory?
Kevin Grittner wrote: gnuo...@rcn.com wrote: -- None of that is anything for amateurs to play with. Not jam a stick in anybody's eye, but shouldn't database pros not be amateurs? While many PostgreSQL installations are managed by professional DBAs, or programmers or consultants with a deep enough grasp of the issues to tune a knob like that appropriately, PostgreSQL is also used in environments without such staff. In fact, there is pressure to make PostgreSQL easier to configure for exactly that reason. If we add more knobs which are this hard to tune correctly, we would risk inundation with complaints from people to tried to use it and made things worse. Agreed. Here is a blog entry that explains some of the tradeoffs of adding knobs: http://momjian.us/main/blogs/pgblog/2009.html#January_10_2009 -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Experiences with running PostgreSQL on Blue Arc Network Storage
Hello, Does anyone have any experience with running postgreSQL on Blue Arc's network storage products? In particular, we are interested in the the Titan and Mercury series: http://www.bluearc.com/data-storage-products/titan-3000-network-storage-system.shtml http://www.bluearc.com/data-storage-products/mercury-network-storage-system.shtml -- Tim Goodaire416-673-4126tgood...@ca.afilias.info Database Team Lead, Afilias Canada Corp. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Kevin Grittner wrote: Greg Smith g...@2ndquadrant.com wrote: Kevin Grittner wrote: So you're confident that an 8kB write to the controller will not be done as a series of smaller atomic writes by the OS file system? Sure, that happens. But if the BBU has gotten an fsync call after the 8K write, it shouldn't return success until after all 8K are in its cache. I'm not concerned about an fsync after the controller has it; I'm concerned about a system crash in the middle of writing an 8K page to the controller. Other than the expected *size* of the window of time during which you're vulnerable, what does a BBU caching controller buy you in this regard? Can't the OS rearrange the writes of disk sectors after the 8K page is written to the OS cache so that the window might occasionally be rather large? If the write fails to the controller, the page is not flushed and PG does not continue. If the write fails, the fsync never happens, and hence PG stops. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Bruce Momjian br...@momjian.us wrote: I assume we send a full 8k to the controller, and a failure during that write is not registered as a write. On what do you base that assumption? I assume that we send a full 8K to the OS cache, and the file system writes disk sectors according to its own algorithm. With either platters or BBU cache, the data is persisted on fsync; why do you see a risk with one but not the other? -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Kevin Grittner wrote: Bruce Momjian br...@momjian.us wrote: I assume we send a full 8k to the controller, and a failure during that write is not registered as a write. On what do you base that assumption? I assume that we send a full 8K to the OS cache, and the file system writes disk sectors according to its own algorithm. With either platters or BBU cache, the data is persisted on fsync; why do you see a risk with one but not the other? Now that is an interesting question. We write 8k to the kernel, but the kernel doesn't have to honor those write sizes, so while we probably can't get a partial 512-byte block written to disk with an BBU (that isn't cleanup up by the BBU on reboot), we could get some 512-byte blocks of an 8k written and others not. I agree you are right and a BBU does not mean you can safely turn off full_page_writes. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Index scan is not working, why??
-Original Message- From: AI Rumman [mailto:rumman...@gmail.com] Sent: Thursday, October 21, 2010 1:25 AM To: pgsql-performance@postgresql.org Subject: Index scan is not working, why?? I don't know why seq scan is running on the following query where the same query is giving index scan on other servers: explain analyze select * from act where act.acttype in ( 'Meeting','Call','Task'); QUERY PLAN -- -- Seq Scan on act (cost=0.00..13386.78 rows=259671 width=142) (actual time=0.013..484.572 rows=263639 loops=1) Filter: (((acttype)::text = 'Meeting'::text) OR ((acttype)::text = 'Call'::text) OR ((acttype)::text = 'Task'::text)) Total runtime: 732.956 ms (3 rows) The above query is giving index scan on other servers and even if I rewrite the query as follows I got index scan: explain analyze select * from act where act.acttype = 'Meeting' or act.acttype = 'Call'; QUERY PLAN -- -- -- Bitmap Heap Scan on act (cost=17.98..1083.80 rows=2277 width=142) (actual time=1.901..9.722 rows=4808 loops=1) Recheck Cond: (((acttype)::text = 'Meeting'::text) OR ((acttype)::text = 'Call'::text)) - BitmapOr (cost=17.98..17.98 rows=2281 width=0) (actual time=1.262..1.262 rows=0 loops=1) - Bitmap Index Scan on act_acttype_idx (cost=0.00..8.99 rows=1141 width=0) (actual time=0.790..0.790 rows=3181 loops=1) Index Cond: ((acttype)::text = 'Meeting'::text) - Bitmap Index Scan on act_acttype_idx (cost=0.00..8.99 rows=1141 width=0) (actual time=0.469..0.469 rows=1630 loops=1) Index Cond: ((acttype)::text = 'Call'::text) Total runtime: 14.227 ms (8 rows) Index Scan is not alwayes prefarable to Seq Scan, it depends on selectivity of your query. When retrieving substancial portion of big table seq scan is usually faster, that's why optimizer chooses it. Your queries (and possibly data sets in the tables on different servers) are not the same. Your first query (which uses seq scan) returns 259671 which is probably substantial part of the whole table. Your second query (which uses index scan) returns only 4808 rows, which makes index access less costly in this case. Regards, Igor Neyman -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Thursday 21 October 2010 21:42:06 Kevin Grittner wrote: Bruce Momjian br...@momjian.us wrote: I assume we send a full 8k to the controller, and a failure during that write is not registered as a write. On what do you base that assumption? I assume that we send a full 8K to the OS cache, and the file system writes disk sectors according to its own algorithm. With either platters or BBU cache, the data is persisted on fsync; why do you see a risk with one but not the other? At least on linux pages can certainly get written out in 8kb batches if youre under memory pressure. Andres -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow count(*) again...
On Oct 21, 2010, at 11:13 AM, Jesper Krogh wrote: On 2010-10-21 06:47, Scott Carey wrote: On a wimpy disk, I/O bound for sure. But my disks go 1000MB/sec. No query can go fast enough for them. The best I've gotten is 800MB/sec, on a wide row (average 800 bytes). Most tables go 300MB/sec or so. And with 72GB of RAM, many scans are in-memory anyway. Is it cpu or io bound while doing it? I/O bound with the fio benchmark tool if 16K blocks or greater, CPU bound with 8K blocks or smaller. CentOS 5.5. CPU bound with postgres. Can you scan it faster using time cat relation-oid.* /dev/null I'm not sure what you mean. in psql, select * piped to /dev/null is VERY CPU bound because of all the formatting. I haven't toyed with COPY. Do you mean the actual files? 'dd' tests from actual files are similar to fio, but not as consistent and hard to add concurrency. That is faster than postgres. A single SSD with supercapacitor will go about 500MB/sec by itself next spring. I will easily be able to build a system with 2GB/sec I/O for under $10k. What filesystem are you using? Readahead? Can you try to check the filesystemfragmentation of the table using filefrag? XFS, defragmented once a day. Readahead 40960 (20MB, 1MB per spindle). two raid 10 arrays, each 10 discs each (2 hot spare), software raid-0 tying those together (md, 1MB blocks). Two Adaptec 5805 (or 5085, the external SAS one). A third raid card for the OS/xlog with 4x10krpm sas drives internal. Fragmentation quickly takes this down a lot as do small files and concurrent activity, since its only enough spindles for ~2000 iops. But its almost all large reporting queries on partitioned tables (500,000 partitions). A few smaller tables are starting to cause too many seeks so those might end up on a smaller, high iops tablespace later. Over time the disks have filled up and there is a significant slowdown in sequential transfer at the end of the partition -- 600MB/sec max. That is still CPU bound on most scans, but postgres can go that fast on some scans. Off topic: Other interesting features is how this setup causes the system tables to bloat by factors of 2x to 8x each week, and requires frequent vacuum full + reindex on several of them else they become 1.5GB in size. Nothing like lots of temp table work + hour long concurrent transactions to make the system catalog bloat. I suppose with 8.4 many temp tables could be replaced using WITH queries, but in other cases analyzing a temp table is the only way to get a sane query plan. -- Jesper -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Kevin Grittner wrote: I assume that we send a full 8K to the OS cache, and the file system writes disk sectors according to its own algorithm. With either platters or BBU cache, the data is persisted on fsync; why do you see a risk with one but not the other I'd like a 10 minute argument please. I started to write something to refute this, only to clarify in my head the sequence of events that leads to the most questionable result, where I feel a bit less certain than I did before of the safety here. Here is the worst case I believe you're describing: 1) Transaction is written to the WAL and sync'd; client receives COMMIT. Since full_page_writes is off, the data in the WAL consists only of the delta of what changed on the page. 2) 8K database page is written to OS cache 3) PG calls fsync to force the database block out 4) OS writes first 4K block of the change to the BBU write cache. Worst case, this fills the cache, and it takes a moment for some random writes to process before it has space to buffer again (makes this more likely to happen, but it's not required to see the failure case here) 5) Sudden power interruption, second half of the page write is lost 6) Server restarts 7) That 4K write is now replayed from the battery's cache At this point, you now have a torn 8K page, with 1/2 old and 1/2 new data. Without a full page write in the WAL, is it always possible to restore its original state now? In theory, I think you do. Since the delta in the WAL should be overwriting all of the bytes that changed between the old and new version of the page, applying it on top of any four possible states here: 1) None of the data was written to the database page yet 2) The first 4K of data was written out 3) The second 4K of data was written out 4) All 8K was actually written out Should lead to the same result: an 8K page that includes the change that was in the WAL but not onto disk at the point when the crash happened. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Greg Smith g...@2ndquadrant.com writes: At this point, you now have a torn 8K page, with 1/2 old and 1/2 new data. Right. Without a full page write in the WAL, is it always possible to restore its original state now? In theory, I think you do. Since the delta in the WAL should be overwriting all of the bytes that changed between the old and new version of the page, applying it on top of any four possible states here: You've got entirely too simplistic a view of what the delta might be, I fear. In particular there are various sorts of changes that involve inserting the data carried in the WAL record and shifting pre-existing data around to make room, or removing an item and moving remaining data around. If you try to replay that type of action against a torn page, you'll get corrupted results. We could possibly redefine the WAL records so that they weren't just the minimum amount of data but carried every byte that'd changed on the page, and then I think what you're envisioning would work. But the records would be a lot bulkier. It's not clear to me that this would be a net savings over the current design, particularly not if there's a long interval between checkpoints. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Kevin Grittner wrote: With either platters or BBU cache, the data is persisted on fsync; why do you see a risk with one but not the other Forgot to address this part. The troublesome sequence if you don't have a BBU is: 1) WAL data is written to the OS cache 2) PG calls fsync 3) Data is tranferred into the drive's volatile, non battery-backed cache 4) Drive lies about data being on disk, says fsync is done 5) That 8K data page is written out to the OS cache, also with fsync, then onto the drive. It says it has that too. 6) Due to current disk head location, 4KB of the data page gets written out before it gets to the WAL data 7) System crashes Now you're dead. You've just torn a data page, but not written any of the data to the WAL necessary to reconstruct any valid version of that page. I think Kevin's point here may be that if your fsync isn't reliable, you're always in trouble. But if your fsync is good, even torn pages should be repairable by the deltas written to the WAL, as I described in the message I just sent before this one. That's true regardless of whether you achieved non-lying fsync with a BBU or by turning a drive's write cache off. There's nothing really special about the BBU beyond it behind the most common form of reliable write cache that works. You get the same properties at a slower rate with a drive that's configured to never lie about writes. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance