Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, Mar 18, 2011 at 2:46 AM, Robert Haas wrote: > On further review, I've changed my mind. Making synchronous_commit > trump synchronous_replication is appealing conceptually, but it's > going to lead to some weird corner cases. For example, a transaction > that drops a non-temporary relation always commits synchronously; and > 2PC also ignores synchronous_commit. In the case where > synchronous_commit=off and synchronous_replication=on, we'd either > have to decide that these sorts of transactions aren't going to > replicate synchronously (which would give synchronous_commit a rather > long reach into areas it doesn't currently touch) or else that it's OK > for CREATE TABLE foo () to be totally asynchronous but that DROP TABLE > foo requires sync commit AND sync rep. That's pretty weird. > > What makes more sense to me after having thought about this more > carefully is to simply make a blanket rule that when > synchronous_replication=on, synchronous_commit has no effect. That is > easy to understand and document. I'm inclined to think it's OK to let > synchronous_replication have this effect even if max_wal_senders=0 or > synchronous_standby_names=''; you shouldn't turn > synchronous_replication on just for kicks, and I don't think we want > to complicate the test in RecordTransactionCommit() more than > necessary. We should, however, adjust the logic so that a transaction > which has not written WAL can still commit asynchronously, because > such a transaction has only touched temp or unlogged tables and so > it's not important for it to make it to the standby, where that data > doesn't exist anyway. In the first place, I think that it's complicated to keep those two parameters separately. What about merging them to one parameter? What I'm thinking is to remove synchronous_replication and to increase the valid values of synchronous_commit from on/off to async/local/remote/both. Each value works as follows. async = (synchronous_commit = off && synchronous_replication = off) "async" makes a transaction do local WAL flush and replication asynchronously. local = (synchronous_commit = on && synchronous_replication = off) "local" makes a transaction wait for only local WAL flush. remote = (synchronous_commit = off && synchronous_replication = on) "remote" makes a transaction wait for only replication. Local WAL flush is performed by walwriter. This is useless in 9.1 because we always must wait for local WAL flush when we wait for replication. But in the future, if we'll be able to send WAL before WAL write (i.e., send WAL from wal_buffers), this might become useful. In 9.1, it seems reasonable to remove this value. both = (synchronous_commit = on && synchronous_replication = on) "both" makes a transaction wait for local WAL flush and replication. Thought? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, Mar 18, 2011 at 2:52 AM, Robert Haas wrote: > On Thu, Mar 10, 2011 at 3:04 PM, Robert Haas wrote: >>> - /* Let the master know that we received some data. >>> */ >>> - XLogWalRcvSendReply(); >>> - XLogWalRcvSendHSFeedback(); >>> >>> This change completely eliminates the difference between write_location >>> and flush_location in pg_stat_replication. If this change is reasoable, we >>> should get rid of write_location from pg_stat_replication since it's >>> useless. >>> If not, this change should be reverted. I'm not sure whether monitoring >>> the difference between write and flush locations is useful. But I guess that >>> someone thought so and that code was added. >> >> I could go either way on this but clearly we need to do one or the other. > > I'm not really sure why this was part of the synchronous replication > patch, but after mulling it over I think it's probably right to rip > out write_location completely. There shouldn't ordinarily be much of > a gap between write location and flush location, so it's probably not > worth the extra network overhead to keep track of it. We might need > to re-add some form of this in the future if we have a version of > synchronous replication that only waits for confirmation of receipt > rather than for confirmation of flush, but we don't have that in 9.1, > so why bother? > > Barring objections, I'll go do that. I agree to get rid of write_location. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Basic Recovery Control functions for use in Hot Standby. Pause,
On Fri, Mar 18, 2011 at 1:17 AM, Robert Haas wrote: >> Sorry, I've not been able to understand the point well yet. We should >> just use elog(ERROR) instead? But since ERROR in startup process >> is treated as FATAL, I'm not sure whether it's worth using ERROR >> instead. Or you meant another things? > > Yeah, I think he's saying that an ERROR in the startup process is > better than a FATAL, even though the effect is the same. We've already been using FATAL all over the recovery code. We should s/FATAL/ERROR/g there (at least readRecoveryCommandFile)? > On the substantive issue, I don't think we have any consensus that > forbidding this combination of parameters is the right thing to do > anyway. Both Simon and I voted against that, and Tom's point has to > do only with style. Similarly, I voted for flipping the default for > pause_at_recovery_target to off, rather than on, but no one else has > bought into that suggestion either. Unless we get some more votes in > favor of doing one of those things, I think we should focus on the > actual must-fix issue here, which is properly documenting the way it > works now (i.e. adding the parameter to recovery.conf.sample with > appropriate documentation of the current behavior). I agree to flip the default to false, whether we forbid that combination of settings. > One thing I'm not quite clear on is what happens if we reach the > recovery target before we reach consistency. i.e. create restore > point, flush xlog, abnormal shutdown, try to recover to named restore > point. Is there any possibility that we can end up paused before Hot > Standby has actually started up. Because that would be fairly useless > and annoying. Good catch! In that case, the same situation as (3) would happen. I think that recovery should ignore pause_at_recovery_target until it reaches consistent point. If we do so, when recovery target is ahead of consistent point, recovery just ends in inconsistent point and throws FATAL error. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19
On Wed, 2011-03-16 at 13:35 -0400, Robert Haas wrote: > 2. If a query cancel interrupt is received (pg_cancel_backend or ^C), > then cancel the sync rep wait and issue a warning before acknowledging > the commit. When I saw this commit, I noticed that the WARNING doesn't have an errcode(). It seems like it should -- this is the kind of thing that the client is likely to care about, and may want to handle specially. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FK constraints "NOT VALID" by default?
On Thu, Mar 17, 2011 at 5:29 PM, Andrew Dunstan wrote: >> Is this really intended? > > I sure hope not. That's a bug. Not sure if it's a psql bug or a backend bug, but it's definitely a bug. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FK constraints "NOT VALID" by default?
On 03/17/2011 05:22 PM, Alvaro Herrera wrote: I just ran this quick test in HEAD: and was very surprised to see that the foreign key is marked as NOT VALID: Is this really intended? I sure hope not. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] really lazy vacuums?
On Thu, Mar 17, 2011 at 4:02 PM, Jesper Krogh wrote: > On the 1 bit per page the "best case" would be 341 times better than above > reducing the size of the visibiility map on a 10GB table to around 152KB > which > is extremely small (and thus also awesome) But the consequenses of a single > update would mean that you loose visibilllity map benefit on 341 tuples in > one shot. True, but I'm not really sure that matters very much. Keep in mind also that would increase the frequency with which visibility map bits would need to be flipped, which would carry its own costs. > Worst case situations are, where we approach the 4 tuples per page, before > we hit toast where the ratio of space reduction in 1 bit per tuple would be: > 1:(2048x8) ~ 1:16384 and the 1 bit per page is 4 times better. > In the 1 bit per tuple a visibillity map of a 10GB relation would be around > 610KB > 1 bit per page would then drop it to around 160KB. > > > Can we drag out some average-case numbers on row-size in the heap > from some real production systems? Well, unless you went to some significantly more complicated data structure, you'd need to allow room for the maximum number of tuples per page on every page, whether the slots were all in use or not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] FK constraints "NOT VALID" by default?
I just ran this quick test in HEAD: alvherre=# create table study (id int primary key); NOTICE: CREATE TABLE / PRIMARY KEY creará el índice implícito «study_pkey» para la tabla «study» CREATE TABLE alvherre=# insert into study select a from generate_series(1, 100) as a; INSERT 0 100 alvherre=# create table studyform (id int primary key, study_id int not null references study); NOTICE: CREATE TABLE / PRIMARY KEY creará el índice implícito «studyform_pkey» para la tabla «studyform» CREATE TABLE alvherre=# insert into studyform select a, 1 + a * random() from generate_series(1, 10) a; INSERT 0 10 and was very surprised to see that the foreign key is marked as NOT VALID: alvherre=# \d studyform Tabla «public.studyform» Columna │ Tipo │ Modificadores ──┼─┼─── id │ integer │ not null study_id │ integer │ not null Índices: "studyform_pkey" PRIMARY KEY, btree (id) Restricciones de llave foránea: "studyform_study_id_fkey" FOREIGN KEY (study_id) REFERENCES study(id) NOT VALID Is this really intended? -- Álvaro Herrera -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Rados*aw Smogura wrote: > I have implemented initial concept of 2nd level cache. Idea is to > keep some segments of shared memory for special buffers (e.g. > indices) to prevent overwrite those by other operations. I added > those functionality to nbtree index scan. > > I tested this with doing index scan, seq read, drop system > buffers, do index scan and in few places I saw performance > improvements, but actually, I'm not sure if this was just "random" > or intended improvement. I've often wondered about this. In a database I developed back in the '80s it was clearly a win to have a special cache for index entries and other special pages closer to the database than the general cache. A couple things have changed since the '80s (I mean, besides my waistline and hair color), and PostgreSQL has many differences from that other database, so I haven't been sure it would help as much, but I have wondered. I can't really look at this for a couple weeks, but I'm definitely interested. I suggest that you add this to the next CommitFest as a WIP patch, under the Performance category. https://commitfest.postgresql.org/action/commitfest_view/open > There is few places to optimize code as well, and patch need many > work, but may you see it and give opinions? For something like this it makes perfect sense to show "proof of concept" before trying to cover everything. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] really lazy vacuums?
On 2011-03-17 15:02, Robert Haas wrote: On Thu, Mar 17, 2011 at 4:17 AM, Jesper Krogh wrote: Is it obvious that the visibillity map bits should track complete pages and not individual tuples? If the visibillity map tracks at page-level the benefit would fall on "slim tables" where you squeeze 200 tuples into each page and having an update rate of 1% would lower the likelyhood even more. (it may be that for slim tables the index-only-scans are not as benefitial as to wide tables). I'm not sure exactly what MaxHeapTuplesPerPage works out to be, but say it's 200. If you track visibility info per tuple rather than per page, then the size of the visibility map is going to expand by a factor of 200. That might decrease contention, but otherwise it's a bad thing - the whole point of having the visibility map in the first place is that it's much, much smaller than the heap. If it were the same size as the heap, we could just read the heap. What the map attempts to accomplish is to allow us, by reading a small number of pages, to check whether the tuples we're thinking of reading are likely to be all-visible without actually looking at them. Yes, that was sort of the math I was trying to make. I do allthough belive that you have a way better feeling about it. But according to this: http://wiki.postgresql.org/wiki/FAQ#How_much_database_disk_space_is_required_to_store_data_from_a_typical_text_file.3F The bulk row-overhead is around 24bytes, which will with 1 bit per row give a size reduction of 1:(24x8) ~1:192, worstcase... that gives at best 341 tuples/page (where each tuple, does not contain any data at all). With that ratio, the visibillitymap of a relation of 10GB would fill 52MB on disk (still worst case) and that by itself would by all means be awesome. (with that small tuples a 10GB relation would have around 42 billion tuples). On the 1 bit per page the "best case" would be 341 times better than above reducing the size of the visibiility map on a 10GB table to around 152KB which is extremely small (and thus also awesome) But the consequenses of a single update would mean that you loose visibilllity map benefit on 341 tuples in one shot. Worst case situations are, where we approach the 4 tuples per page, before we hit toast where the ratio of space reduction in 1 bit per tuple would be: 1:(2048x8) ~ 1:16384 and the 1 bit per page is 4 times better. In the 1 bit per tuple a visibillity map of a 10GB relation would be around 610KB 1 bit per page would then drop it to around 160KB. Can we drag out some average-case numbers on row-size in the heap from some real production systems? I may have gotten something hugely wrong in above calculations and/or have missed some important points. -- Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] 2nd Level Buffer Cache
Hi, I have implemented initial concept of 2nd level cache. Idea is to keep some segments of shared memory for special buffers (e.g. indices) to prevent overwrite those by other operations. I added those functionality to nbtree index scan. I tested this with doing index scan, seq read, drop system buffers, do index scan and in few places I saw performance improvements, but actually, I'm not sure if this was just "random" or intended improvement. There is few places to optimize code as well, and patch need many work, but may you see it and give opinions? Regards, Radek diff --git a/.gitignore b/.gitignore index 3f11f2e..6542e35 100644 --- a/.gitignore +++ b/.gitignore @@ -22,3 +22,4 @@ lcov.info /GNUmakefile /config.log /config.status +/nbproject/private/ \ No newline at end of file diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c index 2796445..0229f5a 100644 --- a/src/backend/access/nbtree/nbtpage.c +++ b/src/backend/access/nbtree/nbtpage.c @@ -508,7 +508,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access) if (blkno != P_NEW) { /* Read an existing block of the relation */ - buf = ReadBuffer(rel, blkno); + buf = ReadBufferLevel(rel, blkno, BUFFER_LEVEL_2ND); LockBuffer(buf, access); _bt_checkpage(rel, buf); } @@ -548,7 +548,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access) blkno = GetFreeIndexPage(rel); if (blkno == InvalidBlockNumber) break; - buf = ReadBuffer(rel, blkno); + buf = ReadBufferLevel(rel, blkno, BUFFER_LEVEL_2ND); if (ConditionalLockBuffer(buf)) { page = BufferGetPage(buf); diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c index dadb49d..2922711 100644 --- a/src/backend/storage/buffer/buf_init.c +++ b/src/backend/storage/buffer/buf_init.c @@ -22,6 +22,7 @@ BufferDesc *BufferDescriptors; char *BufferBlocks; int32 *PrivateRefCount; +BufferLevelDesc *bufferLevels; /* * Data Structures: @@ -72,8 +73,7 @@ int32 *PrivateRefCount; void InitBufferPool(void) { - bool foundBufs, -foundDescs; + bool foundBufs, foundDescs, foundBufferLevels = false; BufferDescriptors = (BufferDesc *) ShmemInitStruct("Buffer Descriptors", @@ -83,19 +83,38 @@ InitBufferPool(void) ShmemInitStruct("Buffer Blocks", NBuffers * (Size) BLCKSZ, &foundBufs); - if (foundDescs || foundBufs) +bufferLevels = (BufferLevelDesc*) +ShmemInitStruct("Buffer Levels Descriptors Table", + sizeof(BufferLevelDesc) * BUFFER_LEVEL_SIZE, +&foundBufferLevels); + if (foundDescs || foundBufs || foundBufferLevels) { /* both should be present or neither */ - Assert(foundDescs && foundBufs); + Assert(foundDescs && foundBufs && foundBufferLevels); /* note: this path is only taken in EXEC_BACKEND case */ } else { BufferDesc *buf; +BufferLevelDesc *bufferLevelDesc; + int i; - + buf = BufferDescriptors; +/* Initialize buffer levels. */ +//1st Level - Default +bufferLevelDesc = bufferLevels; +bufferLevelDesc->index = 0; +bufferLevelDesc->super = BUFFER_LEVEL_END_OF_LIST; +bufferLevelDesc->lower = BUFFER_LEVEL_END_OF_LIST; + +//2nd Level - For indices +bufferLevelDesc++; +bufferLevelDesc->index = 1; +bufferLevelDesc->super = BUFFER_LEVEL_END_OF_LIST; +bufferLevelDesc->lower = 0; + /* * Initialize all the buffer headers. */ @@ -117,6 +136,10 @@ InitBufferPool(void) */ buf->freeNext = i + 1; +/* Assign buffer level. */ +//TODO Currently hardcoded - +buf->buf_level = ( 0.3 * NBuffers > i ) ? BUFFER_LEVEL_DEFAULT : BUFFER_LEVEL_2ND; + buf->io_in_progress_lock = LWLockAssign(); buf->content_lock = LWLockAssign(); } diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 1f89e52..867bae0 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -47,7 +47,8 @@ #include "storage/standby.h" #include "utils/rel.h" #include "utils/resowner.h" - +#include "catalog/pg_type.h" +#include "funcapi.h" /* Note: these two macros only work on shared buffers, not local ones! */ #define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ)) @@ -85,7 +86,7 @@ static volatile BufferDesc *PinCountWaitBuf = NULL; static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence, ForkNumber forkNum, BlockNumber blockNum, ReadBufferMode mode, BufferAccessStrategy strategy, - bool *hit); + bool *hit, BufferLevel bufferLevel); static bool PinBuffer(vol
Re: [HACKERS] Allowing multiple concurrent base backups
On Mon, Jan 31, 2011 at 10:45 PM, Fujii Masao wrote: > On Tue, Feb 1, 2011 at 1:31 AM, Heikki Linnakangas > wrote: >> Hmm, good point. It's harmless, but creating the history file in the first >> place sure seems like a waste of time. > > The attached patch changes pg_stop_backup so that it doesn't create > the backup history file if archiving is not enabled. > > When I tested the multiple backups, I found that they can have the same > checkpoint location and the same history file name. > > > $ for ((i=0; i<4; i++)); do > pg_basebackup -D test$i -c fast -x -l test$i & > done > > $ cat test0/backup_label > START WAL LOCATION: 0/2B0 (file 00010002) > CHECKPOINT LOCATION: 0/2E8 > START TIME: 2011-02-01 12:12:31 JST > LABEL: test0 > > $ cat test1/backup_label > START WAL LOCATION: 0/2B0 (file 00010002) > CHECKPOINT LOCATION: 0/2E8 > START TIME: 2011-02-01 12:12:31 JST > LABEL: test1 > > $ cat test2/backup_label > START WAL LOCATION: 0/2B0 (file 00010002) > CHECKPOINT LOCATION: 0/2E8 > START TIME: 2011-02-01 12:12:31 JST > LABEL: test2 > > $ cat test3/backup_label > START WAL LOCATION: 0/2B0 (file 00010002) > CHECKPOINT LOCATION: 0/2E8 > START TIME: 2011-02-01 12:12:31 JST > LABEL: test3 > > $ ls archive/*.backup > archive/00010002.00B0.backup > > > This would cause a serious problem. Because the backup-end record > which indicates the same "START WAL LOCATION" can be written by the > first backup before the other finishes. So we might think wrongly that > we've already reached a consistency state by reading the backup-end > record (written by the first backup) before reading the last required WAL > file. > > /* > * Force a CHECKPOINT. Aside from being necessary to prevent > torn > * page problems, this guarantees that two successive backup > runs will > * have different checkpoint positions and hence different > history > * file names, even if nothing happened in between. > * > * We use CHECKPOINT_IMMEDIATE only if requested by user (via > passing > * fast = true). Otherwise this can take awhile. > */ > RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT | > (fast ? CHECKPOINT_IMMEDIATE > : 0)); > > This problem happens because the above code (in do_pg_start_backup) > actually doesn't ensure that the concurrent backups have the different > checkpoint locations. ISTM that we should change the above or elsewhere > to ensure that. Or we should include backup label name in the backup-end > record, to prevent a recovery from reading not-its-own backup-end record. > > Thought? This patch is on the 9.1 open items list, but I don't understand it well enough to know whether it's correct. Can someone else pick it up? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Flex output missing from 9.1a4 tarballs?
2011/3/16 Devrim GÜNDÜZ : > On Tue, 2011-03-15 at 22:00 -0400, Tom Lane wrote: >> >> > My only hesitation about this is that it seems like sync rep and >> > collation support are both still pretty broken. Should we just not >> > worry about that for alpha? >> >> FWIW, collations are probably still several days away from being >> noticeably less broken than they were in alpha4. I have mixed >> feelings >> about whether an alpha5 right now is useful. > > Fair enough. Looks like we will skip next week, too. Yeah, probably best. I think we're making good progress, though, so I propose we wrap alpha5 on Monday, March 28. I don't think we'll be all the way there yet by then, but I think we will have made enough progress that it makes sense to get another snapshot out the door and into the hands of testers. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] why is max standby delay only 35 minutes?
On tis, 2011-03-15 at 20:44 +0200, Peter Eisentraut wrote: > Consequently, I propose the attached patch. I didn't find any > relevant documentation references that would need updating. Applied. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [COMMITTERS] pgsql: Fix various possible problems with synchronous replication.
On 17 March 2011 17:55, Robert Haas wrote: > On Thu, Mar 17, 2011 at 1:24 PM, Thom Brown wrote: >> errdetail("The transaction has already been committed locally but >> might have not been replicated to the standby."))); >> errdetail("The transaction has committed locally, but may not have >> replicated to the standby."))); >> >> Could we have these saying precisely the same thing? > > Yeah. Which is better? Personally I prefer the 2nd. It reads better somehow. -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935 EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Thu, 2011-03-17 at 13:46 -0400, Robert Haas wrote: > On Fri, Mar 11, 2011 at 5:46 AM, Fujii Masao wrote: > > On Fri, Mar 11, 2011 at 5:04 AM, Robert Haas wrote: > >>>if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0 > >>> || > >>> SyncRepRequested()) > >>> > >>> Whenever synchronous_replication is TRUE, we disable synchronous_commit. > >>> But, before disabling that, we should check also max_wal_senders and > >>> synchronous_standby_names? Otherwise, synchronous_commit can > >>> be disabled unexpectedly even in non replication case. > >> > >> Yeah, that's bad. At the risk of repeating myself, I don't think this > >> code should be checking SyncRepRequested() in the first place. If the > >> user has turned off synchronous_commit, then we should just commit > >> asynchronously, even if sync rep is otherwise in force. Otherwise, > >> this if statement is going to get really complicated. The logic is > >> already at least mildly wrong here anyway: clearly we do NOT need to > >> commit synchronously if the transaction has not written xlog, even if > >> sync rep is enabled. > > > > Yeah, not to wait for replication when synchronous_commit is disabled > > seems to be more reasonable. > > On further review, I've changed my mind. Making synchronous_commit > trump synchronous_replication is appealing conceptually, but it's > going to lead to some weird corner cases. For example, a transaction > that drops a non-temporary relation always commits synchronously; and > 2PC also ignores synchronous_commit. In the case where > synchronous_commit=off and synchronous_replication=on, we'd either > have to decide that these sorts of transactions aren't going to > replicate synchronously (which would give synchronous_commit a rather > long reach into areas it doesn't currently touch) or else that it's OK > for CREATE TABLE foo () to be totally asynchronous but that DROP TABLE > foo requires sync commit AND sync rep. That's pretty weird. > > What makes more sense to me after having thought about this more > carefully is to simply make a blanket rule that when > synchronous_replication=on, synchronous_commit has no effect. That is > easy to understand and document. I'm inclined to think it's OK to let > synchronous_replication have this effect even if max_wal_senders=0 or > synchronous_standby_names=''; you shouldn't turn > synchronous_replication on just for kicks, and I don't think we want > to complicate the test in RecordTransactionCommit() more than > necessary. We should, however, adjust the logic so that a transaction > which has not written WAL can still commit asynchronously, because > such a transaction has only touched temp or unlogged tables and so > it's not important for it to make it to the standby, where that data > doesn't exist anyway. Agree to that. Not read your other stuff yet, will do that later. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [COMMITTERS] pgsql: Fix various possible problems with synchronous replication.
On Thu, Mar 17, 2011 at 1:24 PM, Thom Brown wrote: > errmsg("canceling the wait for replication and terminating connection > due to administrator command") > errmsg("canceling wait for synchronous replication due to user request") > > Should that first one then also say "synchronous replication"? I could go either way. Clearly if it's asynchronous replication, we wouldn't be waiting. But you're certainly right that we should be consistent. > errdetail("The transaction has already been committed locally but > might have not been replicated to the standby."))); > errdetail("The transaction has committed locally, but may not have > replicated to the standby."))); > > Could we have these saying precisely the same thing? Yeah. Which is better? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Thu, Mar 10, 2011 at 3:04 PM, Robert Haas wrote: >> - /* Let the master know that we received some data. */ >> - XLogWalRcvSendReply(); >> - XLogWalRcvSendHSFeedback(); >> >> This change completely eliminates the difference between write_location >> and flush_location in pg_stat_replication. If this change is reasoable, we >> should get rid of write_location from pg_stat_replication since it's useless. >> If not, this change should be reverted. I'm not sure whether monitoring >> the difference between write and flush locations is useful. But I guess that >> someone thought so and that code was added. > > I could go either way on this but clearly we need to do one or the other. I'm not really sure why this was part of the synchronous replication patch, but after mulling it over I think it's probably right to rip out write_location completely. There shouldn't ordinarily be much of a gap between write location and flush location, so it's probably not worth the extra network overhead to keep track of it. We might need to re-add some form of this in the future if we have a version of synchronous replication that only waits for confirmation of receipt rather than for confirmation of flush, but we don't have that in 9.1, so why bother? Barring objections, I'll go do that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, Mar 11, 2011 at 5:46 AM, Fujii Masao wrote: > On Fri, Mar 11, 2011 at 5:04 AM, Robert Haas wrote: >>> if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0 || >>> SyncRepRequested()) >>> >>> Whenever synchronous_replication is TRUE, we disable synchronous_commit. >>> But, before disabling that, we should check also max_wal_senders and >>> synchronous_standby_names? Otherwise, synchronous_commit can >>> be disabled unexpectedly even in non replication case. >> >> Yeah, that's bad. At the risk of repeating myself, I don't think this >> code should be checking SyncRepRequested() in the first place. If the >> user has turned off synchronous_commit, then we should just commit >> asynchronously, even if sync rep is otherwise in force. Otherwise, >> this if statement is going to get really complicated. The logic is >> already at least mildly wrong here anyway: clearly we do NOT need to >> commit synchronously if the transaction has not written xlog, even if >> sync rep is enabled. > > Yeah, not to wait for replication when synchronous_commit is disabled > seems to be more reasonable. On further review, I've changed my mind. Making synchronous_commit trump synchronous_replication is appealing conceptually, but it's going to lead to some weird corner cases. For example, a transaction that drops a non-temporary relation always commits synchronously; and 2PC also ignores synchronous_commit. In the case where synchronous_commit=off and synchronous_replication=on, we'd either have to decide that these sorts of transactions aren't going to replicate synchronously (which would give synchronous_commit a rather long reach into areas it doesn't currently touch) or else that it's OK for CREATE TABLE foo () to be totally asynchronous but that DROP TABLE foo requires sync commit AND sync rep. That's pretty weird. What makes more sense to me after having thought about this more carefully is to simply make a blanket rule that when synchronous_replication=on, synchronous_commit has no effect. That is easy to understand and document. I'm inclined to think it's OK to let synchronous_replication have this effect even if max_wal_senders=0 or synchronous_standby_names=''; you shouldn't turn synchronous_replication on just for kicks, and I don't think we want to complicate the test in RecordTransactionCommit() more than necessary. We should, however, adjust the logic so that a transaction which has not written WAL can still commit asynchronously, because such a transaction has only touched temp or unlogged tables and so it's not important for it to make it to the standby, where that data doesn't exist anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Basic Recovery Control functions for use in Hot Standby. Pause,
On Thu, Mar 17, 2011 at 2:47 AM, Fujii Masao wrote: > On Wed, Mar 16, 2011 at 11:27 PM, Tom Lane wrote: >> Fujii Masao writes: >>> How should recovery work when pause_at_recovery_target is >>> enabled but hot standby is disabled? We have three choices: >> >>> 1. Forbit those settings, i.e., throw FATAL error. Tom dislikes this >>> idea. >> >> No, I didn't say that. I said not to write elog(FATAL). > > Oh, sorry. > >> If the >> combination is nonsensical then it's fine to forbid it, but you don't >> need FATAL for that. In particular, attempting to change to a >> disallowed setting after system startup should not result in crashing >> the postmaster. And it won't, if you just use the normal error level >> for complaining about an invalid GUC setting. > > Sorry, I've not been able to understand the point well yet. We should > just use elog(ERROR) instead? But since ERROR in startup process > is treated as FATAL, I'm not sure whether it's worth using ERROR > instead. Or you meant another things? Yeah, I think he's saying that an ERROR in the startup process is better than a FATAL, even though the effect is the same. On the substantive issue, I don't think we have any consensus that forbidding this combination of parameters is the right thing to do anyway. Both Simon and I voted against that, and Tom's point has to do only with style. Similarly, I voted for flipping the default for pause_at_recovery_target to off, rather than on, but no one else has bought into that suggestion either. Unless we get some more votes in favor of doing one of those things, I think we should focus on the actual must-fix issue here, which is properly documenting the way it works now (i.e. adding the parameter to recovery.conf.sample with appropriate documentation of the current behavior). One thing I'm not quite clear on is what happens if we reach the recovery target before we reach consistency. i.e. create restore point, flush xlog, abnormal shutdown, try to recover to named restore point. Is there any possibility that we can end up paused before Hot Standby has actually started up. Because that would be fairly useless and annoying. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] I am confused after reading codes of PostgreSQL three week
hom wrote: > I try to known how a database is implemented and I have been > reading PG source codes for a month. That's ambitious. find -name '*.h' -or -name '*.c' \ | egrep -v '^\./src/test/.+/tmp_check/' \ | xargs cat | wc -l 1059144 Depending on how you do the math, that's about 50,000 lines of code per day to get through it in the time you mention. > Is there any article or some way could help understand the source > code ? Your best bet would be to follow links from the Developers tab on the main PostgreSQL web site: http://www.postgresql.org/developer/ In particular the Developer FAQ page: http://wiki.postgresql.org/wiki/Developer_FAQ And the "Coding" links: http://www.postgresql.org/developer/coding may help. Before reading code in a directory, be sure to read any README file(s) in that directory carefully. It helps to read this list. In spite of reviewing all of that myself, it was rather intimidating when I went to work on a major patch 14 months ago. Robert Haas offered some good advice which served me well in that effort -- divide the effort in to a series of incremental steps, each of which deals with a small enough portion of the code to get your head around. As you work in any one narrow area, it becomes increasingly clear; with that as a base you can expand your scope. When you're working in the code, it is tremendously helpful to use an editor with ctags support (or similar IDE functionality). I hope this is helpful. Good luck. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] I am confused after reading codes of PostgreSQL three week
hom wrote: > Hi, > > I try to known how a database is implemented and I have been reading > PG source codes for a month. > > Now, I only know a little about how PG work. :( > > I just know PG work like this but I don't know why PG work like this. :( :( > > even worse, I feel I can better understand the source code. it may be > that I could't split the large module into small piece which may help > to understand. > > Is there any article or some way could help understand the source code ? I assume you have looked at these places: http://wiki.postgresql.org/wiki/Developer_FAQ http://www.postgresql.org/developer/coding -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [BUGS] BUG #5842: Memory leak in PL/Python when taking slices of results
Excerpts from Marko Kreen's message of jue mar 17 12:01:13 -0300 2011: > On Thu, Mar 17, 2011 at 4:40 AM, Robert Haas wrote: > > On Fri, Mar 11, 2011 at 6:02 AM, Bruce Momjian wrote: > >> What has been done with this report/fix? > > > > AFAIK, nothing. Added to 9.1 open items list. > > The patch seems to do the right thing. Looking into this. AFAICT this needs to be backported. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [BUGS] BUG #5842: Memory leak in PL/Python when taking slices of results
On Thu, Mar 17, 2011 at 4:40 AM, Robert Haas wrote: > On Fri, Mar 11, 2011 at 6:02 AM, Bruce Momjian wrote: >> What has been done with this report/fix? > > AFAIK, nothing. Added to 9.1 open items list. The patch seems to do the right thing. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] I am confused after reading codes of PostgreSQL three week
Hi, I try to known how a database is implemented and I have been reading PG source codes for a month. Now, I only know a little about how PG work. :( I just know PG work like this but I don't know why PG work like this. :( :( even worse, I feel I can better understand the source code. it may be that I could't split the large module into small piece which may help to understand. Is there any article or some way could help understand the source code ? Thanks for help ~ -- Best Wishes! hom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] volatile markings to silence compilers
Martijn van Oosterhout writes: > It appears the issue is mostly that the compiler is unable to prove > that the variables aren't changed. IME, older versions of gcc will warn about any variable that's assigned more than once, even if those assignments are before the setjmp call. Presumably this is stricter than necessary, but I don't know enough details of gcc's register usage to be sure. > My point is, are we hopeful this problem will ever go away? Since we're trying to silence the warning in existing and even obsolete compilers, whether it gets improved in future compilers is not terribly relevant. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rectifying wrong Date outputs
Tom Lane wrote: > what we should first do is see what Oracle does in such cases, > because the main driving factor for these functions is Oracle > compatibility not what might seem sane in a vacuum. +1 -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rectifying wrong Date outputs
Excerpts from Robert Haas's message of jue mar 17 11:09:56 -0300 2011: > On Thu, Mar 17, 2011 at 9:46 AM, Alvaro Herrera > wrote: > > Keep in mind that the datetime stuff was abandoned by the maintainer > > some years ago with quite some rough edges. Some of it has been fixed, > > but a lot of bugs remain. Looks like this is one of those places and it > > seems appropriate to spend some time fixing it. Since it would involve > > a behavior change, it should only go to 9.2, of course. > > I wouldn't object to fixing the problem with # of digits > # of Ys in > 9.1, if the fix is simple and clear-cut. I think we are still > accepting patches to make minor tweaks, like the tab-completion patch > I committed yesterday. It also doesn't bother me tremendously if we > push it off, but I don't think that anyone's going to be too sad if > TO_DATE('01-jan-2010', 'DD-MON-YYY') starts returning something more > sensible than 3010-01-01. If it can be delivered quickly and it is simple, sure. But anything more involved should respect the release schedule. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rectifying wrong Date outputs
Robert Haas writes: > On Thu, Mar 17, 2011 at 9:46 AM, Alvaro Herrera > wrote: >> Keep in mind that the datetime stuff was abandoned by the maintainer >> some years ago with quite some rough edges. Some of it has been fixed, >> but a lot of bugs remain. Looks like this is one of those places and it >> seems appropriate to spend some time fixing it. Since it would involve >> a behavior change, it should only go to 9.2, of course. > I wouldn't object to fixing the problem with # of digits > # of Ys in > 9.1, if the fix is simple and clear-cut. I think we are still > accepting patches to make minor tweaks, like the tab-completion patch > I committed yesterday. It also doesn't bother me tremendously if we > push it off, but I don't think that anyone's going to be too sad if > TO_DATE('01-jan-2010', 'DD-MON-YYY') starts returning something more > sensible than 3010-01-01. Agreed, it's certainly not too late for bug fixes in 9.1. I agree that this isn't something we would want to tweak in released branches, but 9.1 isn't there yet. Having said that, it's not entirely clear to me what sane behavior is here. Personally I would expect that an n-Ys format spec would consume at most n digits from the input. Otherwise how are you going to use to_date to pick apart strings that don't have any separators? So I think the problem is actually upstream of the behavior complained of here. However, what we should first do is see what Oracle does in such cases, because the main driving factor for these functions is Oracle compatibility not what might seem sane in a vacuum. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rectifying wrong Date outputs
On Thu, Mar 17, 2011 at 9:46 AM, Alvaro Herrera wrote: > Excerpts from Piyush Newe's message of jue mar 17 02:30:06 -0300 2011: >> Sorry for creating the confusion. The table drawn was PostgreSQL vs EDB >> Advanced Server. >> Thanks Burce for clarification. >> >> For the 1-digit, 2-digit & 3-digit Year inputs, as I said, I didn't see any >> document in PG which will explain what would be the century considered if it >> is not given. If I missed out it somewhere please let me know. > > Keep in mind that the datetime stuff was abandoned by the maintainer > some years ago with quite some rough edges. Some of it has been fixed, > but a lot of bugs remain. Looks like this is one of those places and it > seems appropriate to spend some time fixing it. Since it would involve > a behavior change, it should only go to 9.2, of course. I wouldn't object to fixing the problem with # of digits > # of Ys in 9.1, if the fix is simple and clear-cut. I think we are still accepting patches to make minor tweaks, like the tab-completion patch I committed yesterday. It also doesn't bother me tremendously if we push it off, but I don't think that anyone's going to be too sad if TO_DATE('01-jan-2010', 'DD-MON-YYY') starts returning something more sensible than 3010-01-01. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] volatile markings to silence compilers
On Thu, Mar 17, 2011 at 12:36 AM, Bruce Momjian wrote: > Looking over the release notes, we have added a few 'volatile' storage > specifications to variables which are near longjump/TRY blocks to > silence compilers. I am worried that these specifications don't clearly > identify their purpose. Can we rename these to use a macro for > 'volatile' that will make their purpose clearer and perhaps their > removal one day easier? I don't particularly see the point of this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] really lazy vacuums?
On Thu, Mar 17, 2011 at 4:17 AM, Jesper Krogh wrote: > Is it obvious that the visibillity map bits should track complete > pages and not individual tuples? If the visibillity map tracks at > page-level the benefit would fall on "slim tables" where you squeeze > 200 tuples into each page and having an update rate of 1% would > lower the likelyhood even more. (it may be that for slim tables the > index-only-scans are not as benefitial as to wide tables). I'm not sure exactly what MaxHeapTuplesPerPage works out to be, but say it's 200. If you track visibility info per tuple rather than per page, then the size of the visibility map is going to expand by a factor of 200. That might decrease contention, but otherwise it's a bad thing - the whole point of having the visibility map in the first place is that it's much, much smaller than the heap. If it were the same size as the heap, we could just read the heap. What the map attempts to accomplish is to allow us, by reading a small number of pages, to check whether the tuples we're thinking of reading are likely to be all-visible without actually looking at them. > In collaboration with a vacuuming discussion, I dont know if it > is there allready but how about "opportunistic vacuuming". Say > you have a page what due to changes in one of the tuples are > being written out, will it, while being written out anyway get the > other tuples on the page vacuumed? The really lazy kind of vacuuming I'm talking about could be done this way. Regular vacuuming cannot, because you can't actually prune a dead tuple until you've scanned all the indexes for references to that CTID. The obvious place to do this would be the background writer: if the page is dirty anyway and we're about to evict it, we could decide to (1) set hint bits, (2) set visibility map bits, (3) freeze tuples that need freezing, and (4) identify dead tuples and reclaim the space they use, but not the associated line pointer. Sadly, I'm not sure this would help much. If we have, say, a 4MB relation, you're not even going to notice it when vacuum comes along and does its thing. Even a vacuum freeze is chump change. The problem is with big relations, like say 1GB+. Processing the whole table at once can have a material impact on system performance, so it'd be nice to do some work incrementally. But it's likely that doing it opportunistically as you evict things from shared buffers is only going to help here and there. Even if you optimistically assumed that we could opportunistically do 10% of the vacuuming that way, that's still not much of a dent. And I think it'd probably be less than that in most real-world cases. A further problem is that the background writer is already a pretty busy process, and giving it more things to do isn't very appealing from the standpoint of overall system performance. > It actually dont have to hook into the process directly to benefit > the IO-usage, if it just can get the opportunity to do it before > the page gets evicted from the OS-cache, then it would save a > second read on that page, but it seems way harder to do something > sane around that assumption. Yeah. > Really lazy vacuums would "only" benefit "really static tables" ? where > vacuuming is not that big a problem in the first place. I think the benefit would be tables that are either quite large (where the ability to do this incrementally would be an advantage over regular VACUUM) or insert-only (where we currently have no way to get PD_ALL_VISIBLE bits set without user intervention). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rectifying wrong Date outputs
Excerpts from Piyush Newe's message of jue mar 17 02:30:06 -0300 2011: > Sorry for creating the confusion. The table drawn was PostgreSQL vs EDB > Advanced Server. > Thanks Burce for clarification. > > For the 1-digit, 2-digit & 3-digit Year inputs, as I said, I didn't see any > document in PG which will explain what would be the century considered if it > is not given. If I missed out it somewhere please let me know. Keep in mind that the datetime stuff was abandoned by the maintainer some years ago with quite some rough edges. Some of it has been fixed, but a lot of bugs remain. Looks like this is one of those places and it seems appropriate to spend some time fixing it. Since it would involve a behavior change, it should only go to 9.2, of course. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19
On Thu, Mar 17, 2011 at 8:24 AM, Heikki Linnakangas wrote: > Hmm, so setting synchronous_standby_names to '' takes effect immediately, > but other changes to it don't apply to already-blocked commits. That seems a > bit inconsistent. Perhaps walwriter should store the parsed list of > standby-names in shared memory, not just a boolean. I don't think this is necessary. In general, the current or potential WAL sender processes are responsible for working out among themselves whose job it is to release waiters, and doing it. As long as synchronous_standby_names is non-empty, then either (1) there are one or more standbys connected that can take on the role of synchronous standby, and whoever does will release waiters or (2) there are no standbys connected that can take on the role of synchronous standbys, in which case no waiters should be released until one connects. But when synchronous_standby_names becomes completely empty, that doesn't mean "wait until a standby connects whose application name is in the empty set and make him the synchronous standby" but rather "synchronous replication is administratively disabled, don't wait in the first place". So we just need a Boolean flag. > +1 otherwise. Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19
On Thu, Mar 17, 2011 at 2:08 AM, Fujii Masao wrote: > This occurs to me; we should ensure that, in shutdown case, walwriter > should exit after all the backends have gone out? I'm not sure if it's worth > thinking of the case, but what if synchronous_standby_names is unset > and config file is reloaded after smart shutdown is requested? In this > case, the reload cannot wake up the waiting backends since walwriter > has already exited. This behavior looks a bit inconsistent. I agree we need to fix smart shutdown. I think that's going to have to be a separate patch, however; it's got more problems than this patch can fix without expanding into a monster. > In the patch, in order to read the latest value, you take a light-weight lock. > But I wonder why taking a lock can ensure that the value is up-to-date. A lock acquisition acts as a memory sequence point. > + * WAL writer calls this as needed to update the shared sync_standbys_needed > > Typo: s/sync_standbys_needed/sync_standbys_defined Fixed. > + * we exit normally, or SYNC_REP_MUST_DISCONNECT in abnormal cases. > > Typo: the reference to SYNC_REP_MUST_DISCONNECT is not required. Fixed. > + * So in this case we issue a NOTICE (which some clients may > > Typo: s/NOTICE/WARNING Fixed. > + if (ProcDiePending) > + { > + ereport(WARNING, > + (errcode(ERRCODE_ADMIN_SHUTDOWN), > + errmsg("canceling the wait for > replication and terminating > connection due to administrator command"), > + errdetail("The transaction has > already been committed locally > but might have not been replicated to the standby."))); > + whereToSendOutput = DestNone; > + LWLockAcquire(SyncRepLock, LW_EXCLUSIVE); > + MyProc->syncRepState = SYNC_REP_WAIT_COMPLETE; > + SHMQueueDelete(&(MyProc->syncRepLinks)); > > SHMQueueIsDetached() should be checked before calling SHMQueueDelete(). > Walsender can already delete the backend from the queue before reaching here. Fixed. But come to think of it, doesn't this mean SyncRepCleanupAtProcExit() needs to repeat the test after acquiring the lock? > + if (QueryCancelPending) > + { > + QueryCancelPending = false; > + ereport(WARNING, > + (errmsg("canceling wait for > synchronous replication due to user request"), > + errdetail("The transaction has > committed locally, but may not > have replicated to the standby."))); > + LWLockAcquire(SyncRepLock, LW_EXCLUSIVE); > + MyProc->syncRepState = SYNC_REP_WAIT_COMPLETE; > + SHMQueueDelete(&(MyProc->syncRepLinks)); > + LWLockRelease(SyncRepLock); > > Same as above. Fixed. > + if (!PostmasterIsAlive(true)) > + { > + whereToSendOutput = DestNone; > + proc_exit(1); > > proc_exit() should not be called at that point because it leads PANIC. Fixed, although I'm not sure it matters. > I think that it's better to check ProcDiePending, QueryCancelPending > and PostmasterIsAlive *before* waiting on the latch, not after. Because > those events can occur before reaching there, and it's not worth waiting > for 60 seconds to detect them. Not necessary. Inspired by one of your earlier patches, I made die() and StatementCancelHandler() set the latch. We could still wait up to 60 s to detect postmaster death, but that's a very rare situation and it's not worth slowing down the common case for it, especially since there is a race condition no matter what. Thanks for the review! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company sync-rep-wait-fixes-v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] really lazy vacuums?
2011/3/17 Robert Haas : > On Wed, Mar 16, 2011 at 6:36 PM, Jim Nasby wrote: >> One way to look at this is that any system will have a limit on how quickly >> it can vacuum everything. If it's having trouble dedicating enough IO to >> vacuum, then autovac is going to have a long list of tables that it wants to >> vacuum. When you're in that situation, you want to get to the next table >> that needs vacuuming as quickly as possible, so if you've run through the >> first heap scan and found only a limited number of dead tuples, it doesn't >> make sense to spend a bunch of time scanning indexes and making a second >> heap scan (though, IIRC the second scan doesn't hit the entire heap; it only >> hits the tuples that were remembered as being dead). > > I mostly agree with this, but you also can't postpone vacuuming > indefinitely just because you're too busy; that's going to blow up in > your face. > >> Of course, going along the lines of an autovac-based tuning mechanism, you >> have to question how a table would show up for autovac if there's not >> actually a number of dead tuples. One scenario is freezing (though I'm not >> sure if your super-lazy vacuum could freeze tuples or not). Another is >> inserts. That might become a big win; you might want to aggressively scan a >> table that gets data loaded into it in order to set hint/all visible bits. > > Right. Really-lazy vacuum could freeze tuples. Unlike regular > vacuum, it can also sensibly be done incrementally. One thing I was > thinking about is counting the number of times that we fetched a tuple > that was older than RecentGlobalXmin and had a committed xmin and an > invalid xmax, but where the page was not PD_ALL_VISIBLE. If that's > happening a lot, it probably means that some vacuuming would speed > things up, by getting those PD_ALL_VISIBLE bits set. Perhaps you > could work out some formula where you do a variable amount of > super-lazy vacuuming depending on the number of such tuple fetches. > The trick would be to avoid overdoing it (so that you swamp the I/O > system) or underdoing it (so that the system never converges). It > would be really nice (for this and for other things) if we had some > way of measuring the I/O saturation of the system, so that we could > automatically adjust the aggressiveness of background processes > Yes. I am thinking of something like that (the IO saturation measurement) to let the background writer try to work on hint bit when it does not have so much to do, if IO ressources are ok. > > Note also that if and when we get index-only scans, making sure the > PD_ALL_VISIBLE bits (and thus the visibility map bits) actually get > set is going to be a lot more important. > >> From a manual standpoint, ISTM that super-lazy vac would be extremely useful >> for dealing with hint bits after a bulk insert to a table that also has some >> update activity. Using a regular vacuum in that case would result in a lot >> of extra work to deal with the small number of dead tuples. > > I can see that. > >> Perhaps it would be useful to write a script that analyzed the output of >> vacuum verbose looking for tables where a super-lazy vacuum would have made >> sense (assuming vacuum verbose provides the needed info). If we had such a >> script we could ask folks to run it and see how much super-lazy vacuuming >> would help in the real world. > > I'm a bit doubtful about this part. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19
On 16.03.2011 19:35, Robert Haas wrote: 3. If synchronous_standby_names is changed to '' by editing postgresql.conf and issuing pg_ctl reload, then cancel all waits in progress and wake everybody up. As I mentioned before, reloading the config file from within the waiting backend (which can't safely throw an error) seems risky, so what I did instead is made WAL writer responsible for handling this. Nobody's allowed to wait for sync rep unless a global shared memory flag is set, and the WAL writer process is responsible for setting and clearing this flag when the config file is reloaded. This has basically no performance cost; WAL writer only ever does any extra work at all with this code when it receives a SIGHUP, and even then the work is trivial except in the case where synchronous_standby_names has changed from empty to non-empty or visca versa. The advantage of putting this in WAL writer rather than, say, bgwriter is that WAL writer doesn't have nearly as many jobs to do and they don't involve nearly as much I/O, so the chances of a long delay due to the process being busy are much less. Hmm, so setting synchronous_standby_names to '' takes effect immediately, but other changes to it don't apply to already-blocked commits. That seems a bit inconsistent. Perhaps walwriter should store the parsed list of standby-names in shared memory, not just a boolean. +1 otherwise. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] really lazy vacuums?
Robert Haas wrote: Right. Really-lazy vacuum could freeze tuples. Unlike regular vacuum, it can also sensibly be done incrementally. One thing I was thinking about is counting the number of times that we fetched a tuple that was older than RecentGlobalXmin and had a committed xmin and an invalid xmax, but where the page was not PD_ALL_VISIBLE. If that's happening a lot, it probably means that some vacuuming would speed things up, by getting those PD_ALL_VISIBLE bits set. Perhaps you could work out some formula where you do a variable amount of super-lazy vacuuming depending on the number of such tuple fetches. The trick would be to avoid overdoing it (so that you swamp the I/O system) or underdoing it (so that the system never converges). It would be really nice (for this and for other things) if we had some way of measuring the I/O saturation of the system, so that we could automatically adjust the aggressiveness of background processes accordingly. Note also that if and when we get index-only scans, making sure the PD_ALL_VISIBLE bits (and thus the visibility map bits) actually get set is going to be a lot more important. Is it obvious that the visibillity map bits should track complete pages and not individual tuples? If the visibillity map tracks at page-level the benefit would fall on "slim tables" where you squeeze 200 tuples into each page and having an update rate of 1% would lower the likelyhood even more. (it may be that for slim tables the index-only-scans are not as benefitial as to wide tables). In collaboration with a vacuuming discussion, I dont know if it is there allready but how about "opportunistic vacuuming". Say you have a page what due to changes in one of the tuples are being written out, will it, while being written out anyway get the other tuples on the page vacuumed? It actually dont have to hook into the process directly to benefit the IO-usage, if it just can get the opportunity to do it before the page gets evicted from the OS-cache, then it would save a second read on that page, but it seems way harder to do something sane around that assumption. Really lazy vacuums would "only" benefit "really static tables" ? where vacuuming is not that big a problem in the first place. -- Jesper - Demonstrating totally lack of insight I would assume. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] volatile markings to silence compilers
On Thu, Mar 17, 2011 at 12:36:59AM -0400, Bruce Momjian wrote: > Looking over the release notes, we have added a few 'volatile' storage > specifications to variables which are near longjump/TRY blocks to > silence compilers. I am worried that these specifications don't clearly > identify their purpose. Can we rename these to use a macro for > 'volatile' that will make their purpose clearer and perhaps their > removal one day easier? The question is, are they wrong? The longjmp manpage says: The values of automatic variables are unspecified after a call to longjmp() if they meet all the following criteria: · they are local to the function that made the corresponding setjmp(3) call; · their values are changed between the calls to setjmp(3) and longjmp(); and · they are not declared as volatile. It appears the issue is mostly that the compiler is unable to prove that the variables aren't changed. It's hard because the buffer created by setjmp() doesn't expire. We know that after PG_END_TRY() the buffer won't be used, but apparently the compiler doesn't. My point is, are we hopeful this problem will ever go away? Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle signature.asc Description: Digital signature