Re: [HACKERS] Function and view to retrieve WAL receiver status
On Sun, Dec 13, 2015 at 10:15 PM, Michael Paquier wrote: > On Mon, Dec 14, 2015 at 3:09 PM, Gurjeet Singh > wrote: > > On Dec 13, 2015 9:56 PM, "Michael Paquier" > > wrote: > >> If the node has no WAL receiver active, a tuple with NULL values is > >> returned instead. > > > > IMO, in the absence of a WAL receiver the SRF (and the view) should not > > return any rows. > > The whole point is to not use a SRF in this case: there is always at > most one WAL receiver. > The function, maybe. But emitting an all-nulls row from a view seems counter-intuitive, at least when looking at it in context of relational database. -- Gurjeet Singh http://gurjeet.singh.im/
Re: [HACKERS] Function and view to retrieve WAL receiver status
On Dec 13, 2015 9:56 PM, "Michael Paquier" wrote: > > If the node has no WAL receiver active, a tuple with NULL values is > returned instead. IMO, in the absence of a WAL receiver the SRF (and the view) should not return any rows.
Re: [HACKERS] Limit GIST_MAX_SPLIT_PAGES to XLR_MAX_BLOCK_ID
(adding Heikki, since that macro was touched by his commit 04e298b8) Does my previous description of the problem make sense, or am I fretting over something that's a non-issue. Best regards, On Sun, Sep 20, 2015 at 7:03 PM, Gurjeet Singh wrote: > Gin code respects the XLR_MAX_BLOCK_ID when > calling XLogEnsureRecordSpace(). But it appears that the Gist code does not > try to limit its block-id consumption to XLR_MAX_BLOCK_ID. > > The GIST_MAX_SPLIT_PAGES is hard-coded at 75, but XLogEnsureRecordSpace() > would reject a request of more than 32 (XLR_MAX_BLOCK_ID). > > The attached patch redefines GIST_MAX_SPLIT_PAGES so that in case of a > split, gistplacetopage() now throws an error when the block-ids needed > exceed 32. > > I have used Min(75, XLR_MAX_BLOCK_ID) as the macro expansion, but I > believe it can be set to plain XLR_MAX_BLOCK_ID. > > > -- > Gurjeet Singh http://gurjeet.singh.im/ > -- Gurjeet Singh http://gurjeet.singh.im/
[HACKERS] Limit GIST_MAX_SPLIT_PAGES to XLR_MAX_BLOCK_ID
Gin code respects the XLR_MAX_BLOCK_ID when calling XLogEnsureRecordSpace(). But it appears that the Gist code does not try to limit its block-id consumption to XLR_MAX_BLOCK_ID. The GIST_MAX_SPLIT_PAGES is hard-coded at 75, but XLogEnsureRecordSpace() would reject a request of more than 32 (XLR_MAX_BLOCK_ID). The attached patch redefines GIST_MAX_SPLIT_PAGES so that in case of a split, gistplacetopage() now throws an error when the block-ids needed exceed 32. I have used Min(75, XLR_MAX_BLOCK_ID) as the macro expansion, but I believe it can be set to plain XLR_MAX_BLOCK_ID. -- Gurjeet Singh http://gurjeet.singh.im/ gist_limit_blockid_to_XLR_MAX_BLOCK_ID.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] replication slot restart_lsn initialization
On Mon, Aug 10, 2015 at 7:12 AM, Andres Freund wrote: > On 2015-07-07 09:42:54 -0700, Gurjeet Singh wrote: > > /* > > + * Grab and save an LSN value to prevent WAL recycling past that point. > > + */ > > +void > > +ReplicationSlotRegisterRestartLSN() > > +{ > > Didn't like that description and function name very much. What does > 'grabbing' mean here? Should probably mention that it works on the > currently active slot and modifies it. > In your version, I don't see a comment that refers to the fact that it works on the currently active (global variable) slot. > > It's now ReplicationSlotReserveWal() > Okay. > > > + ReplicationSlot *slot = MyReplicationSlot; > > + > > + Assert(slot != NULL); > > + Assert(slot->data.restart_lsn == InvalidXLogRecPtr); > > + > > + /* > > + * The replication slot mechanism is used to prevent removal of > required > > + * WAL. As there is no interlock between this and checkpoints > required, WAL > > + * segment could be removed before > ReplicationSlotsComputeRequiredLSN() has > > + * been called to prevent that. In the very unlikely case that > this happens > > + * we'll just retry. > > + */ > > You removed some punctuation in that sentence converting a sentence in > bad english into one without the original meaning ;). See the attached > for a new version. > Your version looks better. > > > +/* > > * Flush all replication slots to disk. > > * > > * This needn't actually be part of a checkpoint, but it's a convenient > > @@ -876,7 +942,7 @@ StartupReplicationSlots(void) > > } > > > > /* > > - * Manipulation of ondisk state of replication slots > > + * Manipulation of on-disk state of replication slots > > * > > * NB: none of the routines below should take any notice whether a slot > is the > > * current one or not, that's all handled a layer above. > > diff --git a/src/backend/replication/slotfuncs.c > b/src/backend/replication/slotfuncs.c > > index 9a2793f..01b376a 100644 > > --- a/src/backend/replication/slotfuncs.c > > +++ b/src/backend/replication/slotfuncs.c > > @@ -40,6 +40,7 @@ Datum > > pg_create_physical_replication_slot(PG_FUNCTION_ARGS) > > { > > Namename = PG_GETARG_NAME(0); > > + boolimmediately_reserve = PG_GETARG_BOOL(1); > > Datum values[2]; > > boolnulls[2]; > > TupleDesc tupdesc; > > @@ -58,10 +59,28 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS) > > /* acquire replication slot, this will check for conflicting names > */ > > ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT); > > > > - values[0] = NameGetDatum(&MyReplicationSlot->data.name); > > + if (immediately_reserve) > > + { > > + /* Allocate restart-LSN, if the user asked for it */ > > + ReplicationSlotRegisterRestartLSN(); > > + > > + /* Write this slot to disk */ > > + ReplicationSlotMarkDirty(); > > + ReplicationSlotSave(); > > > > - nulls[0] = false; > > - nulls[1] = true; > > + values[0] = NameGetDatum(&MyReplicationSlot->data.name); > > + values[1] = > LSNGetDatum(MyReplicationSlot->data.restart_lsn); > > + > > + nulls[0] = false; > > + nulls[1] = false; > > + } > > + else > > + { > > + values[0] = NameGetDatum(&MyReplicationSlot->data.name); > > + > > + nulls[0] = false; > > + nulls[1] = true; > > + } > > I moved > values[0] = NameGetDatum(&MyReplicationSlot->data.name); > nulls[0] = false; > to outside the conditional block, there seems no reason to have it in > there? > The assignment to values[0] is being done twice. We can do away with the one in the else part of the if condition. Also, there was a typo in my patch [1] and it seems to have made it into the commit: . Tom seems to have just fixed it in commit 750fc78b. Best regards, [1]: I altered you to it in a personal email, but probably it fell through the cracks. -- Gurjeet Singh http://gurjeet.singh.im/
Re: [HACKERS] 64-bit XIDs again
On Jul 30, 2015 2:23 PM, "Tom Lane" wrote: > > Gavin Flower writes: > > On 31/07/15 02:24, Heikki Linnakangas wrote: > >> There is a big downside to expanding xmin/xmax to 64 bits: it takes > >> space. More space means more memory needed for caching, more memory > >> bandwidth, more I/O, etc. > > > I think having a special case to save 32 bits per tuple would cause > > unnecessary complications, and the savings are minimal compared to the > > size of current modern storage devices and the typical memory used in > > serious database servers. > > I think the argument that the savings are minimal is pretty thin. > It all depends on how wide your tables are --- but on a narrow table, say > half a dozen ints, the current tuple size is 24 bytes header plus the same > number of bytes of data. We'd be going up to 32 bytes header which makes > for a 16% increase in physical table size. If your table is large, > claiming that 16% doesn't hurt is just silly. > > But the elephant in the room is on-disk compatibility. There is > absolutely no way that we can just change xmin/xmax to 64 bits without a > disk format break. However, if we do something like what Heikki is > suggesting, it's at least conceivable that we could convert incrementally > (ie, if you find a page with the old header format, assume all tuples in > it are part of epoch 0; and do not insert new tuples into it unless there > is room to convert the header to new format ... but I'm not sure what we > do about tuple deletion if the old page is totally full and we need to > write an xmax that's past 4G). Can we safely relegate the responsibility of tracking the per block epoch to a relation fork?
Re: [HACKERS] "A huge debt of gratitude" - Michael Stonebraker
On Jul 22, 2015 12:07 PM, "Jolly Chen" wrote: > > Hey everyone, > > You have probably heard that Mike Stonebraker recently won the Turing award. A recording of his award lecture is available at: > https://www.youtube.com/watch?v=BbGeKi6T6QI > > It is an entertaining talk overall. If you fast forward to about the 1:07 mark, he makes some comments about postgres. > > Here’s my rough transcription: > > "The abstract data type system in postgres has been added to a lot of relational database systems. It's kind of de facto table stakes for relational databases these days, essentially intact. That idea was really a good one. It was mentioned in the citation for my Turing award winning. However, serendipity played a huge role, which is, the biggest impact of postgres by far came from two Berkeley students that I'll affectionately call Grumpy and Sleepy. They converted the academic postgres prototype from QUEL to SQL in 1995. This was in parallel to the commercial activity. And then a pick-up team of volunteers, none of whom have anything to do with me or Berkeley, have been shepherding that open source system ever since 1995. The system that you get off the web for postgres comes from this pick-up team. It is open source at its best and I want to just mention that I have nothing to do with that and that collection of folks we all owe a huge debt of gratitude to, because they have robustize that code line and made it so it really works.” > > Thank you all so much for your hard work over the last twenty years!! > > Affectionately, > > Grumpy Thank you! And a big thanks to the stewards of the project. Sincerely,
Re: [HACKERS] ALTER TABLE .. ADD PRIMARY KEY .. USING INDEX has dump-restore hazard
On Tue, Jul 21, 2015 at 9:23 AM, Robert Haas wrote: > rhaas=# create unique index on foo (a collate "C"); > CREATE INDEX > rhaas=# alter table foo add primary key using index foo_a_idx; > ALTER TABLE > > Now dump and restore this database. Then: > > Notice that the collation specifier is gone. Oops. > The intent of the 'USING INDEX' feature was to work around the problem that PRIMARY KEY indexes cannot be reindexed concurrently to reduce bloat. Considering that the feature doesn't work [1] (at least as shown in example in the docs [2]) if there are any foreign keys referencing the table, there's scope for improvement. There was proposal to support reindexing the primary key index, but I am not sure where that stands. If that's already in, or soon to be in core, we may put a deprecation notice around USING INDEX. OTOH, if reindexing PKeys is too difficult, we may want to support index-replacement via a top-level ALTER TABLE subcommand that works in CONCURRENT mode, and not recommend the method shown in the docs (i.e deprecate the USING INDEX syntax). [1]: The DROP CONSTRAINT part of the command fails if there are any FKeys pointing to this table. [2]: http://www.postgresql.org/docs/9.4/static/sql-altertable.html -- Gurjeet Singh http://gurjeet.singh.im/
Re: [HACKERS] ALTER TABLE .. ADD PRIMARY KEY .. USING INDEX has dump-restore hazard
On Wed, Jul 22, 2015 at 7:34 AM, Robert Haas wrote: > On Tue, Jul 21, 2015 at 8:34 PM, Michael Paquier > wrote: > > On Wed, Jul 22, 2015 at 1:23 AM, Robert Haas > wrote: > >> Notice that the collation specifier is gone. Oops. > > > > As it is not possible to specify directly a constraint for a PRIMARY > > KEY expression, what about switching dumpConstraint to have it use > > first a CREATE INDEX query with the collation and then use ALTER TABLE > > to attach the constraint to it? I am noticing that we already fetch > > the index definition in indxinfo via pg_get_indexdef. Thoughts? > > I guess the questions on my mind is "Why is it that you can't do this > directly when creating a primary key, but you can do it when turning > an existing index into a primary key?" > > If there's a good reason for prohibiting doing this when creating a > primary key, then presumably it shouldn't be allowed when turning an > index into a primary key either. If there's not, then why not extend > the PRIMARY KEY syntax to allow it? > +1. I think in the short term we can treat this as a bug, and "fix" the bug by disallowing an index with attributes that cannot be present in an index created by PRIMARY KEY constraint. The collation attribute on one of the keys may be just one of many such attributes. In the long term, we may want to allow collation in primary key, but that will be a feature ideally suited for a major version release. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/
Re: [HACKERS] [PATCH] Function to get size of asynchronous notification queue
On Thu, Jun 25, 2015 at 8:43 PM, Brendan Jurd wrote: > On Fri, 26 Jun 2015 at 06:03 Gurjeet Singh wrote: > > >> s/proportion/fraction/ >> > > I think of these as synonymous -- do you have any particular reason to > prefer "fraction"? I don't feel strongly about it either way, so I'm quite > happy to go with fraction if folks find that more expressive. > It just feels better to me in this context. If the number of times used in Postgres code is any measure, 'fraction' wins hands down: "proportion" : 33, "fraction": 620. I don't feel strongly about it, either. I can leave it up to the committer to decide. > > >> >> + * The caller must hold (at least) shared AysncQueueLock. >> >> A possibly better wording: The caller must hold AysncQueueLock in (at >> least) shared mode. >> > > Yes, that is more accurate. > OK. > > >> >> Unnecessary whitespace changes in pg_proc.h for existing functions. >> >> > I did group the asynchronous notification functions together, which seemed > reasonable as there are now three of them, and changed the tabbing between > the function name and namespace ID to match, as is done elsewhere in > pg_proc.h. I think those changes improve readability, but again I don't > feel strongly about it. > Fair enough. > > +DESCR("get the current usage of the asynchronous notification queue"); >> >> A possibly better wording: get the fraction of the asynchronous >> notification queue currently in use >> > > I have no objections to your wording. > > OK. Please send a new patch with the changes you agree to, and I can mark it ready for committer. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/
Re: [HACKERS] More logging for autovacuum
On Tue, Jul 7, 2015 at 11:20 AM, Sawada Masahiko wrote: > On Sat, Jul 4, 2015 at 2:30 PM, Amit Kapila > wrote: > > On Thu, Jul 2, 2015 at 4:41 AM, Gurjeet Singh wrote: > >> log_min_messages acts as a single gate for everything headed for the > >> server logs; controls for per-background process logging do not exist. > If > >> one wants to see DEBUG/INFO messages for just the Autovacuum (or > >> checkpointer, bgwriter, etc.), they have to set log_min_messages to that > >> level, but the result would be a lot of clutter from other processes to > >> grovel through, to see the messages of interest. > >> > > > > I think that will be quite helpful. During the patch development of > > parallel sequential scan, it was quite helpful to see the LOG messages > > of bgworkers, however one of the recent commits (91118f1a) have > > changed those to DEBUG1, now if I have to enable all DEBUG1 > > messages, then what I need will be difficult to find in all the log > > messages. > > Having control of separate logging for background tasks will serve such > > a purpose. > > > >> The facilities don't yet exist, but it'd be nice if such parameters when > >> unset (ie NULL) pick up the value of log_min_messages. So by default, > the > >> users will get the same behaviour as today, but can choose to tweak per > >> background-process logging when needed. > >> > > > > Will this proposal allow user to see all the messages by all the > background > > workers or will it be per background task. Example in some cases user > > might want to see the messages by all bgworkers and in some cases one > > might want to see just messages by AutoVacuum and it's workers. > I want the logging to be controlled per background process, so that user can pick and choose how much detail they want from each process. In absence of such a setting for a background process, the global log_min_messages should control the logging. > > > I think here designing User Interface is an important work if others also > > agree > > that such an option is useful, could you please elaborate your proposal? > > +1 > I would like this feature to be as simple as the current log_min_messages; currently a superuser can do ALTER USER X SET log_min_messages = Y, and control logging for specific users or databases or a combination of those. In the same vein, setting autovacuum.log_min_messages in postgresql.conf, or a corresponding ALTER SYSTEM should be enough to use a different setting of log_min_messages inside autovacuum launcher and its worker processes. I am using autovacuum process as an example, but the same is applicable to other background processes as well. E.g. checkpointer.log_min_messages. As for the implementation details, upon startup and after every reload, the autovacuum launcher process will look for autovacuum.log_min_messages being defined; if yes, it'd set the global variable log_min_messages in code to that value, and if not defined in any of the conf files, the global variable in code would be assigned whatever is assigned to the GUC param log_min_messages. DefineCustomXXVariable() seems incapable of supporting this behaviour, so we might have to invent a new one. This will allow us to keep the current behaviour as default when autovacuum.log_min_messages is not defined, and provide finer control over autovacuum's logging by defining this variable, when needed. Same goes for Checkpointer, BGWriter, WALWriter, and BGWorkers. BTW, the facility we invent needn't be limited to log_min_messages, but it's just what we need now. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/
Re: [HACKERS] replication slot restart_lsn initialization
On Tue, Jul 7, 2015 at 6:49 AM, Andres Freund wrote: > On 2015-07-07 06:41:55 -0700, Gurjeet Singh wrote: > > There seems to be a misplaced not operator ! in that if statement, as > > well. That sucks :( The MacOS gcc binary is actually clang, and its > output > > is too noisy [1], which is probably why I might have missed warning if > any. > > Which version of clang is it? With newer ones you can individually > disable nearly all of the warnings. I regularly use clang, and most of > the time it compiles master without warnings. > > $ gcc --version Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1 Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn) Target: x86_64-apple-darwin14.4.0 Thread model: posix I see that -Wno-parentheses-equality might help, but I am not looking forward to maintaining OS specific flags for clang-that-pretends-to-be-gcc in my shell scripts [2] [2] https://github.com/gurjeet/pgd Attached is a patch that introduces SlotIsPhyscial/SlotIsLogical macros. This patch, if accepted, goes on top of the v4 patch. > > > > pg_create_physical_replication_slot(PG_FUNCTION_ARGS) > > > > { > > > > Namename = PG_GETARG_NAME(0); > > > > + boolactivate = PG_GETARG_BOOL(1); > > > > > > Don't like 'activate' much as a name. How about 'immediately_reserve'? > > > > > > > I still like 'activate, but okay. How about 'reserve_immediately' > instead? > > Activate is just utterly wrong. A slot isn't inactive before. And > 'active' already is used for slots that are currently in use > (i.e. connected to). > > > Also, do you want this name change just in the C code, or in the pg_proc > > and docs as well? > > All. Patch v4 attached. On a side note, I see that the pg_create_*_replication_slot() functions do not behave transactionally; that is, rolling back a transaction does not undo the slot creation. Should we prevent these, and corresponding drop functions, from being called inside an explicit transaction? PreventTransactionChain() is geared towards serving just the utility commands. Do we protect against calling such functions in a transaction block, or from user functions? How? Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ physical_repl_slot_activate_restart_lsn.v4.patch Description: Binary data slot_is_physical_or_logical_macros.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] replication slot restart_lsn initialization
On Tue, Jul 7, 2015 at 4:59 AM, Andres Freund wrote: > On 2015-06-10 13:13:41 -0700, Gurjeet Singh wrote: > > + /* > > + * Log an xid snapshot for logical replication. > It's not needed for > > + * physical slots as it is done in BGWriter on a > regular basis. > > + */ > > + if (!slot->data.persistency == RS_PERSISTENT) > > + { > > + /* make sure we have enough information to > start */ > > + flushptr = LogStandbySnapshot(); > > + > > + /* and make sure it's fsynced to disk */ > > + XLogFlush(flushptr); > > + } > > Huh? The slot->data.persistency == RS_PERSISTENT check seems pretty much > entirely random to me. There seems to be a misplaced not operator ! in that if statement, as well. That sucks :( The MacOS gcc binary is actually clang, and its output is too noisy [1], which is probably why I might have missed warning if any. [1]: I am particularly annoyed by these: note: remove extraneous parentheses around the comparison to silence this warning note: use '=' to turn this equality comparison into an assignment Eg. : if (((opaque)->btpo_next == 0)) I'll see what I can do about these. > I mean physical slots can (and normally are) > persistent as well? Instead check whether it's a database specifics lot. > Agreed, the slot being database-specific is the right indicator. > > The reasoning why it's not helpful for physical slots is wrong. The > point is that a standby snapshot at that location will simply not help > physical replication, it's only going to read ones after the location at > which the base backup starts (i.e. the location from the backup label). > > > pg_create_physical_replication_slot(PG_FUNCTION_ARGS) > > { > > Namename = PG_GETARG_NAME(0); > > + boolactivate = PG_GETARG_BOOL(1); > > Don't like 'activate' much as a name. How about 'immediately_reserve'? > I still like 'activate, but okay. How about 'reserve_immediately' instead? Also, do you want this name change just in the C code, or in the pg_proc and docs as well? > > Other than that it's looking good to me. > Will send a new patch after your feedback on the 'activate' renaming. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/
Re: [HACKERS] More logging for autovacuum
On Fri, Aug 17, 2007 at 3:14 PM, Alvaro Herrera wrote: > Gregory Stark wrote: > > > > I'm having trouble following what's going on with autovacuum and I'm > finding > > the existing logging insufficient. In particular that it's only logging > vacuum > > runs *after* the vacuum finishes makes it hard to see what vacuums are > running > > at any given time. Also, I want to see what is making autovacuum decide > to > > forgo vacuuming a table and the log with that information is at DEBUG2. > > So did this idea go anywhere? > Assuming the thread stopped here, I'd like to rekindle the proposal. log_min_messages acts as a single gate for everything headed for the server logs; controls for per-background process logging do not exist. If one wants to see DEBUG/INFO messages for just the Autovacuum (or checkpointer, bgwriter, etc.), they have to set log_min_messages to that level, but the result would be a lot of clutter from other processes to grovel through, to see the messages of interest. The facilities don't yet exist, but it'd be nice if such parameters when unset (ie NULL) pick up the value of log_min_messages. So by default, the users will get the same behaviour as today, but can choose to tweak per background-process logging when needed. Absent such a feature, one hack is to set the desired log_min_messages value in conf file and send SIGHUP to just the process of interest and revert the conf file back; but it's a hack. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On Fri, Jun 26, 2015 at 9:45 PM, Amit Kapila wrote: > Sometime back on one of the PostgreSQL blog [1], there was > discussion about the performance of drop/truncate table for > large values of shared_buffers and it seems that as the value > of shared_buffers increase the performance of drop/truncate > table becomes worse. I think those are not often used operations, > so it never became priority to look into improving them if possible. > > I have looked into it and found that the main reason for such > a behaviour is that for those operations it traverses whole > shared_buffers and it seems to me that we don't need that > especially for not-so-big tables. We can optimize that path > by looking into buff mapping table for the pages that exist in > shared_buffers for the case when table size is less than some > threshold (say 25%) of shared buffers. > > Attached patch implements the above idea and I found that > performance doesn't dip much with patch even with large value > of shared_buffers. I have also attached script and sql file used > to take performance data. > +1 for the effort to improve this. With your technique added, there are 3 possible ways the search can happen a) Scan NBuffers and scan list of relations, b) Scan NBuffers and bsearch list of relations, and c) Scan list of relations and then invalidate blocks of each fork from shared buffers. Would it be worth it finding one technique that can serve decently from the low-end shared_buffers to the high-end. On patch: There are multiple naming styles being used in DropForkSpecificBuffers(); my_name and myName. Given this is a new function, it'd help to be consistent. s/blk_count/blockNum/ s/new//, for eg. newTag, because there's no corresponding tag/oldTag variable in the function. s/blocksToDel/blocksToDrop/. BTW, we never pass anything other than the total number of blocks in the fork, so we may as well call it just numBlocks. s/traverse_buf_freelist/scan_shared_buffers/, because when it is true, we scan the whole shared_buffers. s/rel_count/rel_num/ Reduce indentation/tab in header-comments of DropForkSpecificBuffers(). But I see there's precedent in neighboring functions, so this may be okay. Doing pfree() of num_blocks, num_fsm_blocks and num_vm_blocks in one place (instead of two, at different indentation levels) would help readability. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/
Re: [HACKERS] [PATCH] Function to get size of asynchronous notification queue
Patch reviewed following the instructions on https://wiki.postgresql.org/wiki/Reviewing_a_Patch # Submission review - Is the patch in a patch format which has context? Yes. Note to other reviewers: `git diff —patience` yields patch better suited for readability - Does it apply cleanly to the current git master? Yes. - Does it include reasonable tests, necessary doc patches, etc? Doc patch - Yes. Tests - Yes. # Usability review - Does the patch actually implement the feature? Yes. - Do we want that? Yes; see the discussion in mailing list. - Do we already have it? No. - Does it follow SQL spec, or the community-agreed behavior? Yes. It seems to implement the behavior agreed upon in the mailing list. - Does it include pg_dump support (if applicable)? N/A - Are there dangers? None that I could spot. - Have all the bases been covered? There’s room for an additional test which tests for non-zero return value. # Feature test - Does the feature work as advertised? Yes. Build configured with '--enable-debug --enable-cassert CFLAGS=-O0’. With a slightly aggressive notifications-in-a-loop script I was able to see non-zero return value: Session 1: listen ggg; begin; Session 2: while sleep 0.1; do echo 'notify ggg; select pg_notification_queue_usage();' ; done | psql - Are there corner cases the author has failed to consider? No. - Are there any assertion failures or crashes? The patch exposes an already available function to the SQL interface, and rearranges code, so it doesn’t look like the patch can induce an assertion or a crash. - Performance review Not evaluated, since it’s not a performance patch, and it doesn’t seem to impact any performance critical code path, ,either. Additional notes: Patch updates the docs of another function (pg_listening_channels), but the update is an improvement so we can let it slide :) + proportion of the queue that is currently occupied by pending notifications. s/proportion/fraction/ + * The caller must hold (at least) shared AysncQueueLock. A possibly better wording: The caller must hold AysncQueueLock in (at least) shared mode. Unnecessary whitespace changes in pg_proc.h for existing functions. +DESCR("get the current usage of the asynchronous notification queue"); A possibly better wording: get the fraction of the asynchronous notification queue currently in use On Thu, Jun 18, 2015 at 10:47 AM, Merlin Moncure wrote: > On Wed, Jun 17, 2015 at 6:15 PM, Brendan Jurd wrote: > > Posting v2 of the patch, incorporating some helpful suggestions from > Merlin. > > Based on perfunctory scan of the code, I think this is gonna make it, > unless you draw some objections based on lack of necessity. > > merlin > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Gurjeet Singh http://gurjeet.singh.im/
Re: [HACKERS] [PATCH] Function to get size of asynchronous notification queue
I don't see this in the CF app; can you please add it there? Best regards, On Wed, Jun 17, 2015 at 3:31 AM, Brendan Jurd wrote: > Hello hackers, > > I present a patch to add a new built-in function > pg_notify_queue_saturation(). > > The purpose of the function is to allow users to monitor the health of > their notification queue. In certain cases, a client connection listening > for notifications might get stuck inside a transaction, and this would > cause the queue to keep filling up, until finally it reaches capacity and > further attempts to NOTIFY error out. > > The current documentation under LISTEN explains this possible gotcha, but > doesn't really suggest a useful way to address it, except to mention that > warnings will show up in the log once you get to 50% saturation of the > queue. Unless you happen to be eyeballing the logs when it happens, that's > not a huge help. The choice of 50% as a threshold is also very much > arbitrary, and by the time you hit 50% the problem has likely been going on > for quite a while. If you want your nagios (or whatever) to say, alert you > when the queue goes over 5% or 1%, your options are limited and awkward. > > The patch has almost no new code. It makes use of the existing logic for > the 50% warning. I simply refactored that logic into a separate function > asyncQueueSaturation, and then added pg_notify_queue_saturation to make > that available in SQL. > > I am not convinced that pg_notify_queue_saturation is the best possible > name for this function, and am very much open to other suggestions. > > The patch includes documentation, a regression test and an isolation test. > > Cheers, > BJ > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > -- Gurjeet Singh http://gurjeet.singh.im/
Re: [HACKERS] replication slot restart_lsn initialization
On Wed, Jun 10, 2015 at 9:10 AM, Gurjeet Singh wrote: > > I am in the process of writing up a doc patch, and will submit that as > well in a short while. > Please find attached the patch with the doc update. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ physical_repl_slot_activate_restart_lsn.v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] replication slot restart_lsn initialization
On Wed, Jun 10, 2015 at 8:36 AM, Andres Freund wrote: > On 2015-06-10 08:24:23 -0700, Gurjeet Singh wrote: > > On Wed, Jun 10, 2015 at 8:07 AM, Andres Freund > wrote: > > > That doesn't look right to me. Why is this code logging a standby > > > snapshot for physical slots? > > > > > > > This is the new function I referred to above. The logging of the snapshot > > is in 'not RecoveryInProgress()' case, meaning it's running in primary > and > > not in a standby. > > It's still done uselessly for physical slots? > I missed the comments on LogStandbySnapshot(). Attached is the new patch which does the snapshot logging only for a logical replication slot. I am in the process of writing up a doc patch, and will submit that as well in a short while. I have removed a few #include statements which seemed unnecessary. These changes are not relevant to the patch, so I can readily revert those parts if deemed necessary. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ physical_repl_slot_activate_restart_lsn.v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] replication slot restart_lsn initialization
On Wed, Jun 10, 2015 at 8:07 AM, Andres Freund wrote: > On 2015-06-10 08:00:28 -0700, Gurjeet Singh wrote: > > > pg_create_logical_replication_slot() prevents LSN from being > > recycled that by looping (worst case 2 times) until there's no > > conflict with the checkpointer recycling the segment. So I have used > > the same approach. > > There's no need to change anything for logical slots? Or do you just > mean that you moved the existing code? > Yes, I turned the code from logical replication into a function and used it from logical and physical replication. > > > > /* > > + * Grab and save an LSN value to prevent WAL recycling past that point. > > + */ > > +void > > +ReplicationSlotRegisterRestartLSN() > > +{ ... > > + /* > > + * Let's start with enough information if we can, so log a > standby > > + * snapshot and start decoding at exactly that position. > > + */ > > + if (!RecoveryInProgress()) > > + { > > + XLogRecPtr flushptr; > > + > > + /* start at current insert position */ > > + slot->data.restart_lsn = GetXLogInsertRecPtr(); > > + > > + /* make sure we have enough information to start */ > > + flushptr = LogStandbySnapshot(); > > + > > + /* and make sure it's fsynced to disk */ > > + XLogFlush(flushptr); > > + } > > + else > > + slot->data.restart_lsn = GetRedoRecPtr(); > > + > > That doesn't look right to me. Why is this code logging a standby > snapshot for physical slots? > This is the new function I referred to above. The logging of the snapshot is in 'not RecoveryInProgress()' case, meaning it's running in primary and not in a standby. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/
Re: [HACKERS] replication slot restart_lsn initialization
On Tue, May 5, 2015 at 5:53 PM, Andres Freund wrote: > > > Was there any consideration for initializing restart_lsn to the latest > > WAL write pointer when a slot is created? Or for allowing an optional > > parameter in pg_create_(physical|logical)_replication_slot() for > > specifying the restart_lsn at slot creation? > > I've been wondering about allowing for the latter alternative. I could > have used it a couple times. The former doesn't make much sense to me, > it could be too far *ahead* in many cases actually. A patch for this > would be fairly trivial. > Attached is the patch that takes the former approach (initialize restart_lsn when the slot is created). I think it's better than the latter approach (depend on user to specify an LSN) because the LSN user specifies may have already been recycled. pg_create_logical_replication_slot() prevents LSN from being recycled that by looping (worst case 2 times) until there's no conflict with the checkpointer recycling the segment. So I have used the same approach. The function pg_create_physical_replication_slot() now has an additional boolean parameter 'activate' which user can use to allocate restart_lsn as part of the creation process. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ physical_repl_slot_activate_restart_lsn.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Interval arithmetic should emit interval in canonical format
It's hard to argue that the current behaviour is wrong, but it's worth a try. First I'd appreciate the "official" reasons why Postgres prefers to keep interval values in non-canonical form, like '1 day -23:37:00' instead of '00:23:00'. I understand it has something to do with a year/month/day not being exactly 365-days/30-days/24-hours, and/or operations involving interval and 'timestamp with time zone'. But since it's not explicitly spelled out in docs or in code (at least I didn't find it in the obvious places), seeking explanation here. I understand that the answers may obviate any change in behaviour I am requesting below. The interval arithmetic operations may also yield non-canonical values, and IMHO the 'interval op interval' or 'interval op scalar' expressions should yield an interval in canonical form. For eg. postgres=# select '6 days 00:16:00'::interval - '5 days 23:53:00'::interval as result; result - 1 day -23:37:00 postgres=# select '6 days 00:16:00'::interval + '5 days 23:53:00'::interval as result; result -- 11 days 24:09:00 I cannot think of a use case where the above results are any better than emitting '00:23:00' and '12 days 00:09:00', respectively. We may not be able to turn every interval datum into canonical form, but at least the intervals produced as a result of interval operators can be converted to canonical form to reduce surprises for users. I may even go as far as proposing rounding up 24-hours into a day, but not round up days into months or months into years. I was surprised by the presence of non-canonical form of interval in a sorted-by-interval result set. The intervals were computed within the query, using 'timestamp without time zone' values in a table. # select ... result ... 00:23:00 00:23:00 1 day -23:37:00 00:23:00 00:22:00 ... The ordering above demonstrates that Postgres _does_ consider '1 day -23:37:00' == '00:23:00', then it seems pointless to confuse the user by showing two different representations of the same datum. This also increases the code complexity required in applications/ORMs to parse interval data's text representation. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB : www.EnterpriseDB.com : The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Patch to send transaction commit/rollback stats to the stats collector unconditionally.
On Wed, Jul 2, 2014 at 3:49 PM, Kevin Grittner wrote: > In preparing to push the patch, I noticed I hadn't responded to this: > > Gurjeet Singh wrote: >> Kevin Grittner wrote: >>> I have reviewed this patch, and think we should do what the patch >>> is trying to do, but I don't think the submitted patch would >>> actually work. >> >> Just curious, why do you think it won't work. > > Because you didn't touch this part of the function: > > /* > * Send partial messages. If force is true, make sure that any pending > * xact commit/abort gets counted, even if no table stats to send. > */ > if (regular_msg.m_nentries > 0 || > (force && (pgStatXactCommit > 0 || pgStatXactRollback > 0))) > pgstat_send_tabstat(®ular_msg); > > The statistics would not actually be sent unless a table had been > accessed or it was forced because the connection was closing. I sure did! In fact that was the one and only line of code that was changed. It effectively bypassed the 'force' check if there were any transaction stats to report. /* - * Send partial messages. If force is true, make sure that any pending - * xact commit/abort gets counted, even if no table stats to send. + * Send partial messages. Make sure that any pending xact commit/abort gets + * counted, even if no table stats to send. */ if (regular_msg.m_nentries > 0 || -(force && (pgStatXactCommit > 0 || pgStatXactRollback > 0))) +force || (pgStatXactCommit > 0 || pgStatXactRollback > 0)) pgstat_send_tabstat(®ular_msg); if (shared_msg.m_nentries > 0) pgstat_send_tabstat(&shared_msg); Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB : www.EnterpriseDB.com : The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposing pg_hibernate
On Sat, Jun 7, 2014 at 6:48 AM, Cédric Villemain wrote: > Le lundi 3 février 2014 19:18:54 Gurjeet Singh a écrit : > >> Possible enhancements: >> - Ability to save/restore only specific databases. >> - Control how many BlockReaders are active at a time; to avoid I/O >> storms. FWIW, this has been implemented. By default, only one database is restored at a time. >>- Be smart about lowered shared_buffers across the restart. >> - Different modes of reading like pg_prewarm does. >> - Include PgFincore functionality, at least for Linux platforms. > > Please note that pgfincore is working on any system where PostgreSQL > prefetch is working, exactly like pg_prewarm. This includes linux, BSD and > many unix-like. It *is not* limited to linux. > > I never had a single request for windows, but windows does provides an API > for that too (however I have no windows offhand to test). > > Another side note is that currently BSD (at least freeBSD) have a more > advanced mincore() syscall than linux and offers a better analysis (dirty > status is known) and they implemented posix_fadvise... I have never used pgfincore, and have analyzed it solely based on the examples provided, second-hand info, and some code reading, so the following may be wrong; feel free to correct. The UI of pgfincore suggests that to save a snapshot of an object, pgfincore reads all the segments of the object and queries the OS cache. This may take a lot of time on big databases. If this is done at shutdown time, the time to finish shutdown will be proportional to the size of the database, rather than being proportional to the amount of data files in OS cache. > > There is a previous thread about that hibernation feature. Mitsuru IWASAKI > did a patch, and it triggers some interesting discussions. > > Some notes in this thread are outdated now, but it's worth having a look at > it: > > http://www.postgresql.org/message-id/20110504.231048.113741617.iwas...@jp.freebsd.org > > https://commitfest.postgresql.org/action/patch_view?id=549 Thanks for sharing these. I agree with Greg's observations there that the shared-buffers are becoming increasingly smaller subset of the RAM available on modern machines. But until it can be done in a platform-independent way I doubt it will ever be accepted in Postgres. Even when it's accepted, it would have to be off by default because of the slow shutdown mentioned above. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB : www.EnterpriseDB.com : The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposing pg_hibernate
On Sun, Jun 15, 2014 at 2:51 AM, Amit Kapila wrote: > On Thu, Jun 12, 2014 at 9:31 AM, Gurjeet Singh wrote: >> >> I don't have intimate knowledge of recovery but I think the above >> assessment of recovery's operations holds true. If you still think >> this is a concern, can you please provide a bit firm example using >> which I can visualize the problem you're talking about. > > Okay, let me try with an example: > > Assume No. of shared buffers = 5 > Before Crash: > > 1. Pages in shared buffers numbered 3, 4, 5 have write operations > performed on them, followed by lot of reads which makes their > usage_count as 5. > 2. Write operation happened on pages in shared buffers numbered > 1, 2. Usage_count of these buffers is 1. > 3. Now one read operation needs some different pages and evict > pages in shared buffers numbered 1 and 2 and read the required > pages, so buffers 1 and 2 will have usage count as 1. > 4. At this moment shutdown initiated. > 5. Bgwriter saved just buffers 1 and 2 and crashed. > > After Crash: > > 1. Recovery will read in pages on which operations happened in > step-1 and 2 above. > 2. Buffer loader (pg_hibernator) will load buffers on which operations > happened in step-3, so here it might needs to evict buffers which are > corresponding to buffers of step-1 before crash. So what this > essentially means is that pg_hibernator can lead to eviction of more > useful pages. Granted, you have demonstrated that the blocks restored by pg_hibernator can cause eviction of loaded-by-recovery blocks. But, one can argue that pg_hibernator brought the shared-buffer contents to to a state that is much closer to the pre-shutdown state than the recovery would have restored them to. I think this supports the case for pg_hibernator, that is, it is doing what it is supposed to do: restore shared-buffers to pre-shutdown state. I agree that there's uncertainty as to which buffers will be cleared, and hence which blocks will be evicted. So pg_hibernator may cause eviction of blocks that had higher usage count before the shutdown, because they may have a lower/same usage count as other blocks' buffers after recovery. There's not much that can be done for that, because usage count information is not saved anywhere on disk, and I don't think it's worth saving just for pg_hibernator's sake. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB : www.EnterpriseDB.com : The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Patch to send transaction commit/rollback stats to the stats collector unconditionally.
On Tue, Jul 1, 2014 at 10:05 AM, Kevin Grittner wrote: > Tom Lane wrote: > >> If we're going to do it like this, then I think the force flag >> should be considered to do nothing except override the clock >> check, which probably means it shouldn't be tested in the initial >> if() at all. > > That makes sense, and is easily done. Attached is the patch to save you a few key strokes :) > The only question left is > how far back to take the patch. I'm inclined to only apply it to > master and 9.4. Does anyone think otherwise? Considering this as a bug-fix, I'd vote for it to be applied to all supported releases. But since this may cause unforeseen performance penalty, I think it should be applied only as far back as the introduction of PGSTAT_STAT_INTERVAL throttle. The throttle was implemeted in 641912b, which AFAICT was part of 8.3. So I guess all the supported releases it is. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB : www.EnterpriseDB.com : The Enterprise PostgreSQL Company diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 3ab1428..c7f41a5 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -753,7 +753,8 @@ pgstat_report_stat(bool force) /* Don't expend a clock check if nothing to do */ if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) && - !have_function_stats && !force) + pgStatXactCommit == 0 && pgStatXactRollback == 0 && + !have_function_stats) return; /* @@ -817,11 +818,11 @@ pgstat_report_stat(bool force) } /* -* Send partial messages. If force is true, make sure that any pending -* xact commit/abort gets counted, even if no table stats to send. +* Send partial messages. Make sure that any pending xact commit/abort +* gets counted, even if there are no table stats to send. */ if (regular_msg.m_nentries > 0 || - (force && (pgStatXactCommit > 0 || pgStatXactRollback > 0))) + pgStatXactCommit > 0 || pgStatXactRollback > 0) pgstat_send_tabstat(®ular_msg); if (shared_msg.m_nentries > 0) pgstat_send_tabstat(&shared_msg); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Patch to send transaction commit/rollback stats to the stats collector unconditionally.
On Sun, Jun 29, 2014 at 11:58 AM, Kevin Grittner wrote: > I have reviewed this patch, and think we should do what the patch > is trying to do, but I don't think the submitted patch would > actually work. Just curious, why do you think it won't work. Although the discussion is a bit old, but I'm sure I would've tested the patch before submitting. > I have attached a suggested patch which I think > would work. Gurjeet, could you take a look at it? The patch, when considered together with Tom's suggestion upthread, looks good to me. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB : www.EnterpriseDB.com : The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels
On Mon, Jun 23, 2014 at 12:49 PM, Tom Lane wrote: > Gurjeet Singh writes: >> While I'd love to reduce the number of future installations without >> this fix in place, I respect the decision to honor project policy. At >> the same time, this change does not break anything. It introduces new >> environment variables which change the behaviour, but behaves the old >> way in the absence of those variables. > > Uh, no, it doesn't. We removed the dependence on -DLINUX_OOM_SCORE_ADJ. > If a packager is expecting that to still work in 9.4, he's going to be > unpleasantly surprised, because the system will silently fail to do what > he's expecting: it will run all the backend processes at no-OOM-kill > priority, which is likely to be bad. True. I didn't think from a packager's perspective. > It is possible to make packages that will work either way, along the lines > of the advice I sent to the Red Hat guys: > https://bugzilla.redhat.com/show_bug.cgi?id=1110969 > > but I think it's a bit late in the cycle to be telling packagers to do > that for 9.4. Understandable. >> BTW, does the project publish the feature-freeze deadlines and other >> dates somewhere (apart from on this list). > > It's usually in the dev meeting minutes > https://wiki.postgresql.org/wiki/PgCon_2014_Developer_Meeting#9.5_Schedule Thanks. But it doesn't spell out the specific dates, or even the month of feature-freeze. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels
On Mon, Jun 23, 2014 at 11:52 AM, Tom Lane wrote: > Gurjeet Singh writes: >> would it be possible to include this in 9.4 as well? > > While this is clearly an improvement over what we had before, it's > impossible to argue that it's a bug fix, and we are way past the 9.4 > feature freeze deadline. In particular, packagers who've already done > their 9.4 development work might be blindsided by us slipping this into > 9.4 release. So while I wouldn't have a problem with putting this change > into 9.4 from a technical standpoint, it's hard to argue that it'd meet > project norms from a development-process standpoint. While I'd love to reduce the number of future installations without this fix in place, I respect the decision to honor project policy. At the same time, this change does not break anything. It introduces new environment variables which change the behaviour, but behaves the old way in the absence of those variables. So I guess it's not changing the default behaviour in incompatible way. BTW, does the project publish the feature-freeze deadlines and other dates somewhere (apart from on this list). I was looking the other day and didn't find any reference. Either the commitfest app or the 'Developer' section of the wiki [1] seem to be good candidates for this kind of information. [1]: https://wiki.postgresql.org/wiki/Development_information Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels
On Wed, Jun 18, 2014 at 8:15 PM, Tom Lane wrote: > Gurjeet Singh writes: > >> Please find attached the patch. It includes the doc changes as well. > > Applied with some editorialization. Thanks! would it be possible to include this in 9.4 as well? Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] review: Non-recursive processing of AND/OR lists
Thanks! On Mon, Jun 16, 2014 at 3:58 PM, Tom Lane wrote: > I wrote: >> Gurjeet Singh writes: >>> I tried to eliminate the 'pending' list, but I don't see a way around it. >>> We need temporary storage somewhere to store the branches encountered on >>> the right; in recursion case the call stack was serving that purpose. > >> I still think we should fix this in the grammar, rather than introducing >> complicated logic to try to get rid of the recursion later. For example, >> as attached. > > I went looking for (and found) some additional obsoleted comments, and > convinced myself that ruleutils.c is okay as-is, and pushed this. > > regards, tom lane -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels
On Tue, Jun 10, 2014 at 12:05 PM, Tom Lane wrote: > If we're going to do this, I would say that we should also take the > opportunity to get out from under the question of which kernel API > we're dealing with. So perhaps a design like this: > > 1. If the environment variable PG_OOM_ADJUST_FILE is defined, it's > the name of a file to write something into after forking. The typical > value would be "/proc/self/oom_score_adj" or "/proc/self/oom_adj". > If not set, nothing happens. > > 2. If the environment variable PG_OOM_ADJUST_VALUE is defined, that's > the string we write, otherwise we write "0". Please find attached the patch. It includes the doc changes as well. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml index 9fadef5..4492a1d 100644 --- a/doc/src/sgml/runtime.sgml +++ b/doc/src/sgml/runtime.sgml @@ -1284,8 +1284,15 @@ echo -1000 > /proc/self/oom_score_adj in the postmaster's startup script just before invoking the postmaster. Note that this action must be done as root, or it will have no effect; so a root-owned startup script is the easiest place to do it. If you -do this, you may also wish to build PostgreSQL -with -DLINUX_OOM_SCORE_ADJ=0 added to CPPFLAGS. +do this, you may also wish to export these environment variables + +PG_OOM_ADJUST_FILE=oom_score_adj +PG_OOM_ADJUST_VALUE=0 + +export PG_OOM_ADJUST_FILE +export PG_OOM_ADJUST_VALUE + +in the startup script, before invoking the postmaster. That will cause postmaster child processes to run with the normal oom_score_adj value of zero, so that the OOM killer can still target them at need. @@ -1296,8 +1303,7 @@ echo -1000 > /proc/self/oom_score_adj but may have a previous version of the same functionality called /proc/self/oom_adj. This works the same except the disable value is -17 not -1000. The corresponding -build flag for PostgreSQL is --DLINUX_OOM_ADJ=0. +value for PG_OOM_ADJUST_FILE should be oom_adj. diff --git a/src/backend/postmaster/fork_process.c b/src/backend/postmaster/fork_process.c index f6df2de..21559af 100644 --- a/src/backend/postmaster/fork_process.c +++ b/src/backend/postmaster/fork_process.c @@ -22,6 +22,13 @@ #endif #ifndef WIN32 + +#ifdef __linux__ +static bool oom_env_checked = false; +static char oom_adj_file[MAXPGPATH]; +static int oom_adj_value = 0; +#endif /* __linux__ */ + /* * Wrapper for fork(). Return values are the same as those for fork(): * -1 if the fork failed, 0 in the child process, and the PID of the @@ -78,13 +85,38 @@ fork_process(void) * LINUX_OOM_SCORE_ADJ #defined to 0, or to some other value that you * want child processes to adopt here. */ -#ifdef LINUX_OOM_SCORE_ADJ +#ifdef __linux__ + if (!oom_env_checked) + { + char *env; + + oom_env_checked = true; + + env = getenv("PG_OOM_ADJUST_FILE"); + + /* Don't allow a path separator in file name */ + if (env && !strchr(env, '/') && !strchr(env, '\\')) + { +snprintf(oom_adj_file, MAXPGPATH, "/proc/self/%s", env); + +env = getenv("PG_OOM_ADJUST_VALUE"); +if (env) +{ + oom_adj_value = atoi(env); +} +else + oom_adj_value = 0; + } + else +oom_adj_file[0] = '\0'; + } + { /* * Use open() not stdio, to ensure we control the open flags. Some * Linux security environments reject anything but O_WRONLY. */ - int fd = open("/proc/self/oom_score_adj", O_WRONLY, 0); + int fd = open(oom_adj_file, O_WRONLY, 0); /* We ignore all errors */ if (fd >= 0) @@ -92,41 +124,14 @@ fork_process(void) char buf[16]; int rc; -snprintf(buf, sizeof(buf), "%d\n", LINUX_OOM_SCORE_ADJ); +snprintf(buf, sizeof(buf), "%d\n", oom_adj_value); rc = write(fd, buf, strlen(buf)); (void) rc; close(fd); } } -#endif /* LINUX_OOM_SCORE_ADJ */ - /* - * Older Linux kernels have oom_adj not oom_score_adj. This works - * similarly except with a different scale of adjustment values. If - * it's necessary to build Postgres to work with either API, you can - * define both LINUX_OOM_SCORE_ADJ and LINUX_OOM_ADJ. - */ -#ifdef LINUX_OOM_ADJ - { - /* - * Use open() not stdio, to ensure we control the open flags. Some - * Linux security environments reject anything but O_WRONLY. - */ - int fd = open("/proc/self/oom_adj", O_WRONLY, 0); - - /* We ignore all errors */ - if (fd >= 0) - { -char buf[16]; -int rc; - -snprintf(buf, sizeof(buf), "%d\n", LINUX_OOM_ADJ); -rc = write(fd, buf, strlen(buf)); -(void) rc; -close(fd); - } - } -#endif /* LINUX_OOM_ADJ */ +#endif
Re: [HACKERS] Proposing pg_hibernate
On Wed, Jun 11, 2014 at 10:56 AM, Robert Haas wrote: > On Tue, Jun 10, 2014 at 10:03 PM, Gurjeet Singh wrote: >> And it's probably accepted by now that such a bahviour is not >> catastrophic, merely inconvenient. > > I think the whole argument for having pg_hibernator is that getting > the block cache properly initialized is important. If it's not > important, then we don't need pg_hibernator in the first place. But > if it is important, then I think not loading unrelated blocks into > shared_buffers is also important. I was constructing a contrived scenario, something that would rarely happen in reality. I feel that the benefits of this feature greatly outweigh the minor performance loss caused in such an unlikely scenario. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposing pg_hibernate
On Wed, Jun 11, 2014 at 12:25 AM, Amit Kapila wrote: > On Wed, Jun 11, 2014 at 7:59 AM, Gurjeet Singh wrote: >> On Sun, Jun 8, 2014 at 3:24 AM, Amit Kapila >> wrote: >> > Yeap, but if it crashes before writing checkpoint record, it will lead >> > to >> > recovery which is what we are considering. >> >> Good point. >> >> In case of such recovery, the recovery process will read in the blocks >> that were recently modified, and were possibly still in shared-buffers >> when Checkpointer crashed. So after recovery finishes, the >> BlockReaders will be invoked (because save-files were successfully >> written before the crash), and they would request the same blocks to >> be restored. Most likely, those blocks would already be in >> shared-buffers, hence no cause of concern regarding BlockReaders >> evicting buffers populated by recovery. > > Not necessarily because after crash, recovery has to start from > previous checkpoint, so it might not perform operations on same > pages as are saved by buffer saver. Granted, the recovery may not start that way (that is, reading in blocks that were in shared-buffers when shutdown was initiated), but it sure would end that way. Towards the end of recovery, the blocks it'd read back in are highly likely to be the ones that were present in shared-buffers at the time of shutdown. By the end of recovery, either (a) blocks read in at the beginning of recovery are evicted by later operations of recovery, or (b) they are still present in shared-buffers. So the blocks requested by the BlockReaders are highly likely to be already in shared-buffers at the end of recovery, because these are the same blocks that were dirty (and hence recorded in WAL) just before shutdown time. I guess what I am trying to say is that the blocks read in by the BlockReaders will be a superset of those read in by the reocvery process. At the time of shutdown/saving-buffers, the shared-buffers may have contained dirty and clean buffers. WAL contains the info of which blocks were dirtied. Recovery will read back the blocks that were dirty, to replay the WAL, and since the BlockReaders are started _after_ recovery finishes, the BlockReaders will effectively read in only those blocks that are not already read-in by the recovery. I am not yet convinced, at least in this case, that Postgres Hibernator would restore blocks that can cause eviction of buffers restored by recovery. I don't have intimate knowledge of recovery but I think the above assessment of recovery's operations holds true. If you still think this is a concern, can you please provide a bit firm example using which I can visualize the problem you're talking about. > Also as the file saved by > buffer saver can be a file which contains only partial list of > buffers which were in shared buffer's, it becomes more likely that > in such cases it can override the buffers populated by recovery. I beg to differ. As described above, the blocks read-in by the BlockReader will not evict the recovery-restored blocks. The save-files being written partially does not change that. > Now as pg_hibernator doesn't give any preference to usage_count while > saving buffer's, it can also evict the buffers populated by recovery > with some lower used pages of previous run. The case we're discussing (checkpointer/BufferSaver/some-other-process crash during a smart/fast shutdown) should occur rarely in practice. Although Postgres Hibernator is not yet proven to do the wrong thing in this case, I hope you'd agree that BlockReaders evicting buffers populated by recovery process is not catastrophic at all, merely inconvenient from performance perspective. Also, the impact is only on the initial performance immediately after startup, since application queries will re-prime the shared-buffers with whatever buffers they need. > I think Block Readers will remove file only after reading and populating > buffers from it Correct. > and that's the reason I mentioned that it can lead to doing > both recovery as well as load buffers based on file saved by buffer > saver. I am not sure I completely understand the implication here, but I think the above description of case where recovery-followed-by-BlockReaders not causing a concern may cover it. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposing pg_hibernate
On Sun, Jun 8, 2014 at 3:24 AM, Amit Kapila wrote: > On Fri, Jun 6, 2014 at 5:31 PM, Gurjeet Singh wrote: >> On Thu, Jun 5, 2014 at 11:32 PM, Amit Kapila >> wrote: > >> > Buffer saver process itself can crash while saving or restoring >> > buffers. >> >> True. That may lead to partial list of buffers being saved. And the >> code in Reader process tries hard to read only valid data, and punts >> at the first sight of data that doesn't make sense or on ERROR raised >> from Postgres API call. > > Inspite of Reader Process trying hard, I think we should ensure by > some other means that file saved by buffer saver is valid (may be > first write in tmp file and then rename it or something else). I see no harm in current approach, since even if the file is partially written on shutdown, or if it is corrupted due to hardware corruption, the worst that can happen is the BlockReaders will try to restore, and possibly succeed, a wrong block to shared-buffers. I am okay with your approach of first writing to a temp file, if others see an advantage of doing this and insist on it. >> > IIUC on shutdown request, postmaster will send signal to BG Saver >> > and BG Saver will save the buffers and then postmaster will send >> > signal to checkpointer to shutdown. So before writing Checkpoint >> > record, BG Saver can crash (it might have saved half the buffers) >> >> Case handled as described above. >> >> > or may BG saver saves buffers, but checkpointer crashes (due to >> > power outage or any such thing). >> >> Checkpointer process' crash seems to be irrelevant to Postgres >> Hibernator's workings. > > Yeap, but if it crashes before writing checkpoint record, it will lead to > recovery which is what we are considering. Good point. In case of such recovery, the recovery process will read in the blocks that were recently modified, and were possibly still in shared-buffers when Checkpointer crashed. So after recovery finishes, the BlockReaders will be invoked (because save-files were successfully written before the crash), and they would request the same blocks to be restored. Most likely, those blocks would already be in shared-buffers, hence no cause of concern regarding BlockReaders evicting buffers populated by recovery. >> I think you are trying to argue the wording in my claim "save-files >> are created only on clean shutdowons; not on a crash or immediate >> shutdown", by implying that a crash may occur at any time during and >> after the BufferSaver processing. I agree the wording can be improved. > > Not only wording, but in your above mail Case 2 and 1b would need to > load buffer's and perform recovery as well, so we need to decide which > one to give preference. In the cases you mention, 1b and 2, ideally there will be no save-files because the server either (1b) was still running, or (2) crashed. If there were any save-files present during the previous startup (the one that happened before (1b) hot-backup or (2) crash) of the server, they would have been removed by the BlockReaders soon after the startup. > So If you agree that we should have consideration for recovery data > along with saved files data, then I think we have below options to > consider: I don't think any of the options you mention need any consideration because recovery and buffer-restore process don't seem to be at conflict with each other; not enough to be a concern, IMHO. Thanks and best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposing pg_hibernate
On Tue, Jun 10, 2014 at 12:02 PM, Robert Haas wrote: > If recovery has been running for a long time, then restoring > buffers from some save file created before that is probably a bad > idea, regardless of whether the buffers already loaded were read in by > recovery itself or by queries running on the system. But if you're > saying that doesn't happen, then there's no problem there. Normally, it won't happen. There's one case I can think of, which has to coincide with a small window of time for such a thing to happen. Consider this: .) A database is shutdown, which creates the save-files in $PGDATA/pg_hibernator/. .) The database is restarted. .) BlockReaders begin to read and restore the disk blocks into buffers. .) Before the BlockReaders could finish*, a copy of the database is taken (rsync/cp/FS-snapshot/etc.) This causes the the save-files to be present in the copy, because the BlockReaders haven't deleted them, yet. * (The BlockReaders ideally finish their task in first few minutes after first of them is started.) .) The copy of the database is used to restore and erect a warm-standby. .) The warm-standby starts replaying logs from WAL archive/stream. .) Some time (hours/weeks/months) later, the warm-standby is promoted to be a master. .) It starts the Postgres Hibernator, which sees save-files in $PGDATA/pg_hibernator/ and launches BlockReaders. At this point, the BlockReaders will restore the blocks that were present in original DB's shared-buffers at the time of shutdown. So, this would fetch blocks into shared-buffers that may be completely unrelated to the blocks recently operated on by the recovery process. And it's probably accepted by now that such a bahviour is not catastrophic, merely inconvenient. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels
On Tue, Jun 10, 2014 at 1:13 PM, David G Johnston wrote: > Gurjeet Singh-4 wrote >> So the argument that this GUC is a security concern, can be ignored. >> Root user (one with control of start script) still controls the lowest >> badness setting of all Postgres processes. If done at fork_process >> time, the child process simply inherits parent's badness setting. > > The counter point here is that the postmaster can be set to "no kill" and Only the root user can do that, since he/she has control over the start script. All that the DBA could do with a GUC is to set backends' badness worse than postmaster's, but bot better. > the >= condition allows for children to achieve the same while it is our > explicit intent that the children be strictly > parent. I don't think anyone argued for that behaviour. > To that end, should the adjustment value be provided as an offset to the > postmasters instead of an absolute value - and disallow <= zero offset > values in the process? Seems unnecessary, given current knowledge. > I get and generally agree with the environment variable proposal and it's > stated goal to restrict whom can makes changes. But how much less cost does > an environment variable have than a GUC if one GUC argument is still its > maintenance overhead? Having it as a GUC would have meant that two entities are required to get the configuration right: one who controls start scripts, and the other who controls GUC settings. With the environment variable approach, a root user alone can control the behaviour like so in start script: echo -200 > /proc/self/oom_score_adj export PG_OOM_ADJUST_FILE=oom_score_adj export PG_OOM_ADJUST_VALUE=-100 Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels
On Tue, Jun 10, 2014 at 11:52 AM, Tom Lane wrote: > and...@anarazel.de (Andres Freund) writes: >> On 2014-06-10 11:20:28 -0400, Tom Lane wrote: >>> Maybe I'm mistaken, but I thought once the fork_process code has reset our >>> process's setting to zero it's not possible to lower it again (without >>> privileges we'd not have). > >> No, doesn't look that way. It's possible to reset it to the value set at >> process start. So unless we introduce double forks for every backend >> start it can be reset by ordinary processes. > > That's kind of annoying --- I wonder why they went to the trouble of doing > that? But anyway, it's probably not worth the cost of double-forking to > prevent it. I still say though that this is not a reason to make it as > easy as change-a-GUC to break the intended behavior. > > Robert's idea of having the start script set an environment variable to > control the OOM adjustment reset seems like it would satisfy my concern. I'm fine with this solution. Should this be a constant 0, or be configurable based on env. variable's value? -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels
On Tue, Jun 10, 2014 at 11:24 AM, Andres Freund wrote: > On 2014-06-10 11:14:43 -0400, Tom Lane wrote: >> Sure, but what's that have to do with this? Any Red Hat or PGDG RPM will >> come with this code already enabled in the build, so there is no need for >> anyone to have a GUC to play around with the behavior. > > That's imo a fair point. Unless I misunderstand things Gurjeet picked > the topic up again because he wants to increase the priority of the > children. Is that correct Gurjeet? Yes. A DBA would like to prevent the postmaster from being killed by OOM killer, but let the child processes be still subject to OOM killer's whim. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels
's badness, but keep it same as postmaster's $ echo -100 > /proc/1837/oom_score_adj $ for p in $(echo $(pgserverPIDList)| cut -d , ... 1835 -100 /home/gurjeet/dev/pgdbuilds/oom_guc/db/bin/postgres-D/home/gurjeet/dev/pgdbuilds/oom_guc/db/data 1837 -100 postgres: checkpointer process ... # Lower checkpointer's badness, and try to lower it below postmaster's $ echo -101 > /proc/1837/oom_score_adj bash: echo: write error: Permission denied $ for p in $(echo $(pgserverPIDList)| cut -d , ... 1835 -100 /home/gurjeet/dev/pgdbuilds/oom_guc/db/bin/postgres-D/home/gurjeet/dev/pgdbuilds/oom_guc/db/data 1837 -100 postgres: checkpointer process ... # As root, force child process' badness to be lower than postnaster's $ sudo echo -101 > /proc/1837/oom_score_adj [sudo] password for gurjeet: $ for p in $(echo $(pgserverPIDList)| cut -d , ... 1835 -100 /home/gurjeet/dev/pgdbuilds/oom_guc/db/bin/postgres-D/home/gurjeet/dev/pgdbuilds/oom_guc/db/data 1837 -101 postgres: checkpointer process ... -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels
On Tue, Jun 10, 2014 at 10:32 AM, Andres Freund wrote: > On 2014-06-10 07:56:01 -0400, Gurjeet Singh wrote: >> Providing it as GUC would have given end users both the peices, but >> with a compile-time option they have only one half of the solution; >> except if they go compile their own binaries, which forces them into >> being packagers. >> >> I am not alone in feeling that if Postgres wishes to provide a control >> over child backend's oom_score_adj, it should be a GUC parameter >> rather than a compile-time option. Yesterday a customer wanted to >> leverage this and couldn't because they refuse to maintain their own >> fork of Postgres code. > > Independent of the rest of the discussion, I think there's one more > point: Trying to keep your system stable by *increasing* the priority of > normal backends is a bad idea. If you system gets into OOM land you need > to fix that, not whack who gets killed around. > The reason it makes sense to increase the priority of the postmaster is > that that *does* increase the stability by cleaning up resources and > restarting everything. As it stands right now, a user can decrease the likelyhood of Postmaster being killed by adjusting the start script, but that decreases the likelyhood of al the child processes, too, making the Postmaster just as likely as a kill-candidate, maybe even higher because it's the parent, as any backend. This patch gives the user a control to let the backend's likelyhood of being killed be different/higher than that of the postmaster. The users were already capable of doing that, but were required to custom-compile Postgres to get the benefits. This patch just removes that requirement. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels
On Tue, Jun 10, 2014 at 10:02 AM, Tom Lane wrote: > Gurjeet Singh writes: >> Startup scripts are not solely in the domain of packagers. End users >> can also be expected to develop/edit their own startup scripts. > >> Providing it as GUC would have given end users both the peices, but >> with a compile-time option they have only one half of the solution; >> except if they go compile their own binaries, which forces them into >> being packagers. > > I don't find that this argument holds any water at all. Anyone who's > developing their own start script can be expected to manage recompiling > Postgres. I respectfully disagree. Writing and managing init/start scripts requires much different set of expertise than compiling and managing builds of a critical software like a database product. I would consider adding `echo -1000 > /proc/self/oom_score_adj` to a start script as a deployment specific tweak, and not 'developing own start script'. > Extra GUCs do not have zero cost, especially not ones that are as > complicated-to-explain as this would be. > > I would also argue that there's a security issue with making it a GUC. > A non-root DBA should not have the authority to decide whether or not > postmaster child processes run with nonzero OOM adjustment; that decision > properly belongs to whoever has authorized the root-owned startup script > to change the adjustment in the first place. So seeing this as two > independent pieces is not only wrong but dangerous. >From experiments last night, I see that child process' oom_score_adj is capped by the parent process' setting that is forking. So I don't think it's a security risk from that perspective. I'll retest and provide the exact findings. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels
On Mon, Sep 19, 2011 at 10:18 AM, Tom Lane wrote: > Peter Eisentraut writes: >> On sön, 2011-09-18 at 12:21 -0400, Tom Lane wrote: >>> But having said that, it wouldn't be very hard to arrange things so that >>> if you did have both symbols defined, the code would only attempt to >>> write oom_adj if it had failed to write oom_score_adj; which is about as >>> close as you're likely to get to a kernel version test for this. > >> Why is this feature not a run-time configuration variable or at least a >> configure option? It's awfully well hidden now. I doubt a lot of >> people are using this even though they might wish to. > > See the thread in which the feature was designed originally: > http://archives.postgresql.org/pgsql-hackers/2010-01/msg00170.php > > The key point is that to get useful behavior, you need cooperation > between both a root-privileged startup script and the PG executable. > That tends to throw the problem into the domain of packagers, more > than end users, and definitely puts a big crimp in the idea that > run-time configuration of just half of the behavior could be helpful. > So far, no Linux packagers have complained that the design is inadequate > (a position that I also hold when wearing my red fedora) so I do not > feel a need to complicate it further. Startup scripts are not solely in the domain of packagers. End users can also be expected to develop/edit their own startup scripts. Providing it as GUC would have given end users both the peices, but with a compile-time option they have only one half of the solution; except if they go compile their own binaries, which forces them into being packagers. I am not alone in feeling that if Postgres wishes to provide a control over child backend's oom_score_adj, it should be a GUC parameter rather than a compile-time option. Yesterday a customer wanted to leverage this and couldn't because they refuse to maintain their own fork of Postgres code. Please find attached a patch to turn it into a GUC, #ifdef'd by __linux__ macro. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com diff --git a/src/backend/postmaster/fork_process.c b/src/backend/postmaster/fork_process.c index f6df2de..7b9bc38 100644 --- a/src/backend/postmaster/fork_process.c +++ b/src/backend/postmaster/fork_process.c @@ -22,6 +22,11 @@ #endif #ifndef WIN32 + +#ifdef __linux__ +int linux_oom_score_adj = 0; +#endif + /* * Wrapper for fork(). Return values are the same as those for fork(): * -1 if the fork failed, 0 in the child process, and the PID of the @@ -78,7 +83,7 @@ fork_process(void) * LINUX_OOM_SCORE_ADJ #defined to 0, or to some other value that you * want child processes to adopt here. */ -#ifdef LINUX_OOM_SCORE_ADJ +#ifdef __linux__ { /* * Use open() not stdio, to ensure we control the open flags. Some @@ -92,13 +97,12 @@ fork_process(void) char buf[16]; int rc; -snprintf(buf, sizeof(buf), "%d\n", LINUX_OOM_SCORE_ADJ); +snprintf(buf, sizeof(buf), "%d\n", linux_oom_score_adj); rc = write(fd, buf, strlen(buf)); (void) rc; close(fd); } } -#endif /* LINUX_OOM_SCORE_ADJ */ /* * Older Linux kernels have oom_adj not oom_score_adj. This works @@ -106,7 +110,6 @@ fork_process(void) * it's necessary to build Postgres to work with either API, you can * define both LINUX_OOM_SCORE_ADJ and LINUX_OOM_ADJ. */ -#ifdef LINUX_OOM_ADJ { /* * Use open() not stdio, to ensure we control the open flags. Some @@ -120,13 +123,13 @@ fork_process(void) char buf[16]; int rc; -snprintf(buf, sizeof(buf), "%d\n", LINUX_OOM_ADJ); +snprintf(buf, sizeof(buf), "%d\n", linux_oom_score_adj); rc = write(fd, buf, strlen(buf)); (void) rc; close(fd); } } -#endif /* LINUX_OOM_ADJ */ +#endif /* __linux__ */ /* * Make sure processes do not share OpenSSL randomness state. diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 1d094f0..8713abb 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -128,6 +128,7 @@ extern bool synchronize_seqscans; extern char *SSLCipherSuites; extern char *SSLECDHCurve; extern bool SSLPreferServerCiphers; +extern int linux_oom_score_adj; #ifdef TRACE_SORT extern bool trace_sort; @@ -2554,6 +2555,18 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, +#ifdef __linux__ + { + {"linux_oom_score_adj", PGC_POSTMASTER, RESOURCES_KERNEL, + gettext_noop("Sets the oom_score_adj parameter of postmaster's child processes."), + NULL, + }, + &linux_oom_score_adj, + 0, -1000, 1000, + NULL, NULL, NULL + }, +#endif /* __linux__ */ + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using Index-only scans to speed up count(*)
On Sat, Jun 7, 2014 at 8:56 AM, Cédric Villemain wrote: > Le samedi 7 juin 2014 08:35:27 Gurjeet Singh a écrit : > >> PS: Please note that I am not proposing to add support for the >> optimizer hint embedded in Mitsuru's query. > > :-) Even though I (sometimes) favor hints, and developed the optimizer hints feature in EDB (PPAS), I know how much Postgres **hates** [1] optimizer hints :) So just trying to wade off potential flamewar-ish comments. [1]: http://wiki.postgresql.org/wiki/Todo#Features_We_Do_Not_Want Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Using Index-only scans to speed up count(*)
While reading [1] in context of Postgres Hibernator, I see that Mitsuru mentioned one of the ways other RDBMS allows count(*) to be driven by an index. > 'select /*+ INDEX(emp emp_pk) */ count(*) from emp;' to load index blocks I am not sure if Postgres planner already allows this, but it would be great if the planner considered driving a count(*) query using a non-partial index, in the hopes that it turns into an index-only scan, and hence returns count(*) result faster.The non-partial index may not necessarily be the primary key index, it can be chosen purely based on size, favouring smaller indexes. This may alleviate some of the concerns of people migrating applications from other DBMS' that perform count(*) in a blink of an eye. [1]: http://www.postgresql.org/message-id/20110507.08.83883502.iwas...@jp.freebsd.org Best regards, PS: Please note that I am not proposing to add support for the optimizer hint embedded in Mitsuru's query. -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposing pg_hibernate
On Thu, Jun 5, 2014 at 11:32 PM, Amit Kapila wrote: > On Thu, Jun 5, 2014 at 5:39 PM, Gurjeet Singh wrote: >> >> > On Tue, Jun 3, 2014 at 5:43 PM, Gurjeet Singh wrote: >> Case 2 also won't cause any buffer restores because the save-files are >> created only on clean shutdowons; not on a crash or immediate >> shutdown. > > How do you ensure that buffers are saved only on clean shutdown? Postmaster sends SIGTERM only in "smart" or "fast" shutdown requests. > Buffer saver process itself can crash while saving or restoring > buffers. True. That may lead to partial list of buffers being saved. And the code in Reader process tries hard to read only valid data, and punts at the first sight of data that doesn't make sense or on ERROR raised from Postgres API call. > IIUC on shutdown request, postmaster will send signal to BG Saver > and BG Saver will save the buffers and then postmaster will send > signal to checkpointer to shutdown. So before writing Checkpoint > record, BG Saver can crash (it might have saved half the buffers) Case handled as described above. > or may BG saver saves buffers, but checkpointer crashes (due to > power outage or any such thing). Checkpointer process' crash seems to be irrelevant to Postgres Hibernator's workings. I think you are trying to argue the wording in my claim "save-files are created only on clean shutdowons; not on a crash or immediate shutdown", by implying that a crash may occur at any time during and after the BufferSaver processing. I agree the wording can be improved. How about ... save-files are created only when Postgres is requested to shutdown in normal (smart or fast) modes. Note that I am leaving out the mention of crash. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposing pg_hibernate
On Thu, Jun 5, 2014 at 11:32 PM, Amit Kapila wrote: > Another thing is don't you want to handle SIGQUIT signal in bg saver? I think bgworker_quickdie registered in StartBackgroundWorker() serves the purpose just fine. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposing pg_hibernate
On Wed, Jun 4, 2014 at 2:50 PM, Robert Haas wrote: > The thing I was concerned about is that the system might have been in > recovery for months. What was hot at the time the base backup was > taken seems like a poor guide to what will be hot at the time of > promotion. Consider a history table, for example: the pages at the > end, which have just been written, are much more likely to be useful > than anything earlier. I think you are specifically talking about a warm-standby that runs recovery for months before being brought online. As described in my response to Amit, if the base backup used to create that standby was taken after the BlockReaders had restored the buffers (which should complete within few minutes of startup, even for large databases), then there's no concern since the base backup wouldn't contain the save-files. If it's a hot-standby, the restore process would start as soon as the database starts accepting connections, finish soon after, and get completely out of the way of the normal recovery process. At which point the buffers populated by the recovery would compete only with the buffers being requested by backends, which is the normal behaviour. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposing pg_hibernate
On Wed, Jun 4, 2014 at 2:52 PM, Andres Freund wrote: > On 2014-06-04 14:50:39 -0400, Robert Haas wrote: >> The thing I was concerned about is that the system might have been in >> recovery for months. What was hot at the time the base backup was >> taken seems like a poor guide to what will be hot at the time of >> promotion. Consider a history table, for example: the pages at the >> end, which have just been written, are much more likely to be useful >> than anything earlier. > > I'd assumed that the hibernation files would simply be excluded from the > basebackup... Yes, they will be excluded, provided the BlockReader processes have finished, because each BlockReader unlinks its save-file after it is done restoring buffers listed in it. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposing pg_hibernate
On Wed, Jun 4, 2014 at 12:54 AM, Amit Kapila wrote: > On Tue, Jun 3, 2014 at 5:43 PM, Gurjeet Singh wrote: >> >> For sizeable shared_buffers size, the restoration of the shared >> buffers can take several seconds. > > Incase of recovery, the shared buffers saved by this utility are > from previous shutdown which doesn't seem to be of more use > than buffers loaded by recovery. I feel the need to enumerate the recovery scenarios we're talking about so that we're all on the same page. 1) Hot backup (cp/rsync/pg_basebackup/.. while the master was running) followed by 1a) recovery using archives or streaming replication. 1a.i) database in hot-standby mode 1a.ii) database not in hot-standby mode, i.e. it's in warm-standby mode. 1b) minimal recovery, that is, recover only the WAL available in pg_xlog, then come online. 2) Cold backup of a crashed master, followed by startup of the copy (causing crash recovery; IMHO same as case 1b above.). 3) Cold backup of clean-shutdown master, followed by startup of the copy (no recovery). In cases 1.x there won't be any save-files (*), because the BlockReader processes remove their respective save-file when they are done restoring the buffers, So the hot/warm-standby created thus will not inherit the save-files, and hence post-recovery will not cause any buffer restores. Case 2 also won't cause any buffer restores because the save-files are created only on clean shutdowons; not on a crash or immediate shutdown. Case 3, is the sweet spot of pg_hibernator. It will save buffer-list on shutdown, and restore them when the backup-copy is started (provided pg_hibernator is installed there). (*) If a hot-backup is taken immediately after database comes online, since BlockReaders may still be running and not have deleted the save-files, the save-files may end up in backup, and hence cause the recovery-time conflicts we're talking about. This should be rare in practice, and even when it does happen, at worst it will affect the initial performance of the cluster. >> I have a feeling the users wouldn't >> like their master database take up to a few minutes to start accepting >> connections. > > I think this is fair point and to address this we can have an option to > decide when to load buffer's and have default value as load before > recovery. Given the above description, I don't think crash/archive recovery is a concern anymore. But if that corner case is still a concern, I wouldn't favour making recovery slow by default, and make users of pg_hibernator pay for choosing to use the extension. I'd prefer the user explicitly ask for a behaviour that makes startups slow. > One other point: >> Note that the BuffersSaver process exists at all times, even when this >> parameter is set to `false`. This is to allow the DBA to enable/disable >> the >> extension without having to restart the server. The BufferSaver process >> checks this parameter during server startup and right before shutdown, and >> honors this parameter's value at that time. > > Why can't it be done when user register's the extension by using dynamic > background facility "RegisterDynamicBackgroundWorker"? There's no user interface to this extension except for the 3 GUC parameters; not even CREATE EXTENSION. The DBA is expected to append this extension's name in shared_preload_libraries. Since this extension declares one of its parameters PGC_POSTMASTER, it can't be loaded via the SQL 'LOAD ' command. postgres=# load 'pg_hibernator'; FATAL: cannot create PGC_POSTMASTER variables after startup FATAL: cannot create PGC_POSTMASTER variables after startup The connection to the server was lost. Attempting reset: Succeeded. Best regards, PS: I was out sick yesterday, so couldn't respond promptly. -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposing pg_hibernate
On Tue, Jun 3, 2014 at 8:13 AM, Gurjeet Singh wrote: > On Tue, Jun 3, 2014 at 7:57 AM, Robert Haas wrote: >> On Thu, May 29, 2014 at 12:12 AM, Amit Kapila >> wrote: >>>> IMHO, all of these caveats, would affect a very small fraction of >>>> use-cases and are eclipsed by the benefits this extension provides in >>>> normal cases. >>> >>> I agree with you that there are only few corner cases where evicting >>> shared buffers by this utility would harm, but was wondering if we could >>> even save those, say if it would only use available free buffers. I think >>> currently there is no such interface and inventing a new interface for this >>> case doesn't seem to reasonable unless we see any other use case of >>> such a interface. >> >> It seems like it would be best to try to do this at cluster startup >> time, rather than once recovery has reached consistency. Of course, >> that might mean doing it with a single process, which could have its >> own share of problems. But I'm somewhat inclined to think that if Currently pg_hibernator uses ReadBufferExtended() API, and AIUI, that API requires a database connection//shared-memory attachment, and that a backend process cannot switch between databases after the initial connection. >> own share of problems. But I'm somewhat inclined to think that if >> recovery has already run for a significant period of time, the blocks >> that recovery has brought into shared_buffers are more likely to be >> useful than whatever pg_hibernate would load. The applications that connect to a standby may have a different access pattern than the applications that are operating on the master database. So the buffers that are being restored by startup process may not be relevant to the workload on the standby. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposing pg_hibernate
On Tue, Jun 3, 2014 at 7:57 AM, Robert Haas wrote: > On Thu, May 29, 2014 at 12:12 AM, Amit Kapila wrote: >>> IMHO, all of these caveats, would affect a very small fraction of >>> use-cases and are eclipsed by the benefits this extension provides in >>> normal cases. >> >> I agree with you that there are only few corner cases where evicting >> shared buffers by this utility would harm, but was wondering if we could >> even save those, say if it would only use available free buffers. I think >> currently there is no such interface and inventing a new interface for this >> case doesn't seem to reasonable unless we see any other use case of >> such a interface. > > It seems like it would be best to try to do this at cluster startup > time, rather than once recovery has reached consistency. Of course, > that might mean doing it with a single process, which could have its > own share of problems. But I'm somewhat inclined to think that if > recovery has already run for a significant period of time, the blocks > that recovery has brought into shared_buffers are more likely to be > useful than whatever pg_hibernate would load. I am not absolutely sure of the order of execution between recovery process and the BGWorker, but ... For sizeable shared_buffers size, the restoration of the shared buffers can take several seconds. I have a feeling the users wouldn't like their master database take up to a few minutes to start accepting connections. From my tests [1], " In the 'App after Hibernator' [case] ... This took 70 seconds for reading the ~4 GB database." [1]: http://gurjeet.singh.im/blog/2014/04/30/postgres-hibernator-reduce-planned-database-down-times/ Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposing pg_hibernate
On Fri, May 30, 2014 at 5:33 PM, Josh Kupershmidt wrote: > On Tue, May 27, 2014 at 10:01 PM, Gurjeet Singh wrote: > >> When the Postgres server is being stopped/shut down, the `Buffer >> Saver` scans the >> shared-buffers of Postgres, and stores the unique block identifiers of >> each cached >> block to the disk. This information is saved under the >> `$PGDATA/pg_hibernator/` >> directory. For each of the database whose blocks are resident in shared >> buffers, >> one file is created; for eg.: `$PGDATA/pg_hibernator/2.postgres.save`. > > This file-naming convention seems a bit fragile. For example, on my > filesystem (HFS) if I create a database named "foo / bar", I'll get a > complaint like: > > ERROR: could not open "pg_hibernator/5.foo / bar.save": No such file > or directory > > during shutdown. Thanks for the report. I have reworked the file naming, and now the save-file name is simply '.save', so the name of a database does not affect the file name on disk. Instead, the null-terminated database name is now written in the file, and read back for use when restoring the buffers. Attached is the new version of pg_hibernator, with updated code and README. Just a heads up for anyone who might have read/reviewed previous version's code, there's some unrelated trivial code and Makefile changes as well in this version, which can be easily spotted by a `diff -r`. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com pg_hibernator.v2.tgz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposing pg_hibernate
On May 29, 2014 12:12 AM, "Amit Kapila" wrote: > > I agree with you that there are only few corner cases where evicting > shared buffers by this utility would harm, but was wondering if we could > even save those, say if it would only use available free buffers. I think > currently there is no such interface and inventing a new interface for this > case doesn't seem to reasonable unless we see any other use case of > such a interface. +1
Re: [HACKERS] Proposing pg_hibernate
On Wed, May 28, 2014 at 2:15 AM, Amit Kapila wrote: > On Wed, May 28, 2014 at 7:31 AM, Gurjeet Singh wrote: > > Caveats >> -- >> >> - Buffer list is saved only when Postgres is shutdown in "smart" and >> "fast" modes. >> >> That is, buffer list is not saved when database crashes, or on >> "immediate" >> shutdown. >> >> - A reduction in `shared_buffers` is not detected. >> >> If the `shared_buffers` is reduced across a restart, and if the >> combined >> saved buffer list is larger than the new shared_buffers, Postgres >> Hibernator continues to read and restore blocks even after >> `shared_buffers` >> worth of buffers have been restored. > > How about the cases when shared buffers already contain some > data: > a. Before Readers start filling shared buffers, if this cluster wishes > to join replication as a slave and receive the data from master, then > this utility might need to evict some buffers filled during startup > phase. A cluster that wishes to be a replication standby, it would do so while it's in startup phase. The BlockReaders are launched immediately on cluster reaching consistent state, at which point, I presume, in most of the cases, most of the buffers would be unoccupied. Hence BlockReaders might evict the occupied buffers, which may be a small fraction of total shared_buffers. > b. As soon as the server completes startup (reached consistent > point), it allows new connections which can also use some shared > buffers before Reader process could use shared buffers or are you > planing to change the time when users can connect to database. The BlockReaders are launched immediately after the cluster reaches consistent state, that is, just about when it is ready to accept connections. So yes, there is a possibility that the I/O caused by the BlockReaders may affect the performance of queries executed right at cluster startup. But given that the performance of those queries was anyway going to be low (because of empty shared buffers), and that BlockReaders tend to cause sequential reads, and that by default there's only one BlockReader active at a time, I think this won't be a concern in most of the cases. By the time the shared buffers start getting filled up, the buffer replacement strategy will evict any buffers populated by BlockReaders if they are not used by the normal queries. In the 'Sample Runs' section of my blog [1], I compared the cases 'Hibernator w/ App' and 'Hibernator then App', which demonstrate that launching application load while the BlockReaders are active does cause performance of both to be impacted by each other. But overall it's a net win for application performance. > I am not sure if replacing shared buffer contents in such cases can > always be considered useful. IMHO, all of these caveats, would affect a very small fraction of use-cases and are eclipsed by the benefits this extension provides in normal cases. [1]: http://gurjeet.singh.im/blog/2014/04/30/postgres-hibernator-reduce-planned-database-down-times/ Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposing pg_hibernate
ring server startup and right before shutdown, and honors this parameter's value at that time. To enable/disable Postgres Hibernator at runtime, change the value in `postgresql.conf` and use `pg_ctl reload` to make Postgres re-read the new parameter values from `postgresql.conf`. Default value: `true`. - `pg_hibernator.parallel` This parameter controls whether Postgres Hibernator launches the BlockReader processes in parallel, or sequentially, waiting for current BlockReader to exit before launching the next one. When enabled, all the BlockReaders, one for each database, will be launched simultaneously, and this may cause huge random-read flood on disks if there are many databases in cluster. This may also cause some BlockReaders to fail to launch successfully because of `max_worker_processes` limit. Default value: `false`. - `pg_hibernator.default_database` The BufferSaver process needs to connect to a database in order to perform the database-name lookups etc. This parameter controls which database the BufferSaver process connects to for performing these operations. Default value: `postgres`. Caveats -- - Buffer list is saved only when Postgres is shutdown in "smart" and "fast" modes. That is, buffer list is not saved when database crashes, or on "immediate" shutdown. - A reduction in `shared_buffers` is not detected. If the `shared_buffers` is reduced across a restart, and if the combined saved buffer list is larger than the new shared_buffers, Postgres Hibernator continues to read and restore blocks even after `shared_buffers` worth of buffers have been restored. FAQ -- - What is the relationship between `pg_buffercache`, `pg_prewarm`, and `pg_hibernator`? They all allow you to do different things with Postgres' shared buffers. + pg_buffercahce: Inspect and show contents of shared buffers + pg_prewarm: Load some table/index/fork blocks into shared buffers. User needs to tell it which blocks to load. + pg_hibernator: Upon shutdown, save list of blocks stored in shared buffers. Upon startup, loads those blocks back into shared buffers. The goal of Postgres Hibernator is to be invisible to the user/DBA. Whereas with `pg_prewarm` the user needs to know a lot of stuff about what they really want to do, most likely information gathered via `pg_buffercahce`. - Does `pg_hibernate` use either `pg_buffercache` or `pg_prewarm`? No, Postgres Hibernator works all on its own. If the concern is, "Do I have to install pg_buffercache and pg_prewarm to use pg_hibernator", the answer is no. pg_hibernator is a stand-alone extension, although influenced by pg_buffercache and pg_prewarm. With `pg_prewarm` you can load blocks of **only** the database you're connected to. So if you have `N` databases in your cluster, to restore blocks of all databases, the DBA will have to connect to each database and invoke `pg_prewarm` functions. With `pg_hibernator`, DBA isn't required to do anything, let alone connecting to the database! - Where can I learn more about it? There are a couple of blog posts and initial proposal to Postgres hackers' mailing list. They may provide a better understanding of Postgres Hibernator. [Proposal](http://www.postgresql.org/message-id/cabwtf4ui_anag+ybsefunah5z6de9aw2npdy4hryk+m5odx...@mail.gmail.com) [Introducing Postrges Hibernator](http://gurjeet.singh.im/blog/2014/02/03/introducing-postgres-hibernator/) [Demostrating Performance Benefits](http://gurjeet.singh.im/blog/2014/04/30/postgres-hibernator-reduce-planned-database-down-times/) On Mon, Feb 3, 2014 at 7:18 PM, Gurjeet Singh wrote: > Please find attached the pg_hibernate extension. It is a > set-it-and-forget-it solution to enable hibernation of Postgres > shared-buffers. It can be thought of as an amalgam of pg_buffercache and > pg_prewarm. > > It uses the background worker infrastructure. It registers one worker > process (BufferSaver) to save the shared-buffer metadata when server is > shutting down, and one worker per database (BlockReader) when restoring the > shared buffers. > > It stores the buffer metadata under $PGDATA/pg_database/, one file per > database, and one separate file for global objects. It sorts the list of > buffers before storage, so that when it encounters a range of consecutive > blocks of a relation's fork, it stores that range as just one entry, hence > reducing the storage and I/O overhead. > > On-disk binary format, which is used to create the per-database save-files, > is defined as: > 1. One or more relation filenodes; stored as r. > 2. Each realtion is followed by one or more fork number; stored as &
Re: [HACKERS] review: Non-recursive processing of AND/OR lists
On Thu, Jul 18, 2013 at 1:54 PM, Gurjeet Singh wrote: > On Thu, Jul 18, 2013 at 10:19 AM, Tom Lane wrote: > >> >> > Because simpler code is less likely to have bugs and is easier to >> > maintain. >> >> I agree with that point, but one should also remember Polya's Inventor's >> Paradox: the more general problem may be easier to solve. That is, if >> done right, code that fully flattens an AND tree might actually be >> simpler than code that does just a subset of the transformation. The >> current patch fails to meet this expectation, > > > The current patch does completely flatten any type of tree > (left/right-deep or bushy) without recursing, and right-deep and bushy tree > processing is what Robert is recommending to defer to recursive processing. > Maybe I haven't considered a case where it doesn't flatten the tree; do you > have an example in mind. > > >> but maybe you just haven't >> thought about it the right way. >> > I tried to eliminate the 'pending' list, but I don't see a way around it. We need temporary storage somewhere to store the branches encountered on the right; in recursion case the call stack was serving that purpose. > >> My concerns about this patch have little to do with that, though, and >> much more to do with the likelihood that it breaks some other piece of >> code that is expecting AND/OR to be strictly binary operators, which >> is what they've always been in parsetrees that haven't reached the >> planner. It doesn't appear to me that you've done any research on that >> point whatsoever > > > No, I haven't, and I might not be able to research it for a few more weeks. > There are about 30 files (including optimizer and executor) that match case-insensitive search for BoolExpr, and I scanned those for the usage of the member 'args'. All the instances where BoolExpr.args is being accessed, it's being treated as a null-terminated list. There's one exception that I could find, and it was in context of NOT expression: not_clause() in clauses.c. > > >> you have not even updated the comment for BoolExpr >> (in primnodes.h) that this patch falsifies. >> > > I will fix that. > I think this line in that comment already covers the fact that in some "special" cases a BoolExpr may have more than 2 arguments. "There are also a few special cases where more arguments can appear before optimization." I have updated the comment nevertheless, and removed another comment in parse_expr.c that claimed to be the only place where a BoolExpr with more than 2 args is generated. I have isolated the code for right-deep and bushy tree processing via the macro PROCESS_BUSHY_TREES. Also, I have shortened some variable names while retaining their meaning. Please find the updated patch attached (based on master). Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com <http://www.enterprisedb.com> diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c index 81c9338..eb35d70 100644 --- a/src/backend/parser/parse_expr.c +++ b/src/backend/parser/parse_expr.c @@ -41,8 +41,7 @@ bool Transform_null_equals = false; static Node *transformExprRecurse(ParseState *pstate, Node *expr); static Node *transformParamRef(ParseState *pstate, ParamRef *pref); static Node *transformAExprOp(ParseState *pstate, A_Expr *a); -static Node *transformAExprAnd(ParseState *pstate, A_Expr *a); -static Node *transformAExprOr(ParseState *pstate, A_Expr *a); +static Node *transformAExprAndOr(ParseState *pstate, A_Expr *a); static Node *transformAExprNot(ParseState *pstate, A_Expr *a); static Node *transformAExprOpAny(ParseState *pstate, A_Expr *a); static Node *transformAExprOpAll(ParseState *pstate, A_Expr *a); @@ -224,10 +223,10 @@ transformExprRecurse(ParseState *pstate, Node *expr) result = transformAExprOp(pstate, a); break; case AEXPR_AND: - result = transformAExprAnd(pstate, a); + result = transformAExprAndOr(pstate, a); break; case AEXPR_OR: - result = transformAExprOr(pstate, a); + result = transformAExprAndOr(pstate, a); break; case AEXPR_NOT: result = transformAExprNot(pstate, a); @@ -918,32 +917,102 @@ transformAExprOp(ParseState *pstate, A_Expr *a) return result; } +/* + * Transform the AND/OR trees non-recursively. + * + * The parser turns a list of consecutive AND expressions into a left-deep tree. + * + * a AND b AND c + * + * AND + * / \ + * AND c + * / \ + * ab + * + * For very long lists, it gets deep enough that processing it recursively causes + * check_stack_depth() to raise error and abort the query. Hence, it is necessary + * that we proc
Re: [HACKERS] Patch to send transaction commit/rollback stats to the stats collector unconditionally.
On Wed, Mar 19, 2014 at 7:27 PM, Tom Lane wrote: > Alvaro Herrera writes: > > I'm not sure I understand the point of this whole thing. Realistically, > > how many transactions are there that do not access any database tables? > > I think that something like "select * from pg_stat_activity" might not > bump any table-access counters, once the relevant syscache entries had > gotten loaded. You could imagine that a monitoring app would do a long > series of those and nothing else (whether any actually do or not is a > different question). > +1. This is exactly what I was doing; querying pg_stat_database from a psql session, to track transaction counters. > > But still, it's a bit hard to credit that this patch is solving any real > problem. Where's the user complaints about the existing behavior? > Consider my original email a user complaint, albeit with a patch attached. I was trying to ascertain TPS on a customer's instance when I noticed this behaviour; it violated POLA. Had I not decided to dig through the code to find the source of this behaviour, as a regular user I would've reported this anomaly as a bug, or maybe turned a blind eye to it labeling it as a quirky behaviour. > That is, even granting that anybody has a workload that acts like this, > why would they care ... To avoid surprises! This behaviour, at least in my case, causes Postgres to misreport the very thing I am trying to calculate. > and are they prepared to take a performance hit > to avoid the counter jump after the monitoring app exits? > It doesn't look like we know how much of a performance hit this will cause. I don't see any reasons cited in the code, either. Could this be a case of premature optimization. The commit log shows that, during the overhaul (commit 77947c5), this check was put in place to emulate what the previous code did; code before that commit reported stats, including transaction stats, only if there were any regular or shared table stats to report. Besides, there's already a throttle built in using the PGSTAT_STAT_INTERVAL limit. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com <http://www.enterprisedb.com>
Re: [HACKERS] Patch to send transaction commit/rollback stats to the stats collector unconditionally.
On Wed, Mar 19, 2014 at 4:22 PM, Tom Lane wrote: > Gurjeet Singh writes: > > Please find attached the patch to send transaction commit/rollback stats > to > > stats collector unconditionally. > > That's intentional to reduce stats traffic. What kind of performance > penalty does this patch impose? If the number of such transactions is > large enough to create a noticeable jump in the counters, I would think > that this would be a pretty darn expensive "fix". > I can't speak to the performance impact of this patch, except that it would depend on the fraction of transactions that behave this way. Perhaps the people who develop and/or aggressively use monitoring can pitch in. Presumably, on heavily used systems these transactions would form a small fraction. On relatively idle systems these transactions may be a larger fraction but that wouldn't affect the users since the database is not under stress anyway. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com <http://www.enterprisedb.com>
[HACKERS] Patch to send transaction commit/rollback stats to the stats collector unconditionally.
Please find attached the patch to send transaction commit/rollback stats to stats collector unconditionally. Without this patch, the transactions that do not read from/write to a database table do not get reported to the stats collector until the client disconnects. Hence the transaction counts in pg_stat_database do not increment gradually as one would expect them to. But when such a backend disconnects, the counts jump dramatically, giving the impression that the database processed a lot of transactions (potentially thousands) in an instant. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com <http://www.enterprisedb.com> commit 500d11f96b21552872bad4e9655bd45db28efabd Author: Gurjeet Singh Date: Wed Mar 19 13:41:55 2014 -0400 Send transaction stats to the collector even if no table stats to report. diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 3dc280a..47008ed 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -825,11 +825,11 @@ pgstat_report_stat(bool force) } /* - * Send partial messages. If force is true, make sure that any pending - * xact commit/abort gets counted, even if no table stats to send. + * Send partial messages. Make sure that any pending xact commit/abort gets + * counted, even if no table stats to send. */ if (regular_msg.m_nentries > 0 || - (force && (pgStatXactCommit > 0 || pgStatXactRollback > 0))) + force || (pgStatXactCommit > 0 || pgStatXactRollback > 0)) pgstat_send_tabstat(®ular_msg); if (shared_msg.m_nentries > 0) pgstat_send_tabstat(&shared_msg); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cleaner build output when not much has changed
On Mon, Mar 10, 2014 at 8:12 PM, Alvaro Herrera wrote: > Gurjeet Singh wrote: > > On Tue, Nov 26, 2013 at 12:37 PM, Tom Lane wrote: > > > > > Gurjeet Singh writes: > > > > I was looking for ways to reduce the noise in Postgres make output, > > > > specifically, I wanted to eliminate the "Nothing to be done for > `all' " > > > > messages, since they don't add much value, and just ad to the > clutter. > > > > > > Why don't you just use "make -s" if you don't want to see that? > > > The example output you show is not much less verbose than before. > > > > I have a shell function that now adds --no-print-directory to my make > > command. This patch combined with that switch makes the output really > clean > > (at least from my perspective). Since the use of a command-line switch > can > > be easily left to personal choice, I am not proposing to add that or its > > makefile-equivalent. But modifying the makefiles to suppress noise is not > > that everyone can be expected to do, and do it right. > > FWIW you can add a src/Makefile.custom file with this: > > all: > @true > > and it will get rid of most noise. > As I noted in the first email in this chain, this causes a warning: GNUmakefile:14: warning: overriding commands for target `all' /home/gurjeet/dev/POSTGRES/src/Makefile.custom:2: warning: ignoring old commands for target `all' I have since settled for `make -s`. On slow builds it keeps me guessing for a long time, without any indication of progress, but I've learned to live with that. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com <http://www.enterprisedb.com>
[HACKERS] Proposing pg_hibernate
Please find attached the pg_hibernate extension. It is a set-it-and-forget-it solution to enable hibernation of Postgres shared-buffers. It can be thought of as an amalgam of pg_buffercache and pg_prewarm. It uses the background worker infrastructure. It registers one worker process (BufferSaver) to save the shared-buffer metadata when server is shutting down, and one worker per database (BlockReader) when restoring the shared buffers. It stores the buffer metadata under $PGDATA/pg_database/, one file per database, and one separate file for global objects. It sorts the list of buffers before storage, so that when it encounters a range of consecutive blocks of a relation's fork, it stores that range as just one entry, hence reducing the storage and I/O overhead. On-disk binary format, which is used to create the per-database save-files, is defined as: 1. One or more relation filenodes; stored as r. 2. Each realtion is followed by one or more fork number; stored as f 3. Each fork number is followed by one or more block numbers; stored as b 4. Each block number is followed by zero or more range numbers; stored as N {r {f {b N* }+ }+ }+ Possible enhancements: - Ability to save/restore only specific databases. - Control how many BlockReaders are active at a time; to avoid I/O storms. - Be smart about lowered shared_buffers across the restart. - Different modes of reading like pg_prewarm does. - Include PgFincore functionality, at least for Linux platforms. The extension currently works with PG 9.3, and may work on 9.4 without any changes; I haven't tested, though. If not, I think it'd be easy to port to HEAD/PG 9.4. I see that 9.4 has put a cap on maximum background workers via a GUC, and since my aim is to provide a non-intrusive no-tuning-required extension, I'd like to use the new dynamic-background-worker infrastructure in 9.4, which doesn't seem to have any preset limits (I think it's limited by max_connections, but I may be wrong). I can work on 9.4 port, if there's interest in including this as a contrib/ module. To see the extension in action: .) Compile it. .) Install it. .) Add it to shared_preload_libraries. .) Start/restart Postgres. .) Install pg_buffercache extension, to inspect the shared buffers. .) Note the result of pg_buffercache view. .) Work on your database to fill up the shared buffers. .) Note the result of pg_buffercache view, again; there should be more blocks than last time we checked. .) Stop and start the Postgres server. .) Note the output of pg_buffercache view; it should contain the blocks seen just before the shutdown. .) Future server restarts will automatically save and restore the blocks in shared-buffers. The code is also available as Git repository at https://github.com/gurjeet/pg_hibernate/ Demo: $ make -C contrib/pg_hibernate/ $ make -C contrib/pg_hibernate/ install $ vi $B/db/data/postgresql.conf $ grep shared_preload_libraries $PGDATA/postgresql.conf shared_preload_libraries = 'pg_hibernate' # (change requires restart) $ pgstart waiting for server to start done server started $ pgsql -c 'create extension pg_buffercache;' CREATE EXTENSION $ pgsql -c 'select count(*) from pg_buffercache where relblocknumber is not null group by reldatabase;' count --- 163 14 (2 rows) $ pgsql -c 'create table test_hibernate as select s as a, s::char(1000) as b from generate_series(1, 10) as s;' SELECT 10 $ pgsql -c 'create index on test_hibernate(a);' CREATE INDEX $ pgsql -c 'select count(*) from pg_buffercache where relblocknumber is not null group by reldatabase;' count --- 2254 14 (2 rows) $ pgstop waiting for server to shut down... done server stopped $ pgstart waiting for server to start done server started $ pgsql -c 'select count(*) from pg_buffercache where relblocknumber is not null group by reldatabase;' count --- 2264 17 (2 rows) There are a few more blocks than the time they were saved, but all the blocks from before the restart are present in shared buffers after the restart. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com <http://www.enterprisedb.com> pg_hibernate.tgz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Minor improvements to sslinfo contrib module
Please find attached the patch that fixes a couple of comments, and adds 'static' qualifier to functions that are not used anywhere else in the code base. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com <http://www.enterprisedb.com> diff --git a/contrib/sslinfo/sslinfo.c b/contrib/sslinfo/sslinfo.c index 7a58470..18405f6 100644 --- a/contrib/sslinfo/sslinfo.c +++ b/contrib/sslinfo/sslinfo.c @@ -31,9 +31,10 @@ Datum ssl_client_dn_field(PG_FUNCTION_ARGS); Datum ssl_issuer_field(PG_FUNCTION_ARGS); Datum ssl_client_dn(PG_FUNCTION_ARGS); Datum ssl_issuer_dn(PG_FUNCTION_ARGS); -Datum X509_NAME_field_to_text(X509_NAME *name, text *fieldName); -Datum X509_NAME_to_text(X509_NAME *name); -Datum ASN1_STRING_to_text(ASN1_STRING *str); + +static Datum X509_NAME_field_to_text(X509_NAME *name, text *fieldName); +static Datum X509_NAME_to_text(X509_NAME *name); +static Datum ASN1_STRING_to_text(ASN1_STRING *str); /* @@ -51,7 +52,7 @@ ssl_is_used(PG_FUNCTION_ARGS) /* - * Returns SSL cipher currently in use. + * Returns SSL version currently in use. */ PG_FUNCTION_INFO_V1(ssl_version); Datum @@ -77,7 +78,7 @@ ssl_cipher(PG_FUNCTION_ARGS) /* - * Indicates whether current client have provided a certificate + * Indicates whether current client provided a certificate * * Function has no arguments. Returns bool. True if current session * is SSL session and client certificate is verified, otherwise false. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Shave a few instructions from child-process startup sequence
On Wed, Nov 27, 2013 at 9:12 AM, Robert Haas wrote: > > This is a performance patch, so it should come with benchmark results > demonstrating that it accomplishes its intended purpose. I don't see > any. > Yes, this is a performance patch, but as the subject says, it saves a few instructions. I don't know how to write a test case that can measure savings of skipping a few instructions in a startup sequence that potentially takes thousands, or even millions, of instructions. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB Corp. www.EnterpriseDB.com <http://www.enterprisedb.com>
Re: [HACKERS] Shave a few instructions from child-process startup sequence
On Tue, Nov 26, 2013 at 9:42 PM, Peter Eisentraut wrote: > src/backend/postmaster/postmaster.c:2255: indent with spaces. > +else > src/backend/postmaster/postmaster.c:2267: indent with spaces. > +break Not sure how that happened! Attached is the updated patch. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EnterprsieDB Inc. www.enterprisedb.com diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index ccb8b86..cdae6e5 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -2236,6 +2236,20 @@ ClosePostmasterPorts(bool am_syslogger) StreamClose(ListenSocket[i]); ListenSocket[i] = PGINVALID_SOCKET; } + else + { + /* + * Do not process the rest of the array elements since we expect + * the presence of an invalid socket id to mark the end of valid + * elements. + */ +#ifdef USE_ASSERT_CHECKING + int j; + for(j = i; j < MAXLISTEN; j++) +Assert(ListenSocket[i] == PGINVALID_SOCKET); +#endif + break; + } } /* If using syslogger, close the read side of the pipe */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cleaner build output when not much has changed
On Tue, Nov 26, 2013 at 12:37 PM, Tom Lane wrote: > Gurjeet Singh writes: > > I was looking for ways to reduce the noise in Postgres make output, > > specifically, I wanted to eliminate the "Nothing to be done for `all' " > > messages, since they don't add much value, and just ad to the clutter. > > Why don't you just use "make -s" if you don't want to see that? > The example output you show is not much less verbose than before. > I have a shell function that now adds --no-print-directory to my make command. This patch combined with that switch makes the output really clean (at least from my perspective). Since the use of a command-line switch can be easily left to personal choice, I am not proposing to add that or its makefile-equivalent. But modifying the makefiles to suppress noise is not that everyone can be expected to do, and do it right. > I'm pretty suspicious of cute changes like this to the makefiles. > They too often have unexpected side-effects. (I'm still pissed off > about having to manually remove objfiles.txt to get it to rebuild a .o > file, for instance.) > You mean the --enable-depend switch to ./configure is not sufficient to force a rebuild on changed source code! With this switch, I have always seen my builds do the right thing, whether I modify a .c file or a .h. I personally have never been forced to remove objfiles.txt to solve a build/make issue. (I primarily use VPATH builds, BTW). Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com <http://www.enterprisedb.com>
[HACKERS] Cleaner build output when not much has changed
I was looking for ways to reduce the noise in Postgres make output, specifically, I wanted to eliminate the "Nothing to be done for `all' " messages, since they don't add much value, and just ad to the clutter. Most of the solutions I have seen propose grepping out the noisy parts. But one of them proposed adding a .PHONY rule and then adding a no-op command to a target's recipe. Attached is a small patch to a few makefiles which helps remove the above mentioned message. Following is a sample output: ... make[3]: Entering directory `/home/gurjeet/dev/pgdbuilds/quiet_make/src/bin/initdb' make -C ../../../src/port all make[4]: Entering directory `/home/gurjeet/dev/pgdbuilds/quiet_make/src/port' make -C ../backend submake-errcodes make[5]: Entering directory `/home/gurjeet/dev/pgdbuilds/quiet_make/src/backend' make[5]: Leaving directory `/home/gurjeet/dev/pgdbuilds/quiet_make/src/backend' make[4]: Leaving directory `/home/gurjeet/dev/pgdbuilds/quiet_make/src/port' make -C ../../../src/common all make[4]: Entering directory `/home/gurjeet/dev/pgdbuilds/quiet_make/src/common' make -C ../backend submake-errcodes make[5]: Entering directory `/home/gurjeet/dev/pgdbuilds/quiet_make/src/backend' make[5]: Leaving directory `/home/gurjeet/dev/pgdbuilds/quiet_make/src/backend' make[4]: Leaving directory `/home/gurjeet/dev/pgdbuilds/quiet_make/src/common' make[3]: Leaving directory `/home/gurjeet/dev/pgdbuilds/quiet_make/src/bin/initdb' make[2]: Leaving directory `/home/gurjeet/dev/pgdbuilds/quiet_make/src/bin' make -C pl all make[2]: Entering directory `/home/gurjeet/dev/pgdbuilds/quiet_make/src/pl' ... The noise can be further reduced by adding the --no-print-directory switch, which yeilds the following output: ... make -C ../backend submake-errcodes make -C psql all make -C ../../../src/interfaces/libpq all make -C ../../../src/port all make -C ../backend submake-errcodes make -C ../../../src/common all make -C ../backend submake-errcodes make -C scripts all make -C ../../../src/interfaces/libpq all make -C ../../../src/port all ... The only downside I see to this patch is that it emits this warning in the beginning: GNUmakefile:14: warning: overriding commands for target `all' src/Makefile.global:29: warning: ignoring old commands for target `all' This is from the recipe that emits the message "All of PostgreSQL successfully made. Ready to install." For really quiet builds one can use the -s switch, but for someone who wishes to see some kind of progress and also want a cleaner terminal output, the --no-print-directory switch alone is not enough. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com <http://www.enterprisedb.com> diff --git a/src/Makefile.global.in b/src/Makefile.global.in index 8bfb77d..d0928d5 100644 --- a/src/Makefile.global.in +++ b/src/Makefile.global.in @@ -26,6 +26,7 @@ standard_always_targets = distprep clean distclean maintainer-clean # make `all' the default target all: + @true # Delete target files if the command fails after it has # started to update the file. diff --git a/src/backend/Makefile b/src/backend/Makefile index 318cdbc..cbc6d55 100644 --- a/src/backend/Makefile +++ b/src/backend/Makefile @@ -124,6 +124,7 @@ submake-schemapg: # src/port needs a convenient way to force errcodes.h to get built submake-errcodes: $(top_builddir)/src/include/utils/errcodes.h + @true .PHONY: submake-schemapg submake-errcodes diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile index c4d3f3c..ee8fbcc 100644 --- a/src/backend/catalog/Makefile +++ b/src/backend/catalog/Makefile @@ -55,7 +55,9 @@ postgres.description: postgres.bki ; postgres.shdescription: postgres.bki ; +.PHONY: schemapg.h schemapg.h: postgres.bki ; + @true # Technically, this should depend on Makefile.global, but then # postgres.bki would need to be rebuilt after every configure run, -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Shave a few instructions from child-process startup sequence
On Wed, Nov 20, 2013 at 10:21 AM, Peter Eisentraut wrote: > On 11/5/13, 2:47 AM, Gurjeet Singh wrote: > > On Mon, Nov 4, 2013 at 12:20 AM, Tom Lane > <mailto:t...@sss.pgh.pa.us>> wrote: > > > > But we're not buying much. A few instructions during postmaster > > shutdown > > is entirely negligible. > > > > > > The patch is for ClosePostmasterPorts(), which is called from every > > child process startup sequence (as $subject also implies), not in > > postmaster shutdown. I hope that adds some weight to the argument. > > If there is a concern about future maintenance, you could add assertions > (in appropriate compile mode) that the rest of the array is indeed > PGINVALID_SOCKET. I think that could be a win for both performance and > maintainability. > Makes sense! Does the attached patch look like what you expected? I also added a comment to explain the expectation. Thanks and best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB Inc. www.EnterpriseDB.com <http://www.enterprisedb.com> diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index ccb8b86..9efc9fa 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -2236,6 +2236,20 @@ ClosePostmasterPorts(bool am_syslogger) StreamClose(ListenSocket[i]); ListenSocket[i] = PGINVALID_SOCKET; } +else + { + /* + * Do not process the rest of the array elements since we expect + * the presence of an invalid socket id to mark the end of valid + * elements. + */ +#ifdef USE_ASSERT_CHECKING + int j; + for(j = i; j < MAXLISTEN; j++) +Assert(ListenSocket[i] == PGINVALID_SOCKET); +#endif +break; + } } /* If using syslogger, close the read side of the pipe */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol
On Wed, Nov 20, 2013 at 3:44 PM, Tom Lane wrote: > > To my mind, the "create a socket and hope nobody else can get to it" > approach is exactly one of the main things we're trying to avoid here. > If you'll recall, awhile back we had a big discussion about how pg_upgrade > could positively guarantee that nobody messed with the source database > while it was working, and we still don't have a bulletproof guarantee > there. I would like to fix that by making pg_upgrade use only standalone > backends to talk to the source database, never starting a real postmaster > at all. But if the standalone-pg_dump mode goes through a socket, we're > back to square one on that concern. > (I couldn't find the pg_upgrade-related thread mentioned above). I am not sure of the mechanics of this, but can we not launch the postmaster with a random magic-cookie, and use that cookie while initiating the connection from libpq. The postmaster will then reject any connections that don't provide the cookie. We do something similar to enable applications to send cancellation signals (postmaster.c:Backend.cancel_key), just that it's establishing trust in the opposite direction. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EnterprsieDB Inc. www.enterprisedb.com
Re: [HACKERS] Shave a few instructions from child-process startup sequence
On Mon, Nov 4, 2013 at 12:20 AM, Tom Lane wrote: > But we're not buying much. A few instructions during postmaster shutdown > is entirely negligible. > The patch is for ClosePostmasterPorts(), which is called from every child process startup sequence (as $subject also implies), not in postmaster shutdown. I hope that adds some weight to the argument. Best regards, -- Gurjeet Singh gurjeet.singh.im EnterpriseDB Inc. www.enterprisedb.com
Re: [HACKERS] Shave a few instructions from child-process startup sequence
On Thu, Oct 31, 2013 at 11:20 PM, Tom Lane wrote: > Amit Kapila writes: > > On Thu, Oct 31, 2013 at 2:41 AM, Gurjeet Singh wrote: > >> Just a small patch; hopefully useful. > > > This is valid saving as we are filling array ListenSocket[] in > > StreamServerPort() serially, so during ClosePostmasterPorts() once if > > it encountered PGINVALID_SOCKET, it is valid to break the loop. > > Although savings are small considering this doesn't occur in any > > performance path, still I think this is right thing to do in code. > > > It is better to register this patch in CF app list, unless someone > > feels this is not right. > > I think this is adding fragility for absolutely no meaningful savings. > The existing code does not depend on the assumption that the array > is filled consecutively and no entries are closed early. Why should > we add such an assumption here? > IMHO, typical configurations have one, or maybe two of these array elements populated; one for TCP 5432 (for localhost or *), and the other for Unix Domain Sockets. Saving 62 iterations and comparisons in startup sequence may not be much, but IMHO it does match logic elsewhere. In fact, this was inspired by the following code in ServerLoop(): ServerLoop() { ... /* * New connection pending on any of our sockets? If so, fork a child * process to deal with it. */ if (selres > 0) { int i; for (i = 0; i < MAXLISTEN; i++) { if (ListenSocket[i] == PGINVALID_SOCKET) break; if (FD_ISSET(ListenSocket[i], &rmask)) { And looking for other precedences, I found that initMasks() function also relies on the array being filled consecutively and the first PGINVALID_SOCKET marks end of valid array members. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc.
[HACKERS] Shave a few instructions from child-process startup sequence
Just a small patch; hopefully useful. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc. diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index ccb8b86..48dc7af 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -2236,6 +2236,8 @@ ClosePostmasterPorts(bool am_syslogger) StreamClose(ListenSocket[i]); ListenSocket[i] = PGINVALID_SOCKET; } +else +break; } /* If using syslogger, close the read side of the pipe */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] review: Non-recursive processing of AND/OR lists
On Thu, Jul 18, 2013 at 10:19 AM, Tom Lane wrote: > > > Because simpler code is less likely to have bugs and is easier to > > maintain. > > I agree with that point, but one should also remember Polya's Inventor's > Paradox: the more general problem may be easier to solve. That is, if > done right, code that fully flattens an AND tree might actually be > simpler than code that does just a subset of the transformation. The > current patch fails to meet this expectation, The current patch does completely flatten any type of tree (left/right-deep or bushy) without recursing, and right-deep and bushy tree processing is what Robert is recommending to defer to recursive processing. Maybe I haven't considered a case where it doesn't flatten the tree; do you have an example in mind. > but maybe you just haven't > thought about it the right way. > > My concerns about this patch have little to do with that, though, and > much more to do with the likelihood that it breaks some other piece of > code that is expecting AND/OR to be strictly binary operators, which > is what they've always been in parsetrees that haven't reached the > planner. It doesn't appear to me that you've done any research on that > point whatsoever No, I haven't, and I might not be able to research it for a few more weeks. > you have not even updated the comment for BoolExpr > (in primnodes.h) that this patch falsifies. > I will fix that. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc.
Re: [HACKERS] review: Non-recursive processing of AND/OR lists
On Wed, Jul 17, 2013 at 1:25 PM, Robert Haas wrote: > On Mon, Jul 15, 2013 at 12:45 AM, Gurjeet Singh wrote: > > Agreed that there's overhead in allocating list items, but is it more > > overhead than pushing functions on the call stack? Not sure, so I leave > it > > to others who understand such things better than I do. > > If you think that a palloc can ever be cheaper that pushing a frame on > the callstack, you're wrong. palloc is not some kind of an atomic > primitive. It's implemented by the AllocSetAlloc function, and you're > going to have to push that function on the call stack, too, in order > to run it. > Agreed. I take my objection back. Even if AllocSetAlloc() reuses memory that was pfree'd earlier, it'll still be at least as expensive as recursing. > > My main point here is that if the user writes a = 1 and b = 1 and c = > 1 and d = 1, they're not going to end up with a bushy tree. They're > going to end up with a tree that's only deep in one direction (left, I > guess) and that's the case we might want to consider optimizing. To > obtain a bushy tree, they're going to have to write a = 1 and (b = 1 > and c = 1) and d = 1, or something like that, and I don't see why we > should stress out about that case. It will be rare in practice. > In v6 of the patch, I have deferred the 'pending' list initialization to until we actually hit a candidate right-branch. So in the common case the pending list will never be populated, and if we find a bushy or right-deep tree (for some reason an ORM/tool may choose to build AND/OR lists that may end being right-deep when in Postgres), then the pending list will be used to process them iteratively. Does that alleviate your concern about 'pending' list management causing an overhead. Agreed that bushy/right-deep trees are a remote corner case, but we are addressing a remote corner case in the first place (insanely long AND lists) and why not handle another remote corner case right now if it doesn't cause an overhead for common case. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc.
Re: [HACKERS] review: Non-recursive processing of AND/OR lists
On Wed, Jul 17, 2013 at 8:21 AM, Gurjeet Singh wrote: > > What's the procedure of moving a patch to the next commitfest? > Never mind, I see an email from Josh B. regarding this on my corporate account. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc.
Re: [HACKERS] review: Non-recursive processing of AND/OR lists
On Tue, Jul 16, 2013 at 4:04 PM, Pavel Stehule wrote: > I did a some performance tests of v5 and v6 version and there v5 is > little bit faster than v6, and v6 has significantly higher stddev > Thanks Pavel. The difference in average seems negligible, but stddev is interesting because v6 does less work than v5 in common cases and in the test that I had shared. The current commitfest (2013-06) is marked as 'In Progress', so is it okay to just mark the patch as 'Ready for Committer' or should I move it to the next commitfest (2013-09). What's the procedure of moving a patch to the next commitfest? Do I make a fresh submission there with a link to current submission, or is the move doable somehow in the application itself. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc.
Re: [HACKERS] review: Non-recursive processing of AND/OR lists
On Sun, Jul 14, 2013 at 8:27 PM, Robert Haas wrote: > On Wed, Jul 10, 2013 at 9:02 PM, Josh Berkus wrote: > >> I think it's a waste of code to try to handle bushy trees. A list is > >> not a particularly efficient representation of the pending list; this > >> will probably be slower than recusing in the common case. I'd suggest > >> keeping the logic to handle left-deep trees, which I find rather > >> elegant, but ditching the pending list. > Somehow I find it hard to believe that recursing would be more efficient than processing the items right there. The recursion is not direct either; transformExprRecurse() is going to call this function again, but after a few more switch-case comparisons. Agreed that there's overhead in allocating list items, but is it more overhead than pushing functions on the call stack? Not sure, so I leave it to others who understand such things better than I do. If by common-case you mean a list of just one logical AND/OR operator, then I agree that creating and destroying a list may incur overhead that is relatively very expensive. To that end, I have altered the patch, attached, to not build a pending list until we encounter a node with root_expr_kind in a right branch. We're getting bushy-tree processing with very little extra code, but if you deem it not worthwhile or adding complexity, please feel free to rip it out. > > > > Is there going to be further discussion of this patch, or do I return it? > > Considering it's not been updated, nor my comments responded to, in > almost two weeks, I think we return it at this point. > Sorry, I didn't notice that this patch was put back in 'Waiting on Author' state. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc. non_recursive_and_or_transformation_v6.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Fwd: review: Non-recursive processing of AND/OR lists
On Sun, Jun 30, 2013 at 11:46 AM, Pavel Stehule wrote: > 2013/6/30 Gurjeet Singh : > > On Sun, Jun 30, 2013 at 11:13 AM, Pavel Stehule > > > wrote: > > > > How about naming those 3 variables as follows: > > > > root_expr_kind > > root_expr_name > > root_bool_expr_type > > +1 Thanks. Attached is the patch with that change. I'll update the commitfest entry with a link to this email. -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc. -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc. non_recursive_and_or_transformation_v5.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] review: Non-recursive processing of AND/OR lists
On Sun, Jun 30, 2013 at 11:46 AM, Pavel Stehule wrote: > 2013/6/30 Gurjeet Singh : > > On Sun, Jun 30, 2013 at 11:13 AM, Pavel Stehule > > > wrote: > > > > How about naming those 3 variables as follows: > > > > root_expr_kind > > root_expr_name > > root_bool_expr_type > > +1 Thanks. Attached is the patch with that change. I'll update the commitfest entry with a link to this email. -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc.
Re: [HACKERS] review: Non-recursive processing of AND/OR lists
On Sun, Jun 30, 2013 at 11:13 AM, Pavel Stehule wrote: > Hello > > just one small notices > > I dislike a name "root_bool_expr", because, there is not a expression, > but expression type. Can you use "root_bool_expr_type" instead? It is > little bit longer, but more correct. Same not best name is > "root_char", maybe "root_bool_op_name" > > or root_expr_type and root_op_name ??? > How about naming those 3 variables as follows: root_expr_kind root_expr_name root_bool_expr_type -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc.
Re: [HACKERS] review: Non-recursive processing of AND/OR lists
On Tue, Jun 18, 2013 at 3:01 PM, Pavel Stehule wrote: > > related to > > https://commitfest.postgresql.org/action/patch_view?id=1130 > > http://www.postgresql.org/message-id/cabwtf4v9rsjibwe+87pk83mmm7acdrg7sz08rq-4qyme8jv...@mail.gmail.com > > > * motivation: remove recursive procession of AND/OR list (hangs with > 10062 and more subexpressions) > > * patch is short, clean and respect postgresql source code requirements > * patch was applied cleanly without warnings > * all regression tests was passed > * I successfully evaluated expression with 10 subexpressions > * there is no significant slowdown > > possible improvements > > a = (A_Expr*) list_nth(pending, 0); > > a = (A_Expr*) linitial(pending); > I made that change, hesitantly. The comments above definition of linitial() macro describe the confusion that API causes. I wanted to avoid that confusion for new code, so I used the newer API which makes the intention quite clear. But looking at that code closely, list_nth() causes at least 2 function calls, and that's pretty heavy compared to the linitiali() macro. > > not well comment > > should be -- "If the right branch is also an SAME condition, append it to > the" > I moved that comment above the outer bock, so that the intention of the whole do-while code block is described in one place. I don't see any other issues, so after fixing comments this patch is > ready for commit > Thanks for the review Pavel. Attached is the updated patch, v4. It has the above edits, and a few code improvements, like not repeating the (root_kind == AEPR_AND ? .. : ..) ternary expression. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc. non_recursive_and_or_transformation_v4.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Config reload/restart preview
On Thu, Jun 20, 2013 at 9:33 AM, Magnus Hagander wrote: > On Thu, Jun 20, 2013 at 2:54 PM, Dimitri Fontaine > wrote: > > Magnus Hagander writes: > >>> Should we have a way of previewing changes that would be applied if we > >>> reloaded/restarted the server? > >> > >> Yes, we should. > > > > +1 > > > >> This would go well with something I started working on some time ago > >> (but haven't actually gotten far on at all), which is the ability for > >> pg_ctl to be able to give feedback at all. Meaning a "pg_ctl reload" > >> should also be able to tell you which parameters were changed, without > >> having to go to the log. Obviously that's almost exactly the same > >> feature. > > > > It could probably connect to the server and issue the SQL command to > > reload, and that one could probably get enhanced to return modified > > variable as NOTICE, or be run with the right client_min_message: > > > > SELECT pg_reload_conf(); > > > > The pg_ctl client would then have to know to display the messages sent > > back by the server. > > The problem with that is that now you must *always* have your system > set up to allow the postgres user to log in in pg_hba.conf or it > fails. > > But yes, one option would be to use SQL instead of opening a socket. > Maybe that's a better idea - have pg_ctl try to use that if available, > and if not send a regular signal and not try to collect the output. > I started working on it yesterday after Thom proposed this idea internally at EDB. The discussion until now doesn't seem to be hostile to a SQL interface, so attached is a hack attempt, which hopefully will serve at least as a POC. A sample session is shown below, while changing a few values in postgresql.conf files. The central idea is to use the SIGHUP processing function to do the work for us and report potential changes via DEBUG2, instead of having to write a new parsing engine. The (GUC-nesting + PGC_S_TEST) is nice to have since it avoids the current session from adopting the values that are different in conf file. This approach is susceptible to the fact that the connected superuser may have its GUC values picked up from user/database/session level settings (ALTER USER/DATABASE .. SET ; or SET param TO val;). $ pgsql Expanded display is used automatically. psql (9.4devel) Type "help" for help. postgres=# show work_mem; work_mem -- 1MB (1 row) postgres=# set client_min_messages = debug2; SET postgres=# select pg_test_reload_conf(); DEBUG: parameter "work_mem" changed to "70MB" pg_test_reload_conf - t (1 row) postgres=# show work_mem; work_mem -- 1MB (1 row) postgres=# select pg_test_reload_conf(); DEBUG: parameter "shared_buffers" cannot be changed without restarting the server DEBUG: configuration file "/home/gurjeet/dev/pgdbuilds/report_guc_chanege_pre_reload/db/data/postgresql.conf" contains errors; unaffected changes were applied pg_test_reload_conf - t (1 row) postgres=# select pg_test_reload_conf(); DEBUG: parameter "log_min_messages" removed from configuration file, reset to default pg_test_reload_conf - t (1 row) Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc. report_conf_changes_without_applying.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] review: Non-recursive processing of AND/OR lists
Thanks for the review Pavel. On Tue, Jun 18, 2013 at 3:01 PM, Pavel Stehule wrote: > Hello > > related to > > https://commitfest.postgresql.org/action/patch_view?id=1130 > > http://www.postgresql.org/message-id/cabwtf4v9rsjibwe+87pk83mmm7acdrg7sz08rq-4qyme8jv...@mail.gmail.com > > > * motivation: remove recursive procession of AND/OR list (hangs with > 10062 and more subexpressions) > > * patch is short, clean and respect postgresql source code requirements > * patch was applied cleanly without warnings > * all regression tests was passed > * I successfully evaluated expression with 10 subexpressions > * there is no significant slowdown > > possible improvements > > a = (A_Expr*) list_nth(pending, 0); > > a = (A_Expr*) linitial(pending); > > not well comment > > should be -- "If the right branch is also an SAME condition, append it to > the" > > + /* > +* If the right branch is also an AND condition, > append it to the > +* pending list, to be processed later. This > allows us to walk even > +* bushy trees, not just left-deep trees. > +*/ > + if (IsA(a->rexpr, A_Expr) && > ((A_Expr*)a->rexpr)->kind == root_kind) > + { > + pending = lappend(pending, a->rexpr); > + } > + else > + { > + expr = transformExprRecurse(pstate, > a->rexpr); > + expr = coerce_to_boolean(pstate, expr, > root_kind == AEXPR_AND ? > "AND" : "OR"); > + exprs = lcons(expr, exprs); > + } > > I don't see any other issues, so after fixing comments this patch is > ready for commit > > Regards > > Pavel Stehule > -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc.
Re: [HACKERS] Processing long AND/OR lists
On Mon, May 27, 2013 at 10:32 AM, Christopher Browne wrote: > On Mon, May 27, 2013 at 1:42 AM, Gurjeet Singh wrote: > >> >> >>> Joking about "640K" aside, it doesn't seem reasonable to expect a truly >>> enormous query as is generated by the broken forms of this logic to turn >>> out happily. I'd rather fix Slony (as done in the above patch). >>> >> >> Yes, by all means, fix the application, but that doesn't preclude the >> argument that the database should be a bit more smarter and efficient, >> especially if it is easy to do. > > > Agreed, it seems like a fine idea to have the database support such > queries, as this eases coping with applications that might be more > difficult to get fixed. > Seeing no more objections to it, I am going to add this patch to the commitfest. Attached is updated patch against latest master; it's the same as the previous version, except that that the patch now includes a fix for the failing test case as well. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc. non_recursive_and_or_transformation_v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pgbench: introduce a new automatic variable 'client_number'
Please find attached a patch for pgbench that introduces a new auto-variable 'client_number'. Following in the footsteps of 'scale' auto-variable, this is not declared if the user has specified this variable using -D switch. Since 'clientid' is a very common name a user can use for their own script's variable, I chose to call this auto-variable client_number; just to avoid conflicts. This variable can come in handy when you want to use a different expression in a query depending on which client is executing it. An example custom transaction is attached, where the UPDATE statement from any given client always updates the same logical row. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc. pgbench_add_cleint_number_variable.patch Description: Binary data test_update.sql Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Processing long AND/OR lists
My last email was written before reading this. A few episodes of 24 occurred between writing and sending that email. Added slony1-hackers, but didn't remove pgsql-hackers. Feel free to exclude pgsql lists, as this branch of conversation seems to be more Slony related than Postgres. On Sun, May 26, 2013 at 10:59 PM, Christopher Browne wrote: > > In Slony 2.1, the issue re-emerged because the ordering of the "action id" > values was lost; the query had previously been implicitly forcing them into > order; we had to add an "ORDER BY" clause, to make the "compressor" work > again. > > http://git.postgresql.org/gitweb/?p=slony1-engine.git;a=blobdiff;f=src/slon/remote_worker.c;h=b1f48043f8e25b4a74a392b0dbceeae8f3e18c27;hp=7fbf67c16f97cb7c3f209cf3be903ea52c4490a9;hb=c4ac435308a78a2db63bf267d401d842c169e87d;hpb=d4612aab78bac5a9836e3e2425c403878f7091c8 > > Commit log says it was fixed between 2.1.2, but from the Slony logs at the time, the version in use was 2.1.2. So > Joking about "640K" aside, it doesn't seem reasonable to expect a truly > enormous query as is generated by the broken forms of this logic to turn > out happily. I'd rather fix Slony (as done in the above patch). > Yes, by all means, fix the application, but that doesn't preclude the argument that the database should be a bit more smarter and efficient, especially if it is easy to do. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc.
Re: [HACKERS] Processing long AND/OR lists
On Sun, May 26, 2013 at 11:46 AM, Tom Lane wrote: > Josh Berkus writes: > > ***15,000***? I'd say that someone has an application design issue. > > Fixing the stack overflow is a good thing, but that query is never going > > to return ... > Just for the record, it does finish in 5 sec on my laptop. But yes, point taken. > > Yeah, the parser's stack consumption seems like only the tip of the > iceberg here. > True. After getting past the parser, the bigger lists consume exponentially longer times around process_equivalence(). And BTW, a backend stuck in that stack does not respond to pg_cancel/terminate_backend()! Should there be a CHECK_FOR_INTERRUPT() in process_equivalence(), perhaps invoked only every N calls of that function. > > I find it hard to visualize a use-case for so many AND'ed conditions > anyway. I could believe a lot of OR'd conditions, but very likely it > would be a list that could and should be converted to an IN condition. > As noted earlier, this was seen in real-world, at a customer setup, generated by Slony, a highly regarded community project. Also, the patch addresses the stack limit for long OR clauses as well. Anytime such conditions are generated programmatically, there's always a possibility that the programmer underestimated the amount of data they are generating the query for; just like it happened in Slony's case I quoted. Slony has compress_actionseq() function to scan a list of numbers, and generate a smallest possible WHERE clause. If it sees consecutive numbers that form a range, it turns that range into a BETWEEN clause, and the numbers that don't seem to be in a range are added to the where clause as 'AND col <> n1 AND col <> n2 ...'. It's quite likely that it generates such long lists because it wrongly assumes that the input list is ordered, or at least mostly ordered. Agreed that that function can/should be fixed to do a better job of sorting these numbers before scanning, and also using the NOT IN construct. But the point remains that Postgres should be capable of handling this simple construct efficiently. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc.
Re: [HACKERS] Processing long AND/OR lists
On Sat, May 25, 2013 at 9:56 AM, Gurjeet Singh wrote: > When Postgres encounters a long list of AND/OR chains, it errors out at > check_stack_depth() after a limit of few thousand. At around 10,000 > elements, the recursion at assign_expr_collations() causes the error. But > at a little higher element count, around 15,000, the recursion check errors > out a little earlier, in the stack around transformAExprAnd(). The test > queries were generated using the attached test.sh script. > > This is not a synthetic test. Recently I saw a report where Slony > generated a huge list of 'log_actionseq <> nnn and log_actionseq <> nnn and > ...' for a SYNC event, and that SYNC event could not complete due to this > error. Granted that Slony can be fixed by making it generate the condition > as 'log_actionseq not in (nnn, nnn)' which is processed non-recursively, > but IMHO Postgres should be capable of handling this trivial construct, > however long it may be (of course, limited by memory). > > To that end, I wrote the attached patch, and although I seem to be playing > by the rules, `make check` fails. Some diffs look benign, but others not so > much. > > AIUI, a BoolExpr is perfectly capable of holding multiple expressions as a > list. And since SQL standard does not say anything about short-circuit > evaluation, the evaluation order of these expressions should not affect the > results. So clearly turning a tree of booleans expressions into a list is > not so subtle a change. I suspect that this patch is messing up the join > conditions. > > I see two ways to fix it: > 1) Fix the order of expressions in BoolExpr such that the order of > evaluation remains unchanged compared to before the patch. > I expect this to take care of the benign diffs. But I doubt this will > address the other diffs. > > 2) Construct a tree same as done before the patch, but do it > non-recursively. > This is I guess the right thing to do, but then we'll have to make > similar tree-traversal changes in other places, for eg. in BoolExpr walking > in expression_tree_walker() or maybe just the stack below > assign_expr_collations(). > As suspected, the problem was that a JOIN's USING clause processing cooks up its own join conditions, and those were the ones that couldn't cope with the changed order of output. Fixing that order of arguments of the BoolExpr also fixed all the JOIN related diffs. Now `make check` passes except for this benign diff. | WHERE (((rsl.sl_color = rsh.slcolor) AND (rsl.sl_len_cm >= rsh.slminlen_cm)) AND (rsl.sl_len_cm <= rsh.slmaxlen_cm)); | WHERE ((rsl.sl_color = rsh.slcolor) AND (rsl.sl_len_cm >= rsh.slminlen_cm) AND (rsl.sl_len_cm <= rsh.slmaxlen_cm)); That's expected, since for cases like these, the patch converts what used to be a tree of 2-argument BoolExprs into a single BoolExpr with many arguments. -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc. non_recursive_and_or_transformation_v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Processing long AND/OR lists
When Postgres encounters a long list of AND/OR chains, it errors out at check_stack_depth() after a limit of few thousand. At around 10,000 elements, the recursion at assign_expr_collations() causes the error. But at a little higher element count, around 15,000, the recursion check errors out a little earlier, in the stack around transformAExprAnd(). The test queries were generated using the attached test.sh script. This is not a synthetic test. Recently I saw a report where Slony generated a huge list of 'log_actionseq <> nnn and log_actionseq <> nnn and ...' for a SYNC event, and that SYNC event could not complete due to this error. Granted that Slony can be fixed by making it generate the condition as 'log_actionseq not in (nnn, nnn)' which is processed non-recursively, but IMHO Postgres should be capable of handling this trivial construct, however long it may be (of course, limited by memory). To that end, I wrote the attached patch, and although I seem to be playing by the rules, `make check` fails. Some diffs look benign, but others not so much. AIUI, a BoolExpr is perfectly capable of holding multiple expressions as a list. And since SQL standard does not say anything about short-circuit evaluation, the evaluation order of these expressions should not affect the results. So clearly turning a tree of booleans expressions into a list is not so subtle a change. I suspect that this patch is messing up the join conditions. I see two ways to fix it: 1) Fix the order of expressions in BoolExpr such that the order of evaluation remains unchanged compared to before the patch. I expect this to take care of the benign diffs. But I doubt this will address the other diffs. 2) Construct a tree same as done before the patch, but do it non-recursively. This is I guess the right thing to do, but then we'll have to make similar tree-traversal changes in other places, for eg. in BoolExpr walking in expression_tree_walker() or maybe just the stack below assign_expr_collations(). Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc. non_recursive_and_or_transformation.patch Description: Binary data test.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch to make pgindent work cleanly
On Fri, Apr 12, 2013 at 3:26 PM, Bruce Momjian wrote: > On Fri, Apr 12, 2013 at 01:34:49PM -0400, Gurjeet Singh wrote: > > Can you also improve the output when it dies upon failure to fetch > something? > > Currently the only error message it emits is "fetching xyz", and leaves > the > > user confused as to what really the problem was. The only indication of a > > problem might be the exit code, but I'm not sure of that, and that > doesn't > > help the interactive user running it on terminal. > > Good point. I have reviewed all the error messages and improved them > with the attached, applied patch, e.g.: > > cannot locate typedefs file "xyz" at /usr/local/bin/pgindent line > 121. > Thanks! -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc.
Re: [HACKERS] Patch to make pgindent work cleanly
On Fri, Apr 12, 2013 at 11:44 AM, Bruce Momjian wrote: > On Tue, Feb 19, 2013 at 04:50:45PM -0500, Gurjeet Singh wrote: > > Please find attached the patch for some cleanup and fix bit rot in > pgindent > > script. > > > > There were a few problems with the script. > > > > .) It failed to use the $ENV{PGENTAB} even if it was set. > > .) The file it tries to download from Postgres' ftp site is no longer > present. > > ftp://ftp.postgresql.org/pub/dev/indent.netbsd.patched.tgz > > .) The tar command extracted the above-mentioned file to a child > directory, but > > did not descend into it before running make on it. > > I think it expected a tarbomb, but clearly the current .tgz file on > ftp > > site is not a tarbomb. > > > > I don't like the fact that it dies with a message "fetching xyz" rather > than > > saying "Could not fetch xyz", but this patch does not address that since > it > > doesn't really affect the output when script does work. > > > > With this patch in place I could very easily run it on any arbitrary > file, > > which seemed like a black-magic before the patch. > > > > src/tools/pgindent/pgindent --build > > I have applied this patch. However, I modified the tarball name to > reference $INDENT_VERSION, rather than hard-coding "1.2". Thanks! Can you also improve the output when it dies upon failure to fetch something? Currently the only error message it emits is "fetching xyz", and leaves the user confused as to what really the problem was. The only indication of a problem might be the exit code, but I'm not sure of that, and that doesn't help the interactive user running it on terminal. -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc.
Re: [HACKERS] [DOCS] synchronize_seqscans' description is a bit misleading
On Wed, Apr 10, 2013 at 11:56 PM, Tom Lane wrote: > Gurjeet Singh writes: > > So, again, it is not guaranteed that all the scans on a relation will > > synchronize with each other. Hence my proposal to include the term > > 'probability' in the definition. > > Yeah, it's definitely not "guaranteed" in any sense. But I don't really > think your proposed wording is an improvement. The existing wording > isn't promising guaranteed sync either, to my eyes. > Given Postgres' track record of delivering what it promises, I expect casual readers to take that phrase as a definitive guide to what is happening internally. > > Perhaps we could compromise on, say, changing "so that concurrent scans > read the same block at about the same time" to "so that concurrent scans > tend to read the same block at about the same time", Given that, on first read the word "about" did not deter me from assuming the best, I don't think adding "tend" would make much difference in a readers (mis)understanding. Perhaps we can spare a few more words to make more clear. > or something like > that. I don't mind making it sound a bit more uncertain, but I don't > think that we need to emphasize the probability of failure. > I agree we don't want to stress the failure case too much, especially when it makes the performance no worse than the absence of the feature. But we don't want the reader to get the wrong idea either. In addition to the slight doc improvement being suggested, perhaps a wiki.postgresql.org entry would allow us to explain the behaviour in more detail. -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc.
Re: [HACKERS] [DOCS] synchronize_seqscans' description is a bit misleading
On Wed, Apr 10, 2013 at 11:10 PM, Tom Lane wrote: > Gurjeet Singh writes: > > If I'm reading the code right [1], this GUC does not actually > *synchronize* > > the scans, but instead just makes sure that a new scan starts from a > block > > that was reported by some other backend performing a scan on the same > > relation. > > Well, that's the only *direct* effect, but ... > > > Since the backends scanning the relation may be processing the relation > at > > different speeds, even though each one took the hint when starting the > > scan, they may end up being out of sync with each other. > > The point you're missing is that the synchronization is self-enforcing: > whichever backend gets ahead of the others will be the one forced to > request (and wait for) the next physical I/O. This will naturally slow > down the lower-CPU-cost-per-page scans. The other ones tend to catch up > during the I/O operation. > Got it. So far, so good. Let's consider a pathological case where a scan is performed by a user controlled cursor, whose scan speed depends on how fast the user presses the "Next" button, then this scan is quickly going to fall out of sync with other scans. Moreover, if a new scan happens to pick up the block reported by this slow scan, then that new scan may have to read blocks off the disk afresh. So, again, it is not guaranteed that all the scans on a relation will synchronize with each other. Hence my proposal to include the term 'probability' in the definition. > The feature is not terribly useful unless I/O costs are high compared to > the CPU cost-per-page. But when that is true, it's actually rather > robust. Backends don't have to have exactly the same per-page > processing cost, because pages stay in shared buffers for a while after > the current scan leader reads them. > Agreed. Even if the buffer has been evicted from shared_buffers, there's a high likelihood that the scan that's close on the heels of others will fetch it from FS cache. > > > Imagining that all scans on a table are always synchronized, may make > some > > wrongly believe that adding more backends scanning the same table will > not > > incur any extra I/O; that is, only one stream of blocks will be read from > > disk no matter how many backends you add to the mix. I noticed this when > I > > was creating partition tables, and each of those was a CREATE TABLE AS > > SELECT FROM original_table (to avoid WAL generation), and running more > than > > 3 such transactions caused the disk read throughput to behave > unpredictably, > > sometimes even dipping below 1 MB/s for a few seconds at a stretch. > > It's not really the scans that's causing that to be unpredictable, it's > the write I/O from the output side, which is forcing highly > nonsequential behavior (or at least I suspect so ... how many disk units > were involved in this test?) > You may be right. I don't have access to the system anymore, and I don't remember the disk layout, but it's quite possible that write operations were causing the read throughput to drop. I did try to reproduce the behaviour on my laptop with up to 6 backends doing pure reads on a table that was multiple times the system RAM, but I could not get them to get out of sync. -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc.
[HACKERS] synchronize_seqscans' description is a bit misleading
If I'm reading the code right [1], this GUC does not actually *synchronize* the scans, but instead just makes sure that a new scan starts from a block that was reported by some other backend performing a scan on the same relation. Since the backends scanning the relation may be processing the relation at different speeds, even though each one took the hint when starting the scan, they may end up being out of sync with each other. Even in a single query, there may be different scan nodes scanning different parts of the same relation, and even they don't synchronize with each other (and for good reason). Imagining that all scans on a table are always synchronized, may make some wrongly believe that adding more backends scanning the same table will not incur any extra I/O; that is, only one stream of blocks will be read from disk no matter how many backends you add to the mix. I noticed this when I was creating partition tables, and each of those was a CREATE TABLE AS SELECT FROM original_table (to avoid WAL generation), and running more than 3 such transactions caused the disk read throughput to behave unpredictably, sometimes even dipping below 1 MB/s for a few seconds at a stretch. Please note that I am not complaining about the implementation, which I think is the best we can do without making backends wait for each other. It's just that the documentation [2] implies that the scans are synchronized through the entire run, which is clearly not the case. So I'd like the docs to be improved to reflect that. How about something like: synchronize_seqscans (boolean) This allows sequential scans of large tables to start from a point in the table that is already being read by another backend. This increases the probability that concurrent scans read the same block at about the same time and hence share the I/O workload. Note that, due to the difference in speeds of processing the table, the backends may eventually get out of sync, and hence stop sharing the I/O workload. When this is enabled, ... The default is on. Best regards, [1] src/backend/access/heap/heapam.c [2] http://www.postgresql.org/docs/9.2/static/runtime-config-compatible.html#GUC-SYNCHRONIZE-SEQSCANS -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc.
[HACKERS] One-line comment to improve understanding of VARSIZE_ANY_EXHDR macro
Hopefully I am not wrong. +/* Size of a varlena data, excluding header */ #define VARSIZE_ANY_EXHDR(PTR) \ -- Gurjeet Singh http://gurjeet.singh.im/ EnterprsieDB Inc. exhdr_comment.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Patch to make pgindent work cleanly
Please find attached the patch for some cleanup and fix bit rot in pgindent script. There were a few problems with the script. .) It failed to use the $ENV{PGENTAB} even if it was set. .) The file it tries to download from Postgres' ftp site is no longer present. ftp://ftp.postgresql.org/pub/dev/indent.netbsd.patched.tgz .) The tar command extracted the above-mentioned file to a child directory, but did not descend into it before running make on it. I think it expected a tarbomb, but clearly the current .tgz file on ftp site is not a tarbomb. I don't like the fact that it dies with a message "fetching xyz" rather than saying "Could not fetch xyz", but this patch does not address that since it doesn't really affect the output when script does work. With this patch in place I could very easily run it on any arbitrary file, which seemed like a black-magic before the patch. src/tools/pgindent/pgindent --build -- Gurjeet Singh http://gurjeet.singh.im/ EnterprsieDB Inc. pgindent.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Successful post to pgsql-hackers
On Sat, Feb 9, 2013 at 5:09 PM, Euler Taveira wrote: > On 09-02-2013 13:45, Gurjeet Singh wrote: > > BTW, I hope I understand what selfcopy is: send a copy to yourself. Why > would > > that be turned on by default? > > > If you want to reply to yourself... Wouldn't I be using the copy in my "sent" folder to do that? I guess it is a non-issue for me since my mail provider (GMail) does not show me the mail I sent to myself as a duplicate; it at least seems that way. -- Gurjeet Singh http://gurjeet.singh.im/
Re: [HACKERS] Fwd: Successful post to pgsql-hackers
I tried the lists page and disabled the 'selfcopy' since the 'ackpot' was kinda hidden under the mailing list names. I have updated the settings again to disable ackpost, and enable the selfcopy. This email will be test if that worked. BTW, I hope I understand what selfcopy is: send a copy to yourself. Why would that be turned on by default? On Sat, Feb 9, 2013 at 10:24 AM, Magnus Hagander wrote: > It's in your personal majordomo settings. I don't think it's related > to that at all, but it's a flag you set on your subscription, so > perhaps you set it by mistake. > > You shold be able to set it from https://mail.postgresql.org. The > setting you're looking for is "ackpost", and you'll want to turn it > off. > > //Magnus > > > On Sat, Feb 9, 2013 at 4:17 PM, Gurjeet Singh wrote: > > Recently I have started getting these confirmations for every email I > send > > to the mailing lists. I think it's related to the fact that I recently > > switched to using a new email address. > > > > How can I turn these notifications off? > > > > -- Forwarded message -- > > From: > > Date: Sat, Feb 9, 2013 at 10:13 AM > > Subject: Successful post to pgsql-hackers > > To: Gurjeet Singh > > > > > > Your message to the pgsql-hackers list, posted on > > Sat, 9 Feb 2013 10:11:05 -0500 > > > > with subject > > Re: pg_prewarm > > > > is currently being delivered. > > > > > > > > -- > > Gurjeet Singh > > > > http://gurjeet.singh.im/ > > > > -- > Magnus Hagander > Me: http://www.hagander.net/ > Work: http://www.redpill-linpro.com/ > -- Gurjeet Singh http://gurjeet.singh.im/
[HACKERS] Fwd: Successful post to pgsql-hackers
Recently I have started getting these confirmations for every email I send to the mailing lists. I think it's related to the fact that I recently switched to using a new email address. How can I turn these notifications off? -- Forwarded message -- From: Date: Sat, Feb 9, 2013 at 10:13 AM Subject: Successful post to pgsql-hackers To: Gurjeet Singh Your message to the pgsql-hackers list, posted on Sat, 9 Feb 2013 10:11:05 -0500 with subject Re: pg_prewarm is currently being delivered. -- Gurjeet Singh http://gurjeet.singh.im/
Re: [HACKERS] pg_prewarm
On Sun, Jun 24, 2012 at 1:36 PM, Cédric Villemain wrote: > Le samedi 23 juin 2012 02:47:15, Josh Berkus a écrit : > > > The biggest problem with pgfincore from my point of view is that it > > > only works under Linux, whereas I use a MacOS X machine for my > > > development, and there is also Windows to think about. Even if that > > > were fixed, though, I feel we ought to have something in the core > > > distribution. This patch got more +1s than 95% of what gets proposed > > > on hackers. > > > > Fincore is only a blocker to this patch if we think pgfincore is ready > > to be proposed for the core distribution. Do we? > > I'll make it ready for. (not a huge task). > Hi Cedric, Can you please post the progress on this, if any. I am planning on polishing up pg_prewarm based on the reviews. As others have said, I don't see a reason why both can't coexist, maybe in pgxn. I am all ears if you think otherwise. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/