Re: [HACKERS] REINDEX CONCURRENTLY 2.0
On Tue, Dec 23, 2014 at 5:54 PM, Oskari Saarenmaa o...@ohmu.fi wrote: If the short-lived lock is the only blocker for this feature at the moment could we just require an additional qualifier for CONCURRENTLY (FORCE?) until the lock can be removed, something like: =# [blah] FWIW, I'd just keep only CONCURRENTLY with no fancy additional keywords even if we cheat on it, as long as it is precised in the documentation that an exclusive lock is taken for a very short time, largely shorter than what a normal REINDEX would do btw. It's not optimal, but currently there's no way to reindex a primary key anywhere close to concurrently and a short lock would be a huge improvement over the current situation. Yep. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory
Sorry for my very late answer. It's been a tough month. 2014-11-27 0:00 GMT+01:00 Bruce Momjian br...@momjian.us: On Mon, Nov 3, 2014 at 12:39:26PM -0800, Jeff Janes wrote: It looked to me that the formula, when descending from a previously stressed state, would be: greatest(1 + checkpoint_completion_target) * checkpoint_segments, wal_keep_segments) + 1 + 2 * checkpoint_segments + 1 I don't think we can assume checkpoint_completion_target is at all reliable enough to base a maximum calculation on, assuming anything above the maximum is cause of concern and something to inform the admins about. Assuming checkpoint_completion_target is 1 for maximum purposes, how about: max(2 * checkpoint_segments, wal_keep_segments) + 2 * checkpoint_segments + 2 Seems something I could agree on. At least, it makes sense, and it works for my customers. Although I'm wondering why + 2, and not + 1. It seems Jeff and you agree on this, so I may have misunderstood something. -- Guillaume. http://blog.guillaume.lelarge.info http://www.dalibo.com
Re: [HACKERS] Compression of full-page-writes
On Fri, 2013-08-30 at 09:57 +0300, Heikki Linnakangas wrote: Speeding up the CRC calculation obviously won't help with the WAL volume per se, ie. you still generate the same amount of WAL that needs to be shipped in replication. But then again, if all you want to do is to reduce the volume, you could just compress the whole WAL stream. Was this point addressed? How much benefit is there to compressing the data before it goes into the WAL stream versus after? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Coverity and pgbench
Anybody looks into problems in pgbench pointed out by Coverity? If no, I would like to work on fixing them because I need to write patches for -f option related issues anyway. Done. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Coverity and pgbench
On Tue, Dec 30, 2014 at 8:39 PM, Tatsuo Ishii is...@postgresql.org wrote: Anybody looks into problems in pgbench pointed out by Coverity? If no, I would like to work on fixing them because I need to write patches for -f option related issues anyway. Done. Just wondering, where are those reports located? It would have been good to point out where the leaks actually were.. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Coverity and pgbench
Just wondering, where are those reports located? It would have been good to point out where the leaks actually were.. The report is a closed one. You need to be a member of PostgreSQL project of Coverity to look into the report. If want to join the project, you need to contact to one of admins (Stephen, Peter or Magnus). ...Or I could email you the pgbench report part if you wish. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
Hello. Thanks, I understand, what look in another part of code. Hope right now I did right changes. To not modify current pg_usleep calculation, I changed restore_command_retry_interval value to seconds (not milliseconds). In this case, min value - 1 second. Mon, 29 Dec 2014 00:15:03 +0900 от Michael Paquier michael.paqu...@gmail.com: On Sat, Dec 27, 2014 at 3:42 AM, Alexey Vasiliev leopard...@inbox.ru wrote: Thanks for suggestions. Patch updated. Cool, thanks. I just had an extra look at it. +This is useful, if I using for restore of wal logs some +external storage (like AWS S3) and no matter what the slave database +will lag behind the master. The problem, what for each request to +AWS S3 need to pay, what is why for N nodes, which try to get next +wal log each 5 seconds will be bigger price, than for example each +30 seconds. I reworked this portion of the docs, it is rather incorrect as the documentation should not use first-person subjects, and I don't believe that referencing any commercial products is a good thing in this context. +# specifies an optional timeout after nonzero code of restore_command. +# This can be useful to increase/decrease number of a restore_command calls. This is still referring to a timeout. That's not good. And the name of the parameter at the top of this comment block is missing. +static int restore_command_retry_interval = 5000L; I think that it would be more adapted to set that to 5000, and multiply by 1L. I am also wondering about having a better lower bound, like 100ms to avoid some abuse with this feature in the retries? + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), +errmsg(\%s\ must be bigger zero, + restore_command_retry_interval))); I'd rather rewrite that to must have a strictly positive value. -* Wait for more WAL to arrive. Time out after 5 seconds, +* Wait for more WAL to arrive. Time out after +* restore_command_retry_interval (5 seconds by default), * like when polling the archive, to react to a trigger * file promptly. */ WaitLatch(XLogCtl-recoveryWakeupLatch, WL_LATCH_SET | WL_TIMEOUT, - 5000L); + restore_command_retry_interval); I should have noticed earlier, but in its current state your patch actually does not work. What you are doing here is tuning the time process waits for WAL from stream. In your case what you want to control is the retry time for a restore_command in archive recovery, no? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Alexey Vasiliev diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml index ef78bc0..38420a5 100644 --- a/doc/src/sgml/recovery-config.sgml +++ b/doc/src/sgml/recovery-config.sgml @@ -145,6 +145,26 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p' # Windows /listitem /varlistentry + varlistentry id=restore-command-retry-interval xreflabel=restore_command_retry_interval + termvarnamerestore_command_retry_interval/varname (typeinteger/type) + indexterm +primaryvarnamerestore_command_retry_interval/ recovery parameter/primary + /indexterm + /term + listitem + para +If varnamerestore_command/ returns nonzero exit status code, retry +command after the interval of time specified by this parameter. +Default value is literal5s/. + /para + para +This is useful, if I using for restore of wal logs some +external storage and no matter what the slave database +will lag behind the master. + /para + /listitem + /varlistentry + /variablelist /sect1 diff --git a/src/backend/access/transam/recovery.conf.sample b/src/backend/access/transam/recovery.conf.sample index b777400..5b63f60 100644 --- a/src/backend/access/transam/recovery.conf.sample +++ b/src/backend/access/transam/recovery.conf.sample @@ -58,6 +58,11 @@ # #recovery_end_command = '' # +# specifies an optional retry interval of restore_command command, if previous return nonzero exit status code. +# This can be useful to increase/decrease number of a restore_command calls. +# +#restore_command_retry_interval = 5s +# #--- # RECOVERY TARGET PARAMETERS
Re: [HACKERS] Compression of full-page-writes
On Tue, Dec 30, 2014 at 6:21 PM, Jeff Davis pg...@j-davis.com wrote: On Fri, 2013-08-30 at 09:57 +0300, Heikki Linnakangas wrote: Speeding up the CRC calculation obviously won't help with the WAL volume per se, ie. you still generate the same amount of WAL that needs to be shipped in replication. But then again, if all you want to do is to reduce the volume, you could just compress the whole WAL stream. Was this point addressed? Compressing the whole record is interesting for multi-insert records, but as we need to keep the compressed data in a pre-allocated buffer until WAL is written, we can only compress things within a given size range. The point is, even if we define a lower bound, compression is going to perform badly with an application that generates for example many small records that are just higher than the lower bound... Unsurprisingly for small records this was bad: http://www.postgresql.org/message-id/cab7npqsc97o-ue5paxfmukwcxe_jioyxo1m4a0pmnmyqane...@mail.gmail.com Now are there still people interested in seeing the amount of time spent in the CRC calculation depending on the record length? Isn't that worth speaking on the CRC thread btw? I'd imagine that it would be simple to evaluate the effect of the CRC calculation within a single process using a bit getrusage. How much benefit is there to compressing the data before it goes into the WAL stream versus after? Here is a good list: http://www.postgresql.org/message-id/20141212145330.gk31...@awork2.anarazel.de Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On 2014-12-30 21:23:38 +0900, Michael Paquier wrote: On Tue, Dec 30, 2014 at 6:21 PM, Jeff Davis pg...@j-davis.com wrote: On Fri, 2013-08-30 at 09:57 +0300, Heikki Linnakangas wrote: Speeding up the CRC calculation obviously won't help with the WAL volume per se, ie. you still generate the same amount of WAL that needs to be shipped in replication. But then again, if all you want to do is to reduce the volume, you could just compress the whole WAL stream. Was this point addressed? Compressing the whole record is interesting for multi-insert records, but as we need to keep the compressed data in a pre-allocated buffer until WAL is written, we can only compress things within a given size range. The point is, even if we define a lower bound, compression is going to perform badly with an application that generates for example many small records that are just higher than the lower bound... Unsurprisingly for small records this was bad: So why are you bringing it up? That's not an argument for anything, except not doing it in such a simplistic way. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] ON_ERROR_ROLLBACK
--On 29. Dezember 2014 12:55:11 -0500 Tom Lane t...@sss.pgh.pa.us wrote: Given the lack of previous complaints, this probably isn't backpatching material, but it sure seems like a bit of attention to consistency would be warranted here. Now that i read it i remember a client complaining about this some time ago. I forgot about it, but i think there's value in it to backpatch. -- Thanks Bernd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
On Tue, Dec 30, 2014 at 9:33 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Tue, Dec 30, 2014 at 9:10 PM, Alexey Vasiliev leopard...@inbox.ru wrote: To not modify current pg_usleep calculation, I changed restore_command_retry_interval value to seconds (not milliseconds). In this case, min value - 1 second. Er, what the problem with not changing 100L to 1000L? The unit of your parameter is ms AFAIK. Of course I meant in the previous version of the patch not the current one. Wouldn't it be useful to use it with for example retry intervals of the order of 100ms~300ms for some cases? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Additional role attributes superuser review
* Amit Kapila (amit.kapil...@gmail.com) wrote: On Tue, Dec 30, 2014 at 6:52 AM, Stephen Frost sfr...@snowman.net wrote: There is one issue that occurs to me, however. We're talking about pg_dump, but what about pg_dumpall? In particular, I don't think the DUMP privilege should provide access to pg_authid, as that would allow the user to bypass the privilege system in some environments by using the hash to log in as a superuser. Now, I don't encourage using password based authentication, especially for superuser accounts, but lots of people do. The idea with these privileges is to allow certain operations to be performed by a non-superuser while preventing trivial access to superuser. Perhaps it's pie-in-the-sky, but my hope is to achieve that. Ugh, hadn't thought about that. :( I don't think it kills the whole idea, but we do need to work out how to address it. One way to address this might be to allow this only for superusers, Huh? The point here is to have a role account which isn't a superuser who can perform these actions. another could be have a separate privilege (DBA) or a role which is not a superuser, however can be used to perform such tasks. I'm pretty sure we've agreed that having a catch-all role attribute like 'DBA' is a bad idea because we'd likely be adding privileges to it later which would expand the rights of users with that attribute, possibly beyond what is intended. I don't really like 'xlog' as a permission. BINARY BACKUP is interesting but if we're going to go multi-word, I would think we would do PITR BACKUP or something FILESYSTEM BACKUP or similar. I'm not really a big fan of the two-word options though. How about PHYSICAL BACKUP (for basebackup) and LOGICAL BACKUP for pg_dump? Again, this makes it look like the read-all right would be specific to users executing pg_dump, which is incorrect and misleading. PHYSICAL might also imply the ability to do pg_basebackup. This is normally the terminology I have seen people using for these backup's. I do like the idea of trying to find a correlation in the documentation for these rights, though I'm not sure we'll be able to. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] What exactly is our CRC algorithm?
On 12/30/2014 09:40 AM, Abhijit Menon-Sen wrote: Hi. I'm re-attaching the two patches as produced by format-patch. I haven't listed any reviewers. It's either just Andres, or maybe a lot of people. Is anyone in a position to try out the patches on MSVC and see if they build and work sensibly, please? (Otherwise it may be better to remove those bits from the patch for now.) A couple of quick comments: bswap32 is unused on on little-endian systems. That will give a compiler warning. pg_comp_crc32c_sse processes one byte at a time, until the pointer is 4-bytes aligned. Then it processes 8 bytes at a time. So it fetches the 8-byte chunks from only 4-byte aligned addresses. Is that intentional? If unaligned access performs well, why bother with the initial byte-at-a-time processing at all? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] ON_ERROR_ROLLBACK
Bernd Helmle maili...@oopsware.de writes: --On 29. Dezember 2014 12:55:11 -0500 Tom Lane t...@sss.pgh.pa.us wrote: Given the lack of previous complaints, this probably isn't backpatching material, but it sure seems like a bit of attention to consistency would be warranted here. Now that i read it i remember a client complaining about this some time ago. I forgot about it, but i think there's value in it to backpatch. Hm. Last night I wrote the attached draft patch, which I was intending to apply to HEAD only. The argument against back-patching is basically that this might change the interpretation of scripts that had been accepted silently before. For example \set ECHO_HIDDEN NoExec will now select noexec mode whereas before you silently got on mode. In one light this is certainly a bug fix, but in another it's just definitional instability. If we'd gotten a field bug report we might well have chosen to back-patch, though, and perhaps your client's complaint counts as that. Opinions anyone? regards, tom lane diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index d29dfa2..bdfb67c 100644 *** a/doc/src/sgml/ref/psql-ref.sgml --- b/doc/src/sgml/ref/psql-ref.sgml *** EOF *** 173,180 Echo the actual queries generated by command\d/command and other backslash commands. You can use this to study applicationpsql/application's internal operations. This is equivalent to ! setting the variable varnameECHO_HIDDEN/varname from within ! applicationpsql/application. /para /listitem /varlistentry --- 173,179 Echo the actual queries generated by command\d/command and other backslash commands. You can use this to study applicationpsql/application's internal operations. This is equivalent to ! setting the variable varnameECHO_HIDDEN/varname to literalon/. /para /listitem /varlistentry *** EOF *** 333,340 quietly. By default, it prints welcome messages and various informational output. If this option is used, none of this happens. This is useful with the option-c/option option. ! Within applicationpsql/application you can also set the ! varnameQUIET/varname variable to achieve the same effect. /para /listitem /varlistentry --- 332,339 quietly. By default, it prints welcome messages and various informational output. If this option is used, none of this happens. This is useful with the option-c/option option. ! This is equivalent to setting the variable varnameQUIET/varname ! to literalon/. /para /listitem /varlistentry *** bar *** 2884,2891 termvarnameECHO_HIDDEN/varname/term listitem para ! When this variable is set and a backslash command queries the ! database, the query is first shown. This way you can study the productnamePostgreSQL/productname internals and provide similar functionality in your own programs. (To select this behavior on program start-up, use the switch option-E/option.) If you set --- 2883,2891 termvarnameECHO_HIDDEN/varname/term listitem para ! When this variable is set to literalon/ and a backslash command ! queries the database, the query is first shown. ! This feature helps you to study productnamePostgreSQL/productname internals and provide similar functionality in your own programs. (To select this behavior on program start-up, use the switch option-E/option.) If you set *** bar *** 3046,3061 /term listitem para ! When literalon/, if a statement in a transaction block generates an error, the error is ignored and the transaction ! continues. When literalinteractive/, such errors are only ignored in interactive sessions, and not when reading script ! files. When literaloff/ (the default), a statement in a transaction block that generates an error aborts the entire ! transaction. The on_error_rollback-on mode works by issuing an implicit commandSAVEPOINT/ for you, just before each command ! that is in a transaction block, and rolls back to the savepoint ! on error. /para /listitem /varlistentry --- 3046,3061 /term listitem para ! When set to literalon/, if a statement in a transaction block generates an error, the error is ignored and the transaction ! continues. When set to literalinteractive/, such errors are only ignored in interactive sessions, and not when reading script ! files. When unset or set to literaloff/, a statement in a
Re: [HACKERS] Additional role attributes superuser review
On Mon, Dec 29, 2014 at 11:01 PM, Stephen Frost sfr...@snowman.net wrote: * Adam Brightwell (adam.brightw...@crunchydatasolutions.com) wrote: I'd suggest it's called DUMP if that's what it allows, to keep it separate from the backup parts. Makes sense to me. I'm fine calling it 'DUMP', but for different reasons. We have no (verifiable) idea what client program is being used to connect and therefore we shouldn't try to tie the name of the client program to the permission. That said, a 'DUMP' privilege which allows the user to dump the contents of the entire database is entirely reasonable. We need to be clear in the documentation though- such a 'DUMP' privilege is essentially granting USAGE on all schemas and table-level SELECT on all tables and sequences (anything else..?). To be clear, a user with this privilege can utilize that access without using pg_dump. Well, it would not give you full USAGE - granted USAGE on a schema, you can execute functions in there for example (provided permissions). This privilege would not do that, it would only give you COPY access. (And also COPY != SELECT in the way that the rule system applies, I think? And this one could be for COPY only) But other than that I agree - for *all* these privileges, it needs to be clearly documented that they are not limited to a specific client side application, even if their name happens to be similar to one. One other point is that this shouldn't imply any other privileges, imv. I'm specifically thinking of BYPASSRLS- that's independently grantable and therefore should be explicitly set, if it's intended. Things I think BYPASSRLS would have to be implicitly granted by the DUMP privilege. Without that, the DUMP privilege is more or less meaningless (for anybody who uses RLS - but if they don't use RLS, then having it include BYPASSRLS makes no difference). Worse than that, you may end up with a dump that you cannot restore. Similar concerns would exist for the existing REPLICATION role for example - that one clearly lets you bypass RLS as well, just not with a SQL statement. should work 'sanely' with any combination of the two options. Similairly, DUMP shouldn't imply BACKUP or visa-versa. In particular, I'd like to see roles which have only the BACKUP privilege be unable to directly read any data (modulo things granted to PUBLIC). This would allow an unprivileged user to manage backups, kick off ad-hoc ones, etc, without being able to actually access any of the data (this would require the actual backup system to have a similar control, but that's entirely possible with more advanced SANs and enterprise backup solutions). So you're saying a privilege that would allow you to do pg_start_backup()/pg_stop_backup() but *not* actually use pg_basebackup? That would be EXCLUSIVEBACKUP or something like that, to be consistent with existing terminology though. That seems really bad names, IMHO. Why? Because we use WAL and XLOG throughout documentation and parameters and code to mean *the same thing*. And here they'd suddenly mean different things. If we need them as separate privileges, I think we need much better names. (And a better description - what is xlog operations really?) Fair enough, ultimately what I was trying to address is the following concern raised by Alvaro: To me, what this repeated discussion on this particular BACKUP point says, is that the ability to run pg_start/stop_backend and the xlog related functions should be a different privilege, i.e. something other than BACKUP; because later we will want the ability to grant someone the ability to run pg_dump on the whole database without being superuser, and we will want to use the name BACKUP for that. So I'm inclined to propose something more specific for this like WAL_CONTROL or XLOG_OPERATOR, say. Note that the BACKUP role attribute was never intended to cover the pg_dump use-case. Simply the name of it caused confusion though. I'm not sure if adding a DUMP role attribute is sufficient enough to address that confusion, but I haven't got a better idea either. We need to separate the logical backups (pg_dump) from the physical ones (start/stop+filesystem and pg_basebackup). We might also need to separate the two different ways of doing physical backups. Personalyl I think using the DUMP name makes that a lot more clear. Maybe we need to avoid using BACKUP alone as well, to make sure it doesn't go the other way - using BASEBACKUP and EXCLUSIVEBACKUP for those two different ones perhaps? When indeed, what it meant was to have the following separate (effectively merging #2 and #3): 1) ability to pg_dump 2) ability to start/stop backups *and* ability to execute xlog related functions. We probably also need to define what those xlog related functions actually arse. pg_current_xlog_location() is definitely an xlog related function, but does it need the
Re: [HACKERS] Serialization exception : Who else was involved?
Indeed, NOTICE is wrong because it would doom the transaction that sets the flag if it should be later PREPARED. I think that reporting the PIDs and the current activity of each process would be nice. DeadLoackReport() is using pgstat_get_backend_current_activity() to get the process activity. I'll see what I could come up with. Thanks, om -Message d'origine- De : Noah Misch [mailto:n...@leadboat.com] Envoyé : samedi 27 décembre 2014 10:51 À : Olivier MATROT Cc : pgsql-hackers@postgresql.org Objet : Re: Serialization exception : Who else was involved? On Tue, Dec 02, 2014 at 11:17:43AM +0100, Olivier MATROT wrote: Serialization conflict detection is done in src/backend/storage/lmgr/predicate.c, where transactions that are doomed to fail are marked as such with the SXACT_FLAG_DOOMED flag. I simply added elog(...) calls with the NOTIFY level, each time the flag is set, compiled the code and give it a try. I would like to see this useful and simple addition in a future version of PostgreSQL. Is it in the spirit of what is done when it comes to ease the work of the developer ? May be the level I've chosen is not appropriate ? I would value extra logging designed to help users understand the genesis of serialization failures. A patch the community would adopt will probably have more complexity than your quick elog(NOTICE, ...) addition. I don't have a clear picture of what the final patch should be, but the following are some thoughts to outline the problem space. See [1] for an earlier discussion. The logging done in DeadLockReport() is a good baseline; it would be best to consolidate a similar level of detail and report it all as part of the main serialization failure report. That may prove impractical. If transaction TA marks transaction TB's doom, TA can be long gone by the time TB reports its serialization failure. TA could persist the details needed for that future error report, but that may impose costs out of proportion with the benefit. If so, we can fall back on your original idea of emitting a message in the command that performs the flag flip. That would need a DEBUG error level, though. Sending a NOTICE to a client when its transaction dooms some other transaction would add noise in the wrong place. Thanks, nm [1] http://www.postgresql.org/message-id/flat/AANLkTikF-CR-52nWAo2VG_348aTsK_+0i=chbpnqd...@mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Additional role attributes superuser review
* Magnus Hagander (mag...@hagander.net) wrote: On Mon, Dec 29, 2014 at 11:01 PM, Stephen Frost sfr...@snowman.net wrote: That said, a 'DUMP' privilege which allows the user to dump the contents of the entire database is entirely reasonable. We need to be clear in the documentation though- such a 'DUMP' privilege is essentially granting USAGE on all schemas and table-level SELECT on all tables and sequences (anything else..?). To be clear, a user with this privilege can utilize that access without using pg_dump. Well, it would not give you full USAGE - granted USAGE on a schema, you can execute functions in there for example (provided permissions). This privilege would not do that, The approach I was thinking was to document and implement this as impliciting granting exactly USAGE and SELECT rights, no more (not BYPASSRLS) and no less (yes, the role could execute functions). I agree that doing so would be strictly more than what pg_dump actually requires but it's also what we actually have support for in our privilege system. it would only give you COPY access. (And also COPY != SELECT in the way that the rule system applies, I think? And this one could be for COPY only) COPY certainly does equal SELECT rights.. We haven't got an independent COPY privilege and I don't think it makes sense to invent one. It sounds like you're suggesting that we add a hack directly into COPY for this privilege, but my thinking is that the right place to change for this is in the privilege system and not hack it into individual commands.. I'm also a bit nervous that we'll end up painting ourselves into a corner if we hack this to mean exactly what pg_dump needs today. Lastly, I've been considering other use-cases for this privilege beyond the pg_dump one (thinking about auditing, for example). But other than that I agree - for *all* these privileges, it needs to be clearly documented that they are not limited to a specific client side application, even if their name happens to be similar to one. Right. One other point is that this shouldn't imply any other privileges, imv. I'm specifically thinking of BYPASSRLS- that's independently grantable and therefore should be explicitly set, if it's intended. Things I think BYPASSRLS would have to be implicitly granted by the DUMP privilege. Without that, the DUMP privilege is more or less meaningless (for anybody who uses RLS - but if they don't use RLS, then having it include BYPASSRLS makes no difference). Worse than that, you may end up with a dump that you cannot restore. The dump would fail, as I mentioned before. I don't see why BYPASSRLS has to be implicitly granted by the DUMP privilege and can absolutely see use-cases for it to *not* be. Those use-cases would require passing pg_dump the --enable-row-security option, but that's why it's there. Implementations which don't use RLS are not impacted either way, so we don't need to consider them. Implementations which *do* use RLS, in my experience, would certainly understand and expect BYPASSRLS to be required if they want this role to be able to dump all data out regardless of the RLS policies. What does implicitly including BYPASSRLS in this privilege gain us? A slightly less irritated DBA who forgot to include BYPASSRLS and ended up getting a pg_dump error because of it. On the other hand, it walls off any possibility of using this privilege for roles who shouldn't be allowed to bypass RLS or for pg_dumps to be done across the entire system for specific RLS policies. Similar concerns would exist for the existing REPLICATION role for example - that one clearly lets you bypass RLS as well, just not with a SQL statement. I'm not sure that I see the point you're making here. Yes, REPLICATION allows you to do a filesystem copy of the entire database and that clearly bypasses RLS and *all* of our privilege system. I'm suggesting that this role attribute work as an implicit grant of privileges we already have. That strikes me as easy to document and very clear for users. should work 'sanely' with any combination of the two options. Similairly, DUMP shouldn't imply BACKUP or visa-versa. In particular, I'd like to see roles which have only the BACKUP privilege be unable to directly read any data (modulo things granted to PUBLIC). This would allow an unprivileged user to manage backups, kick off ad-hoc ones, etc, without being able to actually access any of the data (this would require the actual backup system to have a similar control, but that's entirely possible with more advanced SANs and enterprise backup solutions). So you're saying a privilege that would allow you to do pg_start_backup()/pg_stop_backup() but *not* actually use pg_basebackup? Yes. That would be EXCLUSIVEBACKUP or something like that, to be consistent with existing terminology though. Ok. I agree that working out the specific naming is difficult and would like to focus
[HACKERS] Re[2]: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
Tue, 30 Dec 2014 21:39:33 +0900 от Michael Paquier michael.paqu...@gmail.com: On Tue, Dec 30, 2014 at 9:33 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Tue, Dec 30, 2014 at 9:10 PM, Alexey Vasiliev leopard...@inbox.ru wrote: To not modify current pg_usleep calculation, I changed restore_command_retry_interval value to seconds (not milliseconds). In this case, min value - 1 second. Er, what the problem with not changing 100L to 1000L? The unit of your parameter is ms AFAIK. Of course I meant in the previous version of the patch not the current one. Wouldn't it be useful to use it with for example retry intervals of the order of 100ms~300ms for some cases? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers Thanks, patch changed. As I understand now = (pg_time_t) time(NULL); return time in seconds, what is why I multiply value to 1000 to compare with restore_command_retry_interval in milliseconds. I am not sure about small retry interval of time, in my cases I need interval bigger 5 seconds (20-40 seconds). Right now I limiting this value be bigger 100 milliseconds. -- Alexey Vasilievdiff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml index ef78bc0..52cb47d 100644 --- a/doc/src/sgml/recovery-config.sgml +++ b/doc/src/sgml/recovery-config.sgml @@ -145,6 +145,29 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p' # Windows /listitem /varlistentry + varlistentry id=restore-command-retry-interval xreflabel=restore_command_retry_interval + termvarnamerestore_command_retry_interval/varname (typeinteger/type) + indexterm +primaryvarnamerestore_command_retry_interval/ recovery parameter/primary + /indexterm + /term + listitem + para +If varnamerestore_command/ returns nonzero exit status code, retry +command after the interval of time specified by this parameter. +Default value is literal5s/. + /para + para +This is useful, if I using for restore of wal logs some +external storage and request each 5 seconds for wal logs +can be useless and cost additional money. +Increasing this parameter allow to increase retry interval of time, +if not found new wal logs inside external storage and no matter +what the slave database will lag behind the master. + /para + /listitem + /varlistentry + /variablelist /sect1 diff --git a/src/backend/access/transam/recovery.conf.sample b/src/backend/access/transam/recovery.conf.sample index b777400..5b63f60 100644 --- a/src/backend/access/transam/recovery.conf.sample +++ b/src/backend/access/transam/recovery.conf.sample @@ -58,6 +58,11 @@ # #recovery_end_command = '' # +# specifies an optional retry interval of restore_command command, if previous return nonzero exit status code. +# This can be useful to increase/decrease number of a restore_command calls. +# +#restore_command_retry_interval = 5s +# #--- # RECOVERY TARGET PARAMETERS #--- diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e54..67a873c 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -235,6 +235,7 @@ static TimestampTz recoveryTargetTime; static char *recoveryTargetName; static int recovery_min_apply_delay = 0; static TimestampTz recoveryDelayUntilTime; +static int restore_command_retry_interval = 5000; /* options taken from recovery.conf for XLOG streaming */ static bool StandbyModeRequested = false; @@ -4881,6 +4882,28 @@ readRecoveryCommandFile(void) (errmsg_internal(trigger_file = '%s', TriggerFile))); } + else if (strcmp(item-name, restore_command_retry_interval) == 0) + { + const char *hintmsg; + + if (!parse_int(item-value, restore_command_retry_interval, GUC_UNIT_MS, + hintmsg)) +ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg(parameter \%s\ requires a temporal value, +restore_command_retry_interval), + hintmsg ? errhint(%s, _(hintmsg)) : 0)); + ereport(DEBUG2, + (errmsg_internal(restore_command_retry_interval = '%s', item-value))); + + if (restore_command_retry_interval 100) + { +ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg(\%s\ must have a bigger 100 milliseconds value, + restore_command_retry_interval))); + } + } else if (strcmp(item-name, recovery_min_apply_delay) == 0) { const char *hintmsg; @@ -10495,13 +10518,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, * machine, so we've exhausted all the options for *
Re: [HACKERS] Detecting backend failures via libpq / DBD::Pg
-BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 Andrew Gierth asked: this is to send a simple SELECT via PQexec Why not PQexec(conn, ) ? Because I want to leave a good clue for debugging; so DBAs are better able to figure out where a mystery slew of queries is coming from. The query is: SELECT 'DBD::Pg ping test' Which also means the inverse is true: simple blank queries are guaranteed to *not* be coming from DBD::Pg. - -- Greg Sabino Mullane g...@turnstep.com End Point Corporation http://www.endpoint.com/ PGP Key: 0x14964AC8 201412301041 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -BEGIN PGP SIGNATURE- iEYEAREDAAYFAlSix/YACgkQvJuQZxSWSsjILwCdHnkhYC1i+LJZkNUWjfTi5yG+ FHwAn007+arJIw62gIUO20+SxnzRT4ub =9Rym -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What exactly is our CRC algorithm?
At 2014-12-30 16:05:50 +0200, hlinnakan...@vmware.com wrote: A couple of quick comments: bswap32 is unused on on little-endian systems. That will give a compiler warning. Huh. I don't get a warning, even when I add -Wunused to the build flags. But since you mention it, it would be better to write the function thus: static inline uint32 cpu_to_le32(uint32 x) { #ifndef WORDS_BIGENDIAN return x; #elif defined(__GNUC__) || defined(__clang__) return __builtin_bswap32(x); #else return ((x 24) 0xff00) | ((x 8) 0x00ff) | ((x 8) 0xff00) | ((x 24) 0x00ff); #endif } pg_comp_crc32c_sse […] fetches the 8-byte chunks from only 4-byte aligned addresses. Is that intentional? Thanks for spotting that. I had meant to change the test to 7. But again, now that you mention it, I'm not sure it's necessary. The CRC32* instructions don't have the usual SSE alignment requirements, and I see that the Linux kernel (among other implementations) process eight bytes at a time from the start of the buffer and then process the remaining bytes one at a time. I'll do a bit more research and post an update. Thanks for having a look. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BUG #12330: ACID is broken for unique constraints
On Mon, Dec 29, 2014 at 3:53 PM, Kevin Grittner kgri...@ymail.com wrote: I tend to build out applications on top of functions and the inability to set isolation mode inside a function confounds me from using anything but 'read committed'. Hey, no problem -- just set default_transaction_isolation = 'serializable' in your postgresql.conf file and never override isolation level and you'll never have to use an explicit table lock or SELECT FOR UPDATE again. While you're at it, set default_transaction_read_only = on and only override it for transactions that (might) need to update and you'll have dodged the worst of the performance problems from serializable transactions. ok...wow. That's pretty convincing, honestly. Your objective is sensible and clear. So basically there are/were two concrete concerns: compatibility break and loss of reported detail. You've convinced me that principled serialization code shouldn't be impacted negatively by changing the returned sqlstate (although, perhaps, we're being a bit too cavalier in terms of the amount of unprincipled code out there -- particularly looping pl/pgsql exception handlers). Ideally, you'd sneak more detail about the specifics of the error into errdetail or someplace as Jim is suggesting. That'd address the other point. So, given those things, I'm mollified, FWIW. As an aside, the main reason I don't run with serializable isn't performance paranoia or OCD management of locking semantics in the 90%+ of code that doesn't need that treatment; it's because I build code (well, chaining DML and such) in the server, always, and have no way of managing isolation level inside of functions because by the time the function body is invoked the snapshot is already generated. I can make read committed transactions work with appropriate locking but there is no way to downgrade serializable transactions in flight. IOW, lack of proper stored procedures is what's holding me back. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] ON_ERROR_ROLLBACK
On Tue, Dec 30, 2014 at 8:54 AM, Adrian Klaver adrian.kla...@aklaver.com wrote: On 12/30/2014 07:43 AM, David G Johnston wrote: Tom Lane-2 wrote Bernd Helmle lt; mailings@ gt; writes: --On 29. Dezember 2014 12:55:11 -0500 Tom Lane lt; tgl@.pa gt; wrote: Given the lack of previous complaints, this probably isn't backpatching material, but it sure seems like a bit of attention to consistency would be warranted here. Now that i read it i remember a client complaining about this some time ago. I forgot about it, but i think there's value in it to backpatch. Hm. Last night I wrote the attached draft patch, which I was intending to apply to HEAD only. The argument against back-patching is basically that this might change the interpretation of scripts that had been accepted silently before. For example \set ECHO_HIDDEN NoExec will now select noexec mode whereas before you silently got on mode. In one light this is certainly a bug fix, but in another it's just definitional instability. If we'd gotten a field bug report we might well have chosen to back-patch, though, and perhaps your client's complaint counts as that. Opinions anyone? -0.5 for back patching The one thing supporting this is that we'd potentially be fixing scripts that are broken but don't know it yet. But the downside of changing active settings for working scripts - even if they are only accidentally working - is enough to counter that for me. Being more liberal in our acceptance of input is more feature than bug fix even if we document that we accept more items. It is more about being consistent then liberal. Personally I think a situation where for one variable 0 = off but for another 0 = on, is a bug I can sorta buy the consistency angle but what will seal it for me is script portability - the ability to write a script and instructions using the most current release and have it run on previous versions without having to worry about this kind of incompatibility. So, +1 for back patching from me. David J.
[HACKERS] pg_ctl {start,restart,reload} bad handling of stdout file descriptor
Hi, I've been trying to run some pg_ctl command inside a python script, and saw that some of them where deadlocking. It seems that the commands that start postgres handle stdout in a way that that block the caller. Redirecting stdout to /dev/null or to a file using the -l option allow me to workaround the problem, but fixing it upstream would be nice. Here's a simple python program that reproduces the problem, and should deadlock. Customize your data path first, of course. Regards, -- Luis MENINA / Software Engineer Web: www.anevia.com Anevia: 1 rue René Anjolvy, 94250 Gentilly, France #! /usr/bin/env python import subprocess PGDATADIR = '/mnt/streams1/pgsql/localdb/data' # WORKS !!! (because of stdout redirection) #cmd = 'sudo -u postgres /usr/lib/postgresql/9.3/bin/pg_ctl restart -w -s -m fast -D ' + PGDATADIR + ' /dev/null' #cmd = 'sudo -u postgres /usr/lib/postgresql/9.3/bin/pg_ctl restart -w -s -m fast -D ' + PGDATADIR + ' -l /tmp/postgres.log' # BROKEN !!! #cmd = 'sudo -u postgres /usr/lib/postgresql/9.3/bin/pg_ctl restart -w -s -m fast -D ' + PGDATADIR + ' 21' cmd = 'sudo -u postgres /usr/lib/postgresql/9.3/bin/pg_ctl restart -w -s -m fast -D ' + PGDATADIR print cmd print Running subprocess.check_output subprocess.check_output(cmd, shell=True) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl {start,restart,reload} bad handling of stdout file descriptor
I knew I'd forget something... I'm running Postgres 9.3.5 on Debian/Linux. Regards, -- Luis MENINA / Software Engineer Web: www.anevia.com Anevia: 1 rue René Anjolvy, 94250 Gentilly, France -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On Sun, Dec 28, 2014 at 12:45 PM, Peter Geoghegan p...@heroku.com wrote: On Sun, Dec 28, 2014 at 12:37 PM, Jeff Davis pg...@j-davis.com wrote: Do others have similar numbers? I'm quite surprised at how little work_mem seems to matter for these plans (HashJoin might be a different story though). I feel like I made a mistake -- can someone please do a sanity check on my numbers? I have seen external sorts that were quicker than internal sorts before. With my abbreviated key patch, under certain circumstances external sorts are faster, while presumably the same thing is true of int4 attribute sorts today. Actually, I saw a 10MB work_mem setting that was marginally faster than a multi-gigabyte one that fit the entire sort in memory. It probably has something to do with caching effects dominating over the expense of more comparisons, since higher work_mem settings that still resulted in an external sort were slower than the 10MB setting. I was surprised by this too, but it has been independently reported by Jeff Janes. I don't recall (at the moment) seeing our external sort actually faster than quick-sort, but I've very reliably seen external sorts get faster with less memory than with more. It is almost certainly a CPU caching issue. Very large simple binary heaps are horrible on the CPU cache. And for sort-by-reference values, quick sort is also pretty bad. With a slow enough data bus between the CPU and main memory, I don't doubt that a 'tapesort' with small work_mem could actually be faster than quicksort with large work_mem. But I don't recall seeing it myself. But I'd be surprised that a tapesort as currently implemented would be faster than a quicksort if the tapesort is using just one byte less memory than the quicksort is. But to Jeff Davis's question, yes, tapesort is not very sensitive to work_mem, and to the extent it is sensitive it is in the other direction of more memory being bad. Once work_mem is so small that it takes multiple passes over the data to do the merge, then small memory would really be a problem. But on modern hardware you have to get pretty absurd settings before that happens. Cheers, Jeff
Re: [HACKERS] Documentation of bt_page_items()'s ctid field
On 12/30/2014 04:08 AM, Peter Geoghegan wrote: Attached documentation patch describes the purpose of bt_page_items()'s ctid field. This has come up enough times in disaster recovery or testing scenarios that I feel it's worth drawing particular attention to. How much detail on the b-tree internals do we want to put in the pageinspect documentation? I can see that being useful, but should we also explain e.g. that the first item on each (non-rightmost) page is the high key? + Note that structfieldctid/ addresses a heap tuple if the + page under consideration is a B-Tree leaf page. Otherwise, for + internal B-Tree pages structfieldctid/ addresses a page in + the B-Tree itself (excluding the root page if and only if there + has not yet been a root page split, as in the example above). + These internally referenced pages are child pages, and may + themselves be leaf pages or internal pages. I had a hard time understanding the remark about the root page. But in any case, if you look at the flags set e.g. with bt_page_stats(), the root page is flagged as also being a leaf page, when it is the only page in the index. So the root page is considered also a leaf page in that case. I'd suggest saying the same thing (or more) with fewer words: In a B-tree leaf page, structfieldctid/ points to a heap tuple. In an internal page, it points to another page in the index itself, and the offset number part (the second number) of the ctid field is ignored. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BUG #12330: ACID is broken for unique constraints
On Mon, Dec 29, 2014 at 4:17 PM, Merlin Moncure mmonc...@gmail.com wrote: Serialization errors only exist as a concession to concurrency and performance. Again, they should be returned as sparsely as possible I think this is fuzzy thinking. Serialization *errors* themselves are a concession to concurrency and performance but the actual signaling of the errors is just a consequence that the serialization failure occurred. If you can find a way to avoid the serialization failure then yay but you can't just not report it and call that a win. The point of returning a serialization failure is to report when a transaction sees something that can't be possible in a serialized execution. It's not that retrying might make the error go away it's that the error shouldn't have been possible in a serialized execution so the code shouldn't have to be prepared for it. So if you get a unique violation or an RI violation and the transaction didn't previously verify that the constraint wouldn't be violated then it's perfectly ok to return a constraint violation error. It doesn't matter whether retrying might avoid the error since there was some serialized order in which the constraint was violated. The case at issue is things like upsert where the code already verified that no violation should occur. Upsert is a typical case but it would be equally valid if it's an RI constraint and the transaction verified that the RI constraint is satisified before inserting or updating. If a concurrent transaction breaks the constraint for this insert and the insert gets a constraint violation then it ought to get a serialization failure since the constraint failure can't occur in a serialized execution order. But shouldn't the Read-Write dependency graph already be picking this up? IF you've previously read a row and someone updates it and then you try to lock it for updates it should be considering this a serialization failure. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What exactly is our CRC algorithm?
On 2014-12-30 21:36:19 +0530, Abhijit Menon-Sen wrote: Thanks for spotting that. I had meant to change the test to 7. But again, now that you mention it, I'm not sure it's necessary. The CRC32* instructions don't have the usual SSE alignment requirements, and I see that the Linux kernel (among other implementations) process eight bytes at a time from the start of the buffer and then process the remaining bytes one at a time. I'll do a bit more research and post an update. I'd done some quick and dirty benchmarking when writing my initial crc32c POC and I found that four byte aligned accesses were faster than ones not. But it really just was running it a couple times, nothing more. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory
On Tue, Dec 30, 2014 at 12:35 AM, Guillaume Lelarge guilla...@lelarge.info wrote: Sorry for my very late answer. It's been a tough month. 2014-11-27 0:00 GMT+01:00 Bruce Momjian br...@momjian.us: On Mon, Nov 3, 2014 at 12:39:26PM -0800, Jeff Janes wrote: It looked to me that the formula, when descending from a previously stressed state, would be: greatest(1 + checkpoint_completion_target) * checkpoint_segments, wal_keep_segments) + 1 + 2 * checkpoint_segments + 1 I don't think we can assume checkpoint_completion_target is at all reliable enough to base a maximum calculation on, assuming anything above the maximum is cause of concern and something to inform the admins about. Assuming checkpoint_completion_target is 1 for maximum purposes, how about: max(2 * checkpoint_segments, wal_keep_segments) + 2 * checkpoint_segments + 2 Seems something I could agree on. At least, it makes sense, and it works for my customers. Although I'm wondering why + 2, and not + 1. It seems Jeff and you agree on this, so I may have misunderstood something. From hazy memory, one +1 comes from the currently active WAL file, which exists but is not counted towards either wal_keep_segments nor towards recycled files. And the other +1 comes from the formula for how many recycled files to retain, which explicitly has a +1 in it. Cheers, Jeff
Re: [HACKERS] Documentation of bt_page_items()'s ctid field
On Tue, Dec 30, 2014 at 8:59 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: How much detail on the b-tree internals do we want to put in the pageinspect documentation? I can see that being useful, but should we also explain e.g. that the first item on each (non-rightmost) page is the high key? Maybe we should. I see no reason not to, and I think that it makes sense to explain things at that level without going into flags and so on. But don't forget that that isn't quite the full story if we're going to talk about high keys at all; we must also explain minus infinity keys, alongside any explanation of the high key: * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be * minus infinity: this routine will always claim it is less than the * scankey. The actual key value stored (if any, which there probably isn't) * does not matter. This convention allows us to implement the Lehman and * Yao convention that the first down-link pointer is before the first key. * See backend/access/nbtree/README for details. In particular, this means that the key data is garbage, which is something I've also seen causing confusion [1]. I would like to make it easier for competent non-experts on the B-Tree code to eyeball a B-Tree with pageinspect, and be reasonably confident that things add up. In order for such people to know that something is wrong, we should explain what right looks like in moderate detail. So, as I said, I feel an exact explanation of flags is unnecessary, but tend to agree that a brief reference to both page highkeys and minus infinity keys is appropriate, since users of the function will see them all the time. I had a hard time understanding the remark about the root page. But in any case, if you look at the flags set e.g. with bt_page_stats(), the root page is flagged as also being a leaf page, when it is the only page in the index. So the root page is considered also a leaf page in that case. I think that a better way of handling that originally would have been to make root-ness a separate property from leaf-ness/internal-ness. Too late for that now, I suppose. I'd suggest saying the same thing (or more) with fewer words: In a B-tree leaf page, structfieldctid/ points to a heap tuple. In an internal page, it points to another page in the index itself, and the offset number part (the second number) of the ctid field is ignored. That seems good. What do you think of the attached revision? [1] http://www.postgresql.org/message-id/20140828.110824.1195843073079055852.t-is...@sraoss.co.jp -- Peter Geoghegan diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml index 6f51dc6..ffc620a 100644 --- a/doc/src/sgml/pageinspect.sgml +++ b/doc/src/sgml/pageinspect.sgml @@ -192,6 +192,16 @@ test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1); 7 | (0,7) | 12 | f | f| 29 27 00 00 8 | (0,8) | 12 | f | f| 2a 27 00 00 /screen + In a B-tree leaf page, structfieldctid/ points to a heap + tuple. In an internal page, it points to another page in the + index itself, and the offset number part (the second number) of + the structfieldctid/ field is ignored. Note that the first + item on any non-rightmost page is the page quotehigh + key/quote, meaning its structfielddata/ serves as an upper + bound on all items appearing on the page. Also, on non-leaf + pages, the first data item (the first item that is not a high + key) is a quoteminus infinity/quote item, implying that the + value of structfielddata/ is garbage. /para /listitem /varlistentry -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Documentation of bt_page_items()'s ctid field
On 12/30/2014 10:07 PM, Peter Geoghegan wrote: On Tue, Dec 30, 2014 at 8:59 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: How much detail on the b-tree internals do we want to put in the pageinspect documentation? I can see that being useful, but should we also explain e.g. that the first item on each (non-rightmost) page is the high key? Maybe we should. I see no reason not to, and I think that it makes sense to explain things at that level without going into flags and so on. But don't forget that that isn't quite the full story if we're going to talk about high keys at all; we must also explain minus infinity keys, alongside any explanation of the high key: Yeah, good point. * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be * minus infinity: this routine will always claim it is less than the * scankey. The actual key value stored (if any, which there probably isn't) * does not matter. This convention allows us to implement the Lehman and * Yao convention that the first down-link pointer is before the first key. * See backend/access/nbtree/README for details. In particular, this means that the key data is garbage, which is something I've also seen causing confusion [1]. In practice, we never store any actual key value for the minus infinity key. I guess the code would ignore it if it was there, but it would make more sense to explain that the first data key on an internal page does not have a key value. If there is a value there, it's a sign that something's wrong. I would like to make it easier for competent non-experts on the B-Tree code to eyeball a B-Tree with pageinspect, and be reasonably confident that things add up. In order for such people to know that something is wrong, we should explain what right looks like in moderate detail. Makes sense. I had a hard time understanding the remark about the root page. But in any case, if you look at the flags set e.g. with bt_page_stats(), the root page is flagged as also being a leaf page, when it is the only page in the index. So the root page is considered also a leaf page in that case. I think that a better way of handling that originally would have been to make root-ness a separate property from leaf-ness/internal-ness. Hmm, yeah, bt_page_stats() currently returns 'l' in the type column when (BTP_ROOT | BTP_LEAF). - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Documentation of bt_page_items()'s ctid field
On Tue, Dec 30, 2014 at 12:21 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: In practice, we never store any actual key value for the minus infinity key. I guess the code would ignore it if it was there, but it would make more sense to explain that the first data key on an internal page does not have a key value. If there is a value there, it's a sign that something's wrong. Well, depends on what you're exact definition of a key is, I suppose. Here is what I see for a B-Tree's root page (it's a few hundred KiBs in size): postgres=# select * from bt_page_items('ix_order_custid', 3); itemoffset | ctid | itemlen | nulls | vars | data ++-+---+--+- 1 | (1,1) | 8 | f | f| 2 | (2,1) | 16 | f | f| 65 02 00 00 00 00 00 00 3 | (4,1) | 16 | f | f| d2 04 00 00 00 00 00 00 4 | (5,1) | 16 | f | f| 28 07 00 00 00 00 00 00 5 | (6,1) | 16 | f | f| b9 09 00 00 00 00 00 00 *** SNIP *** It's storing an index tuple, but not a key value itself. I guess that the equivocating comments (that I pasted here before, that appear over _bt_compare()) about what is stored made me use the term garbage, but off hand I'm not aware of how that could actually happen either. Perhaps it's an obsolete comment, and the data field will always be empty for minus infinity items? I'll leave the exact description of this state of the data field to you. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl {start,restart,reload} bad handling of stdout file descriptor
Luis Menina lmen...@anevia.com writes: I've been trying to run some pg_ctl command inside a python script, and saw that some of them where deadlocking. It seems that the commands that start postgres handle stdout in a way that that block the caller. Redirecting stdout to /dev/null or to a file using the -l option allow me to workaround the problem, but fixing it upstream would be nice. I think this is just pilot error, not a bug. When you launch the postmaster via pg_ctl and don't supply a -l option, the postmaster's stdout remains connected to wherever pg_ctl's stdout went to --- which, in this instance, is a pipe leading to the python script's process. So even after pg_ctl exits, subprocess.check_output() sees the pipe as still having live writers, and it keeps waiting for more input (which may indeed be forthcoming, if the postmaster prints log data to stdout). You can verify this by issuing pg_ctl stop from another terminal and noting that the python script exits when the postmaster shuts down. This can actually be useful behavior; for instance you might want to collect the postmaster's stdout via a python script. So we are certainly not going to call it a bug and break it. What's perhaps more debatable is whether it should be the default ... but it's been that way for a decade or two, so changing the default would probably annoy far more people than it would help. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [COMMITTERS] pgsql: pg_event_trigger_dropped_objects: Add name/args output columns
Alvaro Herrera wrote: pg_event_trigger_dropped_objects: Add name/args output columns These columns can be passed to pg_get_object_address() and used to reconstruct the dropped objects identities in a remote server containing similar objects, so that the drop can be replicated. This is causing buildfarm member dunlin to fail: == pgsql.build/src/test/regress/regression.diffs === *** /buildfarm/root/HEAD/pgsql.build/src/test/regress/expected/privileges.out Tue Dec 30 13:05:24 2014 --- /buildfarm/root/HEAD/pgsql.build/src/test/regress/results/privileges.out Tue Dec 30 13:10:49 2014 *** *** 1323,1328 --- 1323,1329 DROP SCHEMA testns CASCADE; NOTICE: drop cascades to table testns.acltest1 + ERROR: requested object address for object type that cannot support it SELECT d.* -- check that entries went away FROM pg_default_acl d LEFT JOIN pg_namespace n ON defaclnamespace = n.oid WHERE nspname IS NULL AND defaclnamespace != 0; No other member so far has reported a problem here. Not sure if that's the strangest bit, or the fact that dunlin is reporting anything in the first place. I can reproduce the problem quite simply by creating an event trigger on sql_drop and then try to drop an object not supported by getObjectIdentityParts -- in this case it's a default ACL for tables in the schema being dropped. My hope was that this check would help us detect new object types that did not have a working get_object_address representation, but since there are already cases that don't have it, it's currently bogus. We could add the default ACL support without too much effort, I think, but it would take even less effort to just remove the check. Another possible option is to mark ALTER DEFAULT PRIVILEGES as an unsupported command in event triggers. I'm not sure why we have it, particularly since we don't have GRANT and REVOKE ... but then my DDL deparsing patch is supposed to add support for GRANT and REVOKE, so I'm not really proposing this. Any strong opinions? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] ON_ERROR_ROLLBACK
On 12/30/2014 09:20 AM, Tom Lane wrote: Bernd Helmle maili...@oopsware.de writes: --On 29. Dezember 2014 12:55:11 -0500 Tom Lane t...@sss.pgh.pa.us wrote: Given the lack of previous complaints, this probably isn't backpatching material, but it sure seems like a bit of attention to consistency would be warranted here. Now that i read it i remember a client complaining about this some time ago. I forgot about it, but i think there's value in it to backpatch. Hm. Last night I wrote the attached draft patch, which I was intending to apply to HEAD only. The argument against back-patching is basically that this might change the interpretation of scripts that had been accepted silently before. For example \set ECHO_HIDDEN NoExec will now select noexec mode whereas before you silently got on mode. In one light this is certainly a bug fix, but in another it's just definitional instability. If we'd gotten a field bug report we might well have chosen to back-patch, though, and perhaps your client's complaint counts as that. Opinions anyone? r I got caught by this with ON_ERROR_ROLLBACK on 9.3 just this afternoon before remembering this thread. So there's a field report :-) +0.75 for backpatching (It's hard to imagine someone relying on the bad behaviour, but you never know). cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: pg_event_trigger_dropped_objects: Add name/args output columns
Alvaro Herrera alvhe...@2ndquadrant.com writes: Alvaro Herrera wrote: pg_event_trigger_dropped_objects: Add name/args output columns This is causing buildfarm member dunlin to fail: ... No other member so far has reported a problem here. Not sure if that's the strangest bit, or the fact that dunlin is reporting anything in the first place. I can reproduce the problem quite simply by creating an event trigger on sql_drop and then try to drop an object not supported by getObjectIdentityParts -- in this case it's a default ACL for tables in the schema being dropped. But is there any reason to think the failure on dunlin has anything to do with default ACLs? I think you'd better work on understanding why there is a platform dependency here, before you consider either removing the regression test case or adding support for object types that it shouldn't be hitting. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Failure on markhor with CLOBBER_CACHE_ALWAYS for test brin
HI all. markhor has run for the first time in 8 days, and there is something in range e703261..72dd233 making the regression test of brin crashing. See here: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhordt=2014-12-30%2020%3A58%3A49 Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] POLA violation with \c service=
On Wed, Dec 17, 2014 at 08:14:04AM -0500, Andrew Dunstan wrote: On 12/17/2014 04:11 AM, Heikki Linnakangas wrote: On 12/17/2014 10:03 AM, Albe Laurenz wrote: David Fetter wrote: I've noticed that psql's \c function handles service= requests in a way that I can only characterize as broken. This came up in the context of connecting to a cloud hosting service named after warriors or a river or something, whose default hostnames are long, confusing, and easy to typo, so I suspect that service= may come up more often going forward than it has until now. For example, when I try to use \c service=foo It will correctly figure out which database I'm trying to connect to, but fail to notice that it's on a different host, port, etc., and hence fail to connect with a somewhat unhelpful error message. I can think of a few approaches for fixing this: 0. Leave it broken. 1. Disable service= requests entirely in \c context, and error out if attempted. 2. Ensure that \c actually uses all of the available information. Is there another one I missed? If not, which of the approaches seems reasonable? #2 is the correct solution, #1 a band aid. It would be handy, if \c service=foo actually worked. We should do #3. If the database name is actually a connection string, or a service specification, it should not re-use the hostname and port from previous connection, but use the values from the connection string or service file. Yeah, that's the correct solution. It should not be terribly difficult to create a test for a conninfo string in the dbname parameter. That's what libpq does after all. We certainly don't want psql to have to try to interpret the service file. psql just needs to let libpq do its work in this situation. This took a little longer to get time to polish than I'd hoped, but please find attached a patch which: - Correctly connects to service= and postgres(ql)?:// with \c - Disallows tab completion in the above cases I'd like to see about having tab completion actually work correctly in at least the service= case, but that's a matter for a follow-on patch. Thanks to Andrew Dunstan for the original patch, and to Andrew Gierth for his help getting it into shape. Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c index c7a17d7..4ca50b3 100644 --- a/src/bin/psql/command.c +++ b/src/bin/psql/command.c @@ -1610,6 +1610,8 @@ do_connect(char *dbname, char *user, char *host, char *port) PGconn *o_conn = pset.db, *n_conn; char *password = NULL; + boolkeep_password = true; + boolhas_connection_string = false; if (!o_conn (!dbname || !user || !host || !port)) { @@ -1623,14 +1625,32 @@ do_connect(char *dbname, char *user, char *host, char *port) return false; } - if (!dbname) - dbname = PQdb(o_conn); if (!user) user = PQuser(o_conn); + else if (strcmp(user, PQuser(o_conn)) != 0) + keep_password = false; + if (!host) host = PQhost(o_conn); + else if (strcmp(host, PQhost(o_conn)) != 0) + keep_password = false; + if (!port) port = PQport(o_conn); + else if (strcmp(port, PQport(o_conn)) != 0) + keep_password = false; + + has_connection_string = recognized_connection_string(dbname); + + if (has_connection_string) + keep_password = false; + + /* +* Unlike the previous stanzas, changing only the dbname shouldn't +* trigger throwing away the password. +*/ + if (!dbname) + dbname = PQdb(o_conn); /* * If the user asked to be prompted for a password, ask for one now. If @@ -1646,9 +1666,13 @@ do_connect(char *dbname, char *user, char *host, char *port) { password = prompt_for_password(user); } - else if (o_conn user strcmp(PQuser(o_conn), user) == 0) + else if (o_conn keep_password) { - password = pg_strdup(PQpass(o_conn)); + password = PQpass(o_conn); + if (password *password) + password = pg_strdup(password); + else + password = NULL; } while (true) @@ -1656,23 +1680,28 @@ do_connect(char *dbname, char *user, char *host, char *port) #define PARAMS_ARRAY_SIZE 8 const char **keywords = pg_malloc(PARAMS_ARRAY_SIZE * sizeof(*keywords)); const char **values = pg_malloc(PARAMS_ARRAY_SIZE * sizeof(*values)); +
Re: [HACKERS] Failure on markhor with CLOBBER_CACHE_ALWAYS for test brin
Michael Paquier wrote: HI all. markhor has run for the first time in 8 days, and there is something in range e703261..72dd233 making the regression test of brin crashing. See here: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhordt=2014-12-30%2020%3A58%3A49 This shows that the crash was in the object_address test, not brin. Will research. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Mon, Dec 29, 2014 at 11:52 PM, Jeff Janes jeff.ja...@gmail.com wrote: Correct, I haven't seen any problems with approach #1 That helps with debugging #2, then. That's very helpful. Generally the problem will occur early on in the process, and if not then it will not occur at all. I think that is because the table starts out empty, and so a lot of insertions collide with each other. Once the table is more thoroughly populated, most query takes the CONFLICT branch and therefore two insertion-branches are unlikely to collide. At its simplest, I just use the count_upsert.pl script and your patch and forget all the rest of the stuff from my test platform. I can reproduce this on my laptop now. I think that building at -O2 and without assertions helps. I'm starting to work through debugging it. I threw together a quick script for getting pg_xlogdump into a Postgres table (a nice use of the new pg_lsn type). It's here: https://github.com/petergeoghegan/jjanes_upsert/blob/master/pg_xlogdump2csv.py It tells a story. Looking at the last segment before shutdown when the problem occurred, I see: postgres=# select count(*), tx from my_xlogdump group by tx having count(*) 4 order by 1; count | tx ---+- 5 | 1917836 5 | 1902576 5 | 1909746 5 | 1901586 5 | 1916971 6 | 1870077 39 | 1918004 119 | 1918003 2246 | 0 (9 rows) postgres=# select max(tx::text::int4) from my_xlogdump ; max - 1918004 (1 row) So the last two transactions (1918003 and 1918004) get into some kind of live-lock situation, it looks like. Or at least something that causes them to produce significant more WAL records than other xacts due to some kind of persistent problem with conflicts. Here is where the earlier of the two problematic transactions has its first record: postgres=# select * from my_xlogdump where tx = '1918003' order by r_lsn asc limit 1; rmgr | len_rec | len_tot | tx| r_lsn| prev_lsn | descr --+-+-+-+++ Heap | 3 | 203 | 1918003 | 0/1783BB70 | 0/1783BB48 | INSERT off 33 blkref #0: rel 1663/16471/12502 blk (1 row) After and including that record, until the trouble spot up to and including shutdown, here is the rmgr breakdown: postgres=# select count(*), rmgr from my_xlogdump where r_lsn = '0/1783BB70' group by rmgr order by 1; count |rmgr ---+- 1 | XLOG -- 1 CHECKPOINT_SHUTDOWN record 2 | Transaction -- commit records for the two xacts 20 | Heap2-- all are CLEAN remxid records, tx is 0 76 | Heap -- All from our two xacts... 80 | Btree -- All from XID 1918003 only (5 rows) So looks like a bad interaction with VACUUM. Maybe it's a problem with VACUUM interlocking. That was my first suspicion, FWIW. I'll need to do more investigating, but I can provide a custom format dump of the table, in case anyone wants to look at what I have here in detail. I've uploaded it to: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/files/my_xlogdump.custom.dump.table -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Additional role attributes superuser review
On Tue, Dec 30, 2014 at 6:35 PM, Stephen Frost sfr...@snowman.net wrote: * Amit Kapila (amit.kapil...@gmail.com) wrote: On Tue, Dec 30, 2014 at 6:52 AM, Stephen Frost sfr...@snowman.net wrote: another could be have a separate privilege (DBA) or a role which is not a superuser, however can be used to perform such tasks. I'm pretty sure we've agreed that having a catch-all role attribute like 'DBA' is a bad idea because we'd likely be adding privileges to it later which would expand the rights of users with that attribute, possibly beyond what is intended. Yes, that could happen but do you want to say that is the only reason to consider server level roles (such as DBA) a bad idea. Can't we make users aware of such things via documentation and then there will be some onus on user's to give such a privilege with care. On looking around, it seems many of the databases provides such roles https://dbbulletin.wordpress.com/2013/05/29/backup-privileges-needed-to-backup-databases/ In the link, though they are talking about physical (file level) backup, there is mention about such roles in other databases. The other way as discussed on this thread is to use something like DUMPALL (or DUMPFULL) privilege which also sounds to be a good way, apart from the fact that the same privilege can be used for non-dump purposes to extract the information from database/cluster. I don't really like 'xlog' as a permission. BINARY BACKUP is interesting but if we're going to go multi-word, I would think we would do PITR BACKUP or something FILESYSTEM BACKUP or similar. I'm not really a big fan of the two-word options though. How about PHYSICAL BACKUP (for basebackup) and LOGICAL BACKUP for pg_dump? Again, this makes it look like the read-all right would be specific to users executing pg_dump, which is incorrect and misleading. PHYSICAL might also imply the ability to do pg_basebackup. That's right, however having unrelated privileges for doing physical backup via pg_basebackup and pg_start*/pg_stop* method also doesn't sound to be the best way (can be slightly difficult for users to correlate after reading docs). Don't you think in this case we should have some form of hierarchy for privileges (like user having privilege to do pg_basebackup can also perform the backup via pg_start*/pg_stop* method or some other way to relate both forms of physical backup)? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates
On Fri, Dec 26, 2014 at 7:57 AM, David Rowley dgrowle...@gmail.com wrote: 1. Do we need to keep the 128 byte aggregate state size for machines without 128 bit ints? This has been reduced to 48 bytes in the patch, which is in favour code being compiled with a compiler which has 128 bit ints. I kind of think that we need to keep the 128 byte estimates for compilers that don't support int128, but I'd like to hear any counter arguments. I think you're referring to the estimated state size in pg_aggregate here, and I'd say it's probably not a big deal one way or the other. Presumably, at some point, 128-bit arithmetic will become common enough that we'll want to change that estimate, but I don't know whether we've reached that point or not. 2. References to int16 meaning 16 bytes. I'm really in two minds about this, it's quite nice to keep the natural flow, int4, int8, int16, but I can't help think that this will confuse someone one day. I think it'll be a long time before it confused anyone if we called it int128 instead, but I'm not that excited about seeing it renamed either. I'd like to hear what others have to say... Is there a chance that some sql standard in the distant future will have HUGEINT and we might regret not getting the internal names nailed down? Yeah, I think using int16 to mean 16-bytes will be confusing to someone almost immediately. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] orangutan seizes up during isolation-check
On Sun, Dec 28, 2014 at 4:58 PM, Noah Misch n...@leadboat.com wrote: I wondered whether to downgrade FATAL to LOG in back branches. Introducing a new reason to block startup is disruptive for a minor release, but having the postmaster deadlock at an unpredictable later time is even more disruptive. I am inclined to halt startup that way in all branches. Jeepers. I'd rather not do that. From your report, this problem has been around for years. Yet, as far as I know, it's bothering very few real users, some of whom might be far more bothered by the postmaster suddenly failing to start. I'm fine with a FATAL in master, but I'd vote against doing anything that might prevent startup in the back-branches without more compelling justification. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)
2014-12-12 14:58 GMT+01:00 Heikki Linnakangas hlinnakan...@vmware.com: On 12/10/2014 04:32 PM, Dennis Kögel wrote: Hi, Am 04.09.2014 um 17:50 schrieb Jehan-Guillaume de Rorthais j...@dalibo.com: Since few months, we occasionally see .ready files appearing on some slave instances from various context. The two I have in mind are under 9.2.x. […] So it seems for some reasons, these old WALs were forgotten by the restartpoint mechanism when they should have been recylced/deleted. Am 08.10.2014 um 11:54 schrieb Heikki Linnakangas hlinnakan...@vmware.com: 1. Where do the FF files come from? In 9.2, FF-segments are not supposed to created, ever. […] 2. Why are the .done files sometimes not being created? We’ve encountered behaviour which seems to match what has been described here: On Streaming Replication slaves, there is an odd piling up of old WALs and .ready files in pg_xlog, going back several months. The fine people on IRC have pointed me to this thread, and have encouraged me to revive it with our observations, so here we go: Environment: Master, 9.2.9 |- Slave S1, 9.2.9, on the same network as the master '- Slave S2, 9.2.9, some 100 km away (occassional network hickups; *not* a cascading replication) wal_keep_segments M=100 S1=100 S2=30 checkpoint_segments M=100 S1=30 S2=30 wal_level hot_standby (all) archive_mode on (all) archive_command on both slaves: /bin/true archive_timeout 600s (all) - On both slaves, we have „ghost“ WALs and corresponding .ready files (currently 600 of each on S2, slowly becoming a disk space problem) - There’s always gaps in the ghost WAL names, often roughly 0x20, but not always - The slave with the „bad“ network link has significantly more of these files, which suggests that disturbances of the Streaming Replication increase chances of triggering this bug; OTOH, the presence of a name gap pattern suggests the opposite - We observe files named *FF as well As you can see in the directory listings below, this setup is *very* low traffic, which may explain the pattern in WAL name gaps (?). I’ve listed the entries by time, expecting to easily match WALs to their .ready files. There sometimes is an interesting delay between the WAL’s mtime and the .ready file — especially for *FF, where there’s several days between the WAL and the .ready file. - Master: http://pgsql.privatepaste.com/52ad612dfb - Slave S1: http://pgsql.privatepaste.com/58b4f3bb10 - Slave S2: http://pgsql.privatepaste.com/a693a8d7f4 I’ve only skimmed through the thread; my understanding is that there were several patches floating around, but nothing was committed. If there’s any way I can help, please let me know. Yeah. It wasn't totally clear how all this should work, so I got distracted with other stuff an dropped the ball; sorry. I'm thinking that we should change the behaviour on master so that the standby never archives any files from older timelines, only the new one that it generates itself. That will solve the immediate problem of old WAL files accumulating, and bogus .ready files appearing in the standby. However, it will not solve the bigger problem of how do you ensure that all WAL files are archived, when you promote a standby server. There is no guarantee on that today anyway, but this will make it even less reliable, because it will increase the chances that you miss a file on the old timeline in the archive, after promoting. I'd argue that that's a good thing; it makes the issue more obvious, so you are more likely to encounter it in testing, and you won't be surprised in an emergency. But I've started a new thread on that bigger issue, hopefully we'll come up with a solution ( http://www.postgresql.org/message-id/548af1cb.80...@vmware.com). Now, what do we do with the back-branches? I'm not sure. Changing the behaviour in back-branches could cause nasty surprises. Perhaps it's best to just leave it as it is, even though it's buggy. As long as master is fixed, I don't actually care. But I agree with Dennis that it's hard to see what's been commited with all the different issues found, and if any commits were done, in which branch. I'd like to be able to tell my customers: update to this minor release to see if it's fixed, but I can't even do that. -- Guillaume. http://blog.guillaume.lelarge.info http://www.dalibo.com
Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory
2014-12-30 18:45 GMT+01:00 Jeff Janes jeff.ja...@gmail.com: On Tue, Dec 30, 2014 at 12:35 AM, Guillaume Lelarge guilla...@lelarge.info wrote: Sorry for my very late answer. It's been a tough month. 2014-11-27 0:00 GMT+01:00 Bruce Momjian br...@momjian.us: On Mon, Nov 3, 2014 at 12:39:26PM -0800, Jeff Janes wrote: It looked to me that the formula, when descending from a previously stressed state, would be: greatest(1 + checkpoint_completion_target) * checkpoint_segments, wal_keep_segments) + 1 + 2 * checkpoint_segments + 1 I don't think we can assume checkpoint_completion_target is at all reliable enough to base a maximum calculation on, assuming anything above the maximum is cause of concern and something to inform the admins about. Assuming checkpoint_completion_target is 1 for maximum purposes, how about: max(2 * checkpoint_segments, wal_keep_segments) + 2 * checkpoint_segments + 2 Seems something I could agree on. At least, it makes sense, and it works for my customers. Although I'm wondering why + 2, and not + 1. It seems Jeff and you agree on this, so I may have misunderstood something. From hazy memory, one +1 comes from the currently active WAL file, which exists but is not counted towards either wal_keep_segments nor towards recycled files. And the other +1 comes from the formula for how many recycled files to retain, which explicitly has a +1 in it. OK, that seems much better. Thanks, Jeff. -- Guillaume. http://blog.guillaume.lelarge.info http://www.dalibo.com