[GENERAL] Feature request: separate logging
A glaring weakness in Postgresql for production systems is that the administrator has no way of controlling what types of logs go where. There are at least two types of logs: errors and statement logs. (I could also add: connection, syntax error, query duration, audit). It has becomes increasingly important in the Linux world, with its over reliance on systemd, that we admins be able to distinguish between these log outputs. Systemd wants to control all the daemons and services run on a Linux host. To do this effectively, it needs to capture the stdout and stderr or possibly parse the external logfile of the daemon it spawns. The benefits of systemd's journald subsystem include being able to automatically identify daemons that are failing to start properly. (The merits of systemd are beyond the scope of this discussion; it is now the all-but-ubiquitous standard of linux distributions, and has become nearly intractable). The Postgresql startup process could greatly benefit from systemd. As it is now, PGDG distributes postgresql with init scripts which really do very little to check if postgresql bombed shortly after startup. I have improved upon that script to do some smart detection of a failed startup, but the result is having two different files to monitor. On the one hand, I want to use postgresql's built-in file or CSV logging. On the other, I need to check the output of the startup process. These logs have different formats, especially for the smart DBA who wants to customize the log prefix with a more usable timestamp and other fields. Logging to external files is nice because postgresql rotates the logs for us automatically. (External log rotation is problematic and risky because it requires a configuration reload which may lead to undesirable side-effects. Right?) One alternative is to capture everything into the local logging system, or to send all to stdout/stderr and capture this with journald (or runit's svlogd or something). But then you have the following problem: if I enable statement-level logging or set log_min_duration_statement=0, the in-memory journald will quickly be overrun with statement logs. The journald subsystem will then become useless to the other daemons and subsystems. One solution is proposed in these forums: https://www. postgresql.org/message-id/flat/etPan.53397e77.643c9869. 1a0%40palos#etPan.53397e77.643c9869.1a0@palos While pgbadger has some nice features, it doesn't really solve the problem of allowing postgresql to be used with systemd+journald. What I do today is to configure postgresql to write csvlogs. Stdout/stderr are captured by journald. A custom perl script with the Text::CSV module and tail -F semantics continuously processes the csvlog file, ignores query, dml, and detail log lines, and sends the rest via syslog() (which journald then handles). It's not the right way. I would like to see postgresql to have the ability to 1. Write to a csvlog with one set of selectors 2. Write to stdout/stderr a different set of selectors (no statement, no autovacuum, etc) using a purely line-oriented output that 2.1. has the kind of detail contained in the CSV. Currently, the log-prefix option does not offer some of the information provided in the CSV logs. Really, the CSV log should simply be an implementation of the log-prefix. 2.2. Collapses multi-lined queries into one line (newlines and tabs are escaped with backslashes or the x1B character). Finally, if these changes can be implemented, is it impossible to backport them to prior versions, say 9.1 and up? If I wrote a patch, under what conditions would the patch be accepted for inclusion in official releases of older versions? -- Otheus otheus.u...@gmail.com otheus.shell...@uibk.ac.at
Re: [GENERAL] Unexpected trouble from pg_basebackup
After a 3 to 4 minute delay, pg_basebackup started doing it's thing and finished within a few minutes. So now the question is: why the startup delay?
[GENERAL] Unexpected trouble from pg_basebackup
I recently updated my systems from pg 9.1.8 to 9.5.3. A pg_dumpall was used to migrate the data. Now I'm trying to re-establish replication between master and slave. I'm getting stuck. When I run pg_basebackup (via a script which worked flawlessly on 9.1.8, AND via command line, ala "manual mode"), it creates the PGDATA directory structure, some files, but soon gets stuck. It writes fewer than 40 MB of a 20GB database. Further, using tcpdump to watch packets on the wire, I confirm no data is going across. However, it's clear a connection has been made and the process was started. On the master, I see the process table: postgres: wal sender process pgsync A.B.C.D(42821) sending backup Meanwhile, on the client, I see no output updates. Here's the command: pg_basebackup -x -P -v -D $PGDATA -w PGUSER, PGPASSWORD and PGHOST are set accordingly. The User is a user with the REPLICATION attribute. Just in case, I dropped and re-created the user. So that's question A. Question B is related. In attempting to verify that the permissions and HBA were set correctly, I attempted to do a pg_dump using the same PGUSER and PGHOST. What I got surprised me: $ pg_dump -s onyxeditor pg_dump: [archiver (db)] query failed: ERROR: permission denied for relation licence pg_dump: [archiver (db)] query was: LOCK TABLE public.licence IN ACCESS SHARE MODE This is the replication user. Other schemas worked just fine, but not this one. Is this expected behavior? Kind regards, Otheus -- Otheus otheus.u...@gmail.com otheus.shell...@uibk.ac.at
Re: [GENERAL] How to Qualifying or quantify risk of loss in asynchronous replication
On Wednesday, March 16, 2016, Thomas Munro <thomas.mu...@enterprisedb.com> wrote: > In asynchronous replication, the primary writes to the WAL and flushes the disk. Then, for any standbys that happen to be connected, a WAL sender process trundles along behind feeding new WAL doesn the socket as soon as it can, but it can be running arbitrarily far behind or not running at all (the network could be down or saturated, the standby could be temporarily down or up but not reading the stream fast enough, etc etc). Thanks for your help on finding the code. To be more precise, in the 9.1.8 code, I see this: 1. [backend] WAL is flushed to disk 2. [backend] WAL-senders are sent SIGUSR1 to wake up 3. [backend] wait for responses from other SyncRep-Receiver, effectively skipped if none [wal-sender] wakes up 4. [backend] end-of-xact cycle [wal-sender] reads WAL (XLogRead) up to MAX_SEND_SIZE (or less) bytes 5. [backend] ? is there an ACK send to client? [wal-sender] sends chunk to WAL-receiver using the pq_putmessage_noblock call 6. [wal-sender] repeats reading-sending loop So if the WAL record is bigger than whatever MAX_SEND_SIZE is (in my source, I seek 8k * 16 = 128 kB, so 1 Mb (roughly)), the WAL may end up sleeping (between iterations of 5 and 6). On Wed, Mar 16, 2016 at 10:21 AM, otheus uibk <otheus.u...@gmail.com> wrote: > Section 25.2.5. "The standby connects to the primary, which streams WAL > records to the standby as they're generated, without waiting for the WAL > file to be filled." Section 25.2.6 "If the primary server crashes then some transactions that > were committed may not have been replicated to the standby server, causing > data loss. The amount of data loss is proportional to the replication delay > at the time of failover." Both these statements, then, from the documentation perspective, are incorrect, at least to a pedant. For 25.2.5, The primary streams WAL records to the standby after they've been flushed to disk but without waiting for the file to be filled. For 25.2.6 it's not clear: some transactions that were *written* to the local WAL and reported as committed but not yet *sent* to the standby server is possible. Somehow, the documentation misleads (me) to believe the async replication algorithm at least guarantees WAL records are *sent* before responding "committed" to the client. I now know this is not the case. *grumble*. How can I help make the documentation clearer on this point? -- Otheus otheus.u...@gmail.com otheus.shell...@uibk.ac.at
Re: [GENERAL] How to Qualifying or quantify risk of loss in asynchronous replication
Apologies for the double-reply... This is to point out the ambiguity between the example you gave and stated documentation. On Wednesday, March 16, 2016, Thomas Munrowrote: > > Waiting for the transaction to be durably stored (flushed to disk) on > two servers before COMMIT returns means that you can avoid this > situation: > > 1. You commit a transaction, and COMMIT returns as soon as the WAL is > flushed to disk on the primary. > 2. You communicate a fact based on that transaction to a third party > ("Thank you Dr Bowman, you are booked in seat A4, your reservation > number is JUPITER123"). > 3. Your primary computer is destroyed by a meteor, and its WAL sender > hadn't yet got around to sending that transaction to the standby Section 25.2.5. "The standby connects to the primary, which streams WAL records to the standby as they're generated, without waiting for the WAL file to be filled." This suggests that the record is on the network stack possibly before a flush to disk. Section 25.2.6 "If the primary server crashes then some transactions that were committed may not have been replicated to the standby server, causing data loss. The amount of data loss is proportional to the replication delay at the time of failover." Whence this replication delay? If the standby server is caught up and streaming asynchronously, what delays *in receiving* might there be other than network delays? Note: I am totally unconcerned with the possibility that both primary and standby go down at the same time. -- Otheus otheus.u...@gmail.com otheus.shell...@uibk.ac.at
Re: [GENERAL] How to Qualifying or quantify risk of loss in asynchronous replication
Thomas, thanks for your input... But I'm not quite getting the answer I need > But what precisely is the algorithm and timing involved with streaming > WALs? > > > > Is it: > > * client issues COMMIT > > * master receives commit > > * master processes transaction internally > > * master creates WAL record > > | master appends WAL to local WAL file, flushes disk > > | master sends WAL record to all streaming clients > > * master sends OK to client > > * master applies WAL > > > > So is this correct? Is it correct to say: PG async guarantees that the > WAL > > is *sent* to the receivers, but not that they are received, before the > > client receives acknowledgement? > > Async replication doesn't guarantee anything at all about receivers, > or even that there is one right at this moment. Did you mean to write > "synchronous" instead of "asynchronous"? I'm only concerned with async (for this thread). In asynchronous replication, > the primary writes to the WAL and flushes the disk. Then, for any > standbys that happen to be connected, a WAL sender process trundles > along behind feeding new WAL doesn the socket as soon as it can, but > it can be running arbitrarily far behind or not running at all (the > network could be down or saturated, the standby could be temporarily > down or up but not reading the stream fast enough, etc etc). This is the *process* I want more detail about. The question is the same as above: > (is it true that) PG async guarantees that the WAL > is *sent* to the receivers, but not that they are received, before the > client receives acknowledgement? But I will refine what I mean by "sent"... does PostgreSQL write the WAL to the socket and flush the socket before acknowledging the transaction to the client? Does it *always* do this? Or does it make a best effort? Or does the write to the socket and return to client happen asynchronously? I realize that the data might not be *seen* at the client, i realize network buffers may take time to reach the network, I realize various levels of synchronous replication provide higher guarantees. But For the purposes of this topic, I'm interest to know what PG actually does. I can't tell that from the documentation (because it is not clearly stated and because it is self contradictory). -- Otheus otheus.u...@gmail.com otheus.shell...@uibk.ac.at
Re: [GENERAL] Replaying xlogs from beginning
> You're assuming that the only significant aspect of initdb's output that can vary from run to run is the database system ID. I prefer to call it "optimistic prediction". But yes. :) > If you're lucky this technique will work, but it's not reliable and not supported. You really need to take an initial base backup after running initdb. Thanks, but it's too late in retrospect. Since that is the case, to whom can I make a feature request that initdb have an implicitly enabled option to make this base backup? OK, back to question. Provided the system is the same, the postgresql executable and libraries are identical, the locales are unchanged, we have: 1. bitwise variances in the system catalog -- can be solved with David's suggestion of full dump/restore after the steps I listed. Confer? 2. other things you couldn't think of yet. -- anything come to mind yet? -- Otheus otheus.u...@gmail.com otheus.shell...@uibk.ac.at
Re: [GENERAL] Replaying xlogs from beginning
I came up with an answer to the _second_ question (how do I do this from a new instance?). In the new instance directory: 1. Hack the system ID in the global/pg_control file to that of the original instance. 1a. Use pg_controlinfo to get the hex version of the control id: $ pg_controldata | perl -l -n -e 'if (/^Database system identifier:\s+(\d+)$/) { $sysid=sprintf("%x",$1);print join(" ",reverse $sysid=~m/../g); exit; }' fb fb 59 56 b9 31 58 53 1b. Use a hex editor (vim with xxd / xxd -r will do) to replace the first bytes with these values. 2. The new control file will be unusable because its CRC won't match. Fix that with pg_resetlog. $ pg_resetlog -f 3. Copy the very first pg_xlog file into place. The rest can be copied by hand or found using whatever you use for the "restore_command" 4. Create recovery.conf file. Start instance. Again, this worked for me. What I want to be sure of is: does this really work? And I still want to know: can I do this without creating a new instance?
[GENERAL] Replaying xlogs from beginning
I'm looking for answers to this question, but so far haven't turned up a usable answer. Perhaps I'm asking it the wrong way. I want to replay the xlogs from the beginning of time up until a particular time. The problem is, the time is before the first base backup. But I have all the xlogs since the database was initialized. To test this, I initialized a new DB and put a few xlogs in its pg_xlog directory, and I create a simple recovery.conf ("restore_command=false"). When I start it up, I get something like "DETAIL: WAL file database system identifier is 6221786353392811102, pg_control database system identifier is 6252279422905597461." The most important question is: How do I recover to PIT, starting from initialization, without a base backup, provided I have all xlog files. The secondary question is: How can I test this with a new instance? -- Otheus otheus.u...@gmail.com otheus.shell...@uibk.ac.at
[GENERAL] pg_start_backup does not actually allow for consistent, file-level backup
The manual and in this mailing list, the claim is made that consistent, file-level backups may be made by bracketing the file-copy operation with the postgresql pg_start_backup and pg_stop_backup operations. Many people including myself have found that in some circumstances, using tar to copy these files will result in an error if one of the data files changes during the tar operation. The responses to those queries on this mailing list are unsatisfactory (everything is fine, trust us). For example: http://www.postgresql.org/message-id/flat/aanlktikg70ip91tpuwvuf9gb0xts4fukakm4qxltw...@mail.gmail.com#aanlktikg70ip91tpuwvuf9gb0xts4fukakm4qxltw...@mail.gmail.com and http://www.postgresql.org/message-id/flat/201004201350.o3kdokr02...@momjian.us#201004201350.o3kdokr02...@momjian.us (quoted below) On Fri, Apr 16, 2010 at 11:55 AM, raghavendra t raagavendra(dot)rao(at)gmail(dot)com wrote: Hi All, For some setups reason, i started taking Hot backup. In this course I have first issued pg_start_backup('backup') and went to the data directory for backing up in OS format using the command tar -cf backup.tar /data . When i issued this command , tar was generating some errors as show below. bash-3.00# tar -cf 16aprilstandby.tar /db-data/ tar: Removing leading `/' from member names tar: /db-data/base/24643/151667: file changed as we read it tar: /db-data/base/24643/151764.2: file changed as we read it tar: /db-data/base/24643/151766: file changed as we read it tar: /db-data/base/24643/151768: file changed as we read it tar: /db-data/base/66412/151969: file changed as we read it After sometime tar has ended and i also issued pg_stop_backup() and i continued the process. My question here is, is this errors generated by tar are to worrisome or whats happening in the backend. Is tar file is safewell to use. Could you please tell me. Those are not errors, they are warnings. As long as you use pg_start_backup() and pg_stop_backup() before and after the tar, they are perfectly harmless, and can be ignored. The above scenario is exactly what I saw, albeit with less frequency and severity. I decided to test this claim that these messages are perfectly harmless and can be ignored: 1. I executed pg_start_backup() on server 2. Ran md5sum recursively through PG's data directories 3. waited a split second 4. Ran md5sum recursively through PG's data directories as in step 2 5. Compared output from #2 and #4 As you can see below, there were non-zero changes made to these files. a1a571bfd1e4a98b20245edbdfce6d9a /var/lib/pgsql/data/base/41514/809275 --- 21de5b864019c96c55e81a38fa1c9ccf /var/lib/pgsql/data/base/41514/809275 1783c1783 8eb4a578ecb56667e1698174f89c462c /var/lib/pgsql/data/base/41514/809280 --- b4c7b4ef30dda9543181465f53a85d72 /var/lib/pgsql/data/base/41514/809280 Such changes occurred EVEN WHEN TAR DID NOT WARN of changed files. Further, when step 3 involved an actual backup, involving minutes, not milliseconds, dozens of differences to files in data/base/... are reported. To be clear, I excluded from consideration all files in pg_xlog, pg_clog, pg_subtrans, pg_stat_tmp. If these files are changing during the pg_start_backup() and pg_stop_backup, then exactly what is their purpose? Might they be changing during the tar, as tar thinks? How may an operator be assured the snapshot is consistent (unless one stops the databases)? Will the redo logs restore the files to a consistent state, no matter when these files are changed? I find it hard to believe that would be the case. This test was performed using Postgresql 9.1.8. A scan of the CHANGELOG since then indicates that if this is a bug, it has not been reported as fixed. -- Otheus otheus.u...@gmail.com otheus.shell...@uibk.ac.at
Re: [GENERAL] pg_start_backup does not actually allow for consistent, file-level backup
Thank you, all. The manual for 9.4 is indeed clearer on this point than the 9.1 version.
Re: [GENERAL] pg_start_backup does not actually allow for consistent, file-level backup
On Mon, Jun 8, 2015 at 3:13 PM, otheus uibk otheus.u...@gmail.com wrote: Thank you, all. The manual for 9.4 is indeed clearer on this point than the 9.1 version. Just to nit-pick, I see nowhere in either version of the manual the indication that it is normal for postgresql to continue to update files in its data catalog between pg_start_backup and pg_stop_backup. The closest I see comes in this paragraph: ** Some file system backup tools emit warnings or errors if the files they are trying to copy change while the copy proceeds. When taking a base backup of an active database, this situation is normal and not an error. Does this situation refer to the tools emitting warnings or to the fact that postgresql is updating the files? It might be the case, for instance, that timestamps are updated but not the contents of the files (this is what I had assumed prior to today). -- Otheus otheus.u...@gmail.com otheus.shell...@uibk.ac.at