Re: POC: Cleaning up orphaned files using undo logs
Hi Thomas, I have started reviewing 0003-Add-undo-log-manager, I haven't yet reviewed but some places I noticed that instead of UndoRecPtr you are directly using UndoLogOffset. Which seems like bugs to me 1. +UndoRecPtr +UndoLogAllocateInRecovery(UndoLogAllocContext *context, + TransactionId xid, + uint16 size, + bool *need_xact_header, + UndoRecPtr *last_xact_start, + *need_xact_header = + context->try_location == InvalidUndoRecPtr && + slot->meta.unlogged.insert == slot->meta.unlogged.this_xact_start; + *last_xact_start = slot->meta.unlogged.last_xact_start; the output parameter last_xact_start is of type UndoRecPtr whereas slot->meta.unlogged.last_xact_start is of type UndoLogOffset shouldn't we use MakeUndoRecPtr(logno, offset) here? 2. + slot = find_undo_log_slot(logno, false); + if (UndoLogOffsetPlusUsableBytes(try_offset, size) <= slot->meta.end) + { + *need_xact_header = false; + return try_offset; + } Here also you are returning directly try_offset instead of UndoRecPtr -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: POC: Cleaning up orphaned files using undo logs
On Wed, Jul 24, 2019 at 9:15 PM Amit Kapila wrote: > I have done some more review of undolog patch series and here are my comments: Hi Amit, Thanks! There a number of actionable changes in your review. I'll be posting a new patch set soon that will address most of your complaints individually. In this message want to respond to one topic area, because the answer is long enough already: > 2. > allocate_empty_undo_segment() > { > .. > .. > /* Flush the contents of the file to disk before the next checkpoint. */ > + undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace); > .. > } > > +void > +undofile_request_sync(UndoLogNumber logno, BlockNumber segno, Oid tablespace) > +{ > + char path[MAXPGPATH]; > + FileTag tag; > + > + INIT_UNDOFILETAG(tag, logno, tablespace, segno); > + > + /* Try to send to the checkpointer, but if out of space, do it here. */ > + if (!RegisterSyncRequest(, SYNC_REQUEST, false)) > > > The comment in allocate_empty_undo_segment indicates that the code > wants to flush before checkpoint, but the actual function tries to > register the request with checkpointer. Shouldn't this be similar to > XLogFileInit where we use pg_fsync to flush the contents immediately? > I guess that will avoid what you have written in comments in the same > function (we just want to make sure that the filesystem has allocated > physical blocks for it so that non-COW filesystems will report ENOSPC > now rather than later when space is needed). OTOH, I think it is > performance-wise better to postpone the work to checkpointer. If we > want to push this work to checkpointer, then we might need to change > comments or alternatively, we might want to use bigger segment sizes > to mitigate the performance effect. In an early version I was doing the fsync() immediately. While testing zheap, Mithun CY reported that whenever segments couldn't be recycled in the background, such as during a bit long-running transaction, he could measure ~6% of the time time spent waiting for fsync(), and throughput increased with bigger segments (and thus fewer files to fsync()). Passing the work off to the checkpointer is better not only because it's done in the background but also because there is a chance that the work can be consolidated with other sync requests, and perhaps even avoided completely if the file is discarded and unlinked before the next checkpoint. I'll update the comment to make it clearer. > If my above understanding is correct and the reason to fsync > immediately is to reserve space now, then we also need to think > whether we are always safe in postponing the work? Basically, if this > means that it can fail when we are actually trying to write undo, then > it could be risky because we could be in the critical section at that > time. I am not sure about this point, rather it is just to discuss if > there are any impacts of postponing the fsync work. Here is my theory for why this arrangement is safe, and why it differs from what we're doing with WAL segments and regular relation files. First, let's review why those things work the way they do (as I understand it): 1. WAL's use of fdatasync(): The reason we fill and then fsync() newly created WAL files up front is because we want to make sure the blocks are definitely on disk. The comment doesn't spell out exactly why the author considered later fdatasync() calls to be insufficient, but they were: it was many years after commit 33cc5d8a4d0d that Linux ext3/4 filesystems began flushing file size changes to disk in fdatasync()[1][2]. I don't know if its original behaviour was intentional or not. So, if you didn't use the bigger fsync() hammer on that OS, you might lose the end of a recently extended file in a power failure even though fdatasync() had returned success. By my reading of POSIX, that shouldn't be necessary on a conforming implementation of fdatasync(), and that was fixed years ago in Linux. I'm not proposing any changes there, and I'm not proposing to take advantage of that in the new code. I'm pointing out that that we don't have to worry about that for these undo segments, because they are already flushed with fsync(), not fdatasync(). (To understand POSIX's descriptions of fsync() and fdatasync() you have to find the meanings of "Synchronized I/O Data Integrity Completion" and "Synchronized I/O File Integrity Completion" elsewhere in the spec. TL;DR: fdatasync() is only allowed to skip flushing attributes like the modified time, it's not allowed to skip flushing a file size change since that would interfere with retrieving the data.) 2. Time of reservation: Although they don't call fsync(), regular relations and these new undo files still write zeroes up front (respectively, for a new block and for a new segment). One reason for that is that most popular filesystems reserve space at write time, so you'll get ENOSPC when trying to allocate undo space, and that's a non-fatal ERROR. If we deferred until
Re: POC: Cleaning up orphaned files using undo logs
Hello Thomas, Here are some review comments on 0003-Add-undo-log-manager.patch. I've tried to avoid duplicate comments as much as possible. 1. In UndoLogAllocate, + * time this backend as needed to write to an undo log at all or because s/as/has + * Maintain our tracking of the and the previous transaction start Do you mean current log's transaction start as well? 2. In UndoLogAllocateInRecovery, we try to find the current log from the first undo buffer. So, after a log switch, we always have to register at least one buffer from the current undo log first. If we're updating something in the previous log, the respective buffer should be registered after that. I think we should document this in the comments. 3. In UndoLogGetOldestRecord(UndoLogNumber logno, bool *full), it seems the 'full' parameter is not used anywhere. Do we still need this? + /* It's been recycled. SO it must have been entirely discarded. */ s/SO/So 4. In CleanUpUndoCheckPointFiles, we can emit a debug2 message with something similar to : 'removed unreachable undo metadata files' + if (unlink(path) != 0) + elog(ERROR, "could not unlink file \"%s\": %m", path); according to my observation, whenever we deal with a file operation, we usually emit a ereport message with errcode_for_file_access(). Should we change it to ereport? There are other file operations as well including read(), OpenTransientFile() etc. 5. In CheckPointUndoLogs, + /* Capture snapshot while holding each mutex. */ + LWLockAcquire(>mutex, LW_EXCLUSIVE); + serialized[num_logs++] = slot->meta; + LWLockRelease(>mutex); why do we need an exclusive lock to read something from the slot? A share lock seems to be sufficient. pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_SYNC) is called after pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_WRITE) without calling pgstat_report_wait_end(). I think you've done the same to avoid an extra function call. But, it differs from other places in the PG code. Perhaps, we should follow this approach everywhere. 6. In StartupUndoLogs, + if (fd < 0) + elog(ERROR, "cannot open undo checkpoint snapshot \"%s\": %m", path); assuming your agreement upon changing above elog to ereport, the message should be more user friendly. May be something like 'cannot open pg_undo file'. + if ((size = read(fd, >meta, sizeof(slot->meta))) != sizeof(slot->meta)) The usage of size of doesn't look like a problem. But, we can save some extra padding bytes at the end if we use (offsetof + sizeof) approach similar to other places in PG. 7. In free_undo_log_slot, + /* + * When removing an undo log from a slot in shared memory, we acquire + * UndoLogLock, log->mutex and log->discard_lock, so that other code can + * hold any one of those locks to prevent the slot from being recycled. + */ + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE); + LWLockAcquire(>mutex, LW_EXCLUSIVE); + Assert(slot->logno != InvalidUndoLogNumber); + slot->logno = InvalidUndoLogNumber; + memset(>meta, 0, sizeof(slot->meta)); + LWLockRelease(>mutex); + LWLockRelease(UndoLogLock); you've not taken the discard_lock as mentioned in the comment. 8. In find_undo_log_slot, + * 1. If the calling code knows that it is attached to this lock or is the s/lock/slot + * 2. All other code should acquire log->mutex before accessing any members, + * and after doing so, check that the logno hasn't moved. If it is not, the + * entire undo log must be assumed to be discarded (as if this function + * returned NULL) and the caller must behave accordingly. Perhaps, you meant '..check that the logno remains same. If it is not..'. + /* + * If we didn't find it, then it must already have been entirely + * discarded. We create a negative cache entry so that we can answer + * this question quickly next time. + * + * TODO: We could track the lowest known undo log number, to reduce + * the negative cache entry bloat. + */ This is an interesting thought. But, I'm wondering how we are going to search the discarded logno in the simple hash. I guess that's why it's in the TODO list. 9. In attach_undo_log, + * For now we have a simple linked list of unattached undo logs for each + * persistence level. We'll grovel though it to find something for the + * tablespace you asked for. If you're not using multiple tablespaces s/though/through + if (slot == NULL) + { + if (UndoLogShared->next_logno > MaxUndoLogNumber) + { + /* + * You've used up all 16 exabytes of undo log addressing space. + * This is a difficult state to reach using only 16 exabytes of + * WAL. + */ + elog(ERROR, "undo log address space exhausted"); + } looks like a potential unlikely() condition. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Initdb failure
Hi, Initdb fails when following path is provided as input: datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/ Also the cleanup also tends to fail in the cleanup path. Could be something to do with path handling. I'm not sure if this is already known. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
RE: [Patch] PQconnectPoll() is blocked if target_session_attrs is read-write
From: Matsumura, Ryo [mailto:matsumura@jp.fujitsu.com] > Detail: > If target_session_attrs is set to read-write, PQconnectPoll() calls > PQsendQuery("SHOW transaction_read_only") althogh previous return value > was PGRES_POLLING_READING not WRITING. The current code probably assumes that PQsendQuery() to send "SHOW transaction_read_only" shouldn't block, because the message is small enough to fit in the socket send buffer. Likewise, the code in CONNECTION_AWAITING_RESPONSE case sends authentication data using pg_fe_sendauth() without checking for the write-ready status. OTOH, the code in CONNECTION_MADE case waits for write-ready status in advance before sending the startup packet. That's because the startup packet could get large enough to cause pqPacketSend() to block. So, I don't think the fix is necessary. > In result, PQsendQuery() may be blocked in pqsecure_raw_write(). FWIW, if PQsendQuery() blocked during connection establishment, I think it should block in poll() called from from pqWait(), because the libpq's socket is set non-blocking. Regards Takayuki Tsunakawa
Re: double free in ExecHashJoin, 9.6.12
On Thu, Jul 25, 2019 at 2:39 AM Merlin Moncure wrote: > Server is generally running pretty well, and is high volume. This > query is not new and is also medium volume. Database rebooted in > about 4 seconds with no damage; fast enough we didn't even trip alarms > (I noticed this troubleshooting another issue). We are a couple of > bug fixes releases behind but I didn't see anything obvious in the > release notes suggesting a resolved issue. Anyone have any ideas? > thanks in advance. > postgres: rms ysconfig 10.33.190.21(36788) > SELECT(ExecHashJoin+0x5a2)[0x5e2d32] Hi Merlin, Where's the binary from (exact package name, if installed with a package)? Not sure if this is going to help, but is there any chance you could disassemble that function so we can try to see what it's doing at that offset? For example on Debian if you have postgresql-9.6 and postgresql-9.6-dbg installed you could run "gdb /usr/lib/postgresql/9.6/bin/postgres" and then "disassemble ExecHashJoin". The code at "<+1442>" (0x5a2) is presumably calling free or some other libc thing (though I'm surprised not to see an intervening palloc thing). -- Thomas Munro https://enterprisedb.com
Re: POC: Cleaning up orphaned files using undo logs
On Thu, Jul 25, 2019 at 7:48 AM Amit Kapila wrote: > > On Wed, Jul 24, 2019 at 11:04 PM vignesh C wrote: > > > > Hi, > > > > I have done some review of undolog patch series > > and here are my comments: > > 0003-Add-undo-log-manager.patch > > > > 1) As undo log is being created in tablespace, > > if the tablespace is dropped later, will it have any impact? Thanks Amit, that clarifies the problem I was thinking. I have another question regarding drop table space failure, but I don't have a better solution for that problem. Let me think more about it and discuss. > > Yes, it drops the undo logs present in tablespace being dropped. See > DropUndoLogsInTablespace() in the same patch. > > > > > 4) Should we add a readme file for undolog as it does a fair amount of work > > and is core part of the undo system? > > Thanks Amit, I could get the details of readme. > > The Readme is already present in the patch series posted by Thomas. > See 0019-Add-developer-documentation-for-the-undo-log-storage.patch in > email [1]. > > [1] - > https://www.postgresql.org/message-id/CA%2BhUKGKni7EEU4FT71vZCCwPeaGb2PQOeKOFjQJavKnD577UMQ%40mail.gmail.com > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com -- Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Wed, Jul 24, 2019 at 3:06 PM Peter Geoghegan wrote: > There seems to be a kind of "synergy" between the nbtsplitloc.c > handling of pages that have lots of duplicates and posting list > compression. It seems as if the former mechanism "sets up the bowling > pins", while the latter mechanism "knocks them down", which is really > cool. We should try to gain a better understanding of how that works, > because it's possible that it could be even more effective in some > cases. I found another important way in which this synergy can fail to take place, which I can fix. By removing the BT_COMPRESS_THRESHOLD limit entirely, certain indexes from my test suite become much smaller, while most are not affected. These indexes were not helped too much by the patch before. For example, the TPC-E i_t_st_id index is 50% smaller. It is entirely full of duplicates of a single value (that's how it appears after an initial TPC-E bulk load), as are a couple of other TPC-E indexes. TPC-H's idx_partsupp_partkey index becomes ~18% smaller, while its idx_lineitem_orderkey index becomes ~15% smaller. I believe that this happened because rightmost page splits were an inefficient case for compression. But rightmost page split heavy indexes with lots of duplicates are not that uncommon. Think of any index with many NULL values, for example. I don't know for sure if BT_COMPRESS_THRESHOLD should be removed. I'm not sure what the idea is behind it. My sense is that we're likely to benefit by delaying page splits, no matter what. Though I am still looking at it purely from a space utilization point of view, at least for now. -- Peter Geoghegan
Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding
On Wed, Jul 24, 2019 at 05:27:18PM +0900, Kyotaro Horiguchi wrote: > Sorry in advance for link-breaking message forced by gmail.. Using the archives page "Resend email" link avoids that. > https://www.postgresql.org/message-id/flat/20190202083822.gc32...@gust.leadboat.com > > > 1. The result of the test is valid only until we release the SLRU > > ControlLock, > >which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to > > evaluate > >segments for deletion. Once we release that lock, latest_page_number can > >advance. This creates a TOCTOU race condition, allowing excess deletion: > > > > > >[local] test=# table trunc_clog_concurrency ; > >ERROR: could not access status of transaction 2149484247 > >DETAIL: Could not open file "pg_xact/0801": No such file or directory. > > It seems like some other vacuum process saw larger cutoff page? No, just one VACUUM suffices. > If I'm > not missing something, the missing page is no longer the > "recently-populated" page at the time (As I understand it as the last > page that holds valid data). Couldn't we just ignore ENOENT there? The server reported this error while attempting to read CLOG to determine whether a tuple's xmin committed or aborted. That ENOENT means the relevant CLOG page is not available. To ignore that ENOENT, the server would need to guess whether to consider the xmin committed or consider it aborted. So, no, we can't just ignore the ENOENT.
Re: dropdb --force
Pavel Stehule writes: > [ drop-database-force-20190708.patch ] I took a brief look at this, but I don't think it's really close to being committable. * The documentation claims FORCE will fail if you don't have privileges to terminate the other session(s) in the target DB. This is a lie; the code issues kill() without any regard for such niceties. You could perhaps make that better by going through pg_signal_backend(). * You've hacked CountOtherDBBackends to do something that's completely outside the charter one would expect from its name, and not even bothered to update its header comment. This requires more attention to not confusing future hackers; I'd say you can't even use that function name anymore. * You've also ignored the existing code in CountOtherDBBackends that is careful *not* to issue a kill() while holding the critical ProcArrayLock. That problem would get enormously worse if you tried to sub in pg_signal_backend() there, since that function may do catalog accesses --- it's pretty likely that you could get actual deadlocks, never mind just trashing performance. * I really dislike the addition of more hard-wired delays and timeouts to dropdb(). It's bad enough that CountOtherDBBackends has got that. Two layers of timeout are a rickety mess, and who's to say that a 1-minute timeout is appropriate for anything? * I'm concerned that the proposed syntax is not future-proof. FORCE is not a reserved word, and we surely don't want to make it one; but just appending it to the end of the command without any decoration seems like a recipe for problems if anybody wants to add other options later. (Possible examples: RESTRICT/CASCADE, or a user-defined timeout.) Maybe some parentheses would help? Or possibly I'm being overly paranoid, but ... I hadn't been paying any attention to this thread before now, but I'd assumed from the thread title that the idea was to implement any attempted kills in the dropdb app, not on the backend side. (As indeed it looks like the first version did.) Maybe it would be better to go back to that, instead of putting dubious behaviors into the core server. regards, tom lane
Re: POC: Cleaning up orphaned files using undo logs
On Wed, Jul 24, 2019 at 11:04 PM vignesh C wrote: > > Hi, > > I have done some review of undolog patch series > and here are my comments: > 0003-Add-undo-log-manager.patch > > 1) As undo log is being created in tablespace, > if the tablespace is dropped later, will it have any impact? > Yes, it drops the undo logs present in tablespace being dropped. See DropUndoLogsInTablespace() in the same patch. > > 4) Should we add a readme file for undolog as it does a fair amount of work > and is core part of the undo system? > The Readme is already present in the patch series posted by Thomas. See 0019-Add-developer-documentation-for-the-undo-log-storage.patch in email [1]. [1] - https://www.postgresql.org/message-id/CA%2BhUKGKni7EEU4FT71vZCCwPeaGb2PQOeKOFjQJavKnD577UMQ%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: benchmarking Flex practices
Chapman Flack writes: > On 07/24/19 03:45, John Naylor wrote: >> On Sun, Jul 21, 2019 at 3:14 AM Tom Lane wrote: >>> However, my second reaction was that maybe you were on to something >>> upthread when you speculated about postponing de-escaping of >>> Unicode literals into the grammar. If we did it like that then > With the de-escaping postponed, I think we'd be able to move beyond the > current odd situation where Unicode escapes can't describe non-ascii > characters, in exactly and only the cases where you need them to. How so? The grammar doesn't really have any more context information than the lexer does. (In both cases, it would be ugly but not really invalid for the transformation to depend on the database encoding, I think.) regards, tom lane
Re: [HACKERS] WAL logging problem in 9.4.3?
I found that CF-bot complaining on this. Seems that some comment fixes by the recent 21039555cd are the cause. No substantial change have been made by this rebasing. regards. On Fri, Jul 12, 2019 at 5:37 PM Kyotaro Horiguchi wrote: > > At Fri, 12 Jul 2019 17:30:41 +0900 (Tokyo Standard Time), Kyotaro Horiguchi > wrote in > <20190712.173041.236938840.horikyota@gmail.com> > > The v16 seems no longer works so I'll send further rebased version. > > It's just by renaming of TestLib::real_dir to perl2host. > This is rebased version v17. > > regards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center -- Kyotaro Horiguchi NTT Open Source Software Center v18-0001-TAP-test-for-copy-truncation-optimization.patch Description: Binary data v18-0002-Fix-WAL-skipping-feature.patch Description: Binary data v18-0003-Rename-smgrDoPendingDeletes-to-smgrDoPendingOperatio.patch Description: Binary data
[PATCH] Race condition in logical walsender causes long postgresql shutdown delay
Hi folks I recently tracked down a race in shutdown of logical walsenders that can cause PostgreSQL shutdown to hang for wal_sender_timeout/2 before it continues to a normal shutdown. With a long timeout that can be quite disruptive. TL;DR: The logical walsender may be signalled to stop, then read the last WAL record before the shutdown checkpoint is due to be written and go to sleep. The checkpointer will wait for it to acknowledge the shutdown and the walsender will wait for new WAL. The deadlock is eventually broken by the walsender timeout keepalive timer. Patch attached. The issue arises from the difference between logical and physical walsender shutdown as introduced by commit c6c3334364 "Prevent possibility of panics during shutdown checkpoint". It's probably fairly hard to trigger. I ran into a case where it happened regularly only because of an unrelated patch that caused some WAL to be written just before the checkpointer issued walsender shutdown requests. But it's still a legit bug. If you hit the issue you'll see that walsender(s) can be seen to be sleeping in WaitLatchOrSocket in WalSndLoop. They'll keep sleeping until woken by the keepalive timeout. The checkpointer will be waiting in WalSndWaitStopping() for the walsenders to enter WALSNDSTATE_STOPPING or exit, whichever happens first. The postmaster will be waiting in ServerLoop for the checkpointer to finish the shutdown checkpoint. The checkpointer waits in WalSndWaitStopping() for all walsenders to either exit or enter WALSNDSTATE_STOPPING state. Logical walsenders never enter WALSNDSTATE_STOPPING, they go straight to exiting, so the checkpointer can't finish WalSndWaitStopping() and write the shutdown checkpoint. A logical walsender usually notices the shutdown request and exits as soon as it has flushed all WAL up to the server's flushpoint, while physical walsenders enter WALSNDSTATE_STOPPING. But there's a race where a logical walsender may read the final available record and notice it has caught up - but not notice that it has reached end-of-WAL and check whether it should exit. This happens on the following (simplified) code path in XLogSendLogical: if (record != NULL) { XLogRecPtr flushPtr = GetFlushRecPtr(); LogicalDecodingProcessRecord(...); sentPtr = ...; if (sentPtr >= flushPtr) WalSndCaughtUp = true;// <-- HERE } because the test for got_STOPPING that sets got_SIGUSR2 is only on the other branch where getting a record returns `NULL`; this branch can sleep before checking if shutdown was requested. So if the walsender read the last WAL record available, when it's >= the flush pointer and it already handled the SIGUSR1 latch wakeup for the WAL write, it might go back to sleep and not wake up until the timeout. The checkpointer already sent PROCSIG_WALSND_INIT_STOPPING to the walsenders in the prior WalSndInitStopping() call so the walsender won't be woken by a signal from the checkpointer. No new WAL will be written because the walsender just consumed the final record written before the checkpointer went to sleep, and the checkpointer won't write anything more until the walsender exits. The client might not be due a keepalive for some time.The only reason this doesn't turn into a total deadlock is that keepalive wakeup. An alternative fix would be to have the logical walsender set WALSNDSTATE_STOPPING instead of faking got_SIGUSR2, then go to sleep waiting for more WAL. Logical decoding would need to check if it was running during shutdown and Assert(...) then ERROR if it saw any WAL records that result in output plugin calls or snapshot management calls. I avoided this approach as it's more intrusive and I'm not confident I can concoct a reliable test to trigger it. -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise From 559dda09b35870d3630a65cbca682e50343c6f0f Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Thu, 25 Jul 2019 09:14:58 +0800 Subject: [PATCH] Fix a delay in PostgreSQL shutdown caused by logical replication Due to a race with WAL writing during shutdown, if logical walsenders were running then PostgreSQL's shutdown could be delayed by up to wal_sender_timeout/2 while it waits for the walsenders to shut down. The walsenders wait for new WAL or end-of-wal which won't come until shutdown so there's a deadlock. The walsender timeout eventually breaks the deadlock. The issue was introduced by PostgreSQL 10 commit c6c3334364 "Prevent possibility of panics during shutdown checkpoint". A logical walsender never enters WALSNDSTATE_STOPPING and allows the checkpointer to continue shutdown. Instead it exits when it reads end-of-WAL. But if it reads the last WAL record written before shutdown and that record doesn't generate a client network write, it can mark itself caught up and go to sleep without checking to see if it's been asked to shut
Re: Should we add xid_current() or a int8->xid cast?
On Thu, Jul 25, 2019 at 12:42 PM Tom Lane wrote: > Andres Freund writes: > > On 2019-07-24 20:34:39 -0400, Tom Lane wrote: > >> Yeah, I would absolutely NOT recommend that you open that can of worms > >> right now. We have looked at adding unsigned integer types in the past > >> and it looked like a mess. > > > I assume Thomas was thinking more of another bespoke type like xid, just > > wider. There's some notational advantage in not being able to > > immediately do math etc on xids. > > Well, we could invent an xid8 type if we want, just don't try to make > it part of the numeric hierarchy (as indeed xid isn't). Yeah, I meant an xid64/xid8/fxid/pg_something/... type that isn't a kind of number. -- Thomas Munro https://enterprisedb.com
Re: On the stability of TAP tests for LDAP
On Wed, Jul 24, 2019 at 09:01:47PM +1200, Thomas Munro wrote: > Huh, yeah, I don't know why slapd requires credentials on Debian, when > the version that ships with FreeBSD is OK with an anonymous > connection. Rather than worrying about that, I just adjusted it to > supply the credentials. It works on both for me. Thanks for the updated patch, this looks good. I have done a series of tests keeping my laptop busy and I haven't seen a failure where I usually see problems 10%~20% of the time. So that seems to help, thanks! -- Michael signature.asc Description: PGP signature
Re: Should we add xid_current() or a int8->xid cast?
Andres Freund writes: > On 2019-07-24 20:34:39 -0400, Tom Lane wrote: >> Yeah, I would absolutely NOT recommend that you open that can of worms >> right now. We have looked at adding unsigned integer types in the past >> and it looked like a mess. > I assume Thomas was thinking more of another bespoke type like xid, just > wider. There's some notational advantage in not being able to > immediately do math etc on xids. Well, we could invent an xid8 type if we want, just don't try to make it part of the numeric hierarchy (as indeed xid isn't). regards, tom lane
Re: Should we add xid_current() or a int8->xid cast?
Hi, On 2019-07-24 20:34:39 -0400, Tom Lane wrote: > Yeah, I would absolutely NOT recommend that you open that can of worms > right now. We have looked at adding unsigned integer types in the past > and it looked like a mess. I assume Thomas was thinking more of another bespoke type like xid, just wider. There's some notational advantage in not being able to immediately do math etc on xids. - Andres
Re: Should we add xid_current() or a int8->xid cast?
Andres Freund writes: > On 2019-07-25 12:20:58 +1200, Thomas Munro wrote: >> On Thu, Jul 25, 2019 at 12:06 PM Andres Freund wrote: >>> Seems easiest to just add xid_current(), or add a cast from int8 to xid >>> (probably explicit?) that handles the wraparound logic correctly? >> Yeah, I was wondering about that. int8 isn't really the right type, >> since FullTransactionId is unsigned. > For now that doesn't seem that big an impediment... Yeah, I would absolutely NOT recommend that you open that can of worms right now. We have looked at adding unsigned integer types in the past and it looked like a mess. I think an explicit cast is a reasonable thing to add, though. regards, tom lane
Re: Should we add xid_current() or a int8->xid cast?
Hi, On 2019-07-25 12:20:58 +1200, Thomas Munro wrote: > On Thu, Jul 25, 2019 at 12:06 PM Andres Freund wrote: > > we have txid_current(), which returns an int8. But there's no convenient > > way to convert that to type 'xid'. Which is fairly inconvenient, given > > that we expose xids in various places. > > > > My current need for this was just a regression test to make sure that > > system columns (xmin/xmax in particular) don't get broken again for ON > > CONFLICT. But I've needed this before in other scenarios - e.g. age(xid) > > can be useful to figure out how old a transaction is, but age() doesn't > > work with txid_current()'s return value. > > > > Seems easiest to just add xid_current(), or add a cast from int8 to xid > > (probably explicit?) that handles the wraparound logic correctly? > > Yeah, I was wondering about that. int8 isn't really the right type, > since FullTransactionId is unsigned. For now that doesn't seem that big an impediment... > If we had a SQL type for 64 bit xids, it should be convertible to xid, > and the reverse conversion should require a more complicated dance. > Of course we can't casually change txid_current() without annoying > people who are using it, so perhaps if we invent a new SQL type we > should also make a new function that returns it. Possibly we could add a fullxid or xid8, xid64, pg_xid64, ... type, and have an implicit cast to int8? Greetings, Andres Freund
Re: Should we add xid_current() or a int8->xid cast?
On Thu, Jul 25, 2019 at 12:06 PM Andres Freund wrote: > we have txid_current(), which returns an int8. But there's no convenient > way to convert that to type 'xid'. Which is fairly inconvenient, given > that we expose xids in various places. > > My current need for this was just a regression test to make sure that > system columns (xmin/xmax in particular) don't get broken again for ON > CONFLICT. But I've needed this before in other scenarios - e.g. age(xid) > can be useful to figure out how old a transaction is, but age() doesn't > work with txid_current()'s return value. > > Seems easiest to just add xid_current(), or add a cast from int8 to xid > (probably explicit?) that handles the wraparound logic correctly? Yeah, I was wondering about that. int8 isn't really the right type, since FullTransactionId is unsigned. If we had a SQL type for 64 bit xids, it should be convertible to xid, and the reverse conversion should require a more complicated dance. Of course we can't casually change txid_current() without annoying people who are using it, so perhaps if we invent a new SQL type we should also make a new function that returns it. -- Thomas Munro https://enterprisedb.com
Re: ON CONFLICT (and manual row locks) cause xmax of updated tuple to unnecessarily be set
On Wed, Jul 24, 2019 at 4:24 PM Andres Freund wrote: > as you can see the same xmax is set for both row version, with the new > infomask being HEAP_XMAX_KEYSHR_LOCK | HEAP_XMAX_LOCK_ONLY | HEAP_UPDATED. Meta remark about your test case: I am a big fan of microbenchmarks like this, which execute simple DML queries using a single connection, and then consider if the on-disk state looks as good as expected, for some value of "good". I had a lot of success with this approach while developing the v12 work on nbtree, where I went to the trouble of automating everything. The same test suite also helped with the nbtree compression/deduplication patch just today. I like to call these tests "wind tunnel tests". It's far from obvious that you can take a totally synthetic, serial test, and use it to measure something that is important to real workloads. It seems to work well when there is a narrow, specific thing that you're interested in. This is especially true when there is a real risk of regressing performance in some way. > but we really don't need to do any of that in this case - the only > locker is the current backend, after all. > > I think this isn't great, because it'll later will cause unnecessary > hint bit writes (although ones somewhat likely combined with setting > XMIN_COMMITTED), and even extra work for freezing. > > Based on a quick look this wasn't the case before the finer grained > tuple locking - which makes sense, there was no cases where locks would > need to be carried forward. I agree that this is unfortunate. Are you planning on working on it? -- Peter Geoghegan
Should we add xid_current() or a int8->xid cast?
Hi, we have txid_current(), which returns an int8. But there's no convenient way to convert that to type 'xid'. Which is fairly inconvenient, given that we expose xids in various places. My current need for this was just a regression test to make sure that system columns (xmin/xmax in particular) don't get broken again for ON CONFLICT. But I've needed this before in other scenarios - e.g. age(xid) can be useful to figure out how old a transaction is, but age() doesn't work with txid_current()'s return value. Seems easiest to just add xid_current(), or add a cast from int8 to xid (probably explicit?) that handles the wraparound logic correctly? Greetings, Andres Freund
Re: Compile from source using latest Microsoft Windows SDK
On Wed, Jul 24, 2019 at 10:38:47AM -0400, Andrew Dunstan wrote: > Yeah, on consideration I think Peifeng's patch upthread looks OK. > (Incidentally, this variable is not set in the very old version of VC > running on currawong). Interesting. I am not actually sure in which version of VS this has been introduced. But it would be fine enough to do nothing if the variable is not defined and rely on the default. Except for the formatting and indentation, the patch looks right. Andrew, perhaps you would prefer doing the final touch on it and commit it? -- Michael signature.asc Description: PGP signature
RE: seems like a bug in pgbench -R
On Wed, July 24, 2019 at 7:02 PM, Fabien COELHO wrote: > > but I have one question. Is it better adding any check like if(maxsock > > != -1) before the select? > > > > else/* no explicit delay, select without timeout */ > > { > >nsocks = select(maxsock + 1, _mask, NULL, NULL, NULL); } > > I think that it is not necessary because this case cannot happen: If some > clients are still running (remains > 0), they are either sleeping, in > which case there would be a timeout, or they are waiting for something > from the server, otherwise the script could be advanced further so there > would be something else to do for the thread. Ah, I understand. > We could check this by adding "Assert(maxsock != -1);" before this select, > but I would not do that for a released version. Yeah I also imagined that we can use Assert, but ah, it's released version. I got it. Thanks for telling that. So I'll mark this ready for committer. -- Yoshikazu Imai
Re: [PATCH] minor bugfix for pg_basebackup (9.6 ~ )
On Wed, Jul 24, 2019 at 11:23:30AM -0400, Alvaro Herrera wrote: > Heh, yesterday I revised the original patch as attached and was about to > push when the bell rang. I like this one because it keeps the comment > to one line and it mentions the function name in charge of the > validation (useful for grepping later on). It's a bit laconic because > of the long function name and the desire to keep it to one line, but it > seems sufficient to me. Looks fine to me. A nit: addition of braces for the if block. Even if that a one-liner, there is a comment so I think that this makes the code more readable. -- Michael signature.asc Description: PGP signature
Re: Statistical aggregate functions are not working with PARTIAL aggregation
On Thu, 25 Jul 2019 at 11:33, Andres Freund wrote: > > On 2019-07-25 10:36:26 +1200, David Rowley wrote: > > 2) Planner trying to give nodeAgg.c a sorted path to work with on > > DISTINCT / ORDER BY aggs > > That'll have to be a best effort thing though, i.e. there'll always be > cases where we'll have to retain the current logic (or just regress > performance really badly)? It's something we already do for windowing functions. We just don't do it for aggregates. It's slightly different since windowing functions just chain nodes together to evaluate multiple window definitions. Aggregates can't/don't do that since aggregates... well, aggregate, (i.e the input to the 2nd one can't be aggregated by the 1st one) but there's likely not much of a reason why standard_qp_callback couldn't choose some pathkeys for the first AggRef with a ORDER BY / DISTINCT clause. nodeAgg.c would still need to know how to change the sort order in order to evaluate other Aggrefs in some different order. I'm not quite sure where the regression would be. nodeAgg.c must perform the sort, or if we give the planner some pathkeys, then worst case the planner adds a Sort node. That seems equivalent to me. However, in the best case, there's a suitable index and no sorting is required anywhere. Probably then we can add combine function support for the remaining built-in aggregates. There was trouble doing that in [1] due to some concerns about messing up results for people who rely on the order of an aggregate without actually writing an ORDER BY. [1] https://www.postgresql.org/message-id/cakjs1f9sx_6gtcvd6tmuznntch0vhbzhx6fzqw17tgvfh-g...@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Statistical aggregate functions are not working with PARTIAL aggregation
Hi, On 2019-07-25 10:36:26 +1200, David Rowley wrote: > I'd like to do > much more in nodeAgg.c, TBH. It would be good to remove some > code from > nodeAgg.c and put it in the planner. Indeed! > I'd like to see: > > 1) Planner doing the Aggref merging for aggregates with the same > transfn etc. Makes sense. I assume this would entail associating T_Aggref expressions with the corresponding Agg at an earlier state? The whole business of having to prepare expression evaluation, just so ExecInitAgg() can figure out which aggregates it has to compute always has struck me as architecturally bad. > 2) Planner trying to give nodeAgg.c a sorted path to work with on > DISTINCT / ORDER BY aggs That'll have to be a best effort thing though, i.e. there'll always be cases where we'll have to retain the current logic (or just regress performance really badly)? Greetings, Andres Freund
ON CONFLICT (and manual row locks) cause xmax of updated tuple to unnecessarily be set
Hi, Scenario is a very plain upsert: CREATE TABLE upsert(key int primary key); INSERT INTO upsert VALUES(1) ON CONFLICT (key) DO UPDATE SET key = excluded.key; INSERT INTO upsert VALUES(1) ON CONFLICT (key) DO UPDATE SET key = excluded.key; INSERT 0 1 INSERT 0 1 postgres[8755][1]=# SELECT page_items.* FROM generate_series(0, pg_relation_size('upsert'::regclass::text) / 8192 - 1) blkno, get_raw_page('upsert'::regclass::text, blkno::int4) AS raw_page, heap_page_items(raw_page) as page_items; ┌┬┬──┬┬──┬──┬──┬┬─┬┬┬┬┬┐ │ lp │ lp_off │ lp_flags │ lp_len │ t_xmin │ t_xmax │ t_field3 │ t_ctid │ t_infomask2 │ t_infomask │ t_hoff │ t_bits │ t_oid │ t_data │ ├┼┼──┼┼──┼──┼──┼┼─┼┼┼┼┼┤ │ 1 │ 8160 │1 │ 28 │ 19742591 │ 19742592 │0 │ (0,2) │ 24577 │256 │ 24 │ (null) │ (null) │ \x0100 │ │ 2 │ 8128 │1 │ 28 │ 19742592 │ 19742592 │0 │ (0,2) │ 32769 │ 8336 │ 24 │ (null) │ (null) │ \x0100 │ └┴┴──┴┴──┴──┴──┴┴─┴┴┴┴┴┘ (2 rows) as you can see the same xmax is set for both row version, with the new infomask being HEAP_XMAX_KEYSHR_LOCK | HEAP_XMAX_LOCK_ONLY | HEAP_UPDATED. The reason that happens is that ExecOnConflictUpdate() needs to lock the row to be able to reliably compute a new tuple version based on that row. heap_update() then decides it needs to carry that xmax forward, as it's a valid lock: /* * If the tuple we're updating is locked, we need to preserve the locking * info in the old tuple's Xmax. Prepare a new Xmax value for this. */ compute_new_xmax_infomask(HeapTupleHeaderGetRawXmax(oldtup.t_data), oldtup.t_data->t_infomask, oldtup.t_data->t_infomask2, xid, *lockmode, true, _old_tuple, _old_tuple, _old_tuple); /* * And also prepare an Xmax value for the new copy of the tuple. If there * was no xmax previously, or there was one but all lockers are now gone, * then use InvalidXid; otherwise, get the xmax from the old tuple. (In * rare cases that might also be InvalidXid and yet not have the * HEAP_XMAX_INVALID bit set; that's fine.) */ if ((oldtup.t_data->t_infomask & HEAP_XMAX_INVALID) || HEAP_LOCKED_UPGRADED(oldtup.t_data->t_infomask) || (checked_lockers && !locker_remains)) xmax_new_tuple = InvalidTransactionId; else xmax_new_tuple = HeapTupleHeaderGetRawXmax(oldtup.t_data); but we really don't need to do any of that in this case - the only locker is the current backend, after all. I think this isn't great, because it'll later will cause unnecessary hint bit writes (although ones somewhat likely combined with setting XMIN_COMMITTED), and even extra work for freezing. Based on a quick look this wasn't the case before the finer grained tuple locking - which makes sense, there was no cases where locks would need to be carried forward. It's worthwhile to note that this *nearly* already works, because of the following codepath in compute_new_xmax_infomask(): * If the lock to be acquired is for the same TransactionId as the * existing lock, there's an optimization possible: consider only the * strongest of both locks as the only one present, and restart. */ if (xmax == add_to_xmax) { /* * Note that it's not possible for the original tuple to be updated: * we wouldn't be here because the tuple would have been invisible and * we wouldn't try to update it. As a subtlety, this code can also * run when traversing an update chain to lock future versions of a * tuple. But we wouldn't be here either, because the add_to_xmax * would be different from the original updater. */ Assert(HEAP_XMAX_IS_LOCKED_ONLY(old_infomask)); /* acquire the strongest of both */ if (mode < old_mode) mode = old_mode; /* mustn't touch is_update */
Re: Statistical aggregate functions are not working with PARTIAL aggregation
On Thu, 25 Jul 2019 at 06:52, Andres Freund wrote: > Now that master is open for development, and you have a commit bit, are > you planning to go forward with this on your own? I plan to, but it's not a high priority at the moment. I'd like to do much more in nodeAgg.c, TBH. It would be good to remove some code from nodeAgg.c and put it in the planner. I'd like to see: 1) Planner doing the Aggref merging for aggregates with the same transfn etc. 2) Planner trying to give nodeAgg.c a sorted path to work with on DISTINCT / ORDER BY aggs 3) Planner providing nodeAgg.c with the order that the aggregates should be evaluated in order to minimise sorting for DISTINCT / ORDER BY aggs. I'd take all those up on a separate thread though. If you're in a rush to see the cleanup proposed a few months ago then please feel free to take it up. It might be a while before I can get a chance to look at it again. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: pgbench - allow to create partitioned tables
# and look at latency: # no parts = 0.071 ms # 1 hash = 0.071 ms (did someone optimize this case?!) # 2 hash ~ 0.126 ms (+ 0.055 ms) # 50 hash ~ 0.155 ms # 100 hash ~ 0.178 ms # 150 hash ~ 0.232 ms # 200 hash ~ 0.279 ms # overhead ~ (0.050 + [0.0005-0.0008] * nparts) ms It is linear? Good question. I would have hoped affine, but this is not very clear on these data, which are the median of about five runs, hence the bracket on the slope factor. At least it is increasing with the number of partitions. Maybe it would be clearer on the minimum of five runs. Here is a fellow up. On the minimum of all available runs the query time on hash partitions is about: 0.64375 nparts + 118.30979 (in µs). So the overhead is about 47.30979 + 0.64375 nparts, and it is indeed pretty convincingly linear as suggested by the attached figure. -- Fabien.
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Tue, Jul 23, 2019 at 6:22 PM Peter Geoghegan wrote: > Attached is a revised version of your v2 that fixes this issue -- I'll > call this v3. Remember that index that I said was 5.5x smaller with the patch applied, following retail insertions (a single big INSERT ... SELECT ...)? Well, it's 6.5x faster with this small additional patch applied on top of the v3 I posted yesterday. Many of the indexes in my test suite are about ~20% smaller __in addition to__ very big size reductions. Some are even ~30% smaller than they were with v3 of the patch. For example, the fair use implementation of TPC-H that my test data comes from has an index on the "orders" o_orderdate column, named idx_orders_orderdate, which is made ~30% smaller by the addition of this simple patch (once again, this is following a single big INSERT ... SELECT ...). This change makes idx_orders_orderdate ~3.3x smaller than it is with master/Postgres 12, in case you were wondering. This new patch teaches nbtsplitloc.c to subtract posting list overhead when sizing the new high key for the left half of a candidate split point, since we know for sure that _bt_truncate() will at least manage to truncate away that much from the new high key, even in the worst case. Since posting lists are often very large, this can make a big difference. This is actually just a bugfix, not a new idea -- I merely made nbtsplitloc.c understand how truncation works with posting lists. There seems to be a kind of "synergy" between the nbtsplitloc.c handling of pages that have lots of duplicates and posting list compression. It seems as if the former mechanism "sets up the bowling pins", while the latter mechanism "knocks them down", which is really cool. We should try to gain a better understanding of how that works, because it's possible that it could be even more effective in some cases. -- Peter Geoghegan 0003-Account-for-posting-list-overhead-during-splits.patch Description: Binary data
Re: Built-in connection pooler
Hello Konstantin, > Concerning testing: I do not think that connection pooler needs some kind of special tests. > The idea of built-in connection pooler is that it should be able to handle all requests normal postgres can do. > I have added to regression tests extra path with enabled connection proxies. > Unfortunately, pg_regress is altering some session variables, so it backend becomes tainted and so > pooling is not actually used (but communication through proxy is tested). > Thank you for your work on this patch, I took some good time to really explore the configuration and do some testing with pgbench. This round of testing was done against patch 10 [1] and master branch commit a0555ddab9 from 7/22. Thank you for explaining, I wasn't sure. make installcheck-world: tested, passed Implements feature: tested, passed Documentation: I need to review again, I saw typos when testing but didn't make note of the details. Applying the patch [1] has improved from v9, still getting these: git apply -p1 < builtin_connection_proxy-10.patch :1536: indent with spaces. /* Has pending clients: serve one of them */ :1936: indent with spaces. /* If we attach new client to the existed backend, then we need to send handshake response to the client */ :2208: indent with spaces. if (port->sock == PGINVALID_SOCKET) :2416: indent with spaces. FuncCallContext* srf_ctx; :2429: indent with spaces. ps_ctx = (PoolerStateContext*)palloc(sizeof(PoolerStateContext)); warning: squelched 5 whitespace errors warning: 10 lines add whitespace errors. I used a DigitalOcean droplet with 2 CPU and 2 GB RAM and SSD for this testing, Ubuntu 18.04. I chose the smaller server size based on the availability of similar and recent results around connection pooling [2] that used AWS EC2 m4.large instance (2 cores, 8 GB RAM) and pgbouncer. Your prior pgbench tests [3] also focused on larger servers so I wanted to see how this works on smaller hardware. Considering this from connpool.sgml: "connection_proxies specifies number of connection proxy processes which will be spawned. Default value is zero, so connection pooling is disabled by default." That hints to me that connection_proxies is the main configuration to start with so that was the only configuration I changed from the default for this feature. I adjusted shared_buffers to 500MB (25% of total) and max_connections to 1000. Only having one proxy gives subpar performance across the board, so did setting this value to 10. My hunch is this value should roughly follow the # of cpus available, but that's just a hunch. I tested with 25, 75, 150, 300 and 600 connections. Initialized with a scale of 1000 and ran read-only tests. Basic pgbench commands look like the following, I have full commands and results from 18 tests included in the attached MD file. Postgres was restarted between each test. pgbench -i -s 1000 bench_test pgbench -p 6543 -c 300 -j 2 -T 600 -P 60 -S bench_test Tests were all ran from the same server. I intend to do further testing with external connections using SSL. General observations: For each value of connection_proxies, the TPS observed at 25 connections held up reliably through 600 connections. For this server using connection_proxies = 2 was the fastest, but setting to 1 or 10 still provided **predictable** throughput. That seems to be a good attribute for this feature. Also predictable was the latency increase, doubling the connections roughly doubles the latency. This was true with or without connection pooling. Focusing on disabled connection pooling vs the feature with two proxies, the results are promising, the breakpoint seems to be around 150 connections. Low connections (25): -15% TPS; +45% latency Medium connections (300): +21% TPS; +38% latency High connections (600): Couldn't run without pooling... aka: Win for pooling! The two attached charts show the TPS and average latency of these two scenarios. This feature does a great job of maintaining a consistent TPS as connection numbers increase. This comes with tradeoffs of lower throughput with < 150 connections, and higher latency across the board. The increase in latency seems reasonable to me based on the testing I have done so far. Compared to the results from [2] it seems latency is affecting this feature a bit more than it does pgbouncer, yet not unreasonably so given the benefit of having the feature built in and the reduced complexity. I don't understand yet how max_sessions ties in. Also, having both session_pool_size and connection_proxies seemed confusing at first. I still haven't figured out exactly how they relate together in the overall operation and their impact on performance. The new view helped, I get the concept of **what** it is doing (connection_proxies = more rows, session_pool_size = n_backends for each row), it's more a lack of understanding the **why** regarding how it will
Re: Memory Accounting
On Tue, Jul 23, 2019 at 06:18:26PM -0700, Melanie Plageman wrote: On Thu, Jul 18, 2019 at 11:24 AM Jeff Davis wrote: Previous discussion: https://postgr.es/m/1407012053.15301.53.camel%40jeff-desktop This patch introduces a way to ask a memory context how much memory it currently has allocated. Each time a new block (not an individual chunk, but a new malloc'd block) is allocated, it increments a struct member; and each time a block is free'd, it decrements the struct member. So it tracks blocks allocated by malloc, not what is actually used for chunks allocated by palloc. Cool! I like how straight-forward this approach is. It seems easy to build on, as well. Are there cases where we are likely to palloc a lot without needing to malloc in a certain memory context? For example, do we have a pattern where, for some kind of memory intensive operator, we might palloc in a per tuple context and consistently get chunks without having to malloc and then later, where we to try and check the bytes allocated for one of these per tuple contexts to decide on some behavior, the number would not be representative? I think there's plenty of places where we quickly get into a state with enough chunks in the freelist - the per-tuple contexts are a good example of that, I think. I think that is basically what Heikki is asking about with HashAgg, but I wondered if there were other cases that you had already thought through where this might happen. I think Heikki was asking about places with a lot of sub-contexts, which a completely different issue. It used to be the case that some aggregates created a separate context for each group - like array_agg. That would make Jeff's approach to accounting rather inefficient, because checking how much memory is used would be very expensive (having to loop over a large number of contexts). The purpose is for Memory Bounded Hash Aggregation, but it may be useful in more cases as we try to better adapt to and control memory usage at execution time. This approach seems like it would be good for memory intensive operators which use a large, representative context. I think the HashTableContext for HashJoin might be one? Yes, that might me a good candidate (and it would be much simpler than the manual accounting we use now). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Index Skip Scan
> On Mon, Jul 22, 2019 at 7:10 PM Jesper Pedersen > wrote: > > On 7/22/19 1:44 AM, David Rowley wrote: > > Here are the comments I noted down during the review: > > > > cost_index: > > > > I know you've not finished here, but I think it'll need to adjust > > tuples_fetched somehow to account for estimate_num_groups() on the > > Path's unique keys. Any Eclass with an ec_has_const = true does not > > need to be part of the estimate there as there can only be at most one > > value for these. > > > > For example, in a query such as: > > > > SELECT x,y FROM t WHERE x = 1 GROUP BY x,y; > > > > you only need to perform estimate_num_groups() on "y". > > > > I'm really not quite sure on what exactly will be required from > > amcostestimate() here. The cost of the skip scan is not the same as > > the normal scan. So other that API needs adjusted to allow the caller > > to mention that we want skip scans estimated, or there needs to be > > another callback. > > > > I think this part will become more clear once the index skip scan patch > is rebased, as we got the uniquekeys field on the Path, and the > indexskipprefixy info on the IndexPath node. Here is what I came up with to address the problems, mentioned above in this thread. It passes tests, but I haven't tested it yet more thoughtful (e.g. it occurred to me, that `_bt_read_closest` probably wouldn't work, if a next key, that passes an index condition is few pages away - I'll try to tackle it soon). Just another small step forward, but I hope it's enough to rebase on top of it planner changes. Also I've added few tags, mostly to mention reviewers contribution. v22-0001-Index-skip-scan.patch Description: Binary data
Re: initdb recommendations
On 2019-07-24 22:18, Tom Lane wrote: >> I think we could just define that if geteuid == getpeereid, then >> authentication succeeds. Possibly make that a setting if someone wants >> to turn it off. > > We would still need to make the proposed buildfarm changes, though, > because Windows. (And HPUX, though if it were the only holdout > maybe we could consider blowing it off.) > > I'm not that excited about weakening our authentication rules > just to make things easier for the buildfarm. Yes, this idea is separate from those buildfarm changes. > It's possible that what you suggest is a good idea anyway to reduce > the user impact of switching from trust to peer as default auth. > However, I'm a little worried that we'll start getting a lot of "it > works in psql but I can't connect via JDBC-or-whatever" complaints. Well, the existence of "local" vs. "host" already has that effect anyway. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pg_upgrade version checking questions
On 2019-07-23 17:30, Daniel Gustafsson wrote: > The reason for moving is that we print default values in usage(), and that > requires the value to be computed before calling usage(). We already do this > for resolving environment values in parseCommandLine(). If we do it in setup, > then we’d have to split out resolving the new_cluster.bindir into it’s own > function exposed to option.c, or do you have any other suggestions there? I think doing nontrivial work in order to print default values in usage() is bad practice, because in unfortunate cases it would even prevent you from calling --help. Also, in this case, it would probably very often exceed the typical line length of --help output and create some general ugliness. Writing something like "(default: same as this pg_upgrade)" would probably achieve just about the same. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Support for jsonpath .datetime() method
On 2019-07-24 00:48, Nikita Glukhov wrote: > It seems that our YY works like RR should: > > SELECT to_date('69', 'YY'); > to_date > > 2069-01-01 > (1 row) > > SELECT to_date('70', 'YY'); > to_date > > 1970-01-01 > (1 row) > > But by the standard first two digits of current year should be used in YY. Is this behavior even documented anywhere in our documentation? I couldn't find it. What's the exact specification of what it does in these cases? > So it's unclear what we should do: > - implement YY and RR strictly following the standard only in .datetime() > - fix YY implementation in to_date()/to_timestamp() and implement RR > - use our non-standard templates in .datetime() I think we definitely should try to use the same template system in both the general functions and in .datetime(). This might involve some compromises between existing behavior, Oracle behavior, SQL standard. So far I'm not worried: If you're using two-digit years like above, you're playing with fire anyway. Also some of the other cases like dealing with trailing spaces are probably acceptable as slight incompatibilities or extensions. We should collect a list of test cases that illustrate the differences and then work out how to deal with them. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: initdb recommendations
Peter Eisentraut writes: > If I'm logged in as the OS user that owns the data directory, I should > be able to log in to the database system via local socket as any user. > Because why stop me? I can just change pg_hba.conf to let me in. Hmm ... there's probably some minor loss of safety there, but not much, as you say. > I think we could just define that if geteuid == getpeereid, then > authentication succeeds. Possibly make that a setting if someone wants > to turn it off. We would still need to make the proposed buildfarm changes, though, because Windows. (And HPUX, though if it were the only holdout maybe we could consider blowing it off.) I'm not that excited about weakening our authentication rules just to make things easier for the buildfarm. It's possible that what you suggest is a good idea anyway to reduce the user impact of switching from trust to peer as default auth. However, I'm a little worried that we'll start getting a lot of "it works in psql but I can't connect via JDBC-or-whatever" complaints. So I dunno if it will really make things easier for users. regards, tom lane
Re: initdb recommendations
On 2019-07-22 19:40, Andres Freund wrote: > On 2019-07-22 13:02:13 -0400, Andrew Dunstan wrote: >> There are a few things we could do. We could force trust auth, or we >> could add an ident map that allowed $USER to login as buildfarm. Finding >> all the places we would need to fix that could be a fun project ... > > Perhaps we could actually do so automatically when the initdb invoking > user isn't the same as the OS user? Imo that'd be generally quite > useful, and not just for the regression tets. It seems to me that there is something missing in our client authentication system here. If I'm logged in as the OS user that owns the data directory, I should be able to log in to the database system via local socket as any user. Because why stop me? I can just change pg_hba.conf to let me in. That would also address this problem that when you use the initdb -U option, the proposed default "peer" setting doesn't help you much. Making a pg_ident.conf map automatically helps for that particular user combination, but then not for other users. (There is no "sameuser plus these additional mappings".) I think we could just define that if geteuid == getpeereid, then authentication succeeds. Possibly make that a setting if someone wants to turn it off. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: initdb recommendations
On 7/24/19 10:00 AM, Andrew Dunstan wrote: > On 7/23/19 2:12 AM, Peter Eisentraut wrote: >> On 2019-07-22 21:16, Andrew Dunstan wrote: >>> Modulo this issue, experimentation shows that adding '-A trust' to the >>> line in run_build.pl where initdb is called fixes the issue. If we're >>> going to rely on a buildfarm client fix that one seems simplest. >> Yes, that is the right fix. It's what the in-tree test drivers >> (pg_regress, PostgresNode.pm) do. >> > > I have done that, I will put out a new release probably right after the > CF closes. > > > I think we also need to change vcregress.pl to use trust explicitly for > upgrade checks, just like the Unix upgrade test script does. That should > help to future-proof us a bit. > > Here's a patch along those lines that pretty much syncs up vcregress.pl's initdb with pg_upgrade's test.sh. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/tools/msvc/vcregress.pl b/src/tools/msvc/vcregress.pl index 5495066b4d..05446c3f59 100644 --- a/src/tools/msvc/vcregress.pl +++ b/src/tools/msvc/vcregress.pl @@ -496,12 +496,28 @@ sub recoverycheck } # Run "initdb", then reconfigure authentication. +# mimics what is done in src/bin/pg_upgrade/test.sh:standard_initdb() sub standard_initdb { - return ( - system('initdb', '-N') == 0 and system( - "$topdir/$Config/pg_regress/pg_regress", '--config-auth', - $ENV{PGDATA}) == 0); + my @opts = qw(-N --wal-segsize 1 -g -A trust); + my $status = system('initdb',@opts); + if ($status == 0) + { + if (defined($ENV{TEMP_CONFIG}) && -r $ENV{TEMP_CONFIG}) + { + open(my $handle, '<', $ENV{TEMP_CONFIG}) + || die "file $ENV{TEMP_CONFIG} not found"; + my $contents = <$handle>; + close($handle); + open($handle, '>>', "$ENV{PGDATA}/postgresql.conf") + || die "file $ENV{PGDATA}/postgresql.conf not found"; + print $handle $contents; + close($handle); + } + $status = system("$topdir/$Config/pg_regress/pg_regress", + '--config-auth', $ENV{PGDATA}) + } + return $status; } # This is similar to appendShellString(). Perl system(@args) bypasses
Re: initdb recommendations
On 2019-07-24 16:00, Andrew Dunstan wrote: > I think we also need to change vcregress.pl to use trust explicitly for > upgrade checks, just like the Unix upgrade test script does. That should > help to future-proof us a bit. Right, I'll add that to my patch. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: psql - add SHOW_ALL_RESULTS option
Bonjour Daniel, I kind of agree as well, but I was pretty sure that someone would complain if the current behavior was changed. If queries in a compound statement must be kept silent, they can be converted to CTEs or DO-blocks to produce the same behavior without having to configure anything in psql. That cost on users doesn't seem too bad, compared to introducing a knob in psql, and presumably maintaining it forever. Ok. Attached a "do it always version", which does the necessary refactoring. There is seldom new code, it is rather moved around, some functions are created for clarity. -- Fabien.diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index 7789fc6177..5e4f653f88 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -3355,10 +3355,8 @@ select 1\; select 2\; select 3; commands included in the string to divide it into multiple transactions. (See for more details about how the server handles multi-query strings.) -psql prints only the last query result -it receives for each request; in this example, although all -three SELECTs are indeed executed, psql -only prints the 3. +psql prints all results it receives, one +after the other. diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c index 44a782478d..4534c45854 100644 --- a/src/bin/psql/common.c +++ b/src/bin/psql/common.c @@ -472,6 +472,16 @@ ResetCancelConn(void) #endif } +static void +ShowErrorMessage(const PGresult *result) +{ + const char *error = PQerrorMessage(pset.db); + + if (strlen(error)) + pg_log_info("%s", error); + + CheckConnection(); +} /* * AcceptResult @@ -482,7 +492,7 @@ ResetCancelConn(void) * Returns true for valid result, false for error state. */ static bool -AcceptResult(const PGresult *result) +AcceptResult(const PGresult *result, bool show_error) { bool OK; @@ -513,15 +523,8 @@ AcceptResult(const PGresult *result) break; } - if (!OK) - { - const char *error = PQerrorMessage(pset.db); - - if (strlen(error)) - pg_log_info("%s", error); - - CheckConnection(); - } + if (!OK && show_error) + ShowErrorMessage(result); return OK; } @@ -701,7 +704,7 @@ PSQLexec(const char *query) ResetCancelConn(); - if (!AcceptResult(res)) + if (!AcceptResult(res, true)) { ClearOrSaveResult(res); res = NULL; @@ -743,7 +746,7 @@ PSQLexecWatch(const char *query, const printQueryOpt *opt) ResetCancelConn(); - if (!AcceptResult(res)) + if (!AcceptResult(res, true)) { ClearOrSaveResult(res); return 0; @@ -999,199 +1002,113 @@ loop_exit: return success; } - /* - * ProcessResult: utility function for use by SendQuery() only - * - * When our command string contained a COPY FROM STDIN or COPY TO STDOUT, - * PQexec() has stopped at the PGresult associated with the first such - * command. In that event, we'll marshal data for the COPY and then cycle - * through any subsequent PGresult objects. - * - * When the command string contained no such COPY command, this function - * degenerates to an AcceptResult() call. - * - * Changes its argument to point to the last PGresult of the command string, - * or NULL if that result was for a COPY TO STDOUT. (Returning NULL prevents - * the command status from being printed, which we want in that case so that - * the status line doesn't get taken as part of the COPY data.) - * - * Returns true on complete success, false otherwise. Possible failure modes - * include purely client-side problems; check the transaction status for the - * server-side opinion. + * Marshal the COPY data. Either subroutine will get the + * connection out of its COPY state, then call PQresultStatus() + * once and report any error. + * + * For COPY OUT, direct the output to pset.copyStream if it's set, + * otherwise to pset.gfname if it's set, otherwise to queryFout. + * For COPY IN, use pset.copyStream as data source if it's set, + * otherwise cur_cmd_source. + * + * Update result if further processing is necessary. (Returning NULL + * prevents the command status from being printed, which we want in that + * case so that the status line doesn't get taken as part of the COPY data.) */ static bool -ProcessResult(PGresult **results) +HandleCopyResult(PGresult **result) { bool success = true; - bool first_cycle = true; + FILE *copystream; + PGresult *copy_result; + ExecStatusType result_status = PQresultStatus(*result); - for (;;) + Assert(result_status == PGRES_COPY_OUT || + result_status == PGRES_COPY_IN); + + SetCancelConn(); + if (result_status == PGRES_COPY_OUT) { - ExecStatusType result_status; - bool is_copy; - PGresult *next_result; + bool need_close = false; + bool is_pipe = false; - if (!AcceptResult(*results)) + if (pset.copyStream) { - /* - * Failure at this point is always a server-side failure or a - * failure to submit the command
RE: seems like a bug in pgbench -R
Hello Yoshikazu, I could not reproduce this issue on head, but I confirm on 11.2. I could reproduce the stuck on 11.4. Attached is a fix to apply on pg11. I confirm the stuck doesn't happen after applying your patch. Ok, thanks for the feedback. + /* under throttling we may have finished the last client above */ + if (remains == 0) + break; If there are only CSTATE_WAIT_RESULT, CSTATE_SLEEP or CSTATE_THROTTLE clients, a thread needs to wait the results or sleep. In that logic, there are the case that a thread tried to wait the results when there are no clients wait the results, and this causes the issue. This is happened when there are only CSTATE_THROTLE clients and pgbench timeout is occured. Those clients will be finished and "remains" will be 0. I confirmed above codes prevent such a case. Yep. I almost think this is ready for committer, Almost great, then. but I have one question. Is it better adding any check like if(maxsock != -1) before the select? else/* no explicit delay, select without timeout */ { nsocks = select(maxsock + 1, _mask, NULL, NULL, NULL); } I think that it is not necessary because this case cannot happen: If some clients are still running (remains > 0), they are either sleeping, in which case there would be a timeout, or they are waiting for something from the server, otherwise the script could be advanced further so there would be something else to do for the thread. We could check this by adding "Assert(maxsock != -1);" before this select, but I would not do that for a released version. -- Fabien.
add a MAC check for TRUNCATE
Hackers, Since all DAC checks should have corresponding MAC, this patch adds a hook to allow extensions to implement a MAC check on TRUNCATE. I have also implemented this access check in the sepgsql extension. One important thing to note is that refpolicy [1] and Redhat based distributions do not have the SELinux permission for db_table {truncate} implemented. This patch is the first step to add this permission to the upstream SELinux policy. If this permission does not exist in the policy, sepgsql is being used, and `deny_unknown` is set to 1, the TRUNCATE will be denied. As a workaround for this behavior, the SELinux aware system would need to have `/sys/fs/selinux/deny_unknown` set to 0 until the permission has been added to refpolicy/Redhat SELinux policy. The deny_unknown behavior can be set using CIL [2] by extracting the base SELinux module, and setting how the kernel handles unknown permissions. The dependencies for overriding handle_unknown are policycoreutils, selinux-policy-targeted, and a libsemanage version that supports CIL (CentOS 7+). $ sudo semodule -cE base $ sed -Ei 's/(handleunknown )deny/\1allow/g' base.cil $ sudo semodule -i base.cil Thanks, Yuli [1] https://github.com/SELinuxProject/refpolicy/blob/master/policy/flask/access_vectors#L794 [2] https://github.com/SELinuxProject/selinux/blob/master/secilc/docs/cil_policy_config_statements.md#handleunknown 0001-Use-MAC-in-addition-to-DAC-for-TRUNCATE.patch 0001-Use-MAC-in-addition-to-DAC-for-TRUNCATE.patch Description: Binary data
Re: Statistical aggregate functions are not working with PARTIAL aggregation
Hi, On 2019-05-20 17:27:10 +1200, David Rowley wrote: > On Mon, 20 May 2019 at 13:20, Andres Freund wrote: > > How > > about we have something roughly like: > > > > int numTransFnArgs = -1; > > int numCombineFnArgs = -1; > > Oid transFnInputTypes[FUNC_MAX_ARGS]; > > Oid combineFnInputTypes[2]; > > > > if (DO_AGGSPLIT_COMBINE(...) > >numCombineFnArgs = 1; > >combineFnInputTypes = list_make2(aggtranstype, aggtranstype); > > else > >numTransFnArgs = get_aggregate_argtypes(aggref, transFnInputTypes); > > > > ... > > > > if (DO_AGGSPLIT_COMBINE(...)) > > build_pertrans_for_aggref(pertrans, aggstate, estate, > > aggref, combinefn_oid, aggtranstype, > > serialfn_oid, deserialfn_oid, > > initValue, initValueIsNull, > > combineFnInputTypes, numCombineFnArgs); > > else > > build_pertrans_for_aggref(pertrans, aggstate, estate, > > aggref, transfn_oid, aggtranstype, > > serialfn_oid, deserialfn_oid, > > initValue, initValueIsNull, > > transFnInputTypes, numTransFnArgs); > > > > seems like that'd make the code clearer? > > I think that might be a good idea... I mean apart from trying to > assign a List to an array :) We still must call > get_aggregate_argtypes() in order to determine the final function, so > the code can't look exactly like you've written. > > > I wonder if we shouldn't > > strive to have *no* DO_AGGSPLIT_COMBINE specific logic in > > build_pertrans_for_aggref (except perhaps for an error check or two). > > Just so we have a hard copy to review and discuss, I think this would > look something like the attached. > > We do miss out on a few very small optimisations, but I don't think > they'll be anything we could measure. Namely > build_aggregate_combinefn_expr() called make_agg_arg() once and used > it twice instead of calling it once for each arg. I don't think > that's anything we could measure, especially in a situation where > two-stage aggregation is being used. > > I ended up also renaming aggtransfn to transfn_oid in > build_pertrans_for_aggref(). Having it called aggtranfn seems a bit > too close to the pg_aggregate.aggtransfn column which is confusion > given that we might pass it the value of the aggcombinefn column. Now that master is open for development, and you have a commit bit, are you planning to go forward with this on your own? Greetings, Andres Freund
Re: Adding a test for speculative insert abort case
Hi, On 2019-06-05 15:49:47 -0700, Melanie Plageman wrote: > On Thu, May 16, 2019 at 8:46 PM Melanie Plageman > wrote: > > > > > Good idea. > > I squashed the changes I suggested in previous emails, Ashwin's patch, my > > suggested updates to that patch, and the index order check all into one > > updated > > patch attached. > > > > > I've updated this patch to make it apply on master cleanly. Thanks to > Alvaro for format-patch suggestion. Planning to push this, now that v12 is branched off. But only to master, I don't think it's worth backpatching at the moment. > The second patch in the set is another suggestion I have. I noticed > that the insert-conflict-toast test mentions that it is "not > guaranteed to lead to a failed speculative insertion" and, since it > seems to be testing the speculative abort but with TOAST tables, I > thought it might work to kill that spec file and move that test case > into insert-conflict-specconflict so the test can utilize the existing > advisory locks being used for the other tests in that file to make it > deterministic which session succeeds in inserting the tuple. Seems like a good plan. > diff --git a/src/test/isolation/specs/insert-conflict-specconflict.spec > b/src/test/isolation/specs/insert-conflict-specconflict.spec > index 3a70484fc2..7f29fb9d02 100644 > --- a/src/test/isolation/specs/insert-conflict-specconflict.spec > +++ b/src/test/isolation/specs/insert-conflict-specconflict.spec > @@ -10,7 +10,7 @@ setup > { > CREATE OR REPLACE FUNCTION blurt_and_lock(text) RETURNS text IMMUTABLE > LANGUAGE plpgsql AS $$ > BEGIN > -RAISE NOTICE 'called for %', $1; > +RAISE NOTICE 'blurt_and_lock() called for %', $1; > > -- depending on lock state, wait for lock 2 or 3 > IF pg_try_advisory_xact_lock(current_setting('spec.session')::int, > 1) THEN > @@ -23,9 +23,16 @@ setup > RETURN $1; > END;$$; > > +CREATE OR REPLACE FUNCTION blurt_and_lock2(text) RETURNS text IMMUTABLE > LANGUAGE plpgsql AS $$ > +BEGIN > +RAISE NOTICE 'blurt_and_lock2() called for %', $1; > +PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, > 4); > +RETURN $1; > +END;$$; > + Any chance for a bit more descriptive naming than *2? I can live with it, but ... > +step "controller_print_speculative_locks" { SELECT > locktype,classid,objid,mode,granted FROM pg_locks WHERE locktype='speculative > +token' ORDER BY granted; } I think showing the speculative locks is possibly going to be unreliable - the release time of speculative locks is IIRC not that reliable. I think it could e.g. happen that speculative locks are held longer because autovacuum spawned an analyze in the background. > + # Should report s1 is waiting on speculative lock > + "controller_print_speculative_locks" Hm, I might be missing something, but I don't think it currently does. Looking at the expected file: +step controller_print_speculative_locks: SELECT locktype,classid,objid,mode,granted FROM pg_locks WHERE locktype='speculative +token' ORDER BY granted; +locktype classidobjid mode granted + And if it showed something, it'd make the test not work, because classid/objid aren't necessarily going to be the same from test to test. Greetings, Andres Freund
Re: GiST VACUUM
On Wed, Jul 24, 2019 at 11:33 AM Heikki Linnakangas wrote: > That's probably how it's going to go, but hey, doesn't hurt to ask :-). I think that it would be fine to be conservative with nbtree, and only target the master branch. The problem is annoying, certainly, but it's not likely to make a huge difference for most real world workloads. OTOH, perhaps the risk is so low that we might as well target backbranches. How do you feel about it? -- Peter Geoghegan
Re: GiST VACUUM
On 24/07/2019 21:02, Peter Geoghegan wrote: On Wed, Jul 24, 2019 at 10:30 AM Heikki Linnakangas wrote: Pushed this now, to master and REL_12_STABLE. Now, B-tree indexes still have the same problem, in all versions. Any volunteers to write a similar fix for B-trees? I was hoping that you'd work on it. :-) That's probably how it's going to go, but hey, doesn't hurt to ask :-). Any reason to think that it's much different to what you've done with GiST? No, it should be very similar. - Heikki
Re: GiST VACUUM
On Wed, Jul 24, 2019 at 10:30 AM Heikki Linnakangas wrote: > Pushed this now, to master and REL_12_STABLE. > > Now, B-tree indexes still have the same problem, in all versions. Any > volunteers to write a similar fix for B-trees? I was hoping that you'd work on it. :-) Any reason to think that it's much different to what you've done with GiST? -- Peter Geoghegan
Re: Speed up transaction completion faster after many relations are accessed in a transaction
David Rowley writes: > Here's a more polished version with the debug code removed, complete > with benchmarks. A few gripes: You're measuring the number of locks held at completion of the transaction, which fails to account for locks transiently taken and released, so that the actual peak usage will be more. I think we mostly only do that for system catalog accesses, so *maybe* the number of extra locks involved isn't very large, but that's a very shaky assumption. I don't especially like the fact that this seems to have a hard-wired (and undocumented) assumption that buckets == entries, ie that the fillfactor of the table is set at 1.0. lock.c has no business knowing that. Perhaps instead of returning the raw bucket count, you could have dynahash.c return bucket count times fillfactor, so that the number is in the same units as the entry count. This: running_avg_locks -= running_avg_locks / 10.0; running_avg_locks += numLocksHeld / 10.0; seems like a weird way to do the calculation. Personally I'd write running_avg_locks += (numLocksHeld - running_avg_locks) / 10.0; which is more the way I'm used to seeing exponential moving averages computed --- at least, it seems clearer to me why this will converge towards the average value of numLocksHeld over time. It also makes it clear that it wouldn't be sane to use two different divisors, whereas your formulation makes it look like maybe they could be set independently. Your compiler might not complain that LockReleaseAll's numLocksHeld is potentially uninitialized, but other people's compilers will. On the whole, I don't especially like this approach, because of the confusion between peak lock count and end-of-xact lock count. That seems way too likely to cause problems. regards, tom lane
Re: POC: Cleaning up orphaned files using undo logs
Hi, I have done some review of undolog patch series and here are my comments: 0003-Add-undo-log-manager.patch 1) As undo log is being created in tablespace, if the tablespace is dropped later, will it have any impact? +void +UndoLogDirectory(Oid tablespace, char *dir) +{ + if (tablespace == DEFAULTTABLESPACE_OID || + tablespace == InvalidOid) + snprintf(dir, MAXPGPATH, "base/undo"); + else + snprintf(dir, MAXPGPATH, "pg_tblspc/%u/%s/undo", + tablespace, TABLESPACE_VERSION_DIRECTORY); +} 2) Header file exclusion a) The following headers can be excluded in undolog.c +#include "access/transam.h" +#include "access/undolog.h" +#include "access/xlogreader.h" +#include "catalog/catalog.h" +#include "nodes/execnodes.h" +#include "storage/buf.h" +#include "storage/bufmgr.h" +#include "storage/fd.h" +#include "storage/lwlock.h" +#include "storage/shmem.h" +#include "storage/standby.h" +#include "storage/sync.h" +#include "utils/memutils.h" b) The following headers can be excluded from undofile.c +#include "access/undolog.h" +#include "catalog/database_internal.h" +#include "miscadmin.h" +#include "postmaster/bgwriter.h" +#include "storage/fd.h" +#include "storage/smgr.h" +#include "utils/memutils.h" 3) Some macro replacement. a)Session.h +++ b/src/include/access/session.h @@ -17,6 +17,9 @@ /* Avoid including typcache.h */ struct SharedRecordTypmodRegistry; +/* Avoid including undolog.h */ +struct UndoLogSlot; + /* * A struct encapsulating some elements of a user's session. For now this * manages state that applies to parallel query, but it principle it could @@ -27,6 +30,10 @@ typedef struct Session dsm_segment *segment; /* The session-scoped DSM segment. */ dsa_area *area; /* The session-scoped DSA area. */ + /* State managed by undolog.c. */ + struct UndoLogSlot *attached_undo_slots[4]; /* UndoLogCategories */ + bool need_to_choose_undo_tablespace; + Should we change 4 to UndoLogCategories or suitable macro? b) +static inline size_t +UndoLogNumSlots(void) +{ + return MaxBackends * 4; +} Should we change 4 to UndoLogCategories or suitable macro c) +allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace, + UndoLogOffset end) +{ + struct stat stat_buffer; + off_t size; + char path[MAXPGPATH]; + void *zeroes; + size_t nzeroes = 8192; + int fd; should we use BLCKSZ instead of 8192? 4) Should we add a readme file for undolog as it does a fair amount of work and is core part of the undo system? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com On Wed, Jul 24, 2019 at 5:44 PM Amit Kapila wrote: > > On Wed, Jul 24, 2019 at 2:45 PM Amit Kapila wrote: > > > > On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila wrote: > > > > 7. > > +attach_undo_log(UndoLogCategory category, Oid tablespace) > > { > > .. > > if (candidate->meta.tablespace == tablespace) > > + { > > + logno = *place; > > + slot = candidate; > > + *place = candidate->next_free; > > + break; > > + } > > > > Here, the code is breaking from the loop, so why do we need to set > > *place? Am I missing something obvious? > > > > I think I know what I was missing. It seems here you are removing an > element from the freelist. > > One point related to detach_current_undo_log. > > + LWLockAcquire(>mutex, LW_EXCLUSIVE); > + slot->pid = InvalidPid; > + slot->meta.unlogged.xid = InvalidTransactionId; > + if (full) > + slot->meta.status = UNDO_LOG_STATUS_FULL; > + LWLockRelease(>mutex); > > If I read the comments in structure UndoLogMetaData, it is mentioned > that 'status' is changed by explicit WAL record whereas there is no > WAL record in code to change the status. I see the problem as well if > we don't WAL log this change. Suppose after changing the status of > this log, we allocate a new log and insert some records in that log as > well for the same transaction for which we have inserted records in > the log which we just marked as FULL. Now, here we form the link > between two logs as the same transaction has overflowed into a new > log. Say, we crash after this. Now, after recovery the log won't be > marked as FULL which means there is a chance that it can be used for > some other transaction, if that happens, then our link for a > transaction spanning to different log will break and we won't be able > to access the data in another log. In short, I think it is important > to WAL log this status change unless I am missing something. > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com > > -- Regards, vignesh Have a nice day
Re: Seek failure at end of FSM file during WAL replay (in 11)
Michael Paquier writes: > Recently, one of the test beds we use has blown up once when doing > streaming replication like that: > FATAL: could not seek to end of file "base/16386/19817_fsm": No such >file or directory > CONTEXT: WAL redo at 60/8DA22448 for Heap2/CLEAN: remxid 65751197 > LOG: startup process (PID 44886) exited with exit code 1 > All the WAL records have been wiped out since, so I don't know exactly > what happened, but I could track down that this FSM file got removed > a couple of hours before as I got my hands on some FS-level logs which > showed a deletion. Hm. AFAICS the immediate issuer of the error must have been _mdnblocks(); there are other matches to that error string but they are in places where we can tell which file the seek must have been applied to, and it wasn't a FSM file. > Before blaming a lower level of > the application stack, I am wondering if we have some issues with > mdfd_vfd meaning that the file has been removed but that it is still > tracked as opened. lseek() per se presumably would never return ENOENT. A more likely theory is that the file wasn't actually open but only had a leftover VFD entry, and when FileSize() -> FileAccess() tried to open it, the open failed with ENOENT --- but _mdnblocks() would still call it a seek failure. So I'd opine that this is a pretty high-level failure --- what are we doing trying to replay WAL against a table that's been dropped? Or if it wasn't dropped, why was the FSM removed? regards, tom lane
Re: GiST VACUUM
On 22/07/2019 16:09, Heikki Linnakangas wrote: Unless something comes up, I'll commit this tomorrow. Pushed this now, to master and REL_12_STABLE. Now, B-tree indexes still have the same problem, in all versions. Any volunteers to write a similar fix for B-trees? - Heikki
Re: "localtime" value in TimeZone
Shay Rojansky writes: > In (certain) out-of-the-box PostgreSQL installations, the timezone GUC is > set to "localtime", which seems to mean to query the OS for the value. > Unless I'm mistaken, the issue with this is that it doesn't allow clients > inspecting the TimeZone GUC to actually know what timezone the server is > in, making the GUC largely useless (and creates friction as the GUC can't > be expected to always contain valid IANA/Olson values). It would be more > useful if PostgreSQL exposed the actual timezone provided by the OS. > Does this make sense? Yeah, this is something that some tzdb packagers do --- they put a "localtime" file into /usr/share/zoneinfo that is a symlink or hard link to the active zone file, and then initdb tends to seize on that as being the shortest available spelling of the active zone. I opined in https://www.postgresql.org/message-id/27991.1560984...@sss.pgh.pa.us that we should avoid choosing "localtime", but that thread seems stalled on larger disagreements about how complicated we want that mechanism to be. > As a side note, there doesn't seem to be any specific documentation on the > special "localtime" value of this GUC That's because it's nonstandard and platform-specific. It's also not special from our standpoint --- it's jsut another zone file. regards, tom lane
Re: [PATCH] minor bugfix for pg_basebackup (9.6 ~ )
On 2019-Jul-24, Ian Barwick wrote: > It'd be better if such a hypothetical option validated the provided > slot name anwyay, to prevent later surprises. Hmm, but what would we do if the validation failed? > Revised patch attached, which as Alvaro suggests removes the escaping > and adds a comment explaining why the raw value can be passed as-is. Heh, yesterday I revised the original patch as attached and was about to push when the bell rang. I like this one because it keeps the comment to one line and it mentions the function name in charge of the validation (useful for grepping later on). It's a bit laconic because of the long function name and the desire to keep it to one line, but it seems sufficient to me. BTW upper case letters are not allowed :-) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From e258d242fc0ef653aa7654af6fb065c7133ba430 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 23 Jul 2019 17:48:18 -0400 Subject: [PATCH v3] Don't uselessly escape a string that doesn't need escaping Per gripe from Ian Barwick Discussion: https://postgr.es/m/cabvvfjwnnnkb8chstlhktsvl1+g6bvcv+57+w1jz61p8ygp...@mail.gmail.com --- src/bin/pg_basebackup/pg_basebackup.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c index 77a7c148ba..e7755e807d 100644 --- a/src/bin/pg_basebackup/pg_basebackup.c +++ b/src/bin/pg_basebackup/pg_basebackup.c @@ -1715,11 +1715,9 @@ GenerateRecoveryConf(PGconn *conn) free(escaped); if (replication_slot) - { - escaped = escape_quotes(replication_slot); - appendPQExpBuffer(recoveryconfcontents, "primary_slot_name = '%s'\n", replication_slot); - free(escaped); - } + /* unescaped: ReplicationSlotValidateName allows [a-z0-9_] only */ + appendPQExpBuffer(recoveryconfcontents, "primary_slot_name = '%s'\n", + replication_slot); if (PQExpBufferBroken(recoveryconfcontents) || PQExpBufferDataBroken(conninfo_buf)) -- 2.17.1
"localtime" value in TimeZone
Greetings everyone. In (certain) out-of-the-box PostgreSQL installations, the timezone GUC is set to "localtime", which seems to mean to query the OS for the value. Unless I'm mistaken, the issue with this is that it doesn't allow clients inspecting the TimeZone GUC to actually know what timezone the server is in, making the GUC largely useless (and creates friction as the GUC can't be expected to always contain valid IANA/Olson values). It would be more useful if PostgreSQL exposed the actual timezone provided by the OS. Does this make sense? As a side note, there doesn't seem to be any specific documentation on the special "localtime" value of this GUC (e.g. https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-TIMEZONES ). Shay
Re: [bug fix] Produce a crash dump before main() on Windows
On 2019-Jul-24, Kyotaro Horiguchi wrote: > Hello. > > On Wed, Jul 24, 2019 at 2:31 AM Alvaro Herrera > wrote: > > > > On 2019-Jul-23, Tom Lane wrote: > > > > > Kyotaro Horiguchi writes: > > > > > > My investigation convinced me that there is no way for a process > > > > to detect wheter it is running as a service (except the process > > > > directly called from system (aka entry function)). > > > > Maybe we can try calling pgwin32_is_service()? > > Ah, it might be viable. (I'm not sure though.) I didn't thought that > some process property we are enforcing is available. Well, ereport() relies on this pretty extensively, so it seems okay to rely on it for this too. > The difference of internal behavior is added to unify the external > (apparenet) behavior. It prevents WER dialog from showing while > running on console. Service doesn't show a dialog even if WER is > enabled. Yeah, that seems to be what we want (namely, not to get the server process blocked waiting for the user to click on the dialog box). > The remaining issue is we cannot obtain a dump when running > on console without being blocked by a dialog, but Windows just doesn't > allow it.. (Might be possible if we ignore Windows 7? 8? and earlier) I don't know what a "dump" is in this context, but in the errbacktrace thread I linked to a Stackoverflow question where they mentioned the API to obtain a stack backtrace for a process in Windows. Maybe logging that much info is sufficient for most cases. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Compile from source using latest Microsoft Windows SDK
On 7/22/19 4:23 AM, Michael Paquier wrote: > On Mon, Jul 22, 2019 at 04:01:46PM +0800, Peifeng Qiu wrote: >>> but it's really only a major issue for VS2019 >> VS2019 will use the latest v10 SDK by default. So no need to install 8.1 >> for VS2019. > Yes, FWIW, I have tested with VS2019 when committing 2b1394f, and in > this case only the v10 SDK got installed, with no actual issues > related to the dependency of the SDK reported. In this case I have > installed VS using the community installer provided by Microsoft. > >>> I guess we might need a test for what SDK is available? >> We can just use the WindowsSDKVersion environment variable to >> determine the SDK for current cmd session. It's set when you start >> the Visual Studio Prompt or call one bat script. Developers can >> choose the right version best suit their need. Detecting all >> installed SDK version can be done with some registry magic but I >> think that's not necessary in this case. > This looks more sensible to do if the environment variable is > available. Looking around this variable is available when using the > command prompt for native tools. So using it sounds like a good idea > to me if it exists. Yeah, on consideration I think Peifeng's patch upthread looks OK. (Incidentally, this variable is not set in the very old version of VC running on currawong). cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
double free in ExecHashJoin, 9.6.12
Server is generally running pretty well, and is high volume. This query is not new and is also medium volume. Database rebooted in about 4 seconds with no damage; fast enough we didn't even trip alarms (I noticed this troubleshooting another issue). We are a couple of bug fixes releases behind but I didn't see anything obvious in the release notes suggesting a resolved issue. Anyone have any ideas? thanks in advance. merlin *** glibc detected *** postgres: rms ysconfig 10.33.190.21(36788) SELECT: double free or corruption (!prev): 0x01fb2140 *** === Backtrace: = /lib64/libc.so.6(+0x75dee)[0x7f4fde053dee] /lib64/libc.so.6(+0x78c80)[0x7f4fde056c80] postgres: rms ysconfig 10.33.190.21(36788) SELECT(ExecHashJoin+0x5a2)[0x5e2d32] postgres: rms ysconfig 10.33.190.21(36788) SELECT(ExecProcNode+0x208)[0x5cf728] postgres: rms ysconfig 10.33.190.21(36788) SELECT(standard_ExecutorRun+0x18a)[0x5cd1ca] postgres: rms ysconfig 10.33.190.21(36788) SELECT[0x6e5607] postgres: rms ysconfig 10.33.190.21(36788) SELECT(PortalRun+0x188)[0x6e67d8] postgres: rms ysconfig 10.33.190.21(36788) SELECT[0x6e2af3] postgres: rms ysconfig 10.33.190.21(36788) SELECT(PostgresMain+0x75a)[0x6e456a] postgres: rms ysconfig 10.33.190.21(36788) SELECT(PostmasterMain+0x1875)[0x6840b5] postgres: rms ysconfig 10.33.190.21(36788) SELECT(main+0x7a8)[0x60b528] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f4fddffcd1d] postgres: rms ysconfig 10.33.190.21(36788) SELECT[0x46c589] 2019-07-23 09:41:41 CDT[:@]:LOG: server process (PID 18057) was terminated by signal 6: Aborted 2019-07-23 09:41:41 CDT[:@]:DETAIL: Failed process was running: SELECT JR.job_id as jobId,JR.job_execution_id as jobResultId,JR.created as lastRunDate, JR.status as status, JR.status_message as statusMessage, JR.output_format as outputFormat, JS.schedule_name as scheduleName, JS.job_name as reportName, JS.created_by as scheduledBy, JS.product as source FROM (SELECT JR.job_id, MAX(JR.created) AS MaxCreated FROM job_schedule JS JOIN job_result JR ON JR.job_id=JS.job_id WHERE (lower(JS.recepients) like lower($1) OR lower(JS.created_by) = lower($2)) GROUP BY JR.job_id) TMP JOIN job_result JR ON JR.job_id = TMP.job_id AND JR.created = TMP.MaxCreated JOIN job_schedule JS ON JS.job_id = JR.job_id AND JS.job_type='CRON' merlin
Re: initdb recommendations
On 7/23/19 2:12 AM, Peter Eisentraut wrote: > On 2019-07-22 21:16, Andrew Dunstan wrote: >> Modulo this issue, experimentation shows that adding '-A trust' to the >> line in run_build.pl where initdb is called fixes the issue. If we're >> going to rely on a buildfarm client fix that one seems simplest. > Yes, that is the right fix. It's what the in-tree test drivers > (pg_regress, PostgresNode.pm) do. > I have done that, I will put out a new release probably right after the CF closes. I think we also need to change vcregress.pl to use trust explicitly for upgrade checks, just like the Unix upgrade test script does. That should help to future-proof us a bit. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Psql patch to show access methods info
On Wed, Jul 24, 2019 at 9:01 AM Andres Freund wrote: > Based on a quick skim of the thread - which means I most definitely > missed things - there's not been discussion of why we actually want to > add this. Who's the prospective user of this facility? And why wouldn't > they just query pg_am[proc]? None of this information seems like it's > going to be even remotely targeted towards even advanced users. For > developers it's not clear what these add? I see your point regarding pg_am details. Probably nobody expect developers need this. And probably even developers don't need this, because it's easier to see IndexAmRoutine directly with more details. So, +1 for removing this. pg_amproc for gin/gist/sp-gist/brin is probably for developers. But I think pg_amproc for btree/hash could be useful for advanced users. btree/hash opclasses could be written by advanced users using pl/something, I've faced that several times. > Adding stuff to psql isn't free. It adds clutter to psql's help output, > the commands need to be maintained (including cross-version code). Sure. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: initdb recommendations
On 7/22/19 1:40 PM, Andres Freund wrote: > Hi, > > On 2019-07-22 13:02:13 -0400, Andrew Dunstan wrote: >> There are a few things we could do. We could force trust auth, or we >> could add an ident map that allowed $USER to login as buildfarm. Finding >> all the places we would need to fix that could be a fun project ... > Perhaps we could actually do so automatically when the initdb invoking > user isn't the same as the OS user? Imo that'd be generally quite > useful, and not just for the regression tets. > yeah, although I think that's a separate exercise. So we'd have something like in pg_hba.conf local all all peer map=datadir_owner and in pg_ident.conf datadir_owner $USER $superuser cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Psql patch to show access methods info
Hi! On Wed, Jul 24, 2019 at 9:00 AM Andres Freund wrote: > On 2019-07-23 01:57:29 +0300, Alexander Korotkov wrote: > > It was always scary there is no way in psql to see am/opclass/opfamily > > information rather than query catalog directly. > > What does make that scary? For it's unclear why do we have backslash commands for observing almost every part of system catalog, but this quite large part is missed. > > I'm going to push it. Probably, someone find that commands syntax and > > output formats are not well discussed yet. But we're pretty earlier > > in 13 release cycle. So, we will have time to work out a criticism if > > any. > > Please don't before we've had some discussion as to why we want this > additional code, and who'd be helped by it. OK. Given that few senior developers participate in discussion of details, I thought we kind of agree that need this. Now you've explicitly express other opinion, so let's discuss. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Support for jsonpath .datetime() method
On Wed, Jul 24, 2019 at 1:50 AM Nikita Glukhov wrote: > So it's unclear what we should do: > - implement YY and RR strictly following the standard only in .datetime() > - fix YY implementation in to_date()/to_timestamp() and implement RR > - use our non-standard templates in .datetime() Also it appears that according to standard .datetime() should treat spaces and delimiters differently than our to_date()/to_timestamp(). It requires strict matching of spaces and delimiters in input and format strings. We don't have such behavior in both non-FX and FX modes. Also, standard doesn't define FX mode at all. This rules cover jsonpath .datetime() method and CAST(... FORMAT ...) – new cast clause defined by standard. So, I think due to reasons of compatibility it doesn't worth trying to make behavior of our to_date()/to_timestamp() to fit requirements for jsonpath .datetime() and CAST(... FORMAT ...). I propose to leave this functions as is (maybe add new patterns), but introduce another datetime parsing mode, which would fit to the standard. Opinions? -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: pgbench tests vs Windows
On 7/24/19 3:56 AM, Fabien COELHO wrote: > > Hello Andrew, > >> Unfortunately, this isn't portable, as I've just discovered at the cost >> of quite a bit of time. In particular, you can't assume expr is present >> and in the path on Windows. The Windows equivalent would be something >> like: >> >> \setshell two\ >> @set /a c = 1 + :one && echo %c% > > Hmmm... Can we assume that echo is really always there on Windows? If > so, the attached patch does something only with "echo". Yes, it's built into the cmd processor (as is "set /a", to answer Tom's earlier question about portability - I tested the above back to XP.) echo is much more universal, and I can confirm that the suggested fix works on the Windows test box I'm using. I'll apply and backpatch that. Thanks cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pg_receivewal documentation
Hi, On Wed, 24 Jul 2019 11:29:28 +0900 Michael Paquier wrote: > On Tue, Jul 23, 2019 at 08:00:41AM -0400, Jesper Pedersen wrote: > > Sure. > > Thanks. Applied down to 9.6 where remote_apply has been introduced, > with tweaks for 9.6 as the tool is named pg_receivexlog there. Sorry to step in so lately. Unless I am missing something, another solution might be to use a dedicated role to pg_receive{xlog|wal} with synchronous_commit lower than remote_apply. Not sure we want to add such detail, but if you consider it useful, you'll find a patch in attachment. Regards, >From 01a7de92dbaae5a61d5ec7bd04bef1553467f29d Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Wed, 24 Jul 2019 14:58:41 +0200 Subject: [PATCH] Add doc details for pg_receivewal with remote_apply --- doc/src/sgml/ref/pg_receivewal.sgml | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/ref/pg_receivewal.sgml b/doc/src/sgml/ref/pg_receivewal.sgml index 177e9211c0..abf55ce713 100644 --- a/doc/src/sgml/ref/pg_receivewal.sgml +++ b/doc/src/sgml/ref/pg_receivewal.sgml @@ -62,7 +62,12 @@ PostgreSQL documentation application_name for pg_receivewal that does not match it, or change the value of synchronous_commit to - something other than remote_apply. + something other than remote_apply either cluster wide or + just for a dedicated role to pg_receivewal: + +CREATE ROLE receivewal WITH LOGIN REPLICATION; +ALTER ROLE receivewal SET synchronous_commit TO on; + -- 2.20.1
Re: psql - add SHOW_ALL_RESULTS option
Fabien COELHO wrote: > >> I'd go further and suggest that there shouldn't be a variable > >> controlling this. All results that come in should be processed, period. > > > > I agree with that. > > I kind of agree as well, but I was pretty sure that someone would complain > if the current behavior was changed. If queries in a compound statement must be kept silent, they can be converted to CTEs or DO-blocks to produce the same behavior without having to configure anything in psql. That cost on users doesn't seem too bad, compared to introducing a knob in psql, and presumably maintaining it forever. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
Re: Issue in to_timestamp/to_date while handling the quoted literal string
Hi Suraj, I think the documentation is reasonably clear about this behaviour, quote: " In to_date, to_number, and to_timestamp, literal text and double-quoted strings result in skipping the number of characters contained in the string; for example "XX" skips two input characters (whether or not they are XX)." I can appreciate that this isn't the behaviour you intuitively expected from to_timestamp, and I don't think you'd be the first or the last. The purpose of these functions was never to validate that your input string precisely matches the non-coding parts of your format pattern. For that, I think you'd be better served by using regular expressions. Just as an aside, in the example you gave, the string '2019-05-24T23:12:45' will cast directly to timestamp just fine, so it isn't the kind of situation to_timestamp was really intended for. It's more for when your input string is in an obscure (or ambiguous) format that is known to you in advance. I hope that helps. Cheers Brendan On Wed, 24 Jul 2019 at 21:38, Suraj Kharage wrote: > Hi, > > I noticed the issue in to_timestamp()/to_date() while handling the double > quote literal string. If any double quote literal characters found in > format, we generate the NODE_TYPE_CHAR in parse format and store that > actual character in FormatNode->character. n DCH_from_char, we just > increment the input string by length of character for NODE_TYPE_CHAR. > We are actually not matching these characters in input string and because > of this, date values get changed if quoted literal string is not identical > in input and format string. > > e.g: > > postgres@78619=#select to_timestamp('2019-05-24T23:12:45', > '-mm-dd"TT"hh24:mi:ss'); >to_timestamp > --- > 2019-05-24 03:12:45+05:30 > (1 row) > > > In above example, the quoted string is 'TT', so it just increment the > input string by 2 while handling these characters and returned the wrong > hour value. > > My suggestion is to match the exact characters from quoted literal string > in input string and if doesn't match then throw an error. > > Attached is the POC patch which almost works for all scenarios except for > whitespace - as a quote character. > > Suggestions? > -- > -- > > Thanks & Regards, > Suraj kharage, > EnterpriseDB Corporation, > The Postgres Database Company. >
Re: Fetching timeline during recovery
Hello Michael, On Wed, 24 Jul 2019 09:49:05 +0900 Michael Paquier wrote: > On Tue, Jul 23, 2019 at 06:05:18PM +0200, Jehan-Guillaume de Rorthais wrote: > > Please, find in attachment a first trivial patch to support > > pg_walfile_name() and pg_walfile_name_offset() on a standby. > > Previous restriction on this functions seems related to ThisTimeLineID not > > being safe on standby. This patch is fetching the timeline from > > WalRcv->receivedTLI using GetWalRcvWriteRecPtr(). As far as I understand, > > this is updated each time some data are flushed to the WAL. [...] > Your patch does not count for the case of archive recovery, where > there is no WAL receiver, and as the shared memory state of the WAL > receiver is not set 0 would be set. Indeed. I tested this topic with the following query and was fine with the NULL result: select pg_walfile_name(pg_last_wal_receive_lsn()); I was fine with this result because my use case requires replication anyway. A NULL result would mean that the node never streamed from the old primary since its last startup, so a failover should ignore it anyway. However, NULL just comes from pg_last_wal_receive_lsn() here. The following query result is wrong: > select pg_walfile_name('0/1') I fixed that. See patch 0001-v2-* in attachement > The replay timeline is something we could use here instead via > GetXLogReplayRecPtr(). CreateRestartPoint actually takes the latest WAL > receiver or replayed point for its end LSN position, whichever is newer. I did consider GetXLogReplayRecPtr() or even XLogCtl->replayEndTLI (which is updated right before the replay). However, both depend on read activity on the standby. That's why I picked WalRcv->receivedTLI which is updated whatever the reading activity on the standby. > > Last, I plan to produce an extension to support this on older release. Is > > it something that could be integrated in official source tree during a minor > > release or should I publish it on eg. pgxn? > > Unfortunately no. This is a behavior change so it cannot find its way > into back branches. Yes, my patch is a behavior change. But here, I was yalking about an extension, not the core itself, to support this feature in older releases. > The WAL receiver state is in shared memory and published, so that's easy > enough to get. We don't do that for XLogCtl unfortunately. Both are in shared memory, but WalRcv have a public function to get its receivedTLI member. XLogCtl has nothing in public to expose its ThisTimeLineID member. However, from a module, I'm able to fetch it using: XLogCtl = ShmemInitStruct("XLOG Ctl", XLOGShmemSize(), ); SpinLockAcquire(>info_lck); tl = XLogCtl->ThisTimeLineID; SpinLockRelease(>info_lck); As the "XLOG Ctl" index entry already exists in shmem, ShmemInitStruct returns the correct structure from there. Not sure this was supposed to be used this way though...Adding a public function might be cleaner, but it will not help for older releases. > I think that there are arguments for being more flexible with it, and perhaps > have a system-level view to be able to look at some of its fields. Great idea. I'll give it a try to keep the discussion on. > There is also a downside with get_controlfile(), which is that it > fetches directly the data from the on-disk pg_control, and > post-recovery this only gets updated at the first checkpoint. Indeed, that's why I started this patch and thread. Thanks, >From fdf133645b8cc2728cca3677e71bdd5cb69cdbd4 Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Tue, 23 Jul 2019 17:28:44 +0200 Subject: [PATCH] Support pg_walfile_name on standby Support executing both SQL functions pg_walfile_name() and pg_walfile_name_offset() on a standby. --- src/backend/access/transam/xlogfuncs.c | 39 +- 1 file changed, 25 insertions(+), 14 deletions(-) diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c index b35043bf71..86c4d8382b 100644 --- a/src/backend/access/transam/xlogfuncs.c +++ b/src/backend/access/transam/xlogfuncs.c @@ -460,13 +460,7 @@ pg_walfile_name_offset(PG_FUNCTION_ARGS) TupleDesc resultTupleDesc; HeapTuple resultHeapTuple; Datum result; - - if (RecoveryInProgress()) - ereport(ERROR, -(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), - errmsg("recovery is in progress"), - errhint("%s cannot be executed during recovery.", - "pg_walfile_name_offset()"))); + TimeLineID tl; /* * Construct a tuple descriptor for the result row. This must match this @@ -480,11 +474,24 @@ pg_walfile_name_offset(PG_FUNCTION_ARGS) resultTupleDesc = BlessTupleDesc(resultTupleDesc); + if (RecoveryInProgress()) + { + GetWalRcvWriteRecPtr(NULL, ); + + if (!tl) + { + isnull[0] = isnull[1] = true; + goto result; + } + } + else + tl = ThisTimeLineID; + /* * xlogfilename */ XLByteToPrevSeg(locationpoint, xlogsegno,
Re: benchmarking Flex practices
On 07/24/19 03:45, John Naylor wrote: > On Sun, Jul 21, 2019 at 3:14 AM Tom Lane wrote: >> However, my second reaction was that maybe you were on to something >> upthread when you speculated about postponing de-escaping of >> Unicode literals into the grammar. If we did it like that then Wow, yay. I hadn't been following this thread, but I had just recently looked over my own earlier musings [1] and started thinking "no, it would be outlandish to ask the lexer to return utf-8 always ... but what about postponing the de-escaping of Unicode literals into the grammar?" and had started to think about when I might have a chance to try making a patch. With the de-escaping postponed, I think we'd be able to move beyond the current odd situation where Unicode escapes can't describe non-ascii characters, in exactly and only the cases where you need them to. -Chap [1] https://www.postgresql.org/message-id/6688474e-7c28-b352-bcec-ea0ef59d7a1a%40anastigmatix.net
Re: POC: Cleaning up orphaned files using undo logs
On Wed, Jul 24, 2019 at 2:45 PM Amit Kapila wrote: > > On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila wrote: > > 7. > +attach_undo_log(UndoLogCategory category, Oid tablespace) > { > .. > if (candidate->meta.tablespace == tablespace) > + { > + logno = *place; > + slot = candidate; > + *place = candidate->next_free; > + break; > + } > > Here, the code is breaking from the loop, so why do we need to set > *place? Am I missing something obvious? > I think I know what I was missing. It seems here you are removing an element from the freelist. One point related to detach_current_undo_log. + LWLockAcquire(>mutex, LW_EXCLUSIVE); + slot->pid = InvalidPid; + slot->meta.unlogged.xid = InvalidTransactionId; + if (full) + slot->meta.status = UNDO_LOG_STATUS_FULL; + LWLockRelease(>mutex); If I read the comments in structure UndoLogMetaData, it is mentioned that 'status' is changed by explicit WAL record whereas there is no WAL record in code to change the status. I see the problem as well if we don't WAL log this change. Suppose after changing the status of this log, we allocate a new log and insert some records in that log as well for the same transaction for which we have inserted records in the log which we just marked as FULL. Now, here we form the link between two logs as the same transaction has overflowed into a new log. Say, we crash after this. Now, after recovery the log won't be marked as FULL which means there is a chance that it can be used for some other transaction, if that happens, then our link for a transaction spanning to different log will break and we won't be able to access the data in another log. In short, I think it is important to WAL log this status change unless I am missing something. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: pg_receivewal documentation
Hi, On 7/23/19 10:29 PM, Michael Paquier wrote: Thanks. Applied down to 9.6 where remote_apply has been introduced, with tweaks for 9.6 as the tool is named pg_receivexlog there. Thanks to everybody involved ! Best regards, Jesper
Issue in to_timestamp/to_date while handling the quoted literal string
Hi, I noticed the issue in to_timestamp()/to_date() while handling the double quote literal string. If any double quote literal characters found in format, we generate the NODE_TYPE_CHAR in parse format and store that actual character in FormatNode->character. n DCH_from_char, we just increment the input string by length of character for NODE_TYPE_CHAR. We are actually not matching these characters in input string and because of this, date values get changed if quoted literal string is not identical in input and format string. e.g: postgres@78619=#select to_timestamp('2019-05-24T23:12:45', '-mm-dd"TT"hh24:mi:ss'); to_timestamp --- 2019-05-24 03:12:45+05:30 (1 row) In above example, the quoted string is 'TT', so it just increment the input string by 2 while handling these characters and returned the wrong hour value. My suggestion is to match the exact characters from quoted literal string in input string and if doesn't match then throw an error. Attached is the POC patch which almost works for all scenarios except for whitespace - as a quote character. Suggestions? -- -- Thanks & Regards, Suraj kharage, EnterpriseDB Corporation, The Postgres Database Company. to_timestamp_quoted_string_POC.patch Description: Binary data
RE: seems like a bug in pgbench -R
Hi Fabien, On Fri, Mar 15, 2019 at 4:17 PM, Fabien COELHO wrote: > >> echo 'select 1' > select.sql > >> > >> while /bin/true; do > >> pgbench -n -f select.sql -R 1000 -j 8 -c 8 -T 1 > /dev/null 2>&1; > >> date; > >> done; > > > > Indeed. I'll look at it over the weekend. > > > >> So I guess this is a bug in 12788ae49e1933f463bc59a6efe46c4a01701b76, or > >> one of the other commits touching this part of the code. > > I could not reproduce this issue on head, but I confirm on 11.2. I could reproduce the stuck on 11.4. On Sat, Mar 16, 2019 at 10:14 AM, Fabien COELHO wrote: > Attached is a fix to apply on pg11. I confirm the stuck doesn't happen after applying your patch. It passes make check-world. This change seems not to affect performance, so I didn't do any performance test. > + /* under throttling we may have finished the last client above > */ > + if (remains == 0) > + break; If there are only CSTATE_WAIT_RESULT, CSTATE_SLEEP or CSTATE_THROTTLE clients, a thread needs to wait the results or sleep. In that logic, there are the case that a thread tried to wait the results when there are no clients wait the results, and this causes the issue. This is happened when there are only CSTATE_THROTLE clients and pgbench timeout is occured. Those clients will be finished and "remains" will be 0. I confirmed above codes prevent such a case. I almost think this is ready for committer, but I have one question. Is it better adding any check like if(maxsock != -1) before the select? else/* no explicit delay, select without timeout */ { nsocks = select(maxsock + 1, _mask, NULL, NULL, NULL); } -- Yoshikazu Imai
Re: POC: Cleaning up orphaned files using undo logs
On Wed, Jul 24, 2019 at 2:45 PM Amit Kapila wrote: > > On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila wrote: > > > > On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro wrote: > > > Yep, that was completely wrong. Here's a new version. > > 10. > I think UndoLogAllocate can leak allocation of slots. It first > allocates the slot for a new log from the free pool in there is no > existing slot/log, writes a WAL record and then at a later point of > time it actually creates the required physical space in the log via > extend_undo_log which also writes a separate WAL. Now, if there is a > error between these two operations, then we will have a redundant slot > allocated. What if there are repeated errors for similar thing from > multiple backends after which system crashes. Now, after restart, we > will allocate multiple slots for different lognos which don't have any > actual (physical) logs. This might not be a big problem in practice > because the chances of error between two operations are less, but > can't we delay the WAL logging for allocation of a slot for a new log. > After sending this email, I was browsing the previous comments raised by me for this patch and it seems this same point was raised previously [1] as well and there were few additional questions related to same (See point-1 in email [1].) [1] - https://www.postgresql.org/message-id/CAA4eK1LDctrYeZ8ev1N1v-8KwiigAmNMx%3Dt-UTs9qgEFt%2BP0XQ%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: POC: Cleaning up orphaned files using undo logs
On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila wrote: > > On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro wrote: > > > > On Fri, Jun 28, 2019 at 6:09 AM Robert Haas wrote: > > > I happened to open up 0001 from this series, which is from Thomas, and > > > I do not think that the pg_buffercache changes are correct. The idea > > > here is that the customer might install version 1.3 or any prior > > > version on an old release, then upgrade to PostgreSQL 13. When they > > > do, they will be running with the old SQL definitions and the new > > > binaries. At that point, it sure looks to me like the code in > > > pg_buffercache_pages.c is going to do the Wrong Thing. [...] > > > > Yep, that was completely wrong. Here's a new version. > > > > One comment/question related to > 0022-Use-undo-based-rollback-to-clean-up-files-on-abort.patch. > I have done some more review of undolog patch series and here are my comments: 0003-Add-undo-log-manager.patch 1. allocate_empty_undo_segment() { .. .. if (mkdir(undo_path, S_IRWXU) != 0 && errno != EEXIST) + { + char*parentdir; + + if (errno != ENOENT || !InRecovery) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not create directory \"%s\": %m", + undo_path))); + + /* + * In recovery, it's possible that the tablespace directory + * doesn't exist because a later WAL record removed the whole + * tablespace. In that case we create a regular directory to + * stand in for it. This is similar to the logic in + * TablespaceCreateDbspace(). + */ + + /* create two parents up if not exist */ + parentdir = pstrdup(undo_path); + get_parent_directory(parentdir); + get_parent_directory(parentdir); + /* Can't create parent and it doesn't already exist? */ + if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST) All of this code is almost same as we have code in TablespaceCreateDbspace still we have small differences like here you are using mkdir instead of MakePGDirectory which as far as I can see use similar permissions for creating directory. Also, it checks whether the directory exists before trying to create it. Is there a reason why we need to do a few things differently here if not, they can both the places use one common function? 2. allocate_empty_undo_segment() { .. .. /* Flush the contents of the file to disk before the next checkpoint. */ + undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace); .. } +void +undofile_request_sync(UndoLogNumber logno, BlockNumber segno, Oid tablespace) +{ + char path[MAXPGPATH]; + FileTag tag; + + INIT_UNDOFILETAG(tag, logno, tablespace, segno); + + /* Try to send to the checkpointer, but if out of space, do it here. */ + if (!RegisterSyncRequest(, SYNC_REQUEST, false)) The comment in allocate_empty_undo_segment indicates that the code wants to flush before checkpoint, but the actual function tries to register the request with checkpointer. Shouldn't this be similar to XLogFileInit where we use pg_fsync to flush the contents immediately? I guess that will avoid what you have written in comments in the same function (we just want to make sure that the filesystem has allocated physical blocks for it so that non-COW filesystems will report ENOSPC now rather than later when space is needed). OTOH, I think it is performance-wise better to postpone the work to checkpointer. If we want to push this work to checkpointer, then we might need to change comments or alternatively, we might want to use bigger segment sizes to mitigate the performance effect. If my above understanding is correct and the reason to fsync immediately is to reserve space now, then we also need to think whether we are always safe in postponing the work? Basically, if this means that it can fail when we are actually trying to write undo, then it could be risky because we could be in the critical section at that time. I am not sure about this point, rather it is just to discuss if there are any impacts of postponing the fsync work. Another thing is that recently in commit 475861b261 (commit by you), we have introduced a mechanism to not fill the files with zero's for certain filesystems like ZFS. Do we want similar behavior for undo files? 3. +extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end) +{ + UndoLogSlot *slot; + size_t end; + + slot = find_undo_log_slot(logno, false); + + /* TODO review interlocking */ + + Assert(slot != NULL); + Assert(slot->meta.end % UndoLogSegmentSize == 0); + Assert(new_end % UndoLogSegmentSize == 0); + Assert(InRecovery || +CurrentSession->attached_undo_slots[slot->meta.category] == slot); Can you write some comments explaining the above Asserts? Also, can you explain what interlocking issues are you worried about here? 4. while (end < new_end) + { + allocate_empty_undo_segment(logno, slot->meta.tablespace, end); + end += UndoLogSegmentSize; + } + + /* Flush the directory entries before next checkpoint. */ + undofile_request_sync_dir(slot->meta.tablespace); I see that at two places after
Re: On the stability of TAP tests for LDAP
On Wed, Jul 24, 2019 at 7:50 PM Michael Paquier wrote: > Perhaps this worked on freebsd? Now that I test it, the test gets > stuck on my Debian box: > # waiting for slapd to accept requests... > # Running: ldapsearch -h localhost -p 49534 -s base -b > dc=example,dc=net -n 'objectclass=*' > SASL/DIGEST-MD5 authentication started > Please enter your password: > ldap_sasl_interactive_bind_s: Invalid credentials (49) > additional info: SASL(-13): user not found: no secret in > database Huh, yeah, I don't know why slapd requires credentials on Debian, when the version that ships with FreeBSD is OK with an anonymous connection. Rather than worrying about that, I just adjusted it to supply the credentials. It works on both for me. > pgperltidy complains about the patch indentation using perltidy > v20170521 (version mentioned in tools/pgindent/README). Fixed. -- Thomas Munro https://enterprisedb.com wait-for-slapd-v3.patch Description: Binary data
Re: [bug fix] Produce a crash dump before main() on Windows
Hello. On Wed, Jul 24, 2019 at 2:31 AM Alvaro Herrera wrote: > > On 2019-Jul-23, Tom Lane wrote: > > > Kyotaro Horiguchi writes: > > > > My investigation convinced me that there is no way for a process > > > to detect wheter it is running as a service (except the process > > > directly called from system (aka entry function)). > > Maybe we can try calling pgwin32_is_service()? Ah, it might be viable. (I'm not sure though.) I didn't thought that some process property we are enforcing is available. > > But I will say that in my experience, behavioral differences between > > Postgres started manually and Postgres started as a daemon are bad. > > So I think going out of our way to make the cases behave differently > > on Windows is probably not a good plan. > > We already have such a difference in Windows -- elog.c uses it > extensively to use the eventlog to print if a service, console print if > not. > > Given this thread's OP, ISTM failing to do likewise for this case makes > debugging problems difficult for no good reason. The difference of internal behavior is added to unify the external (apparenet) behavior. It prevents WER dialog from showing while running on console. Service doesn't show a dialog even if WER is enabled. The remaining issue is we cannot obtain a dump when running on console without being blocked by a dialog, but Windows just doesn't allow it.. (Might be possible if we ignore Windows 7? 8? and earlier) reagards -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [PATCH] Improve performance of NOTIFY over many databases (issue blocking on AccessExclusiveLock on object 0 of class 1262 of database 0)
On Tue, 23 Jul 2019 at 23:32, Tom Lane wrote: > > Martijn van Oosterhout writes: > > I mean tracking the listening backends specifically, so you can > > replace the loops: > > for (i=0; i < MaxBackends; i++) > > with > > for (i=QUEUE_FIRST_LISTENING_BACKEND; i; i = > > QUEUE_NEXT_LISTENING_BACKEND(i)) > > Ah ... but surely you would not put such info in AsyncQueueEntry, > where there'd be a separate copy for each message. I think you > meant to add the info to AsyncQueueControl. Umm, yeah. Got that mixed up. > It might be better to redefine the backends[] array as being mostly > contiguous (ie, a new backend takes the first free slot not the one > indexed by its own BackendId), at the price of needing to store > BackendId in each slot explicitly instead of assuming it's equal to > the array index. I suspect the existing data structure is putting too > much of a premium on making sizeof(QueueBackendStatus) a power of 2. This would require adding a "MyListenerId" to each backend which I'm not sure helps the readability. And there's a chance of mixing the id up. The power-of-2-ness is I think indeed overrated. I'll give it a shot a see how it looks. Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/
Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding
Sorry in advance for link-breaking message forced by gmail.. https://www.postgresql.org/message-id/flat/20190202083822.gc32...@gust.leadboat.com > 1. The result of the test is valid only until we release the SLRU ControlLock, >which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to evaluate >segments for deletion. Once we release that lock, latest_page_number can >advance. This creates a TOCTOU race condition, allowing excess deletion: > > >[local] test=# table trunc_clog_concurrency ; >ERROR: could not access status of transaction 2149484247 >DETAIL: Could not open file "pg_xact/0801": No such file or directory. It seems like some other vacuum process saw larger cutoff page? If I'm not missing something, the missing page is no longer the "recently-populated" page at the time (As I understand it as the last page that holds valid data). Couldn't we just ignore ENOENT there? > 2. By the time the "apparent wraparound" test fires, we've already WAL-logged >the truncation. clog_redo() suppresses the "apparent wraparound" test, >then deletes too much. Startup then fails: I agree that if truncation is skipped after issuing log, it will lead to data-loss at the next recovery. But the follwoing log..: >881997 2019-02-10 02:53:32.105 GMT FATAL: could not access status of > transaction 708112327 >881997 2019-02-10 02:53:32.105 GMT DETAIL: Could not open file > "pg_xact/02A3": No such file or directory. >881855 2019-02-10 02:53:32.107 GMT LOG: startup process (PID 881997) > exited with exit code 1 If it came from the same reason as 1, the log is simply ignorable, so recovery stopping by the error is unacceptable, but the ENOENT is just ignorable for the same reason. As the result, I agree to (a) (fix rounding), and (c) (test wrap-around before writing WAL) but I'm not sure for others. And additional fix for ignorable ENOENT is needed. What do you think about this? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pgbench - allow to create partitioned tables
Hello Simon, While doing some performance tests and reviewing patches, I needed to create partitioned tables. Given the current syntax this is time consumming. Good idea. I wonder why we didn't have it already. Probably because I did not have to create partitioned table for some testing:-) sh> pgench -i -s 1 --partition-number=$N --partition-type=hash Given current naming of options, I would call this --partitions=number-of-partitions and --partition-method=hash Ok. # then run sh> pgench -S -M prepared -P 1 -T 10 # and look at latency: # no parts = 0.071 ms # 1 hash = 0.071 ms (did someone optimize this case?!) # 2 hash ~ 0.126 ms (+ 0.055 ms) # 50 hash ~ 0.155 ms # 100 hash ~ 0.178 ms # 150 hash ~ 0.232 ms # 200 hash ~ 0.279 ms # overhead ~ (0.050 + [0.0005-0.0008] * nparts) ms It is linear? Good question. I would have hoped affine, but this is not very clear on these data, which are the median of about five runs, hence the bracket on the slope factor. At least it is increasing with the number of partitions. Maybe it would be clearer on the minimum of five runs. -- Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml index 816f9cc4c7..3e8e292e39 100644 --- a/doc/src/sgml/ref/pgbench.sgml +++ b/doc/src/sgml/ref/pgbench.sgml @@ -306,6 +306,32 @@ pgbench options d + + --partitions=NUM + + +Create a partitioned pgbench_accounts table with +NUM partitions of nearly equal size for +the scaled number of accounts. +Default is 0, meaning no partitioning. + + + + + + --partition-method=NAME + + +Create a partitioned pgbench_accounts table with +NAME method. +Expected values are range or hash. +This option is only taken into account if +--partitions is non-zero. +Default is range. + + + + --tablespace=tablespace diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 570cf3306a..6819b4e433 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -186,6 +186,11 @@ int64 latency_limit = 0; char *tablespace = NULL; char *index_tablespace = NULL; +/* partitioning for pgbench_accounts table, 0 for no partitioning */ +int partitions = 0; +enum { PART_RANGE, PART_HASH } + partition_method = PART_RANGE; + /* random seed used to initialize base_random_sequence */ int64 random_seed = -1; @@ -617,6 +622,9 @@ usage(void) " --foreign-keys create foreign key constraints between tables\n" " --index-tablespace=TABLESPACE\n" " create indexes in the specified tablespace\n" + " --partitions=NUM partition account table in NUM parts (defaults: 0)\n" + " --partition-method=(range|hash)\n" + " partition account table with this method (default: range)\n" " --tablespace=TABLESPACE create tables in the specified tablespace\n" " --unlogged-tablescreate tables as unlogged tables\n" "\nOptions to select what to run:\n" @@ -3601,6 +3609,17 @@ initDropTables(PGconn *con) "pgbench_tellers"); } +/* + * add fillfactor percent option if not 100. + */ +static void +append_fillfactor(char *opts, int len) +{ + if (fillfactor < 100) + snprintf(opts + strlen(opts), len - strlen(opts), + " with (fillfactor=%d)", fillfactor); +} + /* * Create pgbench's standard tables */ @@ -3625,6 +3644,7 @@ initCreateTables(PGconn *con) const char *bigcols; /* column decls if accountIDs are 64 bits */ int declare_fillfactor; }; + static const struct ddlinfo DDLs[] = { { "pgbench_history", @@ -3651,11 +3671,10 @@ initCreateTables(PGconn *con) 1 } }; - int i; fprintf(stderr, "creating tables...\n"); - for (i = 0; i < lengthof(DDLs); i++) + for (int i = 0; i < lengthof(DDLs); i++) { char opts[256]; char buffer[256]; @@ -3664,9 +3683,17 @@ initCreateTables(PGconn *con) /* Construct new create table statement. */ opts[0] = '\0'; - if (ddl->declare_fillfactor) + + /* Partition pgbench_accounts table */ + if (partitions >= 1 && strcmp(ddl->table, "pgbench_accounts") == 0) + { snprintf(opts + strlen(opts), sizeof(opts) - strlen(opts), - " with (fillfactor=%d)", fillfactor); + " partition by %s (aid)", + partition_method == PART_RANGE ? "range" : "hash"); + } + else if (ddl->declare_fillfactor) + append_fillfactor(opts, sizeof(opts)); + if (tablespace != NULL) { char *escape_tablespace; @@ -3686,6 +3713,54 @@ initCreateTables(PGconn *con) executeStatement(con, buffer); } + + if (partitions >= 1) + { + int64 part_size = (naccounts * (int64) scale + partitions - 1) / partitions; + char ff[64]; + ff[0] = '\0'; + append_fillfactor(ff, sizeof(ff)); + + fprintf(stderr, "creating %d
Re: pgbench tests vs Windows
Hello Andrew, Unfortunately, this isn't portable, as I've just discovered at the cost of quite a bit of time. In particular, you can't assume expr is present and in the path on Windows. The Windows equivalent would be something like: \setshell two\ @set /a c = 1 + :one && echo %c% Hmmm... Can we assume that echo is really always there on Windows? If so, the attached patch does something only with "echo". I propose to prepare a patch along these lines. Alternatively we could just drop it - I don't think the test matters all that hugely. The point is to have some minimal coverage so that unexpected changes are caught. This is the only call to a working \setshell. -- Fabien.diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl index 5a2fdb9acb..b82d3f65c4 100644 --- a/src/bin/pgbench/t/001_pgbench_with_server.pl +++ b/src/bin/pgbench/t/001_pgbench_with_server.pl @@ -512,7 +512,7 @@ pgbench( qr{processed: 1/1}, qr{shell-echo-output} ], - [qr{command=8.: int 2\b}], + [qr{command=8.: int 1\b}], 'pgbench backslash commands', { '001_pgbench_backslash_commands' => q{-- run set @@ -524,10 +524,10 @@ pgbench( \sleep 0 s \sleep :zero -- setshell and continuation -\setshell two\ - expr \ -1 + :one -\set n debug(:two) +\setshell another_one\ + echo \ +:one +\set n debug(:another_one) -- shell \shell echo shell-echo-output }
Re: On the stability of TAP tests for LDAP
On Wed, Jul 24, 2019 at 05:47:13PM +1200, Thomas Munro wrote: > Thanks, here's v2. Perhaps this worked on freebsd? Now that I test it, the test gets stuck on my Debian box: # waiting for slapd to accept requests... # Running: ldapsearch -h localhost -p 49534 -s base -b dc=example,dc=net -n 'objectclass=*' SASL/DIGEST-MD5 authentication started Please enter your password: ldap_sasl_interactive_bind_s: Invalid credentials (49) additional info: SASL(-13): user not found: no secret in database pgperltidy complains about the patch indentation using perltidy v20170521 (version mentioned in tools/pgindent/README). -- Michael signature.asc Description: PGP signature
Re: benchmarking Flex practices
On Sun, Jul 21, 2019 at 3:14 AM Tom Lane wrote: > > John Naylor writes: > > The pre-existing ecpg var "state_before" was a bit confusing when > > combined with the new var "state_before_quote_stop", and the former is > > also used with C-comments, so I decided to go with > > "state_before_lit_start" and "state_before_lit_stop". Even though > > comments aren't literals, it's less of a stretch than referring to > > quotes. To keep things consistent, I went with the latter var in psql > > and core. > > Hm, what do you think of "state_before_str_stop" instead? It seems > to me that both "quote" and "lit" are pretty specific terms, so > maybe we need something a bit vaguer. Sounds fine to me. > While poking at that, I also came across this unhappiness: > > regression=# select u&'foo' uescape 'bogus'; > regression'# > > that is, psql thinks we're still in a literal at this point. That's > because the uesccharfail rule eats "'b" and then we go to INITIAL > state, so that consuming the last "'" puts us back in a string state. > The backend would have thrown an error before parsing as far as the > incomplete literal, so it doesn't care (or probably not, anyway), > but that's not an option for psql. > > My first reaction as to how to fix this was to rip the xuend and > xuchar states out of psql, and let it just lex UESCAPE as an > identifier and the escape-character literal like any other literal. > psql doesn't need to account for the escape character's effect on > the meaning of the Unicode literal, so it doesn't have any need to > lex the sequence as one big token. I think the same is true of ecpg > though I've not looked really closely. > > However, my second reaction was that maybe you were on to something > upthread when you speculated about postponing de-escaping of > Unicode literals into the grammar. If we did it like that then > we would not need to have this difference between the backend and > frontend lexers, and we'd not have to worry about what > psql_scan_in_quote should do about the whitespace before and after > UESCAPE, either. > > So I'm feeling like maybe we should experiment to see what that > solution looks like, before we commit to going in this direction. > What do you think? Given the above wrinkles, I thought it was worth trying. Attached is a rough patch (don't mind the #include mess yet :-) ) that works like this: The lexer returns UCONST from xus and UIDENT from xui. The grammar has rules that are effectively: SCONST { do nothing} | UCONST { esc char is backslash } | UCONST UESCAPE SCONST { esc char is from $3 } ...where UESCAPE is now an unreserved keyword. To prevent shift-reduce conflicts, I added UIDENT to the %nonassoc precedence list to match IDENT, and for UESCAPE I added a %left precedence declaration. Maybe there's a more principled way. I also added an unsigned char type to the %union, but it worked fine on my compiler without it. litbuf_udeescape() and check_uescapechar() were moved to gram.y. The former had be massaged to give error messages similar to HEAD. They're not quite identical, but the position info is preserved. Some of the functions I moved around don't seem to have any test coverage, so I should eventually do some work in that regard. Notes: -Binary size is very close to v6. That is to say the grammar tables grew by about the same amount the scanner table shrank, so the binary is still about 200kB smaller than HEAD. -Performance is very close to v6 with the information_schema and pgbench-like queries with standard strings, which is to say also very close to HEAD. When the latter was changed to use Unicode escapes, however, it was about 15% slower than HEAD. That's a big regression and I haven't tried to pinpoint why. -psql was changed to follow suit. It doesn't think it's inside a string with your too-long escape char above, and it removes all blank lines from this query output: $ cat >> test-uesc-lit.sql SELECT u&'!0041' uescape '!' as col ; On HEAD and v6 I get this: $ ./inst/bin/psql -a -f test-uesc-lit.sql SELECT u&'!0041' uescape '!' as col ; col - A (1 row) -The ecpg changes here are only the bare minimum from HEAD to get it to compile, since I'm borrowing its additional token names (although they mean slightly different things). After a bit of experimentation, it's clear there's a bit more work needed to get it functional, and it's not easy to debug, so I'm putting that off until we decide whether this is the way forward. -- John Naylorhttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services v7-draft-handle-uescapes-in-parser.patch Description: Binary data
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Wed, 24 Jul 2019 at 16:16, David Rowley wrote: > > On Wed, 24 Jul 2019 at 15:05, David Rowley > wrote: > > To be able to reduce the threshold down again we'd need to make a > > hash_get_num_entries(LockMethodLocalHash) call before performing the > > guts of LockReleaseAll(). We could then weight that onto some running > > average counter with a weight of, say... 10, so we react to changes > > fairly quickly, but not instantly. We could then have some sort of > > logic like "rebuild the hash table if running average 4 times less > > than max_bucket" > > > > I've attached a spreadsheet of that idea and the algorithm we could > > use to track the running average. Initially, I've mocked it up a > > series of transactions that use 1000 locks, then at row 123 dropped > > that to 10 locks. If we assume the max_bucket is 1000, then it takes > > until row 136 for the running average to drop below the max_bucket > > count, i.e 13 xacts. There we'd reset there at the init size of 16. If > > the average went up again, then we'd automatically expand the table as > > we do now. To make this work we'd need an additional call to > > hash_get_num_entries(), before we release the locks, so there is more > > overhead. > > Here's a patch with this implemented. I've left a NOTICE in there to > make it easier for people to follow along at home and see when the > lock table is reset. Here's a more polished version with the debug code removed, complete with benchmarks. -- Test 1. Select 1 record from a 140 partitioned table. Tests creating a large number of locks with a fast query. create table hp (a int, b int) partition by hash(a); select 'create table hp'||x||' partition of hp for values with (modulus 140, remainder ' || x || ');' from generate_series(0,139)x; create index on hp (b); insert into hp select x,x from generate_series(1, 14) x; analyze hp; select3.sql: select * from hp where b = 1 - Master: $ pgbench -n -f select3.sql -T 60 -M prepared postgres tps = 2098.628975 (excluding connections establishing) tps = 2101.797672 (excluding connections establishing) tps = 2085.317292 (excluding connections establishing) tps = 2094.931999 (excluding connections establishing) tps = 2092.328908 (excluding connections establishing) Patched: $ pgbench -n -f select3.sql -T 60 -M prepared postgres tps = 2101.691459 (excluding connections establishing) tps = 2104.533249 (excluding connections establishing) tps = 2106.499123 (excluding connections establishing) tps = 2104.033459 (excluding connections establishing) tps = 2105.463629 (excluding connections establishing) (I'm surprised there is not more overhead in the additional tracking added to calculate the running average) -- Test 2. Tests a prepared query which will perform a generic plan on the 6th execution then fallback on a custom plan due to it pruning all but one partition. Master suffers from the lock table becoming bloated after locking all partitions when planning the generic plan. create table ht (a int primary key, b int, c int) partition by hash (a); select 'create table ht' || x::text || ' partition of ht for values with (modulus 8192, remainder ' || (x)::text || ');' from generate_series(0,8191) x; \gexec select.sql: \set p 1 select * from ht where a = :p Master: $ pgbench -n -f select.sql -T 60 -M prepared postgres tps = 10207.780843 (excluding connections establishing) tps = 10205.772688 (excluding connections establishing) tps = 10214.896449 (excluding connections establishing) tps = 10157.572153 (excluding connections establishing) tps = 10147.764477 (excluding connections establishing) Patched: $ pgbench -n -f select.sql -T 60 -M prepared postgres tps = 14677.636570 (excluding connections establishing) tps = 14661.437186 (excluding connections establishing) tps = 14647.202877 (excluding connections establishing) tps = 14784.165753 (excluding connections establishing) tps = 14850.355344 (excluding connections establishing) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services shrink_bloated_locallocktable_v8.patch Description: Binary data
RE: [PATCH] Fix Proposal - Deadlock Issue in Single User Mode When IO Failure Occurs
Sorry in advance for link-breaking message force by gmail.. https://www.postgresql.org/message-id/flat/cy4pr2101mb0804ce9836e582c0702214e8aa...@cy4pr2101mb0804.namprd21.prod.outlook.com I assume that we are in a consensus about the problem we are to fix here. > 0a 0004`8080cc30 0004`80dcf917 postgres!PGSemaphoreLock+0x65 > [d:\orcasqlagsea10\14\s\src\backend\port\win32_sema.c @ 158] > 0b 0004`8080cc90 0004`80db025c postgres!LWLockAcquire+0x137 > [d:\orcasqlagsea10\14\s\src\backend\storage\lmgr\lwlock.c @ 1234] > 0c 0004`8080ccd0 0004`80db25db postgres!AbortBufferIO+0x2c > [d:\orcasqlagsea10\14\s\src\backend\storage\buffer\bufmgr.c @ 3995] > 0d 0004`8080cd20 0004`80dbce36 postgres!AtProcExit_Buffers+0xb > [d:\orcasqlagsea10\14\s\src\backend\storage\buffer\bufmgr.c @ 2479] > 0e 0004`8080cd50 0004`80dbd1bd postgres!shmem_exit+0xf6 > [d:\orcasqlagsea10\14\s\src\backend\storage\ipc\ipc.c @ 262] > 0f 0004`8080cd80 0004`80dbccfd postgres!proc_exit_prepare+0x4d > [d:\orcasqlagsea10\14\s\src\backend\storage\ipc\ipc.c @ 188] > 10 0004`8080cdb0 0004`80ef9e74 postgres!proc_exit+0xd > [d:\orcasqlagsea10\14\s\src\backend\storage\ipc\ipc.c @ 141] > 11 0004`8080cde0 0004`80ddb6ef postgres!errfinish+0x204 > [d:\orcasqlagsea10\14\s\src\backend\utils\error\elog.c @ 624] > 12 0004`8080ce50 0004`80db0f59 postgres!mdread+0x12f > [d:\orcasqlagsea10\14\s\src\backend\storage\smgr\md.c @ 806] Ok, we are fixing this. The proposed patch lets LWLockReleaseAll() called before InitBufferPoolBackend() by registering the former after the latter into on_shmem_exit list. Even if it works, I think it's neither clean nor safe to register multiple order-sensitive callbacks. AtProcExit_Buffers has the following comment: > * During backend exit, ensure that we released all shared-buffer locks and > * assert that we have no remaining pins. And the only caller of it is shmem_exit. More of that, all other caller sites calls LWLockReleaseAll() just before calling it. If that's the case, why don't we just release all LWLocks in shmem_exit or in AtProcExit_Buffers before calling AbortBufferIO()? I think it's sufficient that AtProcExit_Buffers calls it at the beginning. (The comment for the funcgtion needs editing). regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: POC: Cleaning up orphaned files using undo logs
On Wed, Jul 24, 2019 at 11:28 AM Rushabh Lathia wrote: > > Hi, > > I have stated review of > 0008-Provide-interfaces-to-store-and-fetch-undo-records.patch, here are few > quick comments. Thanks for the review, I will work on them soon and post the updated patch along with other comments. I have noticed some comments are pointing to the code which is not part of this patch for example > > 5) In function UndoBlockGetFirstUndoRecord() below code: > > /* Calculate the size of the partial record. */ > partial_rec_size = UndoRecordHeaderSize(phdr->uur_info) + >phdr->tuple_len + phdr->payload_len - >phdr->record_offset; > > can directly use UndoPagePartialRecSize(). UndoBlockGetFirstUndoRecord is added under 0014 patch, I think you got confused because this code is in undoaccess.c file. But actually later patch set added some code under undoaccess.c. Basically, this comment needs to be fixed but under another patch. I am pointing out so that we don't miss this. > 8) > > /* > * Defines the number of times we try to wait for rollback hash table to get > * initialized. After these many attempts it will return error and the user > * can retry the operation. > */ > #define ROLLBACK_HT_INIT_WAIT_TRY 60 > > %s/error/an error > This macro also added under 0014. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Seek failure at end of FSM file during WAL replay (in 11)
Hi all, Recently, one of the test beds we use has blown up once when doing streaming replication like that: FATAL: could not seek to end of file "base/16386/19817_fsm": No such file or directory CONTEXT: WAL redo at 60/8DA22448 for Heap2/CLEAN: remxid 65751197 LOG: startup process (PID 44886) exited with exit code 1 All the WAL records have been wiped out since, so I don't know exactly what happened, but I could track down that this FSM file got removed a couple of hours before as I got my hands on some FS-level logs which showed a deletion. This happens in the context of a WAL record XLOG_HEAP2_CLEAN, and the redo logic is in heap_xlog_clean(), where there are FSM lookups within XLogRecordPageWithFreeSpace() -> XLogReadBufferExtended(). At the subsequent restart, recovery has been able to move on after the failing record, so the FSM has been rebuilt correctly, still that caused an HA setup to be less... Available. However, we are rather careful in those code paths to call smgrcreate() so as the file gets created at redo if it is not around. Before blaming a lower level of the application stack, I am wondering if we have some issues with mdfd_vfd meaning that the file has been removed but that it is still tracked as opened. A quick lookup of the code does not show any issues, has anyone seen this particular error recently? The last commit on REL_11_STABLE which touched this area is this one FWIW: commit: 6872c2be6a97057aa736110e31f0390a53305c9e author: Alvaro Herrera date: Wed, 15 Aug 2018 18:09:29 -0300 Update FSM on WAL replay of page all-visible/frozen Also, this setup was using 11.2 (I know this one lags behind a bit, anyway...). Thanks, -- Michael signature.asc Description: PGP signature
Re: Problem during Windows service start
Sorry in advance for link-breaking message, but the original mail was too old and gmail doesn't allow me to craft required headers to link to it. https://www.postgresql.org/message-id/CAKm4Xs71Ma8bV1fY6Gfz9Mg3AKmiHuoJNpxeDVF_KTVOKoy1WQ%40mail.gmail.com > Please find the proposed patch for review. I will attach it to > commitfest as well Pacemaker suffers the same thing. We suggest our customers that "start server alone to perform recovery then start pacemaker if it is expected to take a long time for recovery so that reaches time out". I don't think it is good think to let status SERVICE_RUNNING although it actually is not (yet). I think the right direction here is that, if pg_ctl returns by timeout, pgwin32_ServiceMain kills the starting server then report something like "timedout and server was stopped, please make sure the server not to take a long time to perform recovery.". Thougts? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_basebackup delays closing of stdout
Hi, On 2019-07-23 22:16:26 -0400, Jeff Janes wrote: > Ever since pg_basebackup was created, it had a comment like this: > > * End of chunk. If requested, and this is the base tablespace > * write configuration file into the tarfile. When done, close the > * file (but not stdout). > > But, why make the exception for output going to stdout? If we are done > with it, why not close it? I think closing stdout is a bad idea that can cause issues in a lot of situations. E.g. because a later open() will then use that fd (the lowest unused fd always gets used), and then the next time somebody wants to write something to stdout, there's normal output interspersed with some random file. You'd at the least have to reopen /dev/null into its place or such. It also seems likely to be a trap for some future feature additions that want to write another file to stdout or such - in contrast to the normal files it can't be reopened. > After a massive maintenance operation, I want to re-seed a streaming > standby, which I start to do by: > > pg_basebackup -D - -Ft -P -X none | pxz > base.tar.xz > > But the archiver is way behind, so when it finishes the basebackup part, I > get: > > NOTICE: pg_stop_backup cleanup done, waiting for required WAL segments to > be archived > WARNING: pg_stop_backup still waiting for all required WAL segments to be > archived (60 seconds elapsed) > ... > > The base backup file is not finalized, because pg_basebackup has not closed > its stdout while waiting for the WAL segment to be archived. The file is > incomplete due to data stuck in buffers, so I can't copy it to where I want > and bring up a new streaming replica (which bypasses the WAL archive, so > would otherwise work). That seems more like an argument for sticking a fflush() there, than closing stdout. Greetings, Andres Freund
Re: POC: Cleaning up orphaned files using undo logs
On 2019-07-22 14:21:36 +0530, Amit Kapila wrote: > Another thing is changing subxids to fxids can increase the size of > two-phase file for a xact having many sub-transactions which again > might be okay, but not completely sure. I can't see that being a problem.
Re: Psql patch to show access methods info
Hi, On 2019-07-15 22:03:31 +0300, Nikita Glukhov wrote: > + > + > + \dAc[+] > +[ class="parameter">access-method-pattern > + [ class="parameter">input-type-pattern]] > + > + > + > + > +Shows info index access method operator classes listed in > +. > +If class="parameter">access-method-patttern > +is specified, only operator classes associated with access method > whose > +name matches pattern are shown. > +If input-type-pattern > +is specified, only procedures associated with families whose input > type > +matches the pattern are shown. > +If + is appended to the command name, operator > family > +and owner are listed. > + > + > + > + > + > + > + \dAo[+] > +[ class="parameter">access-method-pattern > + [ class="parameter">operator-family-pattern]] > + > + > + > + > + > +Lists operators () associated > +with access method operator families. If > +access-method-patttern > is > +specified, only operators associated with access method whose name > +matches pattern are shown. If > +operator-family-pattern > is > +specified, only operators associated with families whose name matches > +the pattern are shown. > +If + is appended to the command name, displays > +additional info. > + > + > + > + > + > + > + \dAp[+] > +[ class="parameter">access-method-pattern > + [ class="parameter">operator-family-pattern]] > + > + > + > + > +Lists procedures () > associated > +with access method operator families. > +If class="parameter">access-method-patttern > +is specified, only procedures associated with access method whose > name > +matches pattern are shown. > +If class="parameter">operator-family-pattern > +is specified, only procedures associated with families whose name > +matches the pattern are shown. > +If + is appended to the command name, procedures > +listed with its names. > Based on a quick skim of the thread - which means I most definitely missed things - there's not been discussion of why we actually want to add this. Who's the prospective user of this facility? And why wouldn't they just query pg_am[proc]? None of this information seems like it's going to be even remotely targeted towards even advanced users. For developers it's not clear what these add? Adding stuff to psql isn't free. It adds clutter to psql's help output, the commands need to be maintained (including cross-version code). Greetings, Andres Freund
Re: initdb recommendations
Hi, On 2019-07-22 13:02:13 -0400, Andrew Dunstan wrote: > There are a few things we could do. We could force trust auth, or we > could add an ident map that allowed $USER to login as buildfarm. Finding > all the places we would need to fix that could be a fun project ... Perhaps we could actually do so automatically when the initdb invoking user isn't the same as the OS user? Imo that'd be generally quite useful, and not just for the regression tets. Greetings, Andres Freund
Re: ANALYZE: ERROR: tuple already updated by self
Hi, On 2019-06-18 17:08:37 -0700, Andres Freund wrote: > On 2019-06-18 18:48:58 -0500, Justin Pryzby wrote: > > Ah: the table is an inheritence parent. If I uninherit its child, there's > > no > > error during ANALYZE. MV stats on the child are ok: > > It's a "classical" inheritance parent, not a builtin-partitioning type > of parent, right? And it contains data? > > I assume it ought to not be too hard to come up with a reproducer > then... > > I think the problem is pretty plainly that for inheritance tables we'll > try to store extended statistics twice. And thus end up updating the > same row twice. > > > #6 0x00588346 in do_analyze_rel (onerel=0x7fee16140de8, options=2, > > params=0x7ffe5b6bf8b0, va_cols=0x0, acquirefunc=0x492b4, relpages=36, > > inh=true, in_outer_xact=false, elevel=13) at analyze.c:627 > > /* Build extended statistics (if there are any). */ > BuildRelationExtStatistics(onerel, totalrows, numrows, rows, > attr_cnt, > > vacattrstats); > > Note that for plain statistics we a) pass down the 'inh' flag to the > storage function, b) stainherit is part of pg_statistics' key: > > Indexes: > "pg_statistic_relid_att_inh_index" UNIQUE, btree (starelid, staattnum, > stainherit) > > > Tomas, I think at the very least extended statistics shouldn't be built > when building inherited stats. But going forward I think we should > probably have it as part of the key for pg_statistic_ext. Tomas, ping? Greetings, Andres Freund
Re: Psql patch to show access methods info
Hi, On 2019-07-23 01:57:29 +0300, Alexander Korotkov wrote: > It was always scary there is no way in psql to see am/opclass/opfamily > information rather than query catalog directly. What does make that scary? > I'm going to push it. Probably, someone find that commands syntax and > output formats are not well discussed yet. But we're pretty earlier > in 13 release cycle. So, we will have time to work out a criticism if > any. Please don't before we've had some discussion as to why we want this additional code, and who'd be helped by it. Greetings, Andres Freund
Re: Psql patch to show access methods info
Hi, On 2019-07-15 22:03:31 +0300, Nikita Glukhov wrote: > + > + > + \dAc[+] > +[ class="parameter">access-method-pattern > + [ class="parameter">input-type-pattern]] > + > + > + > + > +Shows info index access method operator classes listed in > +. > +If class="parameter">access-method-patttern > +is specified, only operator classes associated with access method > whose > +name matches pattern are shown. > +If input-type-pattern > +is specified, only procedures associated with families whose input > type > +matches the pattern are shown. > +If + is appended to the command name, operator > family > +and owner are listed. > + > + > + > + > + > + > + \dAo[+] > +[ class="parameter">access-method-pattern > + [ class="parameter">operator-family-pattern]] > + > + > + > + > + > +Lists operators () associated > +with access method operator families. If > +access-method-patttern > is > +specified, only operators associated with access method whose name > +matches pattern are shown. If > +operator-family-pattern > is > +specified, only operators associated with families whose name matches > +the pattern are shown. > +If + is appended to the command name, displays > +additional info. > + > + > + > + > + > + > + \dAp[+] > +[ class="parameter">access-method-pattern > + [ class="parameter">operator-family-pattern]] > + > + > + > + > +Lists procedures () > associated > +with access method operator families. > +If class="parameter">access-method-patttern > +is specified, only procedures associated with access method whose > name > +matches pattern are shown. > +If class="parameter">operator-family-pattern > +is specified, only procedures associated with families whose name > +matches the pattern are shown. > +If + is appended to the command name, procedures > +listed with its names. > Based on a quick skim of the thread - which means I most definitely missed things - there's not been discussion of why we actually want to add this. Who's the prospective user of this facility? And why wouldn't they just query pg_am[proc]? None of this information seems like it's going to be even remotely targeted towards even advanced users. For developers it's not clear what these add? Adding stuff to psql isn't free. It adds clutter to psql's help output, the commands need to be maintained (including cross-version code). Greetings, Andres Freund
Re: Change atoi to strtol in same place
On 2019-07-24 16:57:42 +1200, David Rowley wrote: > On Wed, 24 Jul 2019 at 16:02, Joe Nelson wrote: > > > In general, I think it's a good idea to fix those places, but I wonder > > > if we need to change the error messages too. > > > > I'll leave that decision for the community to debate. I did, however, > > remove the newlines for the new error messages being passed to > > pg_log_error(). > > I'd like to put my vote not to add this complex code to each option > validation that requires an integer number. I'm not sure there > currently is a home for it, but if there was, wouldn't it be better > writing a function that takes a lower and upper bound and sets some > output param with the value and returns a bool to indicate if it's > within range or not? +many
Re: Speed up transaction completion faster after many relations are accessed in a transaction
Hi, On 2019-07-21 21:37:28 +1200, David Rowley wrote: > select.sql: > \set p 1 > select * from ht where a = :p > > Master: > > $ pgbench -n -f select.sql -T 60 -M prepared postgres > tps = 10172.035036 (excluding connections establishing) > tps = 10192.780529 (excluding connections establishing) > tps = 10331.306003 (excluding connections establishing) > > Patched: > > $ pgbench -n -f select.sql -T 60 -M prepared postgres > tps = 15080.765549 (excluding connections establishing) > tps = 14994.404069 (excluding connections establishing) > tps = 14982.923480 (excluding connections establishing) > > That seems fine, 46% faster. > > v6 is attached. > > I plan to push this in a few days unless someone objects. It does seem far less objectionable than the other case. I hate to throw in one more wrench into a topic finally making progress, but: Have either of you considered just replacing the dynahash table with a simplehash style one? Given the obvious speed sensitivity, and the fact that for it (in contrast to the shared lock table) no partitioning is needed, that seems like a good thing to try. It seems quite possible that both the iteration and plain manipulations are going to be faster, due to far less indirections - e.g. the iteration through the array will just be an array walk with a known stride, far easier for the CPU to prefetch. Greetings, Andres Freund