Re: The ultimate extension hook.
> On 10/23/2020 9:31 AM Jehan-Guillaume de Rorthais wrote: > [...] > * useless with encrypted traffic > > So, +1 for such hooks. > > Regards, Ultimately Postgresql is supposed to be extensible. I don't see an API hook as being some crazy idea even if some may not like what I might want to use it for. It can be useful for a number of things. - Dan
Re: The ultimate extension hook.
> On 09/23/2020 9:26 PM Tom Lane wrote: > ... > > The hook I'd like to see would be in the PostgresMain() loop > > for the API "firstchar" messages. > > What, to invent your own protocol? Where will you find client libraries > buying into that? No API/client changes are needed for: 1) API tracing/filtering; or 3) custom SQL like commands through a trivial modification to Simple Query 'Q'. Purely optional as you'll see at the end. Yes, (2) API extension "case 'A'" could be used to roll ones own protocol. When pondering API hooking, in general, I thought of this also but don't let it be a distraction. > I'm not really convinced that any of the specific use-cases you suggest > are untenable to approach via the existing function fastpath mechanism, > anyway. Certainly (3) is just a command level way to execute a function instead of 'select myfunc()'. But it does go through the SQL machinery and SQL argument type lookup and processing. I like fast and direct things. And (3) is so trivial to implement. However, even fastpath doesn't provide a protocol hook function where tracing could be done. If I had that alone I could do my own 'Q' hook and do the "!cmd" processing in my extension even if I sold the idea just based on tracing/filtering. We hook all kinds of things in PG. Think big. Why should the protocol processing not have a hook? I'll bet some others will think of things I haven't even yet thought of that would leverage this. - Dan Wood
The ultimate extension hook.
Hooks exist all over PG for extensions to cover various specific usages. The hook I'd like to see would be in the PostgresMain() loop for the API "firstchar" messages. While I started just wanting the hook for the absolute minimum overhead to execute a function, even faster than fastpath, and in brainstorming with David Rowley other use cases became apparent. API tracing within the engine. I've heard of client tools for this. API filtering. Block/ignore manual checkpoints for instance. API message altering. Anything you want to hook into at the highest level well above ExecutorRun. Originally I just wanted a lightweight mechanism to capture some system counters like /proc/stat without going through the SQL execution machinery. I'm picky about implementing stuff in the absolute fastest way. :-) But I think there are other practical things that I haven't even thought of yet. There are a few implementation mechanisms which achieve slightly different possibilities: 1) The generic mechanism would let one or more API filters be installed to directly call functions in an extension. There would be no SQL arg processing overhead based on the specific function. You'd just pass it the StringInfoData 'msg' itself. Multiple extensions might use the hook so you'd need to rewind the StringInfo buffer. Maybe I return a boolean to indicate no further processing of this message or fall through to the normal "switch (firstchar)" processing. 2) switch (firstchar) { case 'A': // New trivial API message for extensions which would call a single extension installed function to do whatever I wanted based on the message payload. And, yes, I know this can be done just using SQL. It is simply a variation. But this would require client support and I prefer the below. 3) case 'Q': /* simple query */ if (pq_peekbyte() == '!' && APIHook != NULL) { (*APIHook)(msg); ... continue; } I've use this last technique to do things like: if (!strncmp(query_string, "DIEDIEDIE", 9) { char *np = NULL; *np = 1; } else if (!strncmp(query_string, "PING", 4) { static const char *pong = "PONG"; pq_putmessage('C', pong, strlen(pong) + 1); send_ready_for_query = true; continue; } else if (...) Then I can simple type PING into psql and get back a PONG. Or during a stress test on a remote box I can execute the simple query "DIEDIEDIE" and crash the server. I did this inline for experimentation before but it would be nice if I had the mechanism to use a "statement" to invoke a hook function in an extension. A single check for "!" in the 'Q' processing would allow user defined commands in extensions. The dispatcher would be in the extension. I just need the "!" check. Another example where ultimate performance might be a goal, if you are familiar with why redis/memcached/etc. exists then imagine loading SQL results into a cache in an extension and executing as a 'simple' query something like: !LOOKUP and getting the value faster than SQL could do. Before I prototype I want to get some feedback. Why not have a hook at the API level?
Re: Reduce/eliminate the impact of FPW
> On 08/03/2020 8:26 AM Robert Haas wrote: ... > I think this is what's called a double-write buffer, or what was tried > some years ago under that name. A significant problem is that you > have to fsync() the double-write buffer before you can write the WAL. I don't think it does need to be fsync'ed before the WAL. If the log record has a FPW reference beyond the physical log EOF then we don't need to restore the before image because we haven't yet did the dirty page write from the cache. The before image only needs to be flushed before the dirty page write. Usually this will have already done. > ... But for short transactions, such as those > performed by pgbench, you'd probably end up with a lot of cases where > you had to write 3 pages instead of 2, and not only that, but the > writes have to be consecutive rather than simultaneous, and to > different parts of the disk rather than sequential. That would likely > suck a lot. Wherever you write the before images, in the WAL or into a separate file you would write the same number of pages. I don't understand the 3 pages vs 2 pages comment. And, "different parts of the disk"??? I wouldn't enable the feature on spinning media unless I had a dedicated disk for it. NOTE: If the 90's Informix called this the physical log. Restoring at crash time restored physical consistency after which redo/undo recovery achieved logical consistency. From their doc's: "If the before-image of a modified page is stored in the physical-log buffer, it is eventually flushed from the physical-log buffer to the physical log on disk. The before-image of the page plays a critical role in restoring data and fast recovery. For more details, see Physical-Log Buffer." > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
Reduce/eliminate the impact of FPW
I thought that the biggest reason for the pgbench RW slowdown during a checkpoint was the flood of dirty page writes increasing the COMMIT latency. It turns out that the documentation which states that FPW's start "after a checkpoint" really means after a CKPT starts. And this is the really cause of the deep dip in performance. Maybe only I was fooled... :-) If we can't eliminate FPW's can we at least solve the impact of it? Instead of writing the before images of pages inline into the WAL, which increases the COMMIT latency, write these same images to a separate physical log file. The key idea is that I don't believe that COMMIT's require these buffers to be immediately flushed to the physical log. We only need to flush these before the dirty pages are written. This delay allows the physical before image IO's to be decoupled and done in an efficient manner without an impact to COMMIT's. 1. When we generate a physical image add it to an in memory buffer of before page images. 2. Put the physical log offset of the before image into the WAL record. This is the current physical log file size plus the offset in the in-memory buffer of pages. 3. Set a bit in the bufhdr indicating this was done. 4. COMMIT's do not need to worry about those buffers. 5. Periodically flush the in-memory buffer and clear the bit in the BufHdr. 6. During any dirty page flushing if we see the bit set, which should be rare, then make sure we get our before image flushed. This would be similar to our LSN based XLogFlush(). Do we need these before images for more than one CKPT? I don't think so. Do PITR's require before images since it is a continuous rollforward from a restore? Just some of considerations. Do I need to back this physical log up? I likely(?) need to deal with replication. Turning off FPW gives about a 20%, maybe more, boost on a pgbench TPC-B RW workload which fits in the buffer cache. Can I get this 20% improvement with a separate physical log of before page images? Doing IO's off on the side, but decoupled from the WAL stream, doesn't seem to impact COMMIT latency on modern SSD based storage systems. For instance, you can hammer a shared data and WAL SSD filesystem with dirty page writes from the CKPT, at near the MAX IOPS of the SSD, and not impact COMMIT latency. However, this presumes that the CKPT's natural spreading of dirty page writes across the CKPT target doesn't push too many outstanding IO's into the storage write Q on the OS/device. NOTE: I don't believe the CKPT's throttling is perfect and I think a burst of dirty pages into the cache just before a CKPT might cause the Q to be flooded and this would then also further slow TPS during the CKPT. But a fix to this is off topic from the FPW issue. Thanks to Andres Freund for both making me aware of the Q depth impact on COMMIT latency and the hint that FPW might also be causing the CKPT slowdown. FYI, I always knew about FPW slowdown in general but I just didn't realize it was THE primary cause of CKPT TPS slowdown on pgbench. NOTE: I realize that spinning media might exhibit different behavior. And I didn't not say dirty page writing has NO impact on good SSD's. It depends, and this is a subject for a later date as I have a theory as to why I something see a sawtooth performance for pgbench TPC-B and sometimes a square wave but I want to prove if first.
Wait profiling
After nearly 5 years does something like the following yet exist? https://www.postgresql.org/message-id/559d4729.9080...@postgrespro.ru I feel that it would be useful to have the following two things. One PG enhancement and one standard extension. 1) An option to "explain" to produce a wait events profile. postgres=# explain (analyze, waitprofile) update pgbench_accounts set bid=bid+1 where aid < 200; ... Execution time: 23111.231 ms 62.6% BufFileRead 50.0% CPU 9.3% LWLock It uses a PG timer to do this. 2) An extension based function like: select pg_wait_profile(pid, nSeconds, timerFrequency) to return the same thing for an already running query. Useful if you want examine some already long running query that is taking too long. Neither of these would be doing the heavy weight pg_stat_activity but directly poll the wait event in PROC. I've already coded the EXPLAIN option. Furthermore, can't we just remove the following "IF" test from pgstat_report_wait_{start,end}? if (!pgstat_track_activities || !proc) return; Just do the assignment of wait_event_info always. We should use a dummy PGPROC assigned to MyProc until we assign the one in the procarray in shared memory. That way we don't need the "!proc" test. About the only thing I'd want to verify is whether wait_event_info is on the same cache lines as anything else having to do with snapshots. If I recall correctly the blanks lines above I've used to make this more readable will disappear. :-( - Dan Wood
Re: 'Invalid lp' during heap_xlog_delete
> On December 6, 2019 at 3:06 PM Andres Freund wrote: ... > > crash > > smgrtruncate - Not reached > > This seems like a somewhat confusing description to me, because > smgrtruncate() is what calls DropRelFileNodeBuffers(). I assume what you > mean by "smgrtruncate" is not the function, but the smgr_truncate() > callback? My mistake. Yes, smgr_truncate() > > Even if we reach the truncate we don't fsync it till the next > > checkpoint. So on filesystems which delay metadata updates a crash > > can lose the truncate. > > I think that's probably fine though. Leaving the issue of checkpoint > completing inbetween the DropRelFileNodeBuffers() and the actual > truncation aside, we'd have the WAL logged truncation truncating the > file. I don't think it's unreasonable to except a filesystem that claims > to support running without full_page_writes (I've seen several such > claims turning out not to be true under load), to preserve either the > original page contents or the new file size after a a crash. If your > filesystem doesn't, you really ought not to use it with FPW = off. If the phsyical truncate doesn't occur in the seconds after the cache invalidation nor the fsync within the minutes till the next checkpoint you are NOT left with a torn page on disk. You are left with the 'incorrect' page on disk. In other words, an older page because the invalidation prevent the write of the most recent dirty page. Redos don't like old incorrect pages. But, yes, fullpage writes covers up this anomaly(To be generous). > > Once we do the fsync(), for the truncate, the REDO read will return > > BLK_NOTFOUND and the DELETE REDO attempt will be skipped. WIthout the > > fsync() or crashing before the truncate, the delete redo depends on > > the most recent version of the page having been written by the > > checkpoint. > > We also need to correctly replay this on a standby, so I don't think > just adding an smgrimmedsync() is the answer. We'd not replay the > truncation standbys / during PITR, unless I miss something. So we'd just > end up with the same problem in slightly different situations. I haven't mentioned to you that we have seen what appears to be the same problem during PITR's depending on which base backup we start with. I didn't mention it because I haven't created a repro to prove it. I simply suspect it. > To me it sounds the fix here would be to rejigger the RelationTruncate() > sequence for truncation of the main for as follows: > > 1) MyPgXact->delayChkpt = true > 2) XLogInsert(XLOG_SMGR_TRUNCATE) > 3) smgrtruncate() (which, as now, first does a DropRelFileNodeBuffers(), >and then calls the smgr_truncate callback) > 4) MyPgXact->delayChkpt = false > > I'm not worried about the increased delayChkpt = true time. Compared > with the frequency of RecordTransactionCommit() this seems harmless. Seems reasonable. > I'm inclined to think that we should make the XLogFlush() in the > RelationNeedsWAL() branch of RelationTruncate() > unconditional. Performing the truncation on the filesystem level without > actually having persisted the corresponding WAL record is dangerous. Yes, I also was curious about why it was conditional. - Dan Wood
Re: 'Invalid lp' during heap_xlog_delete
> I'll try to look into that with lower page sizes for relation and WAL > pages. The page size is totally unrelated to this bug. When you repro the redo failure it is because the log record is being applied to an old page version. The correct newer page version never got written because of the truncate page invalidation. The cause is not a torn write. > What's that? Azure PostgreSQL on Windows has proprietary mechanisms in its filesystem to guarantee that a write is atomic. - Dan > On November 18, 2019 at 4:58 AM Michael Paquier < mich...@paquier.xyz > mailto:mich...@paquier.xyz > wrote: > > > On Thu, Nov 14, 2019 at 07:38:19PM -0800, Daniel Wood wrote: > > > > Sorry I missed one thing. Turn off full page writes. > > > > > Hmm. Linux FSes use typically 4kB pages. I'll try to look into > > that > with lower page sizes for relation and WAL pages. > > > > > I'm running in an env. with atomic 8K writes. > > > > > What's that? > -- > Michael >
Re: 'Invalid lp' during heap_xlog_delete
Sorry I missed one thing. Turn off full page writes. I'm running in an env. with atomic 8K writes. > On November 12, 2019 at 6:23 PM Daniel Wood wrote: > > It's been tedious to get it exactly right but I think I got it. FYI, I > was delayed because today we had yet another customer hit this: 'redo max > offset' error. The system crashed as a number of autovacuums and a > checkpoint happened and then the REDO failure >
Re: 'Invalid lp' during heap_xlog_delete
It's been tedious to get it exactly right but I think I got it. FYI, I was delayed because today we had yet another customer hit this: 'redo max offset' error. The system crashed as a number of autovacuums and a checkpoint happened and then the REDO failure. Two tiny code changes: bufmgr.c:bufferSync() pg_usleep(1000); // At begin of function smgr.c:smgrtruncate(): Add the following just after CacheInvalidateSmgr() if (forknum == MAIN_FORKNUM && nblocks == 0) { pg_usleep(4000); { char *cp=NULL; *cp=13; } } Now for the heavily commented SQL repro. It will require that you execute a checkpoint in another session when instructed by the repro.sql script. You have 4 seconds to do that. The repro script explains exactly what must happen. --- create table t (c char()); alter table t alter column c set storage plain; -- Make sure there actually is an allocated page 0 and it is empty. -- REDO Delete would ignore a non-existant page: XLogReadBufferForRedoExtended: return BLK_NOTFOUND; -- Hopefully two row deletes don't trigger autovacuum and truncate the empty page. insert into t values ('1'), ('2'); delete from t; checkpoint; -- Checkpoint the empty page to disk. -- This insert should be before the next checkpoint 'start'. I don't want to replay it. -- And, yes, there needs to be another checkpoint completed to skip its replay and start -- with the replay of the delete below. insert into t values ('1'), ('2'); -- Checkpoint needs to start in another session. However, I need to stall the checkpoint -- to prevent it from writing the dirty page to disk until I get to the vacuum below. select 'Please start checkpoint in another session'; select pg_sleep(4); -- Below is the problematic delete. -- It succeeds now(online) because the dirty page has two rows on it. -- However, with respect to crash recovery there are 3 possible scenarios depending on timing. -- 1) The ongoing checkpoint might write the page with the two rows on it before -- the deletes. This leads to BLK_NEEDS_REDO for the deletes. It works -- because the page read from disk has the rows on it. -- 2) The ongoing checkpoint might write the page just after the deletes. -- In that case BLK_DONE will happen and there'll be no problem. LSN check. -- 3) The checkpoint can fail to write the dirty page because a vacuum can call -- smgrtruncate->DropRelFileNodeBuffers() which invalidates the dirty page. -- If smgrtruncate safely completes the physical truncation then BLK_NOTFOUND -- happens and we skip the redo of the delete. So the skipped dirty write is OK. -- The problme happens if we crash after the 2nd checkpoint completes -- but before the physical truncate 'mdtruncate()'. delete from t; -- The vacuum must complete DropRelFileNodeBuffers. -- The vacuum must sleep for a few seconds to allow the checkpoint to complete -- such that recovery starts with the Delete REDO. -- We must crash before mdtruncate() does the physical truncate. If the physical -- truncate happens the BLK_NOTFOUND will be returned and the Delete REDO skipped. vacuum t; > On November 10, 2019 at 11:51 PM Michael Paquier < mich...@paquier.xyz > mailto:mich...@paquier.xyz > wrote: > > > On Fri, Nov 08, 2019 at 06:44:08PM -0800, Daniel Wood wrote: > > > > I repro'ed on PG11 and PG10 STABLE but several months old. > > I looked at 6d05086 but it doesn't address the core issue. > > > > DropRelFileNodeBuffers prevents the checkpoint from writing all > > needed dirty pages for any REDO's that exist BEFORE the truncate. > > If we crash after a checkpoint but before the physical truncate then > > the REDO will need to replay the operation against the dirty page > > that the Drop invalidated. > > > > > I am beginning to look at this thread more seriously, and I'd > > like to > first try to reproduce that by myself. Could you share the steps you > used to do that? This includes any manual sleep calls you may have > added, the timing of the crash, manual checkpoints, debugger > breakpoints, etc. It may be possible to extract some more generic > test from that. > -- > Michael >
Re: 'Invalid lp' during heap_xlog_delete
I repro'ed on PG11 and PG10 STABLE but several months old. I looked at 6d05086 but it doesn't address the core issue. DropRelFileNodeBuffers prevents the checkpoint from writing all needed dirty pages for any REDO's that exist BEFORE the truncate. If we crash after a checkpoint but before the physical truncate then the REDO will need to replay the operation against the dirty page that the Drop invalidated. Teja Mupparti, an engineer I work with, suggested moving DropRelFileNodeBuffers to the bottom of smgrtruncate() after the physical truncate. Doing that along with a fsync() after the truncate seems to plug the hole. > On November 8, 2019 at 5:39 PM Michael Paquier < mich...@paquier.xyz > mailto:mich...@paquier.xyz > wrote: > > > On Fri, Nov 08, 2019 at 12:46:51PM -0800, Daniel Wood wrote: > > > > Is DropRelFileNodeBuffers purely for performance or would > there be > > any correctness problems if not done. > > > > > On which version did you find that? Only HEAD or did you use a > version on a stable branch? There has been some work done in this > area lately as of 6d05086. > -- > Michael >
'Invalid lp' during heap_xlog_delete
Page on disk has empty lp 1 * Insert into page lp 1 checkpoint START.Redo eventually starts here. ** Delete all rows on page. autovac truncate DropRelFileNodeBuffers - dirty page NOT written. lp 1 on disk still empty checkpoint completes crash smgrtruncate - Not reached heap_xlog_delete reads page with empty lp 1 and the delete fails. The checkpoint can not have yet written * or ** before DropRelFileNodeBuffers invalidates either of those dirty page versions for this to repro. Even if we reach the truncate we don't fsync it till the next checkpoint. So on filesystems which delay metadata updates a crash can lose the truncate. Once we do the fsync(), for the truncate, the REDO read will return BLK_NOTFOUND and the DELETE REDO attempt will be skipped. WIthout the fsync() or crashing before the truncate, the delete redo depends on the most recent version of the page having been written by the checkpoint. Found during stress test and verified with pg_usleep's to test hypothesis. Is DropRelFileNodeBuffers purely for performance or would there be any correctness problems if not done.
Re: BTP_DELETED leaf still in tree
> On October 10, 2019 at 1:18 PM Peter Geoghegan wrote: > > > On Thu, Oct 10, 2019 at 12:48 PM Daniel Wood wrote: > > Update query stuck in a loop. Looping in _bt_moveright(). > > You didn't say which PostgreSQL versions were involved, and if the > database was ever upgraded using pg_upgrade. Those details could > matter. PG_VERSION says 10. I suspect we are running 10.9. I have no idea if pg_upgrade was ever done. > > ExecInsertIndexTuples->btinsert->_bt_doinsert->_bt_search->_bt_moveright > > > > Mid Tree Node downlink path taken by _bt_search points to a BTP_DELETED > > Leaf. > > This should hardly ever happen -- it is barely possible for an index > scan to land on a BTP_DELETED leaf page (or a half-dead page) when > following a downlink in its parent. Recall that nbtree uses Lehman & > Yao's design, so _bt_search() does not "couple" buffer locks on the > way down. It would probably be impossible to observe this happening > without carefully setting breakpoints in multiple sessions. > > If this happens reliably for you, which it sounds like, then you can > already assume that the index is corrupt. > > > btpo_next is also DELETED but not in the tree. > > > > btpo_next->btpo_next is NOT deleted but in the mid tree as a lesser key > > value. > > > > Thus creating an endless loop in moveright. > > Offhand, these other details sound normal. The side links are still > needed in fully deleted (BTP_DELETED) pages. And, moving right and > finding lesser key values (not greater key values) is normal with > deleted pages, since page deletion makes the keyspace move right, not > left (moving the keyspace left is how the source Lanin & Shasha paper > does it, though). > > Actually, I take it back -- the looping part is not normal. The > btpo_next->btpo_next page has no business linking back to the > original/first deleted page you mentioned. That's just odd. btpo_next->btpo_next does NOT link directly back to the 1st deleted page. It simply links to some in-use page which is 50 or so leaf pages back in the tree. Eventually we do reach the two deleted pages again. Only the first one is in the 'tree'. > Can you provide me with a dump of the page images? The easiest way of > getting a page dump is described here: Customer data. Looks like meaningless customer data (5 digit key values). But too much paperwork. :-) The hard part for me to understand isn't just why the DELETED leaf node is still referenced in the mid tree node. It is that the step which sets BTP_DELETED should have also linked its leaf and right siblings together. But this hasn't been done. Could the page have already have been dirty, but because of "target != leafblkno", we didn't stamp a new LSN on it. Could this allow us to write the DELETED dirty page without the XLOG_BTREE_MARK_PAGE_HALFDEAD and XLOG_BTREE_UNLINK_PAGE being flushed? Of course, I don't understand the "target != leafblkno". In any case, thanks. > https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump > > If I had to guess, I'd guess that this was due to a generic storage problem. > > -- > Peter Geoghegan
pgbench prints suspect tps numbers
Short benchmark runs are bad if the runs aren't long enough to produce consistent results. Having to do long runs because a benchmarking tool 'converges to reality' over time in reporting a tps number, due to miscalculation, is also bad. I want to measure TPS at a particular connection count. A fully cached Select Only pgbench produces fairly consistent numbers over short runs of a few minutes. pgbench's "including connections establishing" number is polluted by fact that for many seconds the benchmark is running with less than the expected number of connections. I thought that was why the 'excluding' number was also printed and I had been relying on that number. pgbench's "excluding connections establishing" number seems to be a total garbage number which can be way way over the actual tps. During a period when I had a bug causing slow connections I noticed a consistent value of about 100K tps over the measurement intervals. At the end of a 5 minute run it reported 450K tps! There was no point anywhere during the benchmark that it ran anywhere near that number. I had been using 'excluding' because it 'seemed' perhaps right in the past. It was only when I got crazy numbers I looked at the calculation to find: tps_exclude = total->cnt / (time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients)); The 'cnt' is the total across the entire run including the period when connections are ramping up. I don't see how dividing by the total time minus the average connection time produces the correct result. Even without buggy slow connections, when connecting 1000 clients, I've wondered why the 'excluding' number seemed a bit higher than any given reporting interval numbers, over a 5 minute run. I now understand why. NOTE: When the system hits 100% cpu utilization(after about the first 100 connections), on a fully cached Select only pgbench, further connections can struggle to get connected which really skews the results. How about a patch which offered the option to wait on an advisory lock as a mechanism to let the main thread delay the start of the workload after all clients have connected and entered a READY state? This would produce a much cleaner number.
Re: Skylake-S warning
One other thought. Could we update pgxact->xmin less often? What would be the impact of this lower bound being lower than it would normally be with the existing scheme. Yes, it needs to be moved forward "occasionally". FYI, be careful with padding PGXACT's to a full cache line. With 1024 clients you'd completely blow out the L1 cache. The benefits with less than 200 clients is dubious. One problem with multiple(5) pgxact's in a single cache line is that you may get a miss fetching xmin and then a second miss fetching xid because an invalidation occurs between the 2 fetches from updating any of the other 5 pgxact's on the same cache line. I've found some improvement fetching both xmin and xid with a single 64 bit fetch. This prevents the invalidation between the two fetches. Similarily updating them with a single 64 bit write helps. Yes, this is uber tweaky.
Re: Skylake-S warning
> On October 3, 2018 at 3:55 PM Andres Freund wrote: > In the thread around > https://www.postgresql.org/message-id/20160411214029.ce3fw6zxim5k6...@alap3.anarazel.de > I'd found doing more aggressive padding helped a lot. Unfortunately I > didn't pursue this further :( Interesting. Looks like you saw the same thing I see now. > I really suspect we're going to have to change the layout of PGXACT data > in a way that makes a contiguous scan possible. That'll probably require > some uglyness because a naive "use first free slot" scheme obviously is > sensitive to there being holes. But I suspect that a scheme were > backends would occasionally try to move themselves to an earlier slot > wouldn't be too hard to implement. I've also thought of this. But someone could create a thousand connections and then disconnect all but the first and last creating a huge hole. One thought would be to have a bit array of used entries and to bias toward using free lower indexed entries. Only two cache lines of 1024 bits would need to be scanned to handle 1024 connections. ProcArrayAdd() would just do an atomic xchg to set the used bit. I was even thinking of decomposing PGXACT into separate arrays of XID's, XMIN's, etc. Then have a bit map of those entries where the XID was valid. When you set/clear your XID you'd also set/clear this bit. With the select only workload the XID are all "Invalid" thus scanning this bit array of zeroed 64 bit long entries would be quite efficient. The vacuum and logical decoding flags could be directly done as a bit map. Having a array of 1 byte subtxn counts creates another opportunity for improvement. Scanning for a non-zero in env's which use few subtxn's is efficient. If subtxn's are used a lot then the repeated PGXACT cache line invalidations when setting nxids is offset by the fact that you can grab 8 of them in a single fetch. - Dan
Skylake-S warning
If running benchmarks or you are a customer which is currently impacted by GetSnapshotData() on high end multisocket systems be wary of Skylake-S. Performance differences of nearly 2X can be seen on select only pgbench due to nothing else but unlucky choices for max_connections. Scale 1000, 192 local clients on a 2 socket 48 core Skylake-S(Xeon Platinum 8175M @ 2.50-GHz) system. pgbench -S Results from 5 runs varying max_connections from 400 to 405: max conn TPS 400 677639 4011146776 4021122140 403 765664 404 671455 4051190277 ... perf top shows about 21% GetSnapshotData() with the good numbers and 48% with the bad numbers. This problem is not seen on a 2 socket 32 core Haswell system. Being a one man show I lack some of the diagnostic tools to drill down further. My suspicion is that the fact that Intel has lowered the L2 associativity from 8(Haswell) to 4(Skylake-S) may be the cause. The other possibility is that at higher core counts the shared 16-way inclusive associative L3 cache becomes insufficient. Perhaps that is why Intel has moved to an exclusive L3 cache on Skylake-SP. If this is indeed just disadvantageous placement of structures/arrays in memory then you might also find that after upgrading a previous good choice for max_connections becomes a bad choice if things move around. NOTE: int pgprocno = pgprocnos[index]; is where the big increase in time occurs in GetSnapshotData() This is largely read-only, once all connections are established, and easily fits in the L1, and is not next to anything else causing invalidations. NOTE2: It is unclear why PG needs to support over 64K sessions. At about 10MB per backend(at the low end) the empty backends alone would consume 640GB's of memory! Changing pgprocnos from int to short gives me the following results. max conn TPS 400 780119 4011129286 4021263093 403 887021 404 679891 4051218118 While this change is significant on large Skylake systems it is likely just a trivial improvement on other systems or workloads.
GetSnapshotData round two(for me)
I was about to suggest creating a single shared snapshot instead of having multiple backends compute what is essentially the same snapshot. Luckily, before posting, I discovered Avoiding repeated snapshot computation https://www.postgresql.org/message-id/caboikdmsj4osxta7xbv2quhkyuo_4105fjf4n+uyroybazs...@mail.gmail.com from Pavan and POC: Cache data in GetSnapshotData() https://www.postgresql.org/message-id/20150202152706.gd9...@alap3.anarazel.de from Andres. Andres, could I get a short summary of the biggest drawback that may have prevented this from being released? Before I saw this I had did my own implementation and saw some promising results(25% on 48 cores). I do need to do some mixed RO and RW workloads to see how the invalidations of the shared copy, at EOT time, affect the results. There are some differences in my implementation. I choose, perhaps incorrectly?, to busy spin other users trying to get a snapshot while the first guy in builds the shared copy. My thinking is to not increase latency of using the snapshot. The improvement of the idea doesn't come from getting off the CPU, by using a WAIT, but in not reading PGXACT cache lines on all the cpus acquiring the snapshot that are constantly being dirtied. One backend can do the heavy lifting and the others can immediately jump on the shared copy once created. And something else quite weird: As I was evolving a standard setup for benchmark runs and getting baselines I was getting horrible numbers sometimes(680K) and other times I'd get over 1 million QPS. I was thinking I had a bad machine. What I found was that even though I was running a fixed 192 clients I had set max_connections to 600 sometimes and 1000 on other runs. Here is what I see running select-only scale 1000 pgbench with 192 clients on a 48 core box(2 sockets) using different values for max_connections: 200 tps = 1092043 250 tps = 1149490 300 tps = 732080 350 tps = 719611 400 tps = 681170 450 tps = 687527 500 tps = 859978 550 tps = 927161 600 tps = 1092283 650 tps = 1154916 700 tps = 1237271 750 tps = 1195968 800 tps = 1162221 850 tps = 1140626 900 tps = 749519 950 tps = 648398 1000 tps = 653460 This is on the base 12x codeline. The only thought I've had so far is that each PGXACT in use(192) is being scattered across the full set of max_connections, instead of being physically contiguous in the first 192 slots. This would cause more cache lines to be scanned. It doesn't make a lot of sense given that it goes up back again from 500 peaking at 700. Also this is after a fresh restart so the proc's in the freelist shouldn't have been scrambled yet in terms of ordering. NOTE: I believe you'll only see this huge difference on a dual socket machine. It'd probably only take 30 minutes or so on a big machine to confirm with a couple of few minute runs at different values for max_connections. I'll be debugging this soon. But I've been postponing it while experimenting with my shared snapshot code.
Re: On the need for a snapshot in exec_bind_message()
> > Queries stop getting re-optimized after 5 times, unless better plans are to > > be had. In the absence of schema changes or changing search path why is > > the snapshot needed? > > The snapshot has little to do with the query plan, usually. It's about > what view of the database the executed query will see, and particularly > about what view the parameter input functions will see, if they look. > > You could maybe argue that immutable input functions shouldn't need > snapshots, but there are enough not-immutable ones that I don't think > that gets you very far. In any case, we'd still need a snapshot for > query execution. The set of queries that could possibly never need > a snapshot at all is probably not large enough to be interesting. I would think that the vast majority of bind variables in customer queries would involve built-in data types whose input functions are immutable. As far as the "view of the database the executed query will see" goes, either the database itself changes, which should trigger catcache invalidations, that can be accounted for, or the "view" changes as in searchpath which can be tested for. I would think the optimization would be applicable most of the time. It isn't the number of obscure user created non-immutable input functions that exist. It is the frequency with which immutable input functions are used in real app's. This is a 50% performance improvement in point lookups on larger boxes.
Re: On the need for a snapshot in exec_bind_message()
> > exec_bind_message() > > PushActiveSnapshot(GetTransactionSnapshot()); > > > If there were no input functions, that needed this, nor reparsing or > > reanalyzing needed, and we knew this up front, it'd be a huge win. > > Unfortunately, that's not the case, so I think trying to get rid of > this call is a nonstarter. Queries stop getting re-optimized after 5 times, unless better plans are to be had. In the absence of schema changes or changing search path why is the snapshot needed? - Dan
On the need for a snapshot in exec_bind_message()
In particular: exec_bind_message() PushActiveSnapshot(GetTransactionSnapshot()); Suppressing this I've achieved over 1.9 M TXN's a second on select only pgbench on a 48 core box. It is about 50% faster with this change. The cpu usage of GetSnapshotData drops from about 22% to 4.5%. If there were no input functions, that needed this, nor reparsing or reanalyzing needed, and we knew this up front, it'd be a huge win. We could test for a number of conditions on the first parse/optimization of the query and set a flag to suppress this for subsequent executions. NOTE: In GetSnapshotData because pgxact, is declared volatile, the compiler will not reduce the following two IF tests into a single test: if (pgxact->vacuumFlags & PROC_IN_LOGICAL_DECODING) continue; if (pgxact->vacuumFlags & PROC_IN_VACUUM) continue; You can reduce the code path in the inner loop by coding this as: if (pgxact->vacuumFlags & (PROC_IN_LOGICAL_DECODING|PROC_IN_VACUUM)) continue; I'm still working on quantifying any gain. Note it isn't just one L1 cache fetch and one conditional branch eliminated. Due to the update frequency of the pgxact cache line, for single statement TXN's, there are a certain number of full cache misses, due to invalidation, that occurs when given pgxact is updated between the first fetch of vacuumFlags and the 2nd fetch.
First steps to being a contributer
Having quit Amazon, where I was doing Postgres development, I've started looking at various things I might work on for fun. One thought is to start with something easy like the scalability of GetSnapshotData(). :-) I recently found it interesting to examine performance while running near 1 million pgbench selects per sec on a 48 core/96 HT Skylake box. I noticed that additional sessions trying to connect were timing out when they got stuck in ProcArrayAdd trying to get the ProcArrayLock in EXCLUSIVE mode. FYI, scale 1 with 2048 clients. The question is whether it is possible that the problem with GetSnapshotData() has reached a critical point, with respect to snapshot scaling, on the newest high end systems. I didn't have time to complete my analysis as I lost access to the hardware on my last day. It shouldn't cost me much more than about $6 per hour to do experiments on a 48 core system. What I'd like is a short cut to any of the current discussions of various ideas to improve snapshot scaling. I have some of my own ideas but want to review things before posting them.