Re: The ultimate extension hook.

2020-10-25 Thread Daniel Wood
> On 10/23/2020 9:31 AM Jehan-Guillaume de Rorthais  wrote:
> [...]
> * useless with encrypted traffic
> 
> So, +1 for such hooks.
> 
> Regards,

Ultimately Postgresql is supposed to be extensible.
I don't see an API hook as being some crazy idea even if some may not like what 
I might want to use it for.  It can be useful for a number of things.

- Dan




Re: The ultimate extension hook.

2020-09-23 Thread Daniel Wood


> On 09/23/2020 9:26 PM Tom Lane  wrote:
> ...
> > The hook I'd like to see would be in the PostgresMain() loop
> > for the API "firstchar" messages.
> 
> What, to invent your own protocol?  Where will you find client libraries
> buying into that?

No API/client changes are needed for:
1) API tracing/filtering; or
3) custom SQL like commands through a trivial modification to  Simple Query 
'Q'.  Purely optional as you'll see at the end.

Yes, (2) API extension "case 'A'" could be used to roll ones own protocol.  
When pondering API hooking, in general, I thought of this also but don't let it 
be a distraction.

> I'm not really convinced that any of the specific use-cases you suggest
> are untenable to approach via the existing function fastpath mechanism,
> anyway.

Certainly (3) is just a command level way to execute a function instead of 
'select myfunc()'.  But it does go through the SQL machinery and SQL argument 
type lookup and processing.  I like fast and direct things.  And (3) is so 
trivial to implement.

However, even fastpath doesn't provide a protocol hook function where tracing 
could be done.  If I had that alone I could do my own 'Q' hook and do the 
"!cmd" processing in my extension even if I sold the idea just based on 
tracing/filtering.

We hook all kinds of things in PG.  Think big.  Why should the protocol 
processing not have a hook?  I'll bet some others will think of things I 
haven't even yet thought of that would leverage this.

- Dan Wood




The ultimate extension hook.

2020-09-23 Thread Daniel Wood
Hooks exist all over PG for extensions to cover various specific usages.

The hook I'd like to see would be in the PostgresMain() loop
for the API "firstchar" messages.

While I started just wanting the hook for the absolute minimum overhead to 
execute a function, even faster than fastpath, and in brainstorming with David 
Rowley other use cases became apparent.

API tracing within the engine.  I've heard of client tools for this.
API filtering.  Block/ignore manual checkpoints for instance.
API message altering.
Anything you want to hook into at the highest level well above ExecutorRun.

Originally I just wanted a lightweight mechanism to capture some system 
counters like /proc/stat without going through the SQL execution machinery.  
I'm picky about implementing stuff in the absolute fastest way.  :-)  But I 
think there are other practical things that I haven't even thought of yet.

There are a few implementation mechanisms which achieve slightly different 
possibilities:

1) The generic mechanism would let one or more API filters be installed to 
directly call functions in an extension.  There would be no SQL arg processing 
overhead based on the specific function.   You'd just pass it the 
StringInfoData 'msg' itself. Multiple extensions might use the hook so you'd 
need to rewind the StringInfo buffer.  Maybe I return a boolean to indicate no 
further processing of this message or fall through to the normal "switch 
(firstchar)" processing.

2) switch (firstchar) { case 'A': // New trivial API message for extensions
which would call a single extension installed function to do whatever I wanted 
based on the message payload.  And, yes, I know this can be done just using 
SQL.  It is simply a variation.  But this would require client support and I 
prefer the below.

3) case 'Q':  /* simple query */
if (pq_peekbyte() == '!' && APIHook != NULL) {
(*APIHook)(msg);

...
continue;
}

I've use this last technique to do things like:
if (!strncmp(query_string, "DIEDIEDIE", 9) {
char *np = NULL;
*np = 1;
} else if (!strncmp(query_string, "PING", 4) {
static const char *pong = "PONG";
pq_putmessage('C', pong, strlen(pong) + 1);
send_ready_for_query = true;
continue;
} else if (...)

Then I can simple type PING into psql and get back a PONG.
Or during a stress test on a remote box I can execute the simple query 
"DIEDIEDIE" and crash the server.  I did this inline for experimentation before 
but it would be nice if I had the mechanism to use a "statement" to invoke a 
hook function in an extension.  A single check for "!" in the 'Q' processing 
would allow user defined commands in extensions.  The dispatcher would be in 
the extension.  I just need the "!" check.

Another example where ultimate performance might be a goal, if  you are 
familiar with why redis/memcached/etc. exists then imagine loading SQL results 
into a cache in an extension and executing as a 'simple' query something like:  
!LOOKUP 
and getting the value faster than SQL could do.

Before I prototype I want to get some feedback.  Why not have a hook at the API 
level?

Re: Reduce/eliminate the impact of FPW

2020-08-03 Thread Daniel Wood


> On 08/03/2020 8:26 AM Robert Haas  wrote:
...
> I think this is what's called a double-write buffer, or what was tried
> some years ago under that name.  A significant problem is that you
> have to fsync() the double-write buffer before you can write the WAL.

I don't think it does need to be fsync'ed before the WAL.  If the
log record has a FPW reference beyond the physical log EOF then we
don't need to restore the before image because we haven't yet did
the dirty page write from the cache.  The before image only needs
to be flushed before the dirty page write.  Usually this will have
already done.

> ... But for short transactions, such as those
> performed by pgbench, you'd probably end up with a lot of cases where
> you had to write 3 pages instead of 2, and not only that, but the
> writes have to be consecutive rather than simultaneous, and to
> different parts of the disk rather than sequential. That would likely
> suck a lot.

Wherever you write the before images, in the WAL or into a separate
file you would write the same number of pages.  I don't understand
the 3 pages vs 2 pages comment.

And, "different parts of the disk"???  I wouldn't enable the feature
on spinning media unless I had a dedicated disk for it.

NOTE:
If the 90's Informix called this the physical log.  Restoring at
crash time restored physical consistency after which redo/undo
recovery achieved logical consistency.  From their doc's:
"If the before-image of a modified page is stored in the physical-log 
buffer, it is eventually flushed from the physical-log buffer to the physical 
log on disk. The before-image of the page plays a critical role in restoring 
data and fast recovery. For more details, see Physical-Log Buffer."

> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company




Reduce/eliminate the impact of FPW

2020-08-03 Thread Daniel Wood
I thought that the biggest reason for the pgbench RW slowdown during a 
checkpoint was the flood of dirty page writes increasing the COMMIT latency.  
It turns out that the documentation which states that FPW's start "after a 
checkpoint" really means after a CKPT starts.  And this is the really cause of 
the deep dip in performance.  Maybe only I was fooled... :-)

If we can't eliminate FPW's can we at least solve the impact of it?  Instead of 
writing the before images of pages inline into the WAL, which increases the 
COMMIT latency, write these same images to a separate physical log file.  The 
key idea is that I don't believe that COMMIT's require these buffers to be 
immediately flushed to the physical log.  We only need to flush these before 
the dirty pages are written.  This delay allows the physical before image IO's 
to be decoupled and done in an efficient manner without an impact to COMMIT's.

1. When we generate a physical image add it to an in memory buffer of before 
page images.
2. Put the physical log offset of the before image into the WAL record.  This 
is the current physical log file size plus the offset in the in-memory buffer 
of pages.
3. Set a bit in the bufhdr indicating this was done.
4. COMMIT's do not need to worry about those buffers.
5. Periodically flush the in-memory buffer and clear the bit in the BufHdr.
6. During any dirty page flushing if we see the bit set, which should be rare, 
then make sure we get our before image flushed.  This would be similar to our 
LSN based XLogFlush().
Do we need these before images for more than one CKPT?  I don't think so.  Do 
PITR's require before images since it is a continuous rollforward from a 
restore?  Just some of considerations.

Do I need to back this physical log up?  I likely(?) need to deal with 
replication.

Turning off FPW gives about a 20%, maybe more, boost on a pgbench TPC-B RW 
workload which fits in the buffer cache.  Can I get this 20% improvement with a 
separate physical log of before page images?

Doing IO's off on the side, but decoupled from the WAL stream, doesn't seem to 
impact COMMIT latency on modern SSD based storage systems.  For instance, you 
can hammer a shared data and WAL SSD filesystem with dirty page writes from the 
CKPT, at near the MAX IOPS of the SSD, and not impact COMMIT latency.  However, 
this presumes that the CKPT's natural spreading of dirty page writes across the 
CKPT target doesn't push too many outstanding IO's into the storage write Q on 
the OS/device.
NOTE: I don't believe the CKPT's throttling is perfect and I think a burst of 
dirty pages into the cache just before a CKPT might cause the Q to be flooded 
and this would then also further slow TPS during the CKPT.  But a fix to this 
is off topic from the FPW issue.

Thanks to Andres Freund for both making me aware of the Q depth impact on 
COMMIT latency and the hint that FPW might also be causing the CKPT slowdown.  
FYI, I always knew about FPW slowdown in general but I just didn't realize it 
was THE primary cause of CKPT TPS slowdown on pgbench.  NOTE: I realize that 
spinning media might exhibit different behavior.  And I didn't not say dirty 
page writing has NO impact on good SSD's.  It depends, and this is a subject 
for a later date as I have a theory as to why I something see a sawtooth 
performance for pgbench TPC-B and sometimes a square wave but I want to prove 
if first.



Wait profiling

2020-07-10 Thread Daniel Wood
After nearly 5 years does something like the following yet exist?
https://www.postgresql.org/message-id/559d4729.9080...@postgrespro.ru

I feel that it would be useful to have the following two things.  One PG 
enhancement and one standard extension.

1) An option to "explain" to produce a wait events profile.
postgres=# explain (analyze, waitprofile) update pgbench_accounts set bid=bid+1 
where aid < 200;
...
Execution time: 23111.231 ms

62.6% BufFileRead
50.0% CPU
9.3% LWLock

It uses a PG timer to do this.

2) An extension based function like: select pg_wait_profile(pid, nSeconds, 
timerFrequency) to return the same thing for an already running query.  Useful 
if you want examine some already long running query that is taking too long.

Neither of these would be doing the heavy weight pg_stat_activity but directly 
poll the wait event in PROC.  I've already coded the EXPLAIN option.

Furthermore, can't we just remove the following "IF" test from 
pgstat_report_wait_{start,end}?
if (!pgstat_track_activities || !proc)
return;
Just do the assignment of wait_event_info always.  We should use a dummy PGPROC 
assigned to MyProc until we assign the one in the procarray in shared memory.  
That way we don't need the "!proc" test.
About the only thing I'd want to verify is whether wait_event_info is on the 
same cache lines as anything else having to do with snapshots.

If I recall correctly the blanks lines above I've used to make this more 
readable will disappear.  :-(

- Dan Wood

Re: 'Invalid lp' during heap_xlog_delete

2019-12-10 Thread Daniel Wood
> On December 6, 2019 at 3:06 PM Andres Freund  wrote:
...
> > crash
> > smgrtruncate - Not reached
> 
> This seems like a somewhat confusing description to me, because
> smgrtruncate() is what calls DropRelFileNodeBuffers(). I assume what you
> mean by "smgrtruncate" is not the function, but the smgr_truncate()
> callback?

My mistake.  Yes, smgr_truncate()


> > Even if we reach the truncate we don't fsync it till the next
> > checkpoint.  So on filesystems which delay metadata updates a crash
> > can lose the truncate.
> 
> I think that's probably fine though. Leaving the issue of checkpoint
> completing inbetween the DropRelFileNodeBuffers() and the actual
> truncation aside, we'd have the WAL logged truncation truncating the
> file. I don't think it's unreasonable to except a filesystem that claims
> to support running without full_page_writes (I've seen several such
> claims turning out not to be true under load), to preserve either the
> original page contents or the new file size after a a crash.  If your
> filesystem doesn't, you really ought not to use it with FPW = off.

If the phsyical truncate doesn't occur in the seconds after the cache 
invalidation
nor the fsync within the minutes till the next checkpoint you are NOT left
with a torn page on disk.  You are left with the 'incorrect' page on disk.
In other words, an older page because the invalidation prevent the write
of the most recent dirty page.  Redos don't like old incorrect pages.
But, yes, fullpage writes covers up this anomaly(To be generous).

> > Once we do the fsync(), for the truncate, the REDO read will return
> > BLK_NOTFOUND and the DELETE REDO attempt will be skipped.  WIthout the
> > fsync() or crashing before the truncate, the delete redo depends on
> > the most recent version of the page having been written by the
> > checkpoint.
> 
> We also need to correctly replay this on a standby, so I don't think
> just adding an smgrimmedsync() is the answer. We'd not replay the
> truncation standbys / during PITR, unless I miss something. So we'd just
> end up with the same problem in slightly different situations.

I haven't mentioned to you that we have seen what appears to be the same
problem during PITR's depending on which base backup we start with.  I didn't
mention it because I haven't created a repro to prove it.  I simply suspect it.

> To me it sounds the fix here would be to rejigger the RelationTruncate()
> sequence for truncation of the main for as follows:
> 
> 1) MyPgXact->delayChkpt = true
> 2) XLogInsert(XLOG_SMGR_TRUNCATE)
> 3) smgrtruncate() (which, as now, first does a DropRelFileNodeBuffers(),
>and then calls the smgr_truncate callback)
> 4) MyPgXact->delayChkpt = false
> 
> I'm not worried about the increased delayChkpt = true time. Compared
> with the frequency of RecordTransactionCommit() this seems harmless.

Seems reasonable.

> I'm inclined to think that we should make the XLogFlush() in the
> RelationNeedsWAL() branch of RelationTruncate()
> unconditional. Performing the truncation on the filesystem level without
> actually having persisted the corresponding WAL record is dangerous.

Yes, I also was curious about why it was conditional.

- Dan Wood




Re: 'Invalid lp' during heap_xlog_delete

2019-11-18 Thread Daniel Wood
>   I'll try to look into that  with lower page sizes for relation and WAL 
> pages.

The page size is totally unrelated to this bug.  When you repro the redo 
failure it is because the log record is being applied to an old page version.  
The correct newer page version never got written because of the truncate page 
invalidation.  The cause is not a torn write.

> What's that?

Azure PostgreSQL on Windows has proprietary mechanisms in its filesystem to 
guarantee that a write is atomic.

- Dan

> On November 18, 2019 at 4:58 AM Michael Paquier < mich...@paquier.xyz 
> mailto:mich...@paquier.xyz > wrote:
> 
> 
> On Thu, Nov 14, 2019 at 07:38:19PM -0800, Daniel Wood wrote:
> 
> > > Sorry I missed one thing. Turn off full page writes.
> > 
> > > Hmm. Linux FSes use typically 4kB pages. I'll try to look into 
> > that
> with lower page sizes for relation and WAL pages.
> 
> 
> > > I'm running in an env. with atomic 8K writes.
> > 
> > > What's that?
> --
> Michael
> 


Re: 'Invalid lp' during heap_xlog_delete

2019-11-14 Thread Daniel Wood
Sorry I missed one thing.  Turn off full page writes.  I'm running in an env. 
with atomic 8K writes.

> On November 12, 2019 at 6:23 PM Daniel Wood  wrote:
> 
> It's been tedious to get it exactly right but I think I got it.  FYI, I 
> was delayed because today we had yet another customer hit this:  'redo max 
> offset' error.  The system crashed as a number of autovacuums and a 
> checkpoint happened and then the REDO failure
> 


Re: 'Invalid lp' during heap_xlog_delete

2019-11-12 Thread Daniel Wood
It's been tedious to get it exactly right but I think I got it.  FYI, I was 
delayed because today we had yet another customer hit this:  'redo max offset' 
error.  The system crashed as a number of autovacuums and a checkpoint happened 
and then the REDO failure.

Two tiny code changes:
bufmgr.c:bufferSync()   pg_usleep(1000);  // At begin of function

smgr.c:smgrtruncate():  Add the following just after CacheInvalidateSmgr()
if (forknum == MAIN_FORKNUM && nblocks == 0) {
pg_usleep(4000);
{ char *cp=NULL; *cp=13; }
}

Now for the heavily commented SQL repro.  It will require that you execute a 
checkpoint in another session when instructed by the repro.sql script.  You 
have 4 seconds to do that. The repro script explains exactly what must happen.

---
create table t (c char());
alter table t alter column c set storage plain;
-- Make sure there actually is an allocated page 0 and it is empty.
-- REDO Delete would ignore a non-existant page: XLogReadBufferForRedoExtended: 
return BLK_NOTFOUND;
-- Hopefully two row deletes don't trigger autovacuum and truncate the empty 
page.
insert into t values ('1'), ('2');
delete from t;
checkpoint; -- Checkpoint the empty page to disk.
-- This insert should be before the next checkpoint 'start'. I don't want to 
replay it.
-- And, yes, there needs to be another checkpoint completed to skip its replay 
and start
-- with the replay of the delete below.
insert into t values ('1'), ('2');
-- Checkpoint needs to start in another session. However, I need to stall the 
checkpoint
-- to prevent it from writing the dirty page to disk until I get to the vacuum 
below.
select 'Please start checkpoint in another session';
select pg_sleep(4);
-- Below is the problematic delete.
-- It succeeds now(online) because the dirty page has two rows on it.
-- However, with respect to crash recovery there are 3 possible scenarios 
depending on timing.
-- 1) The ongoing checkpoint might write the page with the two rows on it before
-- the deletes. This leads to BLK_NEEDS_REDO for the deletes. It works
-- because the page read from disk has the rows on it.
-- 2) The ongoing checkpoint might write the page just after the deletes.
-- In that case BLK_DONE will happen and there'll be no problem. LSN check.
-- 3) The checkpoint can fail to write the dirty page because a vacuum can call
-- smgrtruncate->DropRelFileNodeBuffers() which invalidates the dirty page.
-- If smgrtruncate safely completes the physical truncation then BLK_NOTFOUND
-- happens and we skip the redo of the delete. So the skipped dirty write is OK.
-- The problme happens if we crash after the 2nd checkpoint completes
-- but before the physical truncate 'mdtruncate()'.
delete from t;
-- The vacuum must complete DropRelFileNodeBuffers.
-- The vacuum must sleep for a few seconds to allow the checkpoint to complete
-- such that recovery starts with the Delete REDO.
-- We must crash before mdtruncate() does the physical truncate. If the physical
-- truncate happens the BLK_NOTFOUND will be returned and the Delete REDO 
skipped.
vacuum t;



> On November 10, 2019 at 11:51 PM Michael Paquier < mich...@paquier.xyz 
> mailto:mich...@paquier.xyz > wrote:
> 
> 
> On Fri, Nov 08, 2019 at 06:44:08PM -0800, Daniel Wood wrote:
> 
> > > I repro'ed on PG11 and PG10 STABLE but several months old.
> > I looked at 6d05086 but it doesn't address the core issue.
> > 
> > DropRelFileNodeBuffers prevents the checkpoint from writing all
> > needed dirty pages for any REDO's that exist BEFORE the truncate.
> > If we crash after a checkpoint but before the physical truncate then
> > the REDO will need to replay the operation against the dirty page
> > that the Drop invalidated.
> > 
> > > I am beginning to look at this thread more seriously, and I'd 
> > like to
> first try to reproduce that by myself. Could you share the steps you
> used to do that? This includes any manual sleep calls you may have
> added, the timing of the crash, manual checkpoints, debugger
> breakpoints, etc. It may be possible to extract some more generic
> test from that.
> --
> Michael
> 


Re: 'Invalid lp' during heap_xlog_delete

2019-11-08 Thread Daniel Wood
I repro'ed on PG11 and PG10 STABLE but several months old.
I looked at 6d05086 but it doesn't address the core issue.

DropRelFileNodeBuffers prevents the checkpoint from writing all needed dirty 
pages for any REDO's that exist BEFORE the truncate.  If we crash after a 
checkpoint but before the physical truncate then the REDO will need to replay 
the operation against the dirty page that the Drop invalidated.

Teja Mupparti, an engineer I work with, suggested moving DropRelFileNodeBuffers 
to the bottom of smgrtruncate() after the physical truncate.  Doing that along 
with a fsync() after the truncate seems to plug the hole.


> On November 8, 2019 at 5:39 PM Michael Paquier < mich...@paquier.xyz 
> mailto:mich...@paquier.xyz > wrote:
> 
> 
> On Fri, Nov 08, 2019 at 12:46:51PM -0800, Daniel Wood wrote:
> 
> > > Is DropRelFileNodeBuffers purely for performance or would 
> there be
> > any correctness problems if not done.
> > 
> > > On which version did you find that? Only HEAD or did you use a
> version on a stable branch? There has been some work done in this
> area lately as of 6d05086.
> --
> Michael
> 


'Invalid lp' during heap_xlog_delete

2019-11-08 Thread Daniel Wood
Page on disk has empty lp 1
* Insert into page lp 1

checkpoint START.Redo eventually starts here.
** Delete all rows on page.
autovac truncate
DropRelFileNodeBuffers - dirty page NOT written. lp 1 on disk still empty
checkpoint completes
crash
smgrtruncate - Not reached

heap_xlog_delete reads page with empty lp 1 and the delete fails.

The checkpoint can not have yet written * or ** before DropRelFileNodeBuffers 
invalidates either of those dirty page versions for this to repro.

Even if we reach the truncate we don't fsync it till the next checkpoint.  So 
on filesystems which delay metadata updates a crash can lose the truncate.

Once we do the fsync(), for the truncate, the REDO read will return 
BLK_NOTFOUND and the DELETE REDO attempt will be skipped.
WIthout the fsync() or crashing before the truncate, the delete redo depends on 
the most recent version of the page having been written by the checkpoint.

Found during stress test and verified with pg_usleep's to test hypothesis.

Is DropRelFileNodeBuffers purely for performance or would there be any 
correctness problems if not done.



Re: BTP_DELETED leaf still in tree

2019-10-10 Thread Daniel Wood
> On October 10, 2019 at 1:18 PM Peter Geoghegan  wrote:
> 
> 
> On Thu, Oct 10, 2019 at 12:48 PM Daniel Wood  wrote:
> > Update query stuck in a loop.  Looping in _bt_moveright().
> 
> You didn't say which PostgreSQL versions were involved, and if the
> database was ever upgraded using pg_upgrade. Those details could
> matter.

PG_VERSION says 10.  I suspect we are running 10.9.  I have no idea if 
pg_upgrade was ever done.

> > ExecInsertIndexTuples->btinsert->_bt_doinsert->_bt_search->_bt_moveright
> >
> > Mid Tree Node downlink path taken by _bt_search points to a BTP_DELETED 
> > Leaf.
> 
> This should hardly ever happen -- it is barely possible for an index
> scan to land on a BTP_DELETED leaf page (or a half-dead page) when
> following a downlink in its parent. Recall that nbtree uses Lehman &
> Yao's design, so _bt_search() does not "couple" buffer locks on the
> way down. It would probably be impossible to observe this happening
> without carefully setting breakpoints in multiple sessions.
> 
> If this happens reliably for you, which it sounds like, then you can
> already assume that the index is corrupt.
> 
> > btpo_next is also DELETED but not in the tree.
> >
> > btpo_next->btpo_next is NOT deleted but in the mid tree as a lesser key 
> > value.
> >
> > Thus creating an endless loop in moveright.
> 
> Offhand, these other details sound normal. The side links are still
> needed in fully deleted (BTP_DELETED) pages. And, moving right and
> finding lesser key values (not greater key values) is normal with
> deleted pages, since page deletion makes the keyspace move right, not
> left (moving the keyspace left is how the source Lanin & Shasha paper
> does it, though).
> 
> Actually, I take it back -- the looping part is not normal. The
> btpo_next->btpo_next page has no business linking back to the
> original/first deleted page you mentioned. That's just odd.

btpo_next->btpo_next does NOT link directly back to the 1st deleted page.  It 
simply links to some in-use page which is 50 or so leaf pages back in the tree. 
 Eventually we do reach the two deleted pages again.  Only the first one is in 
the 'tree'.

> Can you provide me with a dump of the page images? The easiest way of
> getting a page dump is described here:

Customer data.  Looks like meaningless customer data (5 digit key values).  But 
too much paperwork. :-)

The hard part for me to understand isn't just why the DELETED leaf node is 
still referenced in the mid tree node.
It is that the step which sets BTP_DELETED should have also linked its leaf and 
right siblings together.  But this hasn't been done.

Could the page have already have been dirty, but because of "target != 
leafblkno", we didn't stamp a new LSN on it.  Could this allow us to write the 
DELETED dirty page without the XLOG_BTREE_MARK_PAGE_HALFDEAD and 
XLOG_BTREE_UNLINK_PAGE being flushed?  Of course, I don't understand the 
"target != leafblkno".

In any case, thanks.

> https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump
> 
> If I had to guess, I'd guess that this was due to a generic storage problem.
> 
> -- 
> Peter Geoghegan




pgbench prints suspect tps numbers

2019-06-24 Thread Daniel Wood
Short benchmark runs are bad if the runs aren't long enough to produce 
consistent results.

Having to do long runs because a benchmarking tool 'converges to reality' over 
time in reporting a tps number, due to miscalculation, is also bad.


I want to measure TPS at a particular connection count.  A fully cached Select 
Only pgbench produces fairly consistent numbers over short runs of a few 
minutes.

pgbench's "including connections establishing" number is polluted by fact that 
for many seconds the benchmark is running with less than the expected number of 
connections.  I thought that was why the 'excluding' number was also printed 
and I had been relying on that number.

pgbench's "excluding connections establishing" number seems to be a total 
garbage number which can be way way over the actual tps.  During a period when 
I had a bug causing slow connections I noticed a consistent value of about 100K 
tps over the measurement intervals.  At the end of a 5 minute run it reported 
450K tps!  There was no point anywhere during the benchmark that it ran 
anywhere near that number.


I had been using 'excluding' because it 'seemed' perhaps right in the past.  It 
was only when I got crazy numbers I looked at the calculation to find:


tps_exclude = total->cnt / (time_include - 
(INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients));


The 'cnt' is the total across the entire run including the period when 
connections are ramping up.  I don't see how dividing by the total time minus 
the average connection time produces the correct result.


Even without buggy slow connections, when connecting 1000 clients, I've 
wondered why the 'excluding' number seemed a bit higher than any given 
reporting interval numbers, over a 5 minute run.  I now understand why.  NOTE: 
When the system hits 100% cpu utilization(after about the first 100 
connections), on a fully cached Select only pgbench, further connections can 
struggle to get connected which really skews the results.


How about a patch which offered the option to wait on an advisory lock as a 
mechanism to let the main thread delay the start of the workload after all 
clients have connected and entered a READY state?  This would produce a much 
cleaner number.

Re: Skylake-S warning

2018-10-03 Thread Daniel Wood
One other thought.  Could we update pgxact->xmin less often?  What would be the 
impact of this lower bound being lower than it would normally be with the 
existing scheme.  Yes, it needs to be moved forward "occasionally".

FYI, be careful with padding PGXACT's to a full cache line.  With 1024 clients 
you'd completely blow out the L1 cache.  The benefits with less than 200 
clients is dubious.  One problem with multiple(5) pgxact's in a single cache 
line is that you may get a miss fetching xmin and then a second miss fetching 
xid because an invalidation occurs between the 2 fetches from updating any of 
the other 5 pgxact's on the same cache line.  I've found some improvement 
fetching both xmin and xid with a single 64 bit fetch.  This prevents the 
invalidation between the two fetches.  Similarily updating them with a single 
64 bit write helps.

Yes, this is uber tweaky.



Re: Skylake-S warning

2018-10-03 Thread Daniel Wood


> On October 3, 2018 at 3:55 PM Andres Freund  wrote:

> In the thread around 
> https://www.postgresql.org/message-id/20160411214029.ce3fw6zxim5k6...@alap3.anarazel.de
> I'd found doing more aggressive padding helped a lot.  Unfortunately I
> didn't pursue this further :(

Interesting.  Looks like you saw the same thing I see now.

> I really suspect we're going to have to change the layout of PGXACT data
> in a way that makes a contiguous scan possible. That'll probably require
> some uglyness because a naive "use first free slot" scheme obviously is
> sensitive to there being holes.  But I suspect that a scheme were
> backends would occasionally try to move themselves to an earlier slot
> wouldn't be too hard to implement.

I've also thought of this.  But someone could create a thousand connections and
then disconnect all but the first and last creating a huge hole.  One thought
would be to have a bit array of used entries and to bias toward using free lower
indexed entries.  Only two cache lines of 1024 bits would need to be scanned to
handle 1024 connections.  ProcArrayAdd() would just do an atomic xchg to set the
used bit.

I was even thinking of decomposing PGXACT into separate arrays of XID's, XMIN's,
etc.  Then have a bit map of those entries where the XID was valid.  When you
set/clear your XID you'd also set/clear this bit.  With the select only workload
the XID are all "Invalid" thus scanning this bit array of zeroed 64 bit long 
entries
would be quite efficient.  The vacuum and logical decoding flags could be 
directly
done as a bit map.  Having a array of 1 byte subtxn counts creates another
opportunity for improvement.  Scanning for a non-zero in env's which use few
subtxn's is efficient.  If subtxn's are used a lot then the repeated PGXACT
cache line invalidations when setting nxids is offset by the fact that you can
grab 8 of them in a single fetch.

- Dan



Skylake-S warning

2018-10-03 Thread Daniel Wood
If running benchmarks or you are a customer which is currently impacted by 
GetSnapshotData() on high end multisocket systems be wary of Skylake-S.


Performance differences of nearly 2X can be seen on select only pgbench due to 
nothing else but unlucky choices for max_connections.  Scale 1000, 192 local 
clients on a 2 socket 48 core Skylake-S(Xeon Platinum 8175M @ 2.50-GHz) system. 
 pgbench -S


Results from 5 runs varying max_connections from 400 to 405:


max

conn TPS

400 677639

4011146776

4021122140

403 765664

404 671455

4051190277

...


perf top shows about 21% GetSnapshotData() with the good numbers and 48% with 
the bad numbers.


This problem is not seen on a 2 socket 32 core Haswell system.  Being a one man 
show I lack some of the diagnostic tools to drill down further.  My suspicion 
is that the fact that Intel has lowered the L2 associativity from 8(Haswell) to 
4(Skylake-S) may be the cause.  The other possibility is that at higher core 
counts the shared 16-way inclusive associative L3 cache becomes insufficient.  
Perhaps that is why Intel has moved to an exclusive L3 cache on Skylake-SP.


If this is indeed just disadvantageous placement of structures/arrays in memory 
then you might also find that after upgrading a previous good choice for 
max_connections becomes a bad choice if things move around.


NOTE: int pgprocno = pgprocnos[index];

is where the big increase in time occurs in GetSnapshotData()

This is largely read-only, once all connections are established, and easily 
fits in the L1, and is not next to anything else causing invalidations.


NOTE2: It is unclear why PG needs to support over 64K sessions.  At about 10MB 
per backend(at the low end) the empty backends alone would consume 640GB's of 
memory!  Changing pgprocnos from int to short gives me the following results.


max

conn TPS

400 780119

4011129286

4021263093

403 887021

404 679891

4051218118


While this change is significant on large Skylake systems it is likely just a 
trivial improvement on other systems or workloads.

GetSnapshotData round two(for me)

2018-09-24 Thread Daniel Wood
I was about to suggest creating a single shared snapshot instead of having 
multiple backends compute what is essentially the same snapshot.  Luckily, 
before posting, I discovered Avoiding repeated snapshot computation 
https://www.postgresql.org/message-id/caboikdmsj4osxta7xbv2quhkyuo_4105fjf4n+uyroybazs...@mail.gmail.com
  from Pavan and POC: Cache data in GetSnapshotData() 
https://www.postgresql.org/message-id/20150202152706.gd9...@alap3.anarazel.de 
from Andres.


Andres, could I get a short summary of the biggest drawback that may have 
prevented this from being released?  Before I saw this I had did my own 
implementation and saw some promising results(25% on 48 cores).  I do need to 
do some mixed RO and RW workloads to see how the invalidations of the shared 
copy, at EOT time, affect the results.  There are some differences in my 
implementation.  I choose, perhaps incorrectly?, to busy spin other users 
trying to get a snapshot while the first guy in builds the shared copy.  My 
thinking is to not increase latency of using the snapshot.  The improvement of 
the idea doesn't come from getting off the CPU, by using a WAIT, but in not 
reading PGXACT cache lines on all the cpus acquiring the snapshot that are 
constantly being dirtied.  One backend can do the heavy lifting and the others 
can immediately jump on the shared copy once created.


And something else quite weird:  As I was evolving a standard setup for 
benchmark runs and getting baselines I was getting horrible numbers 
sometimes(680K) and other times I'd get over 1 million QPS.  I was thinking I 
had a bad machine.  What I found was that even though I was running a fixed 192 
clients I had set max_connections to 600 sometimes and 1000 on other runs.  
Here is what I see running select-only scale 1000 pgbench with 192 clients on a 
48 core box(2 sockets) using different values for max_connections:


 200 tps = 1092043
 250 tps = 1149490
 300 tps =  732080
 350 tps =  719611
 400 tps =  681170
 450 tps =  687527
 500 tps =  859978
 550 tps =  927161
 600 tps = 1092283
 650 tps = 1154916
 700 tps = 1237271
 750 tps = 1195968
 800 tps = 1162221
 850 tps = 1140626
 900 tps =  749519
 950 tps =  648398
1000 tps =  653460


This is on the base 12x codeline.  The only thought I've had so far is that 
each PGXACT in use(192) is being scattered across the full set of 
max_connections, instead of being physically contiguous in the first 192 slots. 
 This would cause more cache lines to be scanned.  It doesn't make a lot of 
sense given that it goes up back again from 500 peaking at 700.  Also this is 
after a fresh restart so the proc's in the freelist shouldn't have been 
scrambled yet in terms of ordering.


NOTE: I believe you'll only see this huge difference on a dual socket machine.  
It'd probably only take 30 minutes or so on a big machine to confirm with a 
couple of few minute runs at different values for max_connections.  I'll be 
debugging this soon.  But I've been postponing it while experimenting with my 
shared snapshot code.






Re: On the need for a snapshot in exec_bind_message()

2018-09-05 Thread Daniel Wood
> > Queries stop getting re-optimized after 5 times, unless better plans are to 
> > be had.  In the absence of schema changes or changing search path why is 
> > the snapshot needed?
> 
> The snapshot has little to do with the query plan, usually.  It's about
> what view of the database the executed query will see, and particularly
> about what view the parameter input functions will see, if they look.
> 
> You could maybe argue that immutable input functions shouldn't need
> snapshots, but there are enough not-immutable ones that I don't think
> that gets you very far.  In any case, we'd still need a snapshot for
> query execution.  The set of queries that could possibly never need
> a snapshot at all is probably not large enough to be interesting.

I would think that the vast majority of bind variables in customer queries 
would involve built-in data types whose input functions are immutable.  As far 
as the "view of the database the executed query will see" goes, either the 
database itself changes, which should trigger catcache invalidations, that can 
be accounted for, or the "view" changes as in searchpath which can be tested 
for.  I would think the optimization would be applicable most of the time.  It 
isn't the number of obscure user created non-immutable input functions that 
exist.  It is the frequency with which immutable input functions are used in 
real app's.  This is a 50% performance improvement in point lookups on larger 
boxes.



Re: On the need for a snapshot in exec_bind_message()

2018-09-05 Thread Daniel Wood
> > exec_bind_message()
> > PushActiveSnapshot(GetTransactionSnapshot());
> 
> > If there were no input functions, that needed this, nor reparsing or
> > reanalyzing needed, and we knew this up front, it'd be a huge win.
> 
> Unfortunately, that's not the case, so I think trying to get rid of
> this call is a nonstarter.

Queries stop getting re-optimized after 5 times, unless better plans are to be 
had.  In the absence of schema changes or changing search path why is the 
snapshot needed?

- Dan



On the need for a snapshot in exec_bind_message()

2018-09-05 Thread Daniel Wood
In particular:
exec_bind_message()
PushActiveSnapshot(GetTransactionSnapshot());


Suppressing this I've achieved over 1.9 M TXN's a second on select only pgbench 
on a 48 core box.  It is about 50% faster with this change.  The cpu usage of 
GetSnapshotData drops from about 22% to 4.5%.

If there were no input functions, that needed this, nor reparsing or 
reanalyzing needed, and we knew this up front, it'd be a huge win.  We could 
test for a number of conditions on the first parse/optimization of the query 
and set a flag to suppress this for subsequent executions.


NOTE:

In GetSnapshotData because pgxact, is declared volatile, the compiler will not 
reduce the following two IF tests into a single test:


if (pgxact->vacuumFlags & PROC_IN_LOGICAL_DECODING)
continue;


if (pgxact->vacuumFlags & PROC_IN_VACUUM)
continue;


You can reduce the code path in the inner loop by coding this as:

if (pgxact->vacuumFlags & (PROC_IN_LOGICAL_DECODING|PROC_IN_VACUUM))
continue;


I'm still working on quantifying any gain.  Note it isn't just one L1 cache

fetch and one conditional branch eliminated.  Due to the update frequency of 
the pgxact cache line, for single statement TXN's, there are a certain number 
of full cache misses, due to invalidation, that occurs when given pgxact is 
updated between the first fetch of vacuumFlags and the 2nd fetch.

First steps to being a contributer

2018-08-27 Thread Daniel Wood
Having quit Amazon, where I was doing Postgres development, I've started 
looking at various things I might work on for fun. One thought is to start with 
something easy like the scalability of GetSnapshotData(). :-)


I recently found it interesting to examine performance while running near 1 
million pgbench selects per sec on a 48 core/96 HT Skylake box. I noticed that 
additional sessions trying to connect were timing out when they got stuck in 
ProcArrayAdd trying to get the ProcArrayLock in EXCLUSIVE mode. FYI, scale 
1 with 2048 clients.


The question is whether it is possible that the problem with GetSnapshotData() 
has reached a critical point, with respect to snapshot scaling, on the newest 
high end systems.


I didn't have time to complete my analysis as I lost access to the hardware on 
my last day. It shouldn't cost me much more than about $6 per hour to do 
experiments on a 48 core system.


What I'd like is a short cut to any of the current discussions of various ideas 
to improve snapshot scaling. I have some of my own ideas but want to review 
things before posting them.