Re: POC: Extension for adding distributed tracing - pg_tracing

2024-03-12 Thread Craig Ringer
On Tue, 12 Mar 2024, 23:45 Anthonin Bonnefoy, <
anthonin.bonne...@datadoghq.com> wrote:

>
> > - I don't think it's a good idea to do memory allocations in the middle
> of a
> > PG_CATCH. If the error was due to out-of-memory, you'll throw another
> error.
> Good point. I was wondering what were the risks of generating spans
> for errors. I will try to find a better way to handle this.
>

The usual approach is to have pre-allocated memory. This must actually be
written (zeroed usually) or it might be lazily allocated only on page
fault. And it can't be copy on write memory allocated once in postmaster
startup then inherited by fork.

That imposes an overhead for every single postgres backend. So maybe
there's a better solution.


Re: POC: Extension for adding distributed tracing - pg_tracing

2023-12-11 Thread Craig Ringer
Hi all

Just saw this thread.

I cooked up a PoC distributed tracing support in Pg years ago as part of
the 2ndQuadrant BDR project.

I used GUCs to set the `opentelemtery_trace_id` and
`opentelemetry_span_id`. These can be sent as protocol-level messages by
the client driver if the client driver has native trace integration
enabled, in which case the user doesn't need to see or care. And for other
drivers, the application can use `SET` or `SET LOCAL` to assign them.

This approach avoids the need to rewrite SQL or give special meaning to SQL
comments.

Within the server IIRC I used the existing postgres resource owner and
memory context stacks to maintain some internal nested traces.

My implementation used the C++ opentelemetry client library to emit traces,
so it was never going to be able to land in core. But IIRC the current
BDR/PGD folks are now using an OpenTelemetry sidecar process instead. I
think they send it UDP packets to emit metrics and events. Petr or Markus
might be able to tell you more about how they're doing that.

I'd love to see OpenTelemetry integration in Pg.


Re: POC: Better infrastructure for automated testing of concurrency issues

2022-10-17 Thread Craig Ringer
On Tue, 23 Feb 2021 at 08:09, Peter Geoghegan  wrote:

> On Tue, Dec 8, 2020 at 2:42 AM Alexander Korotkov 
> wrote:
> > Thank you for your feedback!
>
> It would be nice to use this patch to test things that are important
> but untested inside vacuumlazy.c, such as the rare
> HEAPTUPLE_DEAD/tupgone case (grep for "Ordinarily, DEAD tuples would
> have been removed by..."). Same is true of the closely related
> heap_prepare_freeze_tuple()/heap_tuple_needs_freeze() code.
>

That's what the PROBE_POINT()s functionality I referenced is for, too.

The proposed stop events feature has finer grained control over when the
events fire than PROBE_POINT()s do. That's probably the main limitation in
the PROBE_POINT()s functionality right now - controlling it at runtime is a
bit crude unless you opt for using a C test extension or a systemtap
script, and both those have other downsides.

On the other hand, PROBE_POINT()s are lighter weight when not actively
turned on, to the point where they can be included in production builds to
facilitate support and runtime diagnostics. They interoperate very nicely
with static tracepoint markers (SDTs), the TRACE_POSTGRESQL_FOO(...) stuff,
so there's no need to yet another separate set of debug markers scattered
through the code. They can perform a wider set of actions useful for
testing and diagnostics - PANIC the current backend, self-deliver an
arbitrary signal, force a LOG message, introduce an interruptible or
uninterruptible sleep, send a message to the client if any (handy for
regress tests), or fire an extension-defined callback function.

I'd like to find a way to get the best of both worlds if possible.

Rather than completely sidetrack the thread on this patch I posted the
PROBE_POINT()s patch on a separate thread here.


Re: [PATCH] ProcessInterrupts_hook

2021-10-11 Thread Craig Ringer
On Sat, 2 Oct 2021 at 01:24, Jaime Casanova 
wrote:

> On Tue, Jun 29, 2021 at 01:32:26PM +0800, Craig Ringer wrote:
> > On Sat, 20 Mar 2021 at 03:46, Tom Lane  wrote:
> >
> > > Robert Haas  writes:
> > > > On Fri, Mar 19, 2021 at 3:25 PM Tom Lane  wrote:
> > > >> I'm not very comfortable about the idea of having the postmaster set
> > > >> child processes' latches ... that doesn't sound terribly safe from
> the
> > > >> standpoint of not allowing the postmaster to mess with shared memory
> > > >> state that could cause it to block or crash.  If we already do that
> > > >> elsewhere, then OK, but I don't think we do.
> > >
> > > > It should be unnecessary anyway. We changed it a while back to make
> > > > any SIGUSR1 set the latch 
> > >
> > > Hmm, so the postmaster could send SIGUSR1 without setting any
> particular
> > > pmsignal reason?  Yeah, I suppose that could work.  Or we could recast
> > > this as being a new pmsignal reason.
> > >
> >
> > I'd be fine with either way.
> >
> > I don't expect to be able to get to working on a concrete patch for this
> > any time soon, so I'll be leaving it here unless someone else needs to
> pick
> > it up for their extension work. The in-principle agreement is there for
> > future work anyway.
>
> Hi Craig,
>
> There is still a CF entry for this. Should we close it as withdrawn? or
> maybe RwF?
>

I'm not going to get time for it now, so I think marking it withdrawn is
reasonable.

I think it's well worth doing and Tom seems to think it's not a crazy idea,
but I'm no longer working on the software that needed it, and don't see a
lot of other people calling for it, so it can wait until someone else needs
it.


Re: Windows crash / abort handling

2021-10-06 Thread Craig Ringer
On Wed, 6 Oct 2021, 03:30 Andres Freund,  wrote:

> Hi,
>
>
> My first attempt was to try to use the existing crashdump stuff in
> pgwin32_install_crashdump_handler(). That's not really quite what I want,
> because it only handles postmaster rather than any binary, but I thought
> it'd
> be a good start. But outside of toy situations it didn't work for me.
>

Odd. It usually has for me, and definitely not limited to the postmaster.
But it will fall down for OOM, smashed stack, and other states where
in-process self-debugging is likely to fail.

>
> A bunch of debugging later I figured out that the reason neither the
> SetUnhandledExceptionFilter() nor JIT debugging works is that the
> SEM_NOGPFAULTERRORBOX in the
>   SetErrorMode(SEM_FAILCRITICALERRORS | SEM_NOGPFAULTERRORBOX);
> we do in startup_hacks() prevents the paths dealing with crashes from being
> reached.
>

Right.

I patch this out when working on windows because it's a real pain.

I keep meaning to propose that we remove this functionality entirely. It's
obsolete. It was introduced back in the days where DrWatson.exe "windows
error reporting") used to launch an interactive prompt asking the user what
to do when a process crashed. This would block the crashed process from
exiting, making everything grind to a halt until the user interacted with
the
UI. Even for a service process.

Not fun on a headless or remote server.

These days Windows handles all this a lot more sensibly, and blocking crash
reporting is quite obsolete and unhelpful.

I'd like to just remove it.

If we can't do that I'd like to at least make it optional.

Alternatively we can generate "minidumps" [6], but that doesn't appear to
> be more
> helpful for CI purposes at least - all we'd do is to create a backtrace
> using
> the same tool. But it might be helpful for local development, to e.g.
> analyze
> crashes in more detail.
>

They're immensely helpful when a bt isn't enough, but BTs are certainly the
first step for CI use.


Re: [PATCH] More docs on what to do and not do in extension code

2021-08-29 Thread Craig Ringer
On Tue, 29 Jun 2021 at 13:30, Craig Ringer 
wrote:

> Laurenz,
>
> Thanks for your comments. Sorry it's taken me so long to get back to you.
> Commenting inline below on anything I think needs comment; other proposed
> changes look good.
>

I'm not going to get back to this anytime soon.

If anybody wants to pick it up that's great. Otherwise at least it's there
in the mailing lists to search.


Re: Mark all GUC variable as PGDLLIMPORT

2021-08-27 Thread Craig Ringer
On Fri, 27 Aug 2021 at 08:59, Julien Rouhaud  wrote:

> On Fri, Aug 27, 2021 at 3:42 AM Magnus Hagander 
> wrote:
> >
> > Yeah, but that does move the problem to the other side doesn't it? So
> > if you (as a pure test of course) were to remove the variable
> > completely from the included header and just declare it manually with
> > PGDLLSPEC in your file, it should work?
> >
> > Ugly as it is, I wonder if there's a chance we could just process all
> > the headers at install times and inject the PGDLLIMPORT. We know which
> > symvols it is on account of what we're getting in the DEF file.
> >
> > Not saying that's not a very ugly solution, but it might work?
>
> It's apparently not enough.  I tried with autovacuum_max_workers GUC,
> and it still errors out.
> If I add a PGDLLIMPORT, there's a link error when trying to access the
> variable:
> error LNK2019: unresolved external symbol __imp_autovacuum_max_workers
> referenced in function...
>
> If I use PGDLLEXPORT I simply get:
> error LNK2001: unresolved external symbol aytovacuum_max_workers
>

It works, but you can't use PGDLLIMPORT, you have to implement the import
yourself without the help of __declspec(dllimport) .

Where you want

autovacuum_max_workers

you must instead write

*((int*)__imp_autovacuum_max_workers)

Here's the comment I wrote on the topic in something I was working on:

/*
 * On Windows, a symbol is not accessible outside the executable or shared
 * library (PE object) that it is defined in unless explicitly exported in
 * the DLL interface.
 *
 * It must then be explicitly imported by objects that use it; Windows
doesn't
 * do ELF-style fixups.
 *
 * The export part is usually accomplished by a __declspec(dllexport)
 * annotation on the symbol or a .DEF file supplied as linker input.
Postgres
 * uses the .DEF approach, auto-exporting all symbols using
 * src\tools\msvc\gendef.pl . Internally this hides "symname" from the DLL
 * interface and instead generates an export symbol "__imp_symname" that is
a
 * pointer to the value of "symname".
 *
 * The import part is usually done with the __declspec(dllimport)
annotation on
 * the symbol. The PGDLLIMPORT macro expands to __declspec(dllimport) when
 * postgres headers are included during extension compilation. But not all
the
 * symbols that pglogical needs are annotated with PGDLLIMPORT. Attempting
to
 * access a symbol that is not so-annotated will fail at link time with an
 * error like
 *
 * error LNK2001: unresolved external symbol
criticalSharedRelcachesBuilt
 *
 * Because of gendefs.pl, postgres still exports the symbol even if it isn't
 * annotated PGDLLIMPORT. So we can just do the shorthand that
 * __declspec(dllimport) does for us in the preprocessor instead: replace
each
 * symbol with its __imp_symbol indirection and dereference it.
 *
 * There's one wrinkle in this though. MSVC doesn't generate a definition
for a
 * global data symbol that is neither initialized nor explicitly marked
 * __declspec(dllexport). gendefs.pl will think such symbols are references
to
 * a symbol defined in another object file and will skip them without
emitting
 * a DATA entry for them in the DEF file, so no __imp_ stub is generated in
the
 * DLL interface. We can't use (*__imp_symbolname) if there's no import
stub.
 *
 * If they're GUCs, we can round-trip them through their text values
 * to read them. Nothing should ever be assigning to GUC storage and
there's no
 * reason to take the address of GUC storage, so this should work fine,
albeit
 * slower. If we find any that aren't GUCs we're in trouble but so far there
 * haven't been any.
 *
 * See also:
 *
 * -
https://docs.microsoft.com/en-us/cpp/build/importing-data-using-declspec-dllimport
 * - https://docs.microsoft.com/en-us/cpp/build/importing-using-def-files
 * -
https://docs.microsoft.com/en-us/cpp/build/exporting-from-a-dll-using-def-files
 * -
https://docs.microsoft.com/en-us/cpp/build/determining-which-exporting-method-to-use
 */


This is gruesome and I hadn't planned to mention it, but now someone
noticed the .DEF file exports the symbols I guess it does no harm.

So can we just fix PGDLLIMPORT now?


Re: Mark all GUC variable as PGDLLIMPORT

2021-08-27 Thread Craig Ringer
On Thu, 26 Aug 2021 at 01:51, Alvaro Herrera 
wrote:

> On 2021-Aug-25, Magnus Hagander wrote:
>
> > The thing we need the PGDLLIMPORT definition for is to *import* them
> > on the other end?
>
> Oh ... so modules that are willing to cheat can include their own
> declarations of the variables they need, and mark them __declspec
> (dllimport)?
>
>
Damn. I was hoping nobody would notice that.

I do exactly that in some extensions to work around some of this mess, but
it is quite awkward and has its limitations.


Re: Mark all GUC variable as PGDLLIMPORT

2021-08-27 Thread Craig Ringer
On Wed, 25 Aug 2021 at 22:36, Magnus Hagander  wrote:

> On Wed, Aug 25, 2021 at 4:06 PM Robert Haas  wrote:
> >
> > On Tue, Aug 24, 2021 at 5:06 PM Chapman Flack 
> wrote:
> > > The thing is, I think I have somewhere a list of all the threads on
> this
> > > topic that I've read through since the first time I had to come with
> my own
> > > hat in hand asking for a PGDLLIMPORT on something, years ago now, and
> > > I don't think I have ever seen one where it was as uncontroversial
> > > as you suggest.
> >
> > It does tend to be controversial, but I think that's basically only
> > because Tom Lane has reservations about it. I think if Tom dropped his
> > opposition to this, nobody else would really care. And I think that
> > would be a good thing for the project.
>
>
> I have only one consideration about it, and that's a technical one :)
>
> Does this in some way have an effect on the size of the binary and/or
> the time it takes to load it?
>

On *nix, no.

On Windows, very, very minimally.

We *should* be looking into making  private symbols we can't make
non-static have hidden visibility at link time, i.e. be DSO-private.  This
can have a huge impact on link-time optimisation and inlining.

But doing so is quite orthogonal to the matter of fixing a linkage issue on
Windows. By making select symbols hidden we'd be *reducing* the exposed set
of functions and data symbols in a disciplined and progressive way on all
platforms. Useful but different.


Re: Mark all GUC variable as PGDLLIMPORT

2021-08-27 Thread Craig Ringer
On Wed, 25 Aug 2021 at 03:13, Robert Haas  wrote:

> On Tue, Aug 24, 2021 at 2:52 PM Chapman Flack 
> wrote:
> > I don't think that's true of the second proposal in [0]. I don't foresee
> > a noticeable runtime cost unless there is a plausible workload that
> > involves very frequent updates to GUC settings that are also of interest
> > to a bunch of extensions. Maybe I'll take a stab at a POC.
>
> I'm not sure I fully understand that proposal, but I find it hard to
> believe that we would seriously consider replacing every direct GUC
> reference in the backend with something that goes through an API. Even
> if didn't hurt performance, I think it would uglify the code a whole
> lot.
>

It'd probably have to be something that resolves the GUC storage addresses
at compile-time or once at runtime, if it's going to be used by core code.
While some code doesn't hit a lot of GUCs, some *really* hammers some
common GUCs.

There are various issues with cache lines and pointer chasing that are
beyond my low-level fu at work here. Adding a level of pointer indirection
can be very expensive in the wrong situations.

So you're probably looking at some kind of mess with token pasting, macros
and static inlines. Ew.


Re: Mark all GUC variable as PGDLLIMPORT

2021-08-23 Thread Craig Ringer
On Mon, 23 Aug 2021 at 22:45, Julien Rouhaud  wrote:

> On Mon, Aug 23, 2021 at 10:22 PM Tom Lane  wrote:
> >
> > Bruce Momjian  writes:
> > > On Mon, Aug 23, 2021 at 10:15:04AM -0400, Tom Lane wrote:
> > >> By that argument, *every* globally-visible variable should be marked
> > >> PGDLLIMPORT.  But the mere fact that two backend .c files need to
> access
> >
> > > No, Julien says 99% need only the GUCs, so that is not the argument I
> am
> > > making.
> >
> > That's a claim unbacked by any evidence that I've seen.  More to
> > the point, we already have a mechanism that extensions can/should
> > use to read and write GUC settings, and it's not direct access.
>
> I clearly didn't try all single extension available out there.  It's
> excessively annoying to compile extensions on Windows, and also I
> don't have a lot of dependencies installed so there are some that I
> wasn't able to test since I'm lacking some other lib and given my
> absolute lack of knowledge of that platform I didn't spent time trying
> to install those.
>
>
Plenty of them are closed too.

While that's not the Pg community's problem, as such, it'd be nice not to
arbitrarily break them all for no actual benefit.


Re: Mark all GUC variable as PGDLLIMPORT

2021-08-23 Thread Craig Ringer
On Tue, 24 Aug 2021 at 05:08, Chapman Flack  wrote:

> On 08/23/21 14:30, Robert Haas wrote:
> > it seems likely that this proposed
> > API would have the exact same problem, because it would let people do
> > exactly the same thing. And, going through this proposed API would
> > still be significantly more expensive than just accessing the bare
> > variables, because you'd at least have to do some kind of lookup based
> > on the GUC name
>
> I think the API ideas in [0] would not let people do exactly the same
> thing.
>
> They would avoid exposing the bare variables to overwrite. Not that
> there has been any plague of extensions going and overwriting GUCs,
> but I think in some messages on this thread I detected a sense that
> in principle it's better if an API precludes it, and that makes sense
> to me.
>
> The second idea also avoids the expense of name-based lookup (except once
> at extension initialization), and in fact minimizes the cost of obtaining
> the current value when needed, by slightly increasing the infrequent cost
> of updating values.
>

I'd be generally in favour of something that reduced our reliance on the
current chaotic and inconsistent jumble of globals which are a semi-random
combination of compilation-unit-scoped, globally-scoped-except-on-Windows
and globally scoped.

Tackling GUCs would be a good start. Especially given the number of GUCs
where the actual GUC value isn't the state that anyone should be
interacting with directly since a hook maintains the "real" state derived
from the GUC storage.

And preventing direct writes to GUCs seems like a clearly correct thing to
do.

Some consideration of performance would be important for some of the hotter
GUCs, of course, but most extensions won't be hammering most GUC access a
lot.


Re: Mark all GUC variable as PGDLLIMPORT

2021-08-23 Thread Craig Ringer
On Tue, 24 Aug 2021 at 13:21, Craig Ringer 
wrote:

>
> There is not even the thinnest pretense of Pg having a dedicated extension
> API or any sort of internal vs public API separation.
>

Oh, and if we do want such a separation, we'll need to introduce a MUCH
lower-pain-and-overhead way to get related patches in. Otherwise it'll take
decades to add any necessary function wrappers for currently exported data
symbols, add necessary hooks, wrap or hide internal symbols and state, etc.

But ... what is the actual goal and expected outcome of such a hypothetical
public/private API separation?

It won't help meaningfully with server maintenance: We already break
extensions freely in major releases, and sometimes even minor releases. We
don't make any stable API promise at all. So any argument that it would
make maintenance of the core server easier is weak at best.

It won't help protect server runtime stability: The server is written in C,
and makes heavy use of non-opaque / non-hidden types. Many of which would
not be practical to hide without enormous refactoring if at all. Writing
any sort of "safe" C API is very difficult even when the exposed
functionality is very narrow and well defined. Even then such an API can
only help prevent inadvertent mistakes, since C programs are free to grovel
around in memory. Look at the mess with EXPORT_SYMBOL_GPL in the kernel for
just how ugly that can get. So I don't think there's any realistic way to
claim that narrowing the exposed API surface would make it safer to load
and run extensions that the user has not separately checked and reviewed or
obtained from a trusted source with robust testing practices. Certainly it
offers no benefit at all against a bad actor.

It won't make it safer to use untrusted extensions.

What will it do? Not much, in the short term, except cripple existing
extensions or add a pile of questionably useful code annotations. The only
real benefits I see are some improvement in link-time optimization and
export symbol table size. Both of which are nice, but IMO not worth the
pain by themselves for a pure C program. In C++, with its enormous symbol
tables it's absolutely worth it. But not really for Pg.

To be clear, I actually love the idea of starting to define a solid public
API, with API, ABI and semantic promises and associated tests. But to say
it's a nontrivial amount of work is an enormous understatement. And unless
done by an existing committer who is trusted to largely define a
provisional API without bike-shedding and arguing the merits of every
single part of it, it's nearly impossible to do with the way Pg is
currently developed.

It's completely beyond me why it's OK to export all function symbols on
Windows, but not all data symbols. Or why it's OK to export all data
symbols on *nix, but not on Windows.


Re: Mark all GUC variable as PGDLLIMPORT

2021-08-23 Thread Craig Ringer
On Tue, 24 Aug 2021 at 02:31, Robert Haas  wrote:

> On Mon, Aug 23, 2021 at 11:40 AM Alvaro Herrera 
> wrote:
> > In that case, why not improve the API with functions that return the
> > values in some native datatype?  For scalars with native C types (int,
> > floats, Boolean etc) this is easy enough; I bet it'll solve 99% of the
> > problems or more.
>
> Sure, but ... why bother?
>
> The entire argument rests on the presumption that there is some harm
> being done by people accessing the values directly, but I don't think
> that's true. And, if it were true, it seems likely that this proposed
> API would have the exact same problem, because it would let people do
> exactly the same thing. And, going through this proposed API would
> still be significantly more expensive than just accessing the bare
> variables, because you'd at least have to do some kind of lookup based
> on the GUC name to find the corresponding variable. It's just a
> solution in search of a problem.
>
> Nothing bad happens when people write extensions that access GUC
> variables directly. It works totally, completely fine. Except on
> Windows.
>

Not only that, but postgres already exports every non-static function
symbol on both *nix and Windows, and every data symbol on *nix. A lot of
those function symbols are very internal and give anything that can call
them the ability to wreck absolute havoc on the server's state.

There is not even the thinnest pretense of Pg having a dedicated extension
API or any sort of internal vs public API separation. This ongoing pain
with PGDLLIMPORT on Windows is hard to see as anything but an excuse to
make working with and supporting Windows harder because some of us don't
like it. I happen to rather dislike working with Windows myself, but I get
to do it anyway, and I'd be very happy to remove this particular source of
pain.


Re: Mark all GUC variable as PGDLLIMPORT

2021-08-23 Thread Craig Ringer
On Mon, 23 Aug 2021 at 22:15, Tom Lane  wrote:

> Bruce Momjian  writes:
> > On Sun, Aug 22, 2021 at 09:29:01PM +0800, Julien Rouhaud wrote:
> >> Then shouldn't we try to prevent direct access on all platforms rather
> than
> >> only one?
>
> > Agreed.  If Julian says 99% of the non-export problems are GUCs, and we
> > can just export them all, why not do it?  We already export every global
> > variable on Unix-like systems, and we have seen no downsides.
>
> By that argument, *every* globally-visible variable should be marked
> PGDLLIMPORT.  But the mere fact that two backend .c files need to access
> some variable doesn't mean that we want any random bit of code doing so.
>
> And yes, I absolutely would prohibit extensions from accessing many
> of them, if there were a reasonable way to do it.  It would be a good
> start towards establishing a defined API for extensions.
>

There is: -fvisibility=hidden and __attribute__((visibility("default"))) .
Or if you prefer to explicitly mark private symbols, use
__attribute__((visiblity("hidden"))) .

In addition to API surface control this also gives you a smaller export
symbol table for faster dynamic linking, and it improves link-time
optimization.

I could've sworn I proposed its use in the past but I can't find a relevant
list thread except quite a recent one. All I can find is [1] . But that is
where we should start: switch from a linker script for libpq to using
PGDLLIMPORT (actually  it'd be LIBPQDLLIMPORT) in libpq. When
-DBUILDING_LIBPQ this would expand to __declspec("dllexport") on Windows
and __attribute__((visibility("default"))) on gcc/clang. Otherwise it
expands to __declspec("dllimport") on Windows and empty on other targets.
This would also be a good time to rename the confusingly named BUILDING_DLL
macro to BUILDING_POSTGRES .

The next step would be to have PGDLLIMPORT expand to
__attribute__((visibility("default"))) on gcc/clang when building the
server itself. This won't do anything by itself since all symbols are
already default visibility. But once the "public API" of both function and
data symbol is so-annotated, we could switch to building Pg with
-fvisibility=hidden by default, and on Windows, we'd disable the linker
script that exports all functions using a generated .DEF file.


[1]
https://www.postgresql.org/message-id/CAMsr%2BYHa3TfA4rKtnZuzurwCSmxxXNQHFE3UE29BoDEQcwfuxA%40mail.gmail.com


Re: Mark all GUC variable as PGDLLIMPORT

2021-08-23 Thread Craig Ringer
On Sun, 22 Aug 2021 at 21:29, Julien Rouhaud  wrote:

> On Sun, Aug 22, 2021 at 09:19:42AM -0400, Tom Lane wrote:
> >
> > Uh, no, it's exactly *not* clear.  There are a lot of GUCs that are only
> > of interest to particular subsystems.  I do not see why being a GUC makes
> > something automatically more interesting than any other global variable.
> > Usually, the fact that one is global is only so the GUC machinery itself
> > can get at it, otherwise it'd be static in the owning module.
> >
> > As for "extensions should be able to get at the values", the GUC
> machinery
> > already provides uniform mechanisms for doing that safely.  Direct access
> > to the variable's internal value would be unsafe in many cases.
>
> Then shouldn't we try to prevent direct access on all platforms rather than
> only one?
>

Yes. That's what I think we should be doing if we're not going to
PGDLLIMPORT them all.

The current API is painful because it round-trips via a text
representation. We'd at least want some kind of

GetConfigOptionBool(...)
GetConfigOptionEnum(...)

etc.

I don't understand the objection to marking them all PGDLLIMPORT anyway
though.

Any pretense that Pg has any sort of public/private API divide is pure
fantasy. Whether things are static or not, in public headers or "internal"
headers, etc,  is nearly random. We have absolutely no API surface control
such as __attribute__((visibility("hidden"))) annotations, and proposals to
add them have been soundly rejected in the past.

If we have a defined API, where is it defined and annotated?

If we don't, then Windows being different and incompatible is a bug, and
that bug should be fixed.


Quick tip on building pg with Visual Studio Build Tools 2019 + small patches

2021-07-13 Thread Craig Ringer
Hi all

If you're trying to build postgres with Visual Studio Build Tools 16 2019
using the optional v140 toolset that installs the Visual Studio 14 2019
C/C++ toolchain to get binaries that're fully compatible with the EDB
postgres builds, you may run into some confusing issues.

Use this incantation in cmd.exe (not a powershell.exe or pwsh.exe session)
to select the VS 16 msbuild with vs 14 compiler:

"%PROGRAMFILES(x86)%\Microsoft Visual
Studio\2019\BuildTools\VC\Auxiliary\Build\vcvarsall.bat" amd64
-vcvars_ver=14.0

all on one line, then run src\tools\msvc\build.bat as normal.

If you instead attempt to use the vcvarsall.bat from the v140 toolchain
that VS Build Tools 2019 installed for you, it'll appear to work, but
compilation will then fail by spamming:

some.vcxproj(17,3): error MSB4019: The imported project
"C:\Microsoft.Cpp.Default.props" was not found. Confirm that the path in
the  declaration is correct, and that the file exists on disk.

This is because the v140 toolset does not include the v140 msbuild. You're
expected to use the v160 msbuild and configure it to use the v140 toolset
instead.

Similar issues occur when you try to use the CMake generator "Visual Studio
14 2015" with a VS Build Tools 2019-installed version of the 140 toolchain;
you have to instead use -G "Visual Studio 16 2019" -T "v140" to select the
VS 16 msbuild and tell it to use the v140 toolchain. Crazy stuff.

If you instead just run:

"%PROGRAMFILES(x86)%\Microsoft Visual
Studio\2019\BuildTools\VC\Auxiliary\Build\vcvarsall.bat" amd64

Then compilation will run fine, but the resulting binary will use the
version 16 MSVC compilers, runtime library and redist, etc.


Note that all these builds will target the default Windows 10 SDK. That
should be fine; we're very conservative in postgres about new Windows
features and functions, and we do dynamic lookups for a few symbols when
we're not sure if they'll be available. But you can force it to compile for
Windows 7 and higher with by editing Mk.pm and adding the definitions

   WINVER=0x0601
   _WIN32_WINNT=0x0601

to your project. I didn't find a way to add custom preprocessor definitions
in config.pl so for testing purposes I hacked it into MSBuildProject.pm in
the  clause as

;WINVER=0x0601;_WIN32_WINNT=0x0601

I've attached a patch that teaches config.pl about a new 'definitions'
option to make this more graceful.

See
https://docs.microsoft.com/en-us/cpp/porting/modifying-winver-and-win32-winnt?view=msvc-160


If you don't have the toolchain installed, you can install Chocolatey
(there's a one-liner on their website) then:

choco install -y visualstudio2019buildtools

choco install -y visualstudio2019-vc++ --packageparameters "--add
Microsoft.VisualStudio.Component.VC.140"

You may also want

choco install -y winflexbison

(I've attached a patch that teaches pgflex.pl and pgbision.pl to use
win_flex.exe and win_bison.exe if they're found, and to accept full paths
for these tools in buildenv.pl).
From ad9625026bcb7768c636fdf3a37a3403db195ae2 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Tue, 13 Jul 2021 14:18:44 +1000
Subject: [PATCH v1 1/2] Teach pgflex.pl and pgbision.pl to read buildenv.pl
 for tool names

Some distributions of flex and bison on Windows use alternate executable
names such as win_flex.exe and win_bison.exe. Teach our pgflex.pl and
pgbison.pl wrappers how to handle them by reading the executables to use
from the new $flex and $bison variables in src/tools/msvc/buildenv.pl .

These may be bare names of commands on the PATH or they may be a fully
qualified path to the target executable.

While we're at it, notice when the test execution of flex or bision to
check the version fails and complain with a more informative error.
---
 src/tools/msvc/build.pl|  4 
 src/tools/msvc/buildenv_default.pl | 14 ++
 src/tools/msvc/pgbison.pl  | 27 +++
 src/tools/msvc/pgflex.pl   | 25 +
 4 files changed, 62 insertions(+), 8 deletions(-)
 create mode 100644 src/tools/msvc/buildenv_default.pl

diff --git a/src/tools/msvc/build.pl b/src/tools/msvc/build.pl
index de50554e7e..dedd515307 100644
--- a/src/tools/msvc/build.pl
+++ b/src/tools/msvc/build.pl
@@ -28,6 +28,10 @@ elsif (-e "./buildenv.pl")
 {
 	do "./buildenv.pl";
 }
+elsif (-e "src/tools/msvc/buildenv_default.pl")
+{
+	do "src/tools/msvc/buildenv_default.pl";
+}
 
 # set up the project
 our $config;
diff --git a/src/tools/msvc/buildenv_default.pl b/src/tools/msvc/buildenv_default.pl
new file mode 100644
index 00..b3868f2145
--- /dev/null
+++ b/src/tools/msvc/buildenv_default.pl
@@ -0,0 +1,14 @@
+# Copy this file to src\test\msvc\buildenv.pl and modify it as required
+# for your build toolchain, path, etc.
+#
+# Note that there's no way to run vcvarsall.bat from here to s

Re: Detecting File Damage & Inconsistencies

2021-07-01 Thread Craig Ringer
On Fri, 2 Jul 2021 at 00:19, Simon Riggs 
wrote:


> > So yeah. I think it'd be better to log the info you want at start-of-txn
> unless there's a compelling reason not so, and I don't see one yet.
>
> AFAIK, XLOG_XACT_ASSIGNMENT does not occur for normal top-level
> transactions, only for subxids.
>
> I don't really want to add an extra record just for this because it
> will slow down applications and it won't get turned on as often.
>

OK, that makes sense - I was indeed operating on an incorrect assumption.

I wouldn't want to add a new record either. I thought we could piggyback on
XLOG_XACT_ASSIGNMENT with a new chunk that's only added when the feature is
required, much like we do for replication origin info on commit records.

Is it worth considering forcing XLOG_XACT_ASSIGNMENT to be logged if this
feature is enabled?

If you don't think the sorts of use cases I presented are worth the trouble
that's fair enough. I'm not against adding it on the commit record. It's
just that with logical decoding moving toward a streaming model I suspect
only having it at commit time may cause us some pain later.


Re: RFC: Detailed reorder buffer stats dumps

2021-06-29 Thread Craig Ringer
On Thu, 6 May 2021 at 19:40, Amit Kapila  wrote:

> On Thu, May 6, 2021 at 9:54 AM Craig Ringer  wrote:
> >

> At the least it'd be helpful to have pg_stat_replication (or a new
> related auxiliary view like pg_stat_logical_decoding) show
> >
> > - walsender total bytes sent this session
> > - number of txns processed this txn
> >
>
> You might be able to derive some of the above sorts of stats from the
> newly added pg_stat_replication_slots [1].
>
>
That's a huge improvement that I managed to totally miss the discussion of
and work on. Thanks. It'll be a significant help.
'

> > - number txns filtered out by output plugin this session
> > - oldest xid in reorder buffer
> > - reorder buffer number of txns
> > - reorder buffer total size (in-memory and total inc spilled)
> > - reorderbuffercommit current xid, last change lsn, total buffered size
> of current tx, total bytes of buffer processed so far within the current
> txn, and commit lsn if known, only when currently streaming a txn from
> reorderbuffercommit
>

These are less statistical in nature, and more about the current activity
of the walsender and logical decoding state. I'm not sure if it'd make much
sense to tack them on to pg_stat_replication_slots, especially as that'd
also mean they were quite delayed.

But it probably isn't worth the effort of exposing this info in a new view.

With the higher level info now available in pg_stat_replication_slots, I
think I might look into exposing these finer details via trace markers for
use with perf / systemtap  / etc instead.

A digression, but: It's a real shame that such tools don't give us a way to
read specific tagged regions of memory with the same ease they let us probe
function calls though. You generally need gdb to read the value of a
global, or a moderately funky systemtap script. There's no convenient
equivalent to SDT markers (TRACE_FOO) to tag variables. Wouldn't it be nice
if we could

 perf watch postgres:walsender_reorderbuffer_oldest_xid

or something like that?


Re: How to expose session vs txn lock info in pg_locks view?

2021-06-28 Thread Craig Ringer
On Mon, 1 Feb 2021 at 18:42, Craig Ringer 
wrote:

> On Sun, 24 Jan 2021 at 09:12, Andres Freund  wrote:
>
>> Hi,
>>
>> On 2021-01-19 14:16:07 +0800, Craig Ringer wrote:
>> > AFAICS it'd be necessary to expand PROCLOG to expose this in shmem.
>> > Probably by adding a small bitfield where bit 0 is set if there's a txn
>> > level lock and bit 1 is set if there's a session level lock. But I'm not
>> > convinced that expanding PROCLOCK is justifiable for this.
>> sizeof(PROCLOCK)
>> > is 64 on a typical x64 machine. Adding anything to it increases it to 72
>> > bytes.
>>
>> Indeed - I really don't want to increase the size, it's already a
>> problem.
>>
>>
>> > It's frustrating to be unable to tell the difference between
>> session-level
>> > and txn-level locks in diagnostic output.
>>
>> It'd be useful, I agree.
>>
>>
>> > And the deadlock detector has no way to tell the difference when
>> > selecting a victim for a deadlock abort - it'd probably make sense to
>> > prefer to send a deadlock abort for txn-only lockers.
>>
>> I'm doubtful this is worth going for.
>>
>>
>> > But I'm not sure I see a sensible way to add the info - PROCLOCK is
>> > already free of any padding, and I wouldn't want to use hacks like
>> > pointer-tagging.
>>
>> I think there's an easy way to squeeze out space: make groupLeader be an
>> integer index into allProcs instead. That requires only 4 bytes...
>>
>> Alternatively, I think it'd be reasonably easy to add the scope as a bit
>> in LOCKMASK - there's plenty space.
>>
>
> I was wondering about that, but concerned that there would be impacts I
> did not understand.
>
> I'm happy to pursue that angle.
>

Just so this thread isn't left dangling, I'm just not going to get time to
follow up on this work with a concrete patch and test suite change.

If anyone else later on wants to differentiate between session and txn
LWLocks they could start with the approach proposed here.


Re: [PATCH] ProcessInterrupts_hook

2021-06-28 Thread Craig Ringer
On Sat, 20 Mar 2021 at 03:46, Tom Lane  wrote:

> Robert Haas  writes:
> > On Fri, Mar 19, 2021 at 3:25 PM Tom Lane  wrote:
> >> I'm not very comfortable about the idea of having the postmaster set
> >> child processes' latches ... that doesn't sound terribly safe from the
> >> standpoint of not allowing the postmaster to mess with shared memory
> >> state that could cause it to block or crash.  If we already do that
> >> elsewhere, then OK, but I don't think we do.
>
> > It should be unnecessary anyway. We changed it a while back to make
> > any SIGUSR1 set the latch 
>
> Hmm, so the postmaster could send SIGUSR1 without setting any particular
> pmsignal reason?  Yeah, I suppose that could work.  Or we could recast
> this as being a new pmsignal reason.
>

I'd be fine with either way.

I don't expect to be able to get to working on a concrete patch for this
any time soon, so I'll be leaving it here unless someone else needs to pick
it up for their extension work. The in-principle agreement is there for
future work anyway.


Re: [PATCH] More docs on what to do and not do in extension code

2021-06-28 Thread Craig Ringer
Laurenz,

Thanks for your comments. Sorry it's taken me so long to get back to you.
Commenting inline below on anything I think needs comment; other proposed
changes look good.

On Sun, 30 May 2021 at 19:20, Laurenz Albe  wrote:

> +   always use  linkend="xfunc-sleeping-interrupts-blocking">PostgreSQL's
> +   interupt-aware APIs for the purpose. Do not
> usleep(),
> +   system(), make blocking system calls, etc.
> +  
>
> "system" is not a verb.
>

When it's a function call it is, but I'm fine with your revision:

  Do not use usleep(), system()
>   or other blocking system calls.
>
> +and should only use the main thread. PostgreSQL generally uses
> subprocesses
>
> Hm.  If the extension does not start threads, it automatically uses the
> main thread.
> I think that should be removed for clarity.
>

IIRC I intended that to apply to the section that tries to say how to
survive running your own threads in the backend if you really must do so.

+primitives like WaitEventSetWait where
> necessary. Any
> +potentially long-running loop should periodically call 
> +CHECK_FOR_INTERRUPTS() to give PostgreSQL a chance to
> interrupt
> +the function in case of a shutdown request, query cancel, etc.
> This means
>
> Are there other causes than shutdown or query cancellation?
> At any rate, I am not fond of enumerations ending with "etc".
>

I guess. I wanted to emphasise that if you mess this up postgres might fail
to shut down or your backend might fail to respond to SIGTERM /
pg_terminate_backend, as those are the most commonly reported symptoms when
such issues are encountered.


+by PostgreSQL if any function that it calls makes a
> +CHECK_FOR_INTERRUPTS() check. It may not
>
> "makes" sound kind of clumsy in my ears.
>

Yeah. I didn't come up with anything better right away but will look when I
get the chance to return to this patch.


> Attached is a new version that has my suggested changes, plus a few from
> Bharath Rupireddy (I do not agree with many of his suggestions).
>

Thanks very much. I will try to return to this soon and review the diff
then rebase and update the patch.

I have a large backlog to get through, and I've recently had the pleasure
of having to work on windows/powershell build system stuff, so it may still
take me a while.


Adding more PGDLLIMPORTs

2021-06-24 Thread Craig Ringer
Hi all

As a result of some recent work on Windows I have a list of PGDLLIMPORTs
I'd like to add to existing exported globals.

All affected variables are already extern, so this doesn't expose any new
API not already available to non-Windows extensions.

I've split the patch up for clarity:

* v1-0001: PGDLLIMPORTs for xlog.c's XactLastRecEnd, ProcLastRecPtr and
reachedConsistency . I only really need XactLastRecEnd but think it's
sensible to expose all of them.

* v1-0002: PGDLLIMPORT for struct WalRecv . Allows extensions to observe
WAL receiver state and behaviour.

* v1-0003: PGDLLIMPORT criticalSharedRelcachesBuilt and
criticalRelcachesBuilt . I only really need criticalSharedRelcachesBuilt
but it seems sensible to export both. Useful when extension code may run
during early startup (_PG_init in shared_preload_libraries, shmem init,
etc) and later during normal running.

* v1-0004: PGDLLIMPORT a set of useful GUCs and vars containing GUC-derived
state in xlog.h and walreceiver.h

I will follow up soon with a patch that marks every GUC as PGDLLIMPORT
including any vars derived from the GUC by hooks. I don't see much point
doing this piecemeal since they're all externs anyway. That patch will
replace patch 4 above, but not patches 1-3.

I'd love to see these PGDLLIMPORTs backported to pg13.
From f4c1abfbfc33ad95b5c216f1fcf132938c6377bc Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Thu, 17 Jun 2021 11:46:17 +0800
Subject: [PATCH v1 1/5] Make xlog.c vars PGDLLIMPORT

Allow extensions to see XactLastRecEnd, ProcLastRecPtr and
reachedConsistency.
---
 src/include/access/xlog.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c12be..e0b3a75d4d 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -97,11 +97,11 @@ typedef enum
 	RECOVERY_TARGET_TIMELINE_NUMERIC
 } RecoveryTargetTimeLineGoal;
 
-extern XLogRecPtr ProcLastRecPtr;
-extern XLogRecPtr XactLastRecEnd;
+extern PGDLLIMPORT XLogRecPtr ProcLastRecPtr;
+extern PGDLLIMPORT XLogRecPtr XactLastRecEnd;
 extern PGDLLIMPORT XLogRecPtr XactLastCommitEnd;
 
-extern bool reachedConsistency;
+extern PGDLLIMPORT bool reachedConsistency;
 
 /* these variables are GUC parameters related to XLOG */
 extern int	wal_segment_size;
-- 
2.31.1

From 9284627abdcee2b851531a55f82288a73285c9fd Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Thu, 17 Jun 2021 11:47:18 +0800
Subject: [PATCH v1 2/5] Make struct WalRecv PGDLLIMPORT for extensions

This allows extensions on a standby to observe the state of the wal
receiver.
---
 src/include/replication/walreceiver.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25ea7..016814658e 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -160,7 +160,7 @@ typedef struct
 	sig_atomic_t force_reply;	/* used as a bool */
 } WalRcvData;
 
-extern WalRcvData *WalRcv;
+extern PGDLLIMPORT WalRcvData *WalRcv;
 
 typedef struct
 {
-- 
2.31.1

From 6ef9e486e829b9e96a622bf60d2fe6a7044a6321 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Thu, 17 Jun 2021 11:50:11 +0800
Subject: [PATCH v1 3/5] PGDLLIMPORT criticalSharedRelcachesBuilt and
 criticalRelcachesBuilt

Allow extensions to see the state of relcache init so they can tell if
relcache access is safe. Useful for extensions that register at
shared_preload_libraries time.
---
 src/include/utils/relcache.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index f772855ac6..202f2b10d9 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -145,9 +145,9 @@ extern void RelationCacheInitFilePostInvalidate(void);
 extern void RelationCacheInitFileRemove(void);
 
 /* should be used only by relcache.c and catcache.c */
-extern bool criticalRelcachesBuilt;
+extern PGDLLIMPORT bool criticalRelcachesBuilt;
 
 /* should be used only by relcache.c and postinit.c */
-extern bool criticalSharedRelcachesBuilt;
+extern PGDLLIMPORT bool criticalSharedRelcachesBuilt;
 
 #endif			/* RELCACHE_H */
-- 
2.31.1

From f5bc0d2fc74eb2855196f7aa02295d29059f742f Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Thu, 17 Jun 2021 11:53:21 +0800
Subject: [PATCH v1 4/5] PGDLLIMPORT a set of useful GUCs and derived variables

---
 src/include/access/xlog.h | 76 +--
 src/include/replication/walreceiver.h |  6 +--
 2 files changed, 41 insertions(+), 41 deletions(-)

diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e0b3a75d4d..3a6dbbaef4 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -104,47 +104,47 @@ extern PGDLLIMPORT XLogRecPtr XactLastCommitEnd;
 extern PGDLLIMPORT bool reachedConsistency;
 
 /* these variables are GUC parameters related to XLOG */
-extern int

Re: Detecting File Damage & Inconsistencies

2021-06-21 Thread Craig Ringer
On Tue, 22 Jun 2021 at 00:24, Simon Riggs 
wrote:

> On Thu, Mar 18, 2021 at 6:20 AM Craig Ringer
>  wrote:
> >
> > On Mon, 15 Mar 2021 at 21:01, David Steele  wrote:
> >>
> >> On 11/18/20 5:23 AM, Simon Riggs wrote:
> >> > On Wed, 18 Nov 2020 at 06:42, Craig Ringer
> >> >  wrote:
> >> >>
> >> >> On Fri, Nov 13, 2020 at 7:24 PM Simon Riggs 
> wrote:
> >> >>>
> >> >>>
> >> >>> What I'm proposing is an option to add 16 bytes onto each COMMIT
> >> >>> record
> >> >>
> >> >>
> >> >> Would it make sense to write this at the time we write a topxid
> assignment to WAL instead?
> >> >>
> >> >> Otherwise it won't be accessible to streaming-mode logical decoding.
> >> >
> >> > Do you mean extend the xl_xact_assignment record? My understanding is
> >> > that is not sent in all cases, so not sure what you mean by "instead".
> >>
> >> Craig, can you clarify?
> >
> >
> > Right. Or write a separate WAL record when the feature is enabled. But
> it's probably sufficient to write it as an optional chunk on
> xl_xact_assignment records. We often defer writing them so we can optimise
> away xacts that never actually wrote anything, but IIRC we still write one
> before we write any WAL that references the xid. That'd be fine, since we
> don't need the info any sooner than that during decoding. I'd have to
> double check that we write it in all cases and won't get to that too soon,
> but I'm pretty sure we do...
>
> The commit record is optimized away if no xid is assigned, though is
> still present if we didn't write any WAL records.
>
> But if a commit record exists in the WAL stream, we want to know where
> it came from.
>
> A later patch will add PITR capability based on this information so
> attaching it directly to the commit record is fairly important, IMHO.
>

Why?

All the proposed info:

* 8-byte session start time (from MyStartTime)
* 2-byte pid (from MyProcPid)
* 4-byte user oid

are available at topxid assignment time. If we defer writing them until
commit, we lose the ability to use this information during streaming
logical decoding. That's something I believe you've wanted for other
functionality in the past, such as logical decoding based audit
functionality.

IIRC the restart_lsn horizon already ensures that we can't miss the
xl_xact_assignment at the start of a txn. We would ensure that the desired
info is available throughout decoding of the txn, including at commit
record processing time, by adding it to the toplevel ReorderBufferTxn.

The only advantage I can see to annotating the commit record instead is
that we don't have to spend a few bytes per reorder-buffered topxid to
track this info between start of decoding for the tx and processing of the
commit record. I don't think that's worth caring about.The advantages that
having it earlier would give us are much more significant.

A few examples:

* Skip reorder buffering of non-target transactions early, so we can decode
the WAL stream to find the target transactions much faster using less
memory and I/O;

* Read the database change stream and use the session info to stream info
into an intrusion detection system and/or audit engine in real time, using
txn streaming to avoid the need to create huge reorder buffers;

* Re-decode the WAL stream to identify a target txn you know was aborted,
and commit it instead, so you can recover data from aborted txns from the
WAL stream using logical decoding. (Only possible if the catalog_xmin
hasn't advanced past that point already though)

So yeah. I think it'd be better to log the info you want at start-of-txn
unless there's a compelling reason not so, and I don't see one yet.


Re: RFC: Detailed reorder buffer stats dumps

2021-05-05 Thread Craig Ringer
On Thu, 6 May 2021 at 02:28, Andres Freund  wrote:

> Hi,
>
> On 2021-05-05 18:33:27 +0800, Craig Ringer wrote:
> > I'm thinking of piggy-backing on the approach used in the "Get memory
> > contexts of an arbitrary backend process" patch in order to provide
> access
> > to detailed reorder buffer content statistics from walsenders on request.
> >
> > Right now the reorder buffer is mostly a black-box. I mostly rely on gdb
> or
> > on dynamic probes (perf, systemtap) to see what it's doing. I intend a
> > patch soon to add a couple of fields to struct WalSnd to report some very
> > coarse reorder buffer stats - at least oldest buffered xid, number of
> > buffered txns, total bytes of buffered txns in memory, total bytes of
> > buffered txns spilled to disk.
> >
> > But sometimes what I really want is details on the txns that're in the
> > reorder buffer, and that's not feasible to export via always-enabled
> > reporting like struct WalSnd. So I'm thinking that the same approach used
> > for the memory context stats patch might work well for asking the
> walsender
> > for a detailed dump of reorder buffer contents. Something like a
> > per-buffered-txn table of txn topxid, start-lsn, most recent change lsn,
> > number of changes, number of subxids, number of invalidations, number of
> > catalog changes, buffer size in memory, buffer size spilled to disk.
> >
> > Anyone drastically opposed to the idea?
>
> I am doubtful. The likelihood of ending with effectively unused code
> seems very substantial here.
>

I can't rule that out, but the motivation for this proposal isn't
development convenience. It's production support and operations. The
reorder buffer is a black box right now, and when you're trying to answer
the questions "what is the walsender doing," "is meaningful progress being
made," and "what is slowing down replication" it's ... not easy.

I currently rely on some fairly hairy gdb scripts, which I'm not keen on
running on production systems if I can avoid it.

I'm far from set on the approach of asking the walsender to dump a reorder
buffer state summary to a file. But I don't think the current state of
affairs is much fun for production use. Especially since we prevent the
pg_stat_replication sent_lsn from going backwards, so reorder buffering can
cause replication to appear to completely cease to progress for long
periods unless you identify the socket and monitor traffic on it, or you
intrude on the process with gdb.

At the least it'd be helpful to have pg_stat_replication (or a new related
auxiliary view like pg_stat_logical_decoding) show

- walsender total bytes sent this session
- number of txns processed this txn
- number txns filtered out by output plugin this session
- oldest xid in reorder buffer
- reorder buffer number of txns
- reorder buffer total size (in-memory and total inc spilled)
- reorderbuffercommit current xid, last change lsn, total buffered size of
current tx, total bytes of buffer processed so far within the current txn,
and commit lsn if known, only when currently streaming a txn from
reorderbuffercommit

That way it'd be possible to tell if a logical walsender is currently
processing a commit and get a much better sense of its progress within the
commit.

Perhaps output plugins could do some of this and expose their own custom
views. But then each plugin would have to add its own. Plus they don't get
a particularly good view into the reorder buffer state; they'd have a hard
time maintaining good running stats.

Some basic monitoring exposed for logical decoding and reorder buffering
would help a lot. Does that sound more palatable?

If so, I'd probably still want to be able to hook a few places in logical
decoding to allow an extension to instrument it when greater insight into
the inner workings is required.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: Is txid_status() actually safe? / What is 011_crash_recovery.pl testing?

2021-05-05 Thread Craig Ringer
On Wed, 5 May 2021 at 23:15, Craig Ringer  wrote:

> Which was fine as far as it went, but I failed to account for the xid
> assignment not necessarily being durable when the client calls
> txid_status().

Ahem, I meant "when the client calls txid_current()"

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: Is txid_status() actually safe? / What is 011_crash_recovery.pl testing?

2021-05-05 Thread Craig Ringer
On Tue, 9 Feb 2021 at 05:52, Andres Freund  wrote:

>
> Craig, it kind of looks to me like you assumed it'd be guaranteed that
> the xid at this point would show in-progress?
>

At the time I wrote that code, I don't think I understood that xid
assignment wasn't necessarily durable until either (a) the next checkpoint;
or (b) commit of some txn with a greater xid.

IIRC I expected that after crash and recovery the tx would always be
treated as aborted, because the xid had been assigned but no corresponding
commit was found before end-of-recovery. No explicit abort records are
written to WAL for such txns since we crashed, but the server's oldest
in-progress txn threshold is used to determine that they must be aborted
rather than in-progress even though their clog entries aren't set to
aborted.

Which was fine as far as it went, but I failed to account for the xid
assignment not necessarily being durable when the client calls
txid_status().


> I don't think the use of txid_status() described in the docs added in
> the commit is actually ever safe?
>

I agree. The client can query for its xid with txid_current() but as you
note there's no guarantee that the assigned xid is durable.

The client would have to ensure that an xid was assigned, then ensure that
the WAL was durably flushed past the point of the xid assignment before
relying on the xid.

If we do a txn that performs a small write, calls txid_current(), and sends
a commit that the server crashes before completing, we can't know for sure
that the xid we recorded client-side before the server crash is the same
txn we check the status of after crash recovery. Some other txn could've
re-used the xid after crash so long as no other txn with a greater xid
durably committed before the crash.

That scenario isn't hugely likely, but it's definitely possible on systems
that don't do a lot of concurrent txns or do mostly long, heavyweight txns.

The txid_status() function was originally intended to be paired with a way
to report topxid assignment to the client automatically, NOTIFY or
GUC_REPORT-style. But that would not make this usage safe either, unless we
delayed the report until WAL was flushed past the LSN of the xid assignment
*or* some other txn with a greater xid committed.

This could be made safe with a variant of txid_current() that forced the
xid assignment to be logged immediately if it was not already, and did not
return until WAL flushed past the point of the assignment. If the client
did most of the txn's work before requesting a guaranteed-durable xid, it
would in practice not land up having to wait for a flush. But we'd have to
keep track of when we assigned the xid in every single topxact in order to
be able to promise we'd flushed it without having to immediately force a
flush. That's pointless overhead all the rest of the time, just in case
someone wants to get an xid for later use with txid_status().

The simplest option with no overhead on anything that doesn't care about
txid_status() is to expose a function to force flush of WAL up to the
current insert LSN. Then update the docs to say you have to call it after
txid_current(), and before sending your commit. But at that point you might
as well use 2PC, since you're paying the same double flush and double
round-trip costs. The main point of txid_status() was to avoid the cost of
that double-flush.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: Is txid_status() actually safe? / What is 011_crash_recovery.pl testing?

2021-05-05 Thread Craig Ringer
On Wed, 10 Feb 2021 at 04:28, Robert Haas  wrote:

> On Mon, Feb 8, 2021 at 4:52 PM Andres Freund  wrote:
> > The 011_crash_recovery.pl test test starts a transaction, creates a
> > table, fetches the transaction's xid. Then shuts down the server in
> > immediate mode. It then asserts that after crash recovery the previously
> > assigned xid is shown as aborted, and that new xids are assigned after
> > the xid.
> >
> > But as far as I can tell there's no guarantee that that is the case.
>
> I think you are right.
>
>
Andres, I missed this mail initially. I'll look into it shortly.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


RFC: Detailed reorder buffer stats dumps

2021-05-05 Thread Craig Ringer
Hi all

I'm thinking of piggy-backing on the approach used in the "Get memory
contexts of an arbitrary backend process" patch in order to provide access
to detailed reorder buffer content statistics from walsenders on request.

Right now the reorder buffer is mostly a black-box. I mostly rely on gdb or
on dynamic probes (perf, systemtap) to see what it's doing. I intend a
patch soon to add a couple of fields to struct WalSnd to report some very
coarse reorder buffer stats - at least oldest buffered xid, number of
buffered txns, total bytes of buffered txns in memory, total bytes of
buffered txns spilled to disk.

But sometimes what I really want is details on the txns that're in the
reorder buffer, and that's not feasible to export via always-enabled
reporting like struct WalSnd. So I'm thinking that the same approach used
for the memory context stats patch might work well for asking the walsender
for a detailed dump of reorder buffer contents. Something like a
per-buffered-txn table of txn topxid, start-lsn, most recent change lsn,
number of changes, number of subxids, number of invalidations, number of
catalog changes, buffer size in memory, buffer size spilled to disk.

Anyone drastically opposed to the idea?

(I know I have to finish up with the LWLock tracepoint patchset first, this
is a RFC at this stage).

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


[PATCH] Faster, easier valgrind runs with make USE_VALGRIND=1 check

2021-05-05 Thread Craig Ringer
Hi all

The attached patch adds support for running any temp-install based tests
(check, isolationcheck, src/test/recovery, etc) under the control of
valgrind with a simple

 make USE_VALGRIND=1 check

It's based on a script I've been using for some time to run faster, simpler
Valgrind checks on the codebase with much less irrelevant noise than the
usual approaches.

There are no C code changes at all in this patch, it only touches
Makefile.global and adds a new src/tools/valgrind_wrapper tool.

When you specify USE_VALGRIND=1, the PATH used by $(with_temp_install) is
prefixed with a tmp_install/bin_valgrind_wrapper/ directory. Each binary in
$(bindir) gets a corresponding wrapper script in bin_valgrind_wrapper in
the temp install. The wrappers intercept execution of every command in the
bindir and exec them under the control of valgrind (or skip valgrind and
exec that target directly, if desired

This has many advantages over the usual approaches of an installcheck-based
valgrind run or "valgrind make check":

* There's no custom setup, it works out of the box
* It produces much less irrelevant log output and runs a lot faster because
it only runs postgres-related binaries under valgrind, not irrelevant noise
like perl interpreters, make, shellscripts, etc.
* It's much more targeted and selective - if you're only interested in some
extension or new backend feature, you can trivially set it to target just
the backend, skip checking of initdb, and skip checking of psql, so you get
more relevant log output and faster runs.

I'll follow up to this post with some timing and log output numbers but
wanted to share what I had first.

-DUSE_VALGRIND is also added to CFLAGS at compile time when USE_VALGRIND=1
is passed to make. This gets rid of the need to hack pg_config_manual.h or
fiddle with configure re-runs when you want to build with valgrind support.
Arguably it'd be better to add a --enable-valgrind option to configure. LMK
if that's preferable.

Note that there's a bit of a hack in the wrapper script to work around
Valgrind's inability to set the argv[0] of a process run under valgrind to
anything other than the exact command-name to be executed. I have a patch
for valgrind pending to add that capability (like "exec -a" in bash) but a
workaround is necessary for now. It's made a bit more complicated by
PostgreSQL's determination to canonicalize paths and follow symlinks in
find_my_exec(). The script's hardlink based workarounds for this could be
removed if we could agree to support a debug env-var or command-line option
that could be used to supply an override path to be returned by
find_my_exec() instead of performing normal discovery. If you'd prefer that
approach to the current workaround in the script let me know.

I'm also willing to add valgrind-support-detection logic that will cause
valgrind launched via "make USE_VALGRIND=1" to refuse to run if it detects
that the target postgres was not built with -DUSE_VALGRIND for proper
instrumentation. This can be done with the valgrind --require-text-symbol
option and a dummy export symbol added to the symbol table only when
compiled with -DUSE_VALGRIND. If that's desirable let me know, it should be
quick to add.

You can find more detail in the patch commit message (attached) and in the
src/test/valgrind_wrapper script it adds. If you're wondering why the
valgrind options --trace-children=yes --trace-children-skip=pattern
--trace-children-skip-by-arg=pattern don't solve this problem, read the
script's comments.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise
From e730e862a0731dc3d2786c5004a146aff7dd6bf7 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Tue, 4 May 2021 10:34:01 +0800
Subject: [PATCH v3] Make Valgrind runs simpler with make USE_VALGRIND=1

To run Valgrind based tests on postgres, it's generally been necessary
to either:

* Initdb and start your own cluster under the control of valgrind then
  use "make installcheck". This works won't work with TAP tests, and
  it's cumbersome to set up and run.
  -or-
* Run "make check" under the control of valgrind. This approach
  runs all the uninteresting processes under valgrind, with all
  the associated overhead. Every make, shell, utility command,
  perl interpreter, etc gets run under valgrind, which slows
  everything down a lot and produces a great deal of uninteresting
  valgrind output. There's no practical way to target just the
  server backend this way.

Provide a faster, simpler and more targeted way to run valgrind on
postgres by adding support for a "USE_VALGRIND=1" parameter to
the makefiles. This has two main effects:

* It adds -DUSE_VALGRIND to CFLAGS, so enhanced valgrind runtime
  support is compiled into postgres; and
* It interposes a wrapper script before all executions of any binary
  installed in the $(bindir) of a temp-install. The wrapper sc

Re: [PATCH] Identify LWLocks in tracepoints

2021-05-04 Thread Craig Ringer
On Wed, 5 May 2021 at 09:15, Craig Ringer  wrote:

>> warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
>>  1322 |TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode);
>>   |  ^
>
> Odd that I didn't get that.

This compiler complaint is not due to the _ENABLED() test as such.
It's due to the fact that we completely define out the
TRACE_POSTGRESQL_ macros with src/backend/utils/Gen_dummy_probes.sed .

While explicit braces could be added around each test, I suggest
fixing Gen_dummy_probes.sed to emit the usual dummy statement instead.
Patch attached.
From a1ca7d752b56717854cc47b31cd299e651de0430 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Wed, 5 May 2021 12:06:10 +0800
Subject: [PATCH v1] Emit dummy statements for probes.d probes when disabled
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When building without --enable-dtrace, emit dummy

do {} while (0)

statements for the stubbed-out TRACE_POSTGRESQL_foo() macros
instead of empty macros that totally elide the original probe
statement.

This fixes the

warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]

introduced by b94409a02f.
---
 src/backend/utils/Gen_dummy_probes.sed | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/backend/utils/Gen_dummy_probes.sed b/src/backend/utils/Gen_dummy_probes.sed
index aa3db59cce..6e29d86afa 100644
--- a/src/backend/utils/Gen_dummy_probes.sed
+++ b/src/backend/utils/Gen_dummy_probes.sed
@@ -19,5 +19,6 @@ s/([^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\})/(INT1, INT2,
 s/([^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\})/(INT1, INT2, INT3, INT4, INT5, INT6)/
 s/([^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\})/(INT1, INT2, INT3, INT4, INT5, INT6, INT7)/
 s/([^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\})/(INT1, INT2, INT3, INT4, INT5, INT6, INT7, INT8)/
+s/$/ do {} while (0)/
 P
 s/(.*$/_ENABLED() (0)/
-- 
2.30.2



Re: [PATCH] Identify LWLocks in tracepoints

2021-05-04 Thread Craig Ringer
On Wed, 5 May 2021, 06:15 Andres Freund,  wrote:

> Hi,
>
> warning: suggest braces around empty body in an ‘if’ statement
> [-Wempty-body]
>  1322 |TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode);
>   |  ^
>


Odd that I didn't get that.

I'll send a patch to revise shortly.


Re: [PATCH] Identify LWLocks in tracepoints

2021-04-29 Thread Craig Ringer
On Wed, 14 Apr 2021, 22:29 Robert Haas,  wrote:

> On Tue, Apr 13, 2021 at 10:42 PM Craig Ringer
>  wrote:
> > I'd really love it if a committer could add an explanatory comment or
> > two in the area though. I'm happy to draft a comment patch if anyone
> > agrees my suggestion is sensible. The key things I needed to know when
> > studying the code were:
> > [...]
>
> I'm willing to review a comment patch along those lines.
>

Cool. I'll draft soon.

I since noticed that some of the info is present, but it's in lwlock.h
whereas in Pg comment detail is more often than not in the .c file.

I prefer it in headers myself anyway, since it's more available to tools
like doxygen. I'll add a few "see lwlock.h" hints, a short para about
appropriate lwlock use in the .c into comment etc and post on a separate
thread soon.


> > I'm actually inclined to revise the patch I sent in order to *remove*
> > the LWLock name from the tracepoint argument.
>

Reducing the overheads is good, but I have no opinion on what's
> important for people doing tracing, because I am not one of those
> people.
>

Truthfully I'm not convinced anyone is "those people" right now. I don't
think anyone is likely to be making serious use of them due to their
limitations.

Certainly that'll be the case for the txn ones which are almost totally
useless. They only track the localxid lifecycle, they don't track real txid
allocation, WAL writing, commit (wal or shmem), etc.


Re: [PATCH] Identify LWLocks in tracepoints

2021-04-29 Thread Craig Ringer
On Thu, 29 Apr 2021 at 15:31, Peter Eisentraut
 wrote:
> > So if you could produce a separate patch that adds the
> > _ENABLED guards targeting PG14 (and PG13), that would be helpful.
>
> Here is a proposed patch for this.

LGTM.

Applies and builds fine on master and (with default fuzz) on
REL_13_STABLE. Works as expected.

This does increase the size of LWLockAcquire() etc slightly but since
it skips these function calls, and the semaphores are easily
predicted, I don't have any doubt it's a net win. +1 for merge.




Re: [PATCH] Identify LWLocks in tracepoints

2021-04-13 Thread Craig Ringer
On Wed, 14 Apr 2021 at 10:41, Craig Ringer
 wrote:
> On Wed, 14 Apr 2021 at 02:25, Robert Haas  wrote:
> > You could try to identify locks by pointer addresses, but that's got
> > security hazards and the addreses aren't portable across all the
> > backends involved in the parallel query because of how DSM works, so
> > it's not really that helpful in terms of matching stuff up.
>
> What I'm doing now is identifying them by LWLock* across backends. I
> keep track of DSM segment mappings in each backend inside the trace
> script and I relocate LWLock* pointers known to be inside DSM segments
> relative to a dummy base address so they're equal across backends.

BTW, one of the reasons I did this was to try to identify BDR and
pglogical code that blocks or sleeps while holding a LWLock. I got
stuck on that for other reasons, so it didn't go anywhere, but those
issues are now resolved so I should probably return to it at some
point.

It'd be a nice thing to be able to run on postgres itself too.




Re: [PATCH] Identify LWLocks in tracepoints

2021-04-13 Thread Craig Ringer
On Wed, 14 Apr 2021 at 02:25, Robert Haas  wrote:

> So before the commit in question --
> 3761fe3c20bb040b15f0e8da58d824631da00caa -- T_ID() used to compute an
> offset for a lock within the tranche that was supposed to uniquely
> identify the lock. However, the whole idea of an array per tranche
> turns out to be broken by design.

Yeah, I understand that.

I'd really love it if a committer could add an explanatory comment or
two in the area though. I'm happy to draft a comment patch if anyone
agrees my suggestion is sensible. The key things I needed to know when
studying the code were:

* A LWLock* is always part of a tranche, but locks within a given
tranche are not necessarily co-located in memory, cross referenced or
associated in any way.
* A LWLock tranche does not track how many LWLocks are in the tranche
or where they are in memory. It only groups up LWLocks into categories
and maps the tranche ID to a name.
* Not all LWLocks are part of the main LWLock array; others can be
embedded in shmem structs elsewhere, including in DSM segments.
* LWLocks in DSM segments may not have the same address between
different backends, because the DSM base address can vary, so a
LWLock* cannot be reliably compared between backends unless you know
it's in the main LWLock array or in static shmem.

Having that info in lwlock.c near the tranche management code or the
tranche and main lwlock arrays would've been very handy.


> You could try to identify locks by pointer addresses, but that's got
> security hazards and the addreses aren't portable across all the
> backends involved in the parallel query because of how DSM works, so
> it's not really that helpful in terms of matching stuff up.

What I'm doing now is identifying them by LWLock* across backends. I
keep track of DSM segment mappings in each backend inside the trace
script and I relocate LWLock* pointers known to be inside DSM segments
relative to a dummy base address so they're equal across backends.

It's a real pain, but it works. The main downside is that the trace
script has to observe the DSM attach; if it's started once a backend
already has the DSM segment attached, it has no idea the LWLock is in
a DSM segment or how to remap it. But that's not a serious issue.

> On a broader level, I agree that being able to find out what the
> system is doing is really important. But I'm also not entirely
> convinced that having really fine-grained information here to
> distinguish between one lock and another is the way to get there.

At the start of this thread I would've disagreed with you.

But yeah, you and Andres are right, because the costs outweigh the
benefits here.

I'm actually inclined to revise the patch I sent in order to *remove*
the LWLock name from the tracepoint argument. At least for the
fast-path tracepoints on start-of-acquire and end-of-acquire. I think
it's probably OK to report it in the lock wait tracepoints, which is
where it's most important to have anyway. So the tracepoint will
always report the LWLock* and tranche ID, but it won't report the
tranche name for the fast-path. I'll add trace events for tranche ID
registration, so trace tools can either remember the tranche ID->name
mappings from there, or capture them from lock wait events and
remember them.

Reasonable? That way we retain the most important trace functionality,
but we reduce the overheads.




Re: [PATCH] Identify LWLocks in tracepoints

2021-04-13 Thread Craig Ringer
On Wed, 14 Apr 2021 at 04:46, Andres Freund  wrote:
>
> On 2021-04-13 14:25:23 -0400, Robert Haas wrote:
> > On Mon, Apr 12, 2021 at 11:06 PM Andres Freund  wrote:
> > You could identify every lock by a tranche ID + an array offset + a
> > "tranche instance ID". But where would you store the tranche instance
> > ID to make it readily accessible, other than in the lock itself?
> > Andres wasn't thrilled about using even 2 bytes to identify the
> > LWLock, so he'll probably like having more bytes in there for that
> > kind of thing even less.
>
> I still don't like the two bytes, fwiw ;). Especially because it's 4
> bytes due to padding right now.

Aha, did I hear you say "there are two free bytes for me to shove
something marginally useful and irrelevant into"?

(*grin*)

> I'd like to move the LWLock->waiters list to outside of the lwlock
> itself - at most TotalProcs LWLocks can be waited for, so we don't need
> millions of empty proclist_heads. That way we could also remove the
> proclist indirection - which shows up a fair bit in contended workloads.
>
> And if we had a separate "lwlocks being waited for" structure, we could
> also add more information to it if we wanted to...

Having the ability to observe LWLock waiters would be nice. But you're
right to constantly emphasise that LWLocks need to be very slim. We
don't want to turn them into near-heavyweight locks by saddling them
with overhead that's not strictly necessary. A simple pg_regress run
(with cassert, admittedly) takes 25,000,000 LWLocks and does 24,000
LWLock waits and 130,000 condvar waits. All in less than a minute.
OTOH, once someone's waiting we don't care nearly as much about
bookkeeping cost, it's the un-contested fast paths that're most
critical.

> - We could make it work if we restricted MAX_BACKENDS to be 2**14 - but
>   while I personally think that's a sane upper limit, I already had a
>   hard time getting consensus to lower the limit to 2^18-1.

16384 backends is totally fine in sane real world deployments. But
it'll probably upset marketing people when OtherDatabaseVendor starts
shouting that they support 14 million connections, and postgres only
has 16k. Sigh.

The real answer here in the long term probably needs to be decoupling
of executors from connection state inside postgres. But while we're on
that topic, how about we convert the entire codebase to Rust while
riding on a flying rainbow unicorn? We're stuck with the 1:1
connection to executor mapping for the foreseeable future.

> - Use a 64bit integer. Then we can easily fit MAX_BACKENDS lockers, as
>   well as an offset to one of MAX_BACKENDS "wait lists" into LWLock.

You know much more than me about the possible impacts of that on
layout and caching, but I gather that it's probably undesirable to
make LWLocks any bigger.

> I think it's quite useful for relatively simple things like analyzing
> the total amount of time spent in individual locks, without incuring
> much overhead when not doing so (for which you need to identify
> individual locks, otherwise your end - start time is going to be
> meaningless).

Yep.

That's why the removal of the lock offset is a bit frustrating,
because you can't identify an LWLock instance-wide by LWLock* due to
the possibility of different per-backend DSM base-address mappings.
Well, and ASLR on EXEC_BACKEND systems, but who cares about those?

The removal was for good reasons though. And it only affects LWLocks
in DSM, for everything else the LWLock* is good enough. If identifying
LWLocks in DSM ever matters enough to bother to solve that problem, we
can emit trace events on DSM mapping attach in each backend, and trace
tools can do the work to track which LWLocks are in DSM and convert
their addresses to a reference base address. Pg shouldn't have to pay
the price for that unless it's something a lot of people need.

> And, slightly more advanced, for analyzing what the stack
> was when the lock was released - which then allows you to see what work
> you're blocked on, something I found hard to figure out otherwise.
>
> I found that that's mostly quite doable with dynamic probes though.

Yeah, it is.

That's part of why my patchset here doesn't try to do a lot to LWLock
tracepoints - I didn't think it was necessary to add a lot.
The LWLock code is fairly stable, not usually something you have to
worry about in production unless you're debugging badly behaved
extensions, and usually somewhat probe-able with DWARF based dynamic
probes. However, the way the wait-loop and fast-path are in the same
function is a serious pain for dynamic probing; you can't tell the
difference between a fast-path acquire and an acquire after a wait
without using probes on function+offset or probing by source line.
Both those are fine for dev work but useless in tool/library scripts.

I almost wonder if we should test out moving the LWLock wait-loops out
of LWLockAcquire(), LWLockAcquireOrWait() and LWLockWaitForVar()
anyway, so the hot parts of the 

Re: [PATCH] Identify LWLocks in tracepoints

2021-04-13 Thread Craig Ringer
On Tue, 13 Apr 2021 at 21:40, Craig Ringer
 wrote:

> Findings:
>
> * A probe without arguments or with simple arguments is just a 'nop' 
> instruction
> * Probes that require function calls, pointer chasing, other
> expression evaluation etc may impose a fixed cost to collect up
> arguments even if the probe is disabled.
> * SDT semaphores can avoid that cost but add a branch, so should
> probably be avoided unless preparing probe arguments is likely to be
> expensive.

Back to the topic directly at hand.

Attached a disassembly of what LWLockAcquire looks like now on my
current build of git master @ 5fe83adad9efd5e3929f0465b44e786dc23c7b55
. This is an --enable-debug --enable-cassert --enable-dtrace build
with -Og -ggdb3.

The three tracepoints in it are all of the form:

   movzwl 0x0(%r13),%edi
   call   0x801c49 
   nop

so it's clear we're doing redundant calls to GetLWTrancheName(), as
expected. Not ideal.

Now if I patch it to add the _ENABLED() guards on all the tracepoints,
the probes look like this:

   0x00803176 <+200>:   cmpw   $0x0,0x462da8(%rip)#
0xc65f26 
   0x0080317e <+208>:   jne0x80328b 
    other interleaved code ...
   0x0080328b <+477>:   movzwl 0x0(%r13),%edi
   0x00803290 <+482>:   call   0x801c49 
   0x00803295 <+487>:   nop
   0x00803296 <+488>:   jmp0x803184 

so we avoid the GetLWTrancheName() call at the cost of a test and
possible branch, and a small expansion in total function size. Without
the semaphores, LWLockAcquire is 463 bytes. With them, it's 524 bytes,
which is nothing to sneeze at for code that doesn't do anything
99.999% of the time, but we avoid a bunch of GetLWTrancheName() calls.

If I instead replace T_NAME() with NULL for all tracepoints in
LWLockAcquire, the disassembly shows that the tracepoints now become a
simple

   0x00803176 <+200>:   nop

which is pretty hard to be concerned about.

So at the very least we should be calling GetLWTrancheName() once at
the start of the function if built with dtrace support and remembering
the value, instead of calling it for each tracepoint.

And omitting the tranche name looks like it might be sensible for the
LWLock code. In most places it won't matter, but LWLocks are hot
enough that it possibly might. A simple pg_regress run hits
LWLockAcquire 25 million times, so that's 75 million calls to
GetLWTrancheName().




Re: [PATCH] Identify LWLocks in tracepoints

2021-04-13 Thread Craig Ringer
On Tue, 13 Apr 2021 at 21:05, Craig Ringer
 wrote:
> On Tue, 13 Apr 2021 at 11:06, Andres Freund  wrote:
> > IIRC those aren't really comparable - the kernel actually does modify
> > the executable code to replace the tracepoints with nops.
>
> Same with userspace static trace markers (USDTs).
>
> A followup mail will contain a testcase and samples to demonstrate this.

Demo follows, with source attached too. gcc 10.2 compiling with -O2,
using dtrace and  from systemtap 4.4 .

Trivial empty function definition:

__attribute__((noinline))
void
no_args(void)
{
SDT_NOOP_NO_ARGS();
}

Disassembly when SDT_NOOP_NO_ARGS is defined as

#define SDT_NOOP_NO_ARGS()

is:

:
retq

When built with a probes.d definition processed by the dtrace script
instead, the disassembly becomes:

:
nop
retq

So ... yup, it's a nop.

Now, if we introduce semaphores that changes.

__attribute__((noinline))
void
no_args(void)
{
if (SDT_NOOP_NO_ARGS_ENABLED())
SDT_NOOP_NO_ARGS();
}

disassembles to:

:
cmpw   $0x0,0x2ec4(%rip)# 
jne
retq
nopl   0x0(%rax,%rax,1)
nop
retq

so the semaphore test is actually quite harmful and wasteful in this
case. That's not surprising since this SDT is a simple marker point.
But what if we supply arguments to it? It turns out that the
disassembly is the same if args are passed, whether locals or globals,
including globals assigned based on program input that can't be
determined at compile time. Still just a nop.

If I pass a function call as an argument expression to a probe, e.g.

__attribute__((noinline)) static int
compute_probe_argument(void)
{
return 100;
}

void
with_computed_arg(void)
{
SDT_NOOP_WITH_COMPUTED_ARG(compute_probe_argument());
}

then the disassembly with SDTs is:

:
callq  
nop
retq

so the function call isn't elided even if it's unused. That's somewhat
expected. The same will be true if the arguments to a probe require
pointer chasing or non-trivial marshalling.

If a semaphore guard is added this becomes:

:
cmpw   $0x0,0x2e2e(%rip)# 
jne
retq
nopl   0x0(%rax,%rax,1)
callq  
nop
retq

so now the call to compute_probe_argument() is skipped unless the
probe is enabled, but the function is longer and requires a test and
jump.

If I dummy up a function that does some pointer chasing, without
semaphores I get

:
mov(%rdi),%rax
mov(%rax),%rax
mov(%rax),%rax
nop
retq

so the arguments are marshalled then ignored.

with semaphores I get:

:
cmpw   $0x0,0x2d90(%rip)# 
jne
retq
nopl   0x0(%rax,%rax,1)
mov(%rdi),%rax
mov(%rax),%rax
mov(%rax),%rax
nop
retq

so again the probe's argument marshalling is inline in the function
body, but at the end, and skipped over.

Findings:

* A probe without arguments or with simple arguments is just a 'nop' instruction
* Probes that require function calls, pointer chasing, other
expression evaluation etc may impose a fixed cost to collect up
arguments even if the probe is disabled.
* SDT semaphores can avoid that cost but add a branch, so should
probably be avoided unless preparing probe arguments is likely to be
expensive.

Hideous but effective demo code attached.
provider sdt_noop {
	probe no_args();
	probe with_args(int arg1, int arg2, int arg3);
	probe with_global_arg(int arg1);
	probe with_volatile_arg(int arg1);
	probe with_many_args(int arg1, int arg2, int arg3, int64_t arg4, int64_t arg5, int64_t arg6, int64_t arg7, int64_t arg8);
	probe with_computed_arg(int arg1);
	probe with_pointer_chasing(int arg1);
};


Makefile
Description: Binary data
#include 

#ifdef USE_SDT
#include "sdt_noop_probes_enabled.h"
#else
#include "sdt_noop_probes_disabled.h"
#endif

void no_args(void);
int with_args(void);
int with_global_arg(void);
int with_volatile_arg(void);
void with_many_args(void);
void with_computed_arg(void);
void with_pointer_chasing(int arg);

__attribute__((noinline))
void
no_args(void)
{
#ifdef USE_SDT_SEMAPHORES
	if (SDT_NOOP_NO_ARGS_ENABLED())
#endif
		SDT_NOOP_NO_ARGS();
}

__attribute__((noinline))
int
with_args(void)
{
	int arg1 = 0;
	int arg2 = 1;
	int arg3 = 2;
#ifdef USE_SDT_SEMAPHORES
	if (SDT_NOOP_WITH_ARGS_ENABLED())
#endif
		SDT_NOOP_WITH_ARGS(arg1, arg2, arg3);

	return arg1 + arg2 + arg3;
}

int some_global;

__attribute__((noinline))
int
with_global_arg(void)
{
#ifdef USE_SDT_SEMAPHORES
	if (SDT_NOOP_WITH_GLOBAL_ARG_ENABLED())
#endif
		SDT_NOOP_WITH_GLOBAL_ARG(some_global);

	return some_global;
}

__attribute__((noinline))
int
with_volatile_arg(void)
{
	volatile int arg1;
	arg1 = 42;
#ifdef USE_SDT_SEMAPHORES
	if (SDT_NOOP_WITH_VOLATILE_ARG_ENABLED())
#endif
		

Re: [PATCH] Identify LWLocks in tracepoints

2021-04-13 Thread Craig Ringer
On Tue, 13 Apr 2021 at 11:06, Andres Freund  wrote:

> > Each backend can have different tranche IDs (right?)
>
> No, they have to be the same in each. Note how the tranche ID is part of
> struct LWLock. Which is why LWLockNewTrancheId() has to acquire a lock
> etc.

Ah. I misunderstood that at some point.

That makes it potentially more sensible to skip reporting tranche
names. Not great, because it's much less convenient to work with trace
data full of internal ordinals that must be re-mapped in
post-processing. But I'm generally OK with deferring runtime costs to
tooling rather than the db itself so long as doing so is moderately
practical.

In this case, I think we could likely get away with removing the
tranche names from the tracepoints if we instead emit a trace event on
each dynamic tranche registration that reports the tranche id -> name
mapping. It still sucks for tools, since they have to scrape up the
static tranche registrations from somewhere else, but ... it'd be
tolerable.

> > The kernel is packed with extremely useful trace events, and for very
> > good reasons. Some on very hot paths.
>
> IIRC those aren't really comparable - the kernel actually does modify
> the executable code to replace the tracepoints with nops.

Same with userspace static trace markers (USDTs).

A followup mail will contain a testcase and samples to demonstrate this.

> Although I still don't really buy that static tracepoints are the best
> way to measure this kind of thing, given the delay introducing them and
> the cost of having them around. I think I pointed out
> https://postgr.es/m/20200813004233.hdsdfvufqrbdwzgr%40alap3.anarazel.de
> before.

Yeah. Semaphores are something hot enough that I'd hesitate to touch them.

> > LWLock lock-ordering deadlocks
>
> This seems unrelated to tracepoints to me.

If I can observe which locks are acquired in which order by each proc,
I can then detect excessive waits and report the stack of held locks
of both procs and their order of acquisition.

Since LWLocks shmem state doesn't AFAICS track any information on the
lock holder(s) I don't see a way to do this in-process.

It's not vital, it's just one of the use cases I have in mind. I
suspect that any case where such deadlocks are possible represents a
misuse of LWLocks anyway.

> > and there's no way to know what a given non-built-in tranche ID means
> > for any given backend without accessing backend-specific in-memory
> > state. Including for non-user-accessible backends like bgworkers and
> > auxprocs, where it's not possible to just query the state from a view
> > directly.
>
> The only per-backend part is that some backends might not know the
> tranche name for dynamically registered tranches where the
> LWLockRegisterTranche() hasn't been executed in a backend. Which should
> pretty much never be an aux process or such. And even for bgworkers it
> seems like a pretty rare thing, because those need to be started by
> something...
>
> It might be worth proposing a shared hashtable with tranch names and
> jut reserving enough space for ~hundred entries...

Yeah, that'd probably work and be cheap enough not to really matter.
Might even save us a chunk of memory by not turning CoW pages into
private mappings for each backend during registration.

> > And you can always build without `--enable-dtrace` and ... just not care.
>
> Practically speaking, distributions enable it, which then incurs the
> cost for everyone.

Yep. That's part of why I was so surprised to notice the
GetLWTrancheName() function call in LWLock tracepoints. Nearly
anywhere else it wouldn't matter at all, but LWLocks are hot enough
that it just might matter for the no-wait fastpath.




Re: [PATCH] Identify LWLocks in tracepoints

2021-04-12 Thread Craig Ringer
On Tue, 13 Apr 2021 at 02:23, Andres Freund  wrote:

[I've changed the order of the quoted sections a little to prioritize
the key stuff]

>
> On 2021-04-12 14:31:32 +0800, Craig Ringer wrote:
>
> > It's annoying that we have to pay the cost of computing the tranche name
> > though. It never used to matter, but now that T_NAME() expands to
> > GetLWTrancheName() calls as of 29c3e2dd5a6 it's going to cost a little more
> > on such a hot path. I might see if I can do a little comparison and see how
> > much.  I could add TRACE_POSTGRESQL_<>_ENABLED() guards 
> > since we
> > do in fact build with SDT semaphore support. That adds a branch for each
> > tracepoint, but they're already marshalling arguments and making a function
> > call that does lots more than a single branch, so that seems pretty
> > sensible.
>
> I am against adding any overhead for this feature. I honestly think the
> probes we have right now in postgres do not provide a commensurate
> benefit.

I agree that the probes we have now are nearly useless, if not
entirely useless. The transaction management ones are misplaced and
utterly worthless. The LWLock ones don't carry enough info to be much
use and are incomplete. I doubt anybody uses any of them at all, or
would even notice their absence.

In terms of overhead, what is in place right now is not free. It used
to be very cheap, but since 29c3e2dd5a6 it's not. I'd like to reduce
the current cost and improve functionality at the same time, so it's
actually useful.


> > * There is no easy way to look up the tranche name by ID from outside the
> > backend
>
> But it's near trivial to add that.

Really?

We can expose a pg_catalog.lwlock_tranches view that lets you observe
the current mappings for any given user backend I guess.

But if I'm looking for performance issues caused by excessive LWLock
contention or waits, LWLocks held too long, LWLock lock-ordering
deadlocks, or the like, it's something I want to capture across the
whole postgres instance. Each backend can have different tranche IDs
(right?) and there's no way to know what a given non-built-in tranche
ID means for any given backend without accessing backend-specific
in-memory state. Including for non-user-accessible backends like
bgworkers and auxprocs, where it's not possible to just query the
state from a view directly.

So we'd be looking at some kind of shm based monstrosity. That doesn't
sound appealing. Worse, there's no way to solve races with it - is a
given tranche ID already allocated when you see it? If not, can you
look it up from the backend before the backend exits/dies? For that
matter, how do you do that, since the connection to the backend is
likely under the control of an application, not your monitoring and
diagnostic tooling.

Some trace tools can poke backend memory directly, but it generally
requires debuginfo, is fragile and Pg version specific, slow, and a
real pain to use. If we don't attach the LWLock names to the
tracepoints in some way they're pretty worthless.

Again, I don't plan to add new costs here. I'm actually proposing to
reduce an existing cost.

And you can always build without `--enable-dtrace` and ... just not care.

Anyway - I'll do some `perf` runs shortly to quantify this:

* With/without tracepoints at all
* With/without names in tracepoints
* With/without tracepoint refcounting (_ENABLED() semaphores)

so as to rely less on handwaving.

> > (Those can also be used with systemtap guru mode scripts to do things
> > like turn a particular elog(DEBUG) into a PANIC at runtime for
> > diagnostic purposes).
>
> Yikes.
>

Well, it's not like it can happen by accident. You have to
deliberately write a script that twiddles process memory, using a tool
that requires special privileges and

I recently had to prepare a custom build for a customer that converted
an elog(DEBUG) into an elog(PANIC) in order to capture a core with
much better diagnostic info for a complex, hard to reproduce and
intermittent memory management issue. It would've been rather nice to
be able to do so with a trace marker instead of a custom build.

> > There are a TON of probes I want to add, and I have a tree full of them
> > waiting to submit progressively. Yes, ability to probe all GUCs is in
> > there. So is detail on walsender, reorder buffer, and snapshot builder
> > activity. Heavyweight lock SDTs. A probe that identifies the backend type
> > at startup. SDT probe events emitted for every wait-event. Probes in elog.c
> > to let probes observe error unwinding, capture error messages,
> > etc. [...] A probe that fires whenever debug_query_string
> > changes. Lots. But I can't submit them all at once, especially without
> > some supporting use cases and scripts that other people can use so
> > they can understand why these 

Re: [PATCH] Identify LWLocks in tracepoints

2021-04-12 Thread Craig Ringer
On Mon, 22 Mar 2021 at 16:38, Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

>
> First, a problem:  0002 doesn't build on macOS, because uint64 has been
> used in the probe definitions.  That needs to be handled like the other
> nonnative types in that file.
>

Will fix.

All the probe changes and additions should be accompanied by
> documentation changes.
>

Agreed, will fix.

The probes used to have an argument to identify the lock, which was
> removed by 3761fe3c20bb040b15f0e8da58d824631da00caa.


Huh. That's exactly the functionality I was looking for. Damn. I understand
why Robert removed it, but its removal makes it much harder to identify an
LWLock since it might fall in a DSM segment that could be mapped at
different base addresses in different backends.

Robert's patch didn't replace the offset within tranche with anything else
to identify the lock. A LWLock* is imperfect due to ASLR and DSM but it's
better than nothing. In theory we could even remap them in trace tools if
we had tracepoints on DSM attach and detach that showed their size and base
address too.

CC'ing Andres, as he expressed interest in being able to globally identify
LWLocks too.


> The 0001 patch is
> essentially trying to reinstate that, which seems sensible.  Perhaps we
> should also use the argument order that used to be there.  It used to be
>
> probe lwlock__acquire(const char *, int, LWLockMode);
>
> and now it would be
>
> probe lwlock__acquire(const char *, LWLockMode, LWLock*, int);
>
> Also, do we need both the tranche name and the tranche id?


Reasons to have the name:

* There is no easy way to look up the tranche name by ID from outside the
backend
* A tranche ID by itself is pretty much meaningless especially for dynamic
tranches
* Any existing scripts will rely on the tranche name

So the tranche name is really required to generate useful output for any
dynamic tranches, or simple and readable output from things like perf.

Reasons to have the tranche ID:

* The tranche name is not guaranteed to have the same address for a given
value across backends in the presence of ASLR, even for built-in tranches.
So tools need to read tranche names as user-space strings, which is much
more expensive than consuming an int argument from the trace args. Storing
and reporting maps of events by tranche name (string) in tools is also more
expensive than having a tranche id.
* When the trace tool or script wants to filter for only one particular
tranche,particularly when it's a built-in tranche where the tranche ID is
known, having the ID is much more useful and efficient.
* If we can avoid computing the tranche name, emitting just the tranche ID
would be much faster.

It's annoying that we have to pay the cost of computing the tranche name
though. It never used to matter, but now that T_NAME() expands to
GetLWTrancheName() calls as of 29c3e2dd5a6 it's going to cost a little more
on such a hot path. I might see if I can do a little comparison and see how
much.

I could add TRACE_POSTGRESQL_<>_ENABLED() guards since we
do in fact build with SDT semaphore support. That adds a branch for each
tracepoint, but they're already marshalling arguments and making a function
call that does lots more than a single branch, so that seems pretty
sensible. The main downside of using _ENABLED() USDT semaphore guards is
that not all tools are guaranteed to understand or support them. So an
older perf, for example, might simply fail to fire events on guarded
probes. That seems OK to me, the onus should be on the probe tool to pay
any costs, not on PostgreSQL. Despite that I don't want to mark the
_ENABLED() guards unlikely(), since that'd increase the observer effect
where probing LWLocks changes their timing and behaviour. Branch prediction
should do a very good job in such cases without being forced.

I wonder a little about the possible cache costs of the _ENABLED() macros
though. Their data is in a separate ELF segment and separate .o, with no
locality to the traced code. It might be worth checking that before
proceeding; I guess it's even possible that the GetLWTrancheName() calls
could be cheaper. Will see if I can run some checks and report back.

BTW, if you want some of the details on how userspace SDTs work,
https://leezhenghui.github.io/linux/2019/03/05/exploring-usdt-on-linux.html
is interesting and useful. It helps explain uprobes, ftrace, bcc, etc.

Or maybe we
> don't need the name, or can record it differently, which might also
> address your other concern that it's too expensive to compute.  In any
> case, I think an argument order like
>
> probe lwlock__acquite(const char *, int, LWLock*, LWLockMode);
>
> would make more sense.
>

OK.

In 0004, you add a probe to record the application_name setting?  Would
> there be any value in making that a generic probe that can record any
> GUC change?
>

Yes, there would, but I didn't want to go and do that in the same patch,
and a named probe on application_name is useful 

Re: [PATCH] Identify LWLocks in tracepoints

2021-04-11 Thread Craig Ringer
On Mon, 22 Mar 2021 at 17:00, Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 20.03.21 01:29, Craig Ringer wrote:
> > There is already support for that.  See the documentation at the end
> of
> > this page:
> >
> https://www.postgresql.org/docs/devel/dynamic-trace.html#DEFINING-TRACE-POINTS
> > <
> https://www.postgresql.org/docs/devel/dynamic-trace.html#DEFINING-TRACE-POINTS
> >
> >
> >
> > Pretty sure it won't work right now.
> >
> > To use systemtap semaphores (the _ENABLED macros) you need to run dtrace
> > -g to generate a probes.o then link that into postgres.
> >
> > I don't think we do that. I'll double check soon.
>
> We do that.  (It's -G.)
>

Huh. I could've sworn we didn't. My mistake, it's there in
src/backend/Makefile .

In that case I'll amend the patch to use semaphore guards.

(On a side note, systemtap's semaphore support is actually a massive pain.
The way it's implemented in  means that a single compilation
unit may not use both probes.d style markers produced by the dtrace script
and use regular DTRACE_PROBE(providername,probename) preprocessor macros.
If it attempts to do so, the DTRACE_PROBE macros will emit inline asm that
tries to reference probename_semaphore symbols that will not exist,
resulting in linker errors or runtime link errors. But that's really a
systemtap problem. Core PostgreSQL doesn't use any explicit
DTRACE_PROBE(...), STAP_PROBE(...) etc.)


Re: [PATCH] More docs on what to do and not do in extension code

2021-03-26 Thread Craig Ringer
On Fri, 26 Mar 2021 at 06:15, Bruce Momjian  wrote:

> On Thu, Mar 25, 2021 at 08:49:44AM -0400, David Steele wrote:
> > On 1/22/21 1:36 AM, Craig Ringer wrote:
> > >
> > > Would you mind attaching a revised version of the patch with your
> edits?
> > > Otherwise I'll go and merge them in once you've had your say on my
> > > comments inline below.
> >
> > Bharath, do the revisions in [1] look OK to you?
> >
> > > Bruce, Robert, can I have an opinion from you on how best to locate and
> > > structure these docs, or whether you think they're suitable for the
> main
> > > docs at all? See patch upthread.
> >
> > Bruce, Robert, any thoughts here?
>
> I know I sent an email earlier this month saying we shouldn't
> over-document the backend hooks because the code could drift away from
> the README content:
>
>
> https://www.postgresql.org/message-id/20210309172049.GD26575%40momjian.us
>
> Agreed.  If you document the hooks too much, it allows them to
> drift
> away from matching the code, which makes the hook documentation
> actually
> worse than having no hook documentation at all.
>
> However, for this doc patch, the content seem to be more strategic, so
> less likely to change, and hard to figure out from the code directly.
> Therefore, I think this would be a useful addition to the docs.
>

Thanks for the kind words. It's good to hear that it may be useful. Let me
know if anything further is needed.


Re: [PATCH] Identify LWLocks in tracepoints

2021-03-19 Thread Craig Ringer
On Sat, 20 Mar 2021, 04:21 Peter Eisentraut, <
peter.eisentr...@enterprisedb.com> wrote:

>
> On 18.03.21 07:34, Craig Ringer wrote:
> > The main reason I didn't want to add more tracepoints than were strictly
> > necessary is that Pg doesn't enable the systemtap semaphores feature, so
> > right now we do a T_NAME(lock) evaluation each time we pass a tracepoint
> > if --enable-dtrace is compiled in, whether or not anything is tracing.
> > This was fine on pg11 where it was just:
> >
> > #define T_NAME(lock) \
> >  (LWLockTrancheArray[(lock)->tranche])
> >
> > but since pg13 it instead expands to
> >
> >  GetLWTrancheName((lock)->tranche)
> >
> > where GetLWTrancheName isn't especially trivial. We'll run that function
> > every single time we pass any of these tracepoints and then discard the
> > result, which is ... not ideal. That applies so long as Pg is compiled
> > with --enable-dtrace. I've been meaning to look at enabling the
> > systemtap semaphores feature in our build so these can be wrapped in
> > unlikely(TRACE_POSTGRESQL_LWLOCK_RELEASE_ENABLED()) guards, but I wanted
> > to wrap this patch set up first as there are some complexities around
> > enabling the semaphores feature.
>
> There is already support for that.  See the documentation at the end of
> this page:
>
> https://www.postgresql.org/docs/devel/dynamic-trace.html#DEFINING-TRACE-POINTS


Pretty sure it won't work right now.

To use systemtap semaphores (the _ENABLED macros) you need to run dtrace -g
to generate a probes.o then link that into postgres.

I don't think we do that. I'll double check soon.


Re: [PATCH] Identify LWLocks in tracepoints

2021-03-18 Thread Craig Ringer
On Thu, 11 Mar 2021 at 15:57, Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 10.03.21 06:38, Craig Ringer wrote:
> > On Wed, 3 Mar 2021 at 20:50, David Steele  > <mailto:da...@pgmasters.net>> wrote:
> >
> > On 1/22/21 6:02 AM, Peter Eisentraut wrote:
> >
> > This patch set no longer applies:
> > http://cfbot.cputube.org/patch_32_2927.log
> > <http://cfbot.cputube.org/patch_32_2927.log>.
> >
> > Can we get a rebase? Also marked Waiting on Author.
> >
> >
> > Rebased as requested.
>
> In patch 0001, why was the TRACE_POSTGRESQL_LWLOCK_RELEASE() call moved?
>   Is there some correctness issue?  If so, we should explain that (at
> least in the commit message, or as a separate patch).
>

If you want I can split it out, or drop that change. I thought it was
sufficiently inconsequential, but you're right to check.

The current tracepoint TRACE_POSTGRESQL_LWLOCK_RELEASE really means
"releaseD". It's appropriate to emit this as soon as the lock could be
acquired by anything else. By deferring it until we'd processed the
waitlist and woken other backends the window during which the lock was
reported as "held" was longer than it truly was, and it was easy to see one
backend acquire the lock while another still appeared to hold it.

It'd possibly make more sense to have a separate
TRACE_POSTGRESQL_LWLOCK_RELEASING just before the `pg_atomic_sub_fetch_u32`
call. But I didn't want to spam the tracepoints too hard, and there's
always going to be some degree of overlap because tracing tools cannot
intercept and act during the atomic swap, so they'll always see a slightly
premature or slightly delayed release. This window should be as short as
possible though, hence moving the tracepoint.

Side note:

The main reason I didn't want to add more tracepoints than were strictly
necessary is that Pg doesn't enable the systemtap semaphores feature, so
right now we do a T_NAME(lock) evaluation each time we pass a tracepoint if
--enable-dtrace is compiled in, whether or not anything is tracing. This
was fine on pg11 where it was just:

#define T_NAME(lock) \
(LWLockTrancheArray[(lock)->tranche])

but since pg13 it instead expands to

GetLWTrancheName((lock)->tranche)

where GetLWTrancheName isn't especially trivial. We'll run that function
every single time we pass any of these tracepoints and then discard the
result, which is ... not ideal. That applies so long as Pg is compiled with
--enable-dtrace. I've been meaning to look at enabling the systemtap
semaphores feature in our build so these can be wrapped in
unlikely(TRACE_POSTGRESQL_LWLOCK_RELEASE_ENABLED()) guards, but I wanted to
wrap this patch set up first as there are some complexities around enabling
the semaphores feature.


Re: Detecting File Damage & Inconsistencies

2021-03-18 Thread Craig Ringer
On Mon, 15 Mar 2021 at 21:01, David Steele  wrote:

> On 11/18/20 5:23 AM, Simon Riggs wrote:
> > On Wed, 18 Nov 2020 at 06:42, Craig Ringer
> >  wrote:
> >>
> >> On Fri, Nov 13, 2020 at 7:24 PM Simon Riggs 
> wrote:
> >>>
> >>>
> >>> What I'm proposing is an option to add 16 bytes onto each COMMIT
> >>> record
> >>
> >>
> >> Would it make sense to write this at the time we write a topxid
> assignment to WAL instead?
> >>
> >> Otherwise it won't be accessible to streaming-mode logical decoding.
> >
> > Do you mean extend the xl_xact_assignment record? My understanding is
> > that is not sent in all cases, so not sure what you mean by "instead".
>
> Craig, can you clarify?
>

Right. Or write a separate WAL record when the feature is enabled. But it's
probably sufficient to write it as an optional chunk on xl_xact_assignment
records. We often defer writing them so we can optimise away xacts that
never actually wrote anything, but IIRC we still write one before we write
any WAL that references the xid. That'd be fine, since we don't need the
info any sooner than that during decoding. I'd have to double check that we
write it in all cases and won't get to that too soon, but I'm pretty sure
we do...


Re: [PATCH] Identify LWLocks in tracepoints

2021-03-09 Thread Craig Ringer
On Wed, 3 Mar 2021 at 20:50, David Steele  wrote:

> On 1/22/21 6:02 AM, Peter Eisentraut wrote:
>
> This patch set no longer applies:
> http://cfbot.cputube.org/patch_32_2927.log.
>
> Can we get a rebase? Also marked Waiting on Author.
>

Rebased as requested.

I'm still interested in whether Andres will be able to do anything about
identifying LWLocks in a cross-backend manner. But this work doesn't really
depend on that; it'd benefit from it, but would be easily adapted to it
later if needed.
From 36c7ddcbca2dbbcb2967f01cb92aa1f61620c838 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Thu, 19 Nov 2020 17:38:45 +0800
Subject: [PATCH 1/4] Pass the target LWLock* and tranche ID to LWLock
 tracepoints

Previously the TRACE_POSTGRESQL_LWLOCK_ tracepoints only received a
pointer to the LWLock tranche name. This made it impossible to identify
individual locks.

Passing the lock pointer itself isn't perfect. If the lock is allocated inside
a DSM segment then it might be mapped at a different address in different
backends. It's safe to compare lock pointers between backends (assuming
!EXEC_BACKEND) if they're in the individual lock tranches or an
extension-requested named tranche, but not necessarily for tranches in
BuiltinTrancheIds or tranches >= LWTRANCHE_FIRST_USER_DEFINED that were
directly assigned with LWLockNewTrancheId(). Still, it's better than nothing;
the pointer is stable within a backend, and usually between backends.
---
 src/backend/storage/lmgr/lwlock.c | 35 +++
 src/backend/utils/probes.d| 18 +---
 2 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 8cb6a6f042..5c8744d316 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1321,7 +1321,8 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 #endif
 
 		LWLockReportWaitStart(lock);
-		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode, lock,
+lock->tranche);
 
 		for (;;)
 		{
@@ -1343,7 +1344,8 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 		}
 #endif
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode, lock,
+lock->tranche);
 		LWLockReportWaitEnd();
 
 		LOG_LWDEBUG("LWLockAcquire", lock, "awakened");
@@ -1352,7 +1354,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 		result = false;
 	}
 
-	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), mode);
+	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), mode, lock, lock->tranche);
 
 	/* Add lock to list of locks held by this backend */
 	held_lwlocks[num_held_lwlocks].lock = lock;
@@ -1403,14 +1405,16 @@ LWLockConditionalAcquire(LWLock *lock, LWLockMode mode)
 		RESUME_INTERRUPTS();
 
 		LOG_LWDEBUG("LWLockConditionalAcquire", lock, "failed");
-		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(lock), mode, lock,
+lock->tranche);
 	}
 	else
 	{
 		/* Add lock to list of locks held by this backend */
 		held_lwlocks[num_held_lwlocks].lock = lock;
 		held_lwlocks[num_held_lwlocks++].mode = mode;
-		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(lock), mode, lock,
+lock->tranche);
 	}
 	return !mustwait;
 }
@@ -1482,7 +1486,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 #endif
 
 			LWLockReportWaitStart(lock);
-			TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode);
+			TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode, lock,
+	lock->tranche);
 
 			for (;;)
 			{
@@ -1500,7 +1505,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 Assert(nwaiters < MAX_BACKENDS);
 			}
 #endif
-			TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode);
+			TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode, lock,
+	lock->tranche);
 			LWLockReportWaitEnd();
 
 			LOG_LWDEBUG("LWLockAcquireOrWait", lock, "awakened");
@@ -1530,7 +1536,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 		/* Failed to get lock, so release interrupt holdoff */
 		RESUME_INTERRUPTS();
 		LOG_LWDEBUG("LWLockAcquireOrWait", lock, "failed");
-		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(lock), mode, lock,
+lock->tranche);
 	}
 	else
 	{
@@ -1538,7 +1545,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 		/* Add lock to list of locks held by this backend */
 		held_lwlocks[num_held_lwlocks].lock = lock;
 		held_lwlocks[num_held_lwlocks++].mode = mode;
-		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(lock), mode, lock,
+lock->tranche);
 	}
 
 	return !mustwait;
@@ -1

Re: libpq compression

2021-02-21 Thread Craig Ringer
On Thu, 11 Feb 2021, 21:09 Daniil Zakhlystov, 
wrote::

>
> 3. Chunked compression allows to compress only well compressible messages
> and save the CPU cycles by not compressing the others
> 4. Chunked compression introduces some traffic overhead compared to the
> permanent (1.2810G vs 1.2761G TX data on pg_restore of IMDB database dump,
> according to results in my previous message)
> 5. From the protocol point of view, chunked compression seems a little bit
> more flexible:
>  - we can inject some uncompressed messages at any time without the need
> to decompress/compress the compressed data
>  - we can potentially switch the compression algorithm at any time (but I
> think that this might be over-engineering)
>

Chunked compression also potentially makes it easier to handle non blocking
sockets, because you aren't worrying about yet another layer of buffering
within the compression stream. This is a real pain with SSL, for example.

It simplifies protocol analysis.

It permits compression to be decided on the fly, heuristically, based on
message size and potential compressibility.

It could relatively easily be extended to compress a group of pending small
messages, e.g. by PQflush. That'd help mitigate the downsides with small
messages.

So while stream compression offers better compression ratios, I'm inclined
to suspect we'll want message level compression.

>


Re: PATCH: Batch/pipelining support for libpq

2021-02-19 Thread Craig Ringer
On Wed, 17 Feb 2021, 07:13 Alvaro Herrera,  wrote:

> Here's a new version, where I've renamed everything to "pipeline".


Wow. Thanks.

I
> think the docs could use some additional tweaks now in order to make a
> coherent story on pipeline mode, how it can be used in a batched
> fashion, etc.
>

I'll do that soon and send an update.

>
>


libpq PQresultErrorMessage vs PQerrorMessage API issue

2021-02-15 Thread Craig Ringer
Hi all

This morning for the the umpteenth time I saw:

 some error message: [blank here]

output from a libpq program.

That's because passing a NULL PGresult to PQgetResultErrorMessage() returns
"". But a NULL PGresult is a normal result from PQexec when it fails to
submit a query due to an invalid connection, when PQgetResult can't get a
result from an invalid connection, etc.

E.g. this pattern:

res = PQexec(conn, "something");

if (PQstatus(res) != PGRES_TUPLES_OK)
{
   report_an_error("some error message: %s",
 PQresultErrorMessage(res));
}

... where "res" is NULL because the connection was invalid, so
PQresultErrorMessage(res) emits the empty string.

As a result, using PQerrorMessage(conn) is actually better in most cases,
despite the static error buffer issues. It'll do the right thing when the
connection itself is bad. Alternately you land up with the pattern

 res == NULL ? PQerrorMessage(conn) : PQresultErrorMessage(res)

I'm not quite sure what to do about this. Ideally PQresultErrorMessage()
would take the PGconn* too, but it doesn't, and it's not too friendly to
add an extra argument. Plus arguably they mean different things.

Maybe it's as simple as changing the docs to say that you should prefer
PQerrorMessage() if you aren't using multiple PGresult s at a time, and
mentioning the need to copy the error string.

But I'd kind of like to instead return a new non-null PGresult
PGRES_RESULT_ERROR whenever we'd currently return a NULL PGresult due to a
failure. Embed the error message into the PGresult, so
PQresultErrorMessage() can fetch it. Because a PGresult needs to be owned
by a PGconn and a NULL PGconn can't own anything, we'd instead return a
pointer to a static const global PGresult with value PGRES_NO_CONNECTION if
any function is passed a NULL PGconn*. That way it doesn't matter if it
gets PQclear()ed or not. And PQclear() could test for (res ==
PGresult_no_connection) and not try to free it if found.

The main issue I see there is that existing code may expect a NULL PGresult
and may test for (res == NULL) as an indicator of a query-sending failure
from PQexec etc. So I suspect we'd need a libpq-global option to enable
this behaviour for apps that are aware of it - we wouldn't want to add new
function signature variants after all.

Similar changes would make sense for returning NULL when there are no
result sets remaining after a PQsendQuery, and for returning NULL after
row-by-row fetch mode runs out of rows.


Re: PATCH: Batch/pipelining support for libpq

2021-02-15 Thread Craig Ringer
On Tue, 16 Feb 2021 at 09:19, Craig Ringer 
wrote:
>
> So long as there is a way to "send A", "send B", "send C", "read results
> from A", "send D", and there's a way for the application to associate some
> kind of state (an application specific id or index, a pointer to an
> application query-queue struct, whatever)  so it can match queries to
> results ... then I'm happy.
>

FWIW I'm also thinking of revising the docs to mostly use the term
"pipeline" instead of "batch". Use "pipelining and batching" in the chapter
topic, and mention "batch" in the index, and add a para that explains how
to run batches on top of pipelining, but otherwise use the term "pipeline"
not "batch".

That's because pipelining isn't actually batching, and using it as a naïve
batch interface will get you in trouble. If you PQsendQuery(...) a long
list of queries without consuming results, you'll block unless you're in
libpq's nonblocking-send mode. You'll then be deadlocked because the server
can't send results until you consume some (tx buffer full) and can't
consume queries until it can send some results, but you can't consume
results because you're blocked on a send that'll never end.

An actual batch interface where you can bind and send a long list of
queries might be worthwhile, but should be addressed separately, and it'd
be confusing if this pipelining interface claimed the term "batch". I'm not
convinced enough application developers actually code directly against
libpq for it to be worth creating a pretty batch interface where you can
submit (say) an array of struct PQbatchEntry { const char * query, int
nparams, ... } to a PQsendBatch(...) and let libpq handle the socket I/O
for you. But if someone wants to add one later it'll be easier if we don't
use the term "batch" for the pipelining feature.

I'll submit a reworded patch in a bit.


Re: PATCH: Batch/pipelining support for libpq

2021-02-15 Thread Craig Ringer
On Thu, 11 Feb 2021 at 07:51, Alvaro Herrera 
wrote:

> On 2021-Jan-21, Alvaro Herrera wrote:
>
> > As you can see in an XXX comment in the libpq test program, the current
> > implementation has the behavior that PQgetResult() returns NULL after a
> > batch is finished and has reported PGRES_BATCH_END.  I don't know if
> > there's a hard reason to do that, but I'd like to supress it because it
> > seems weird and out of place.
>
> Hello Craig, a question for you since this API is your devising.  I've
> been looking at changing the way this works to prevent those NULL
> returns from PQgetResult.  That is, instead of having what seems like a
> "NULL separator" of query results, you'd just get the PGRES_BATCH_END to
> signify a batch end (not a NULL result after the BATCH_END); and the
> normal PGRES_COMMAND_OK or PGRES_TUPLES_OK etc when the result of a
> command has been sent.  It's not working yet so I'm not sending an
> updated patch, but I wanted to know if you had a rationale for including
> this NULL return "separator" or was it just a convenience because of how
> the code grew together.
>

The existing API for libpq actually specifies[1] that you should call
PQgetResult() until it returns NULL:

> After successfully calling PQsendQuery, call PQgetResult one or more
times to obtain the results. PQsendQuery cannot be called again (on the
same connection) until PQgetResult has returned a null pointer, indicating
that the command is done.

Similarly, in single-row mode, the existing API specifies that you should
call PQgetResult() until it returns NULL.

Also, IIRC the protocol already permits multiple result sets to be
returned, and the caller cannot safely assume that a single PQsendQuery()
will produce only a single result set. (I really should write a test
extension that exercises that and how libpq reacts to it.)

That's why I went for the NULL return. I think. It's been a while now so
it's a bit fuzzy.

I would definitely like an API that does not rely on testing for a NULL
return. Especially since NULL return can be ambiguous in the context of
row-at-a-time mode. New explicit enumerations for PGresult would make a lot
more sense.

So +1 from me for the general idea. I need to look at the patch as it has
evolved soon too.

Remember that the original patch I submitted for this was a 1-day weekend
hack and proof of concept to show that libpq could be modified to support
query pipelining (and thus batching too), so I could illustrate the
performance benefits that could be attained by doing so. I'd been aware of
the benefits and the protocol's ability to support it for some time thanks
to working on PgJDBC, but couldn't get anyone interested without some C
code to demonstrate it, so I wrote some. I am not going to argue that the
API I added for it is ideal in any way, and happy to see improvements.

The only change I would very strongly object to would be anything that
turned this into a *batch* mode without query-pipelining support. If you
have to queue all the queries up in advance then send them as a batch and
wait for all the results, you miss out on a lot of the potential
round-trip-time optimisations and you add initial latency. So long as there
is a way to "send A", "send B", "send C", "read results from A", "send D",
and there's a way for the application to associate some kind of state (an
application specific id or index, a pointer to an application query-queue
struct, whatever)  so it can match queries to results ... then I'm happy.

[1]
https://www.postgresql.org/docs/current/libpq-async.html#LIBPQ-PQSENDQUERY


Re: How to expose session vs txn lock info in pg_locks view?

2021-02-01 Thread Craig Ringer
On Sun, 24 Jan 2021 at 09:12, Andres Freund  wrote:

> Hi,
>
> On 2021-01-19 14:16:07 +0800, Craig Ringer wrote:
> > AFAICS it'd be necessary to expand PROCLOG to expose this in shmem.
> > Probably by adding a small bitfield where bit 0 is set if there's a txn
> > level lock and bit 1 is set if there's a session level lock. But I'm not
> > convinced that expanding PROCLOCK is justifiable for this.
> sizeof(PROCLOCK)
> > is 64 on a typical x64 machine. Adding anything to it increases it to 72
> > bytes.
>
> Indeed - I really don't want to increase the size, it's already a
> problem.
>
>
> > It's frustrating to be unable to tell the difference between
> session-level
> > and txn-level locks in diagnostic output.
>
> It'd be useful, I agree.
>
>
> > And the deadlock detector has no way to tell the difference when
> > selecting a victim for a deadlock abort - it'd probably make sense to
> > prefer to send a deadlock abort for txn-only lockers.
>
> I'm doubtful this is worth going for.
>
>
> > But I'm not sure I see a sensible way to add the info - PROCLOCK is
> > already free of any padding, and I wouldn't want to use hacks like
> > pointer-tagging.
>
> I think there's an easy way to squeeze out space: make groupLeader be an
> integer index into allProcs instead. That requires only 4 bytes...
>
> Alternatively, I think it'd be reasonably easy to add the scope as a bit
> in LOCKMASK - there's plenty space.
>

I was wondering about that, but concerned that there would be impacts I did
not understand.

I'm happy to pursue that angle.


Re: Preventing hangups in bgworker start/stop during DB shutdown

2021-01-22 Thread Craig Ringer
On Fri, 25 Dec 2020 at 06:07, Tom Lane  wrote:

> I wrote:
> > Bharath Rupireddy  writes:
> >> 4) IIUC, in the patch we mark slot->terminate = true only for
> >> BGW_NEVER_RESTART kind bg workers, what happens if a bg worker has
> >> bgw_restart_time seconds and don't we hit the hanging issue(that we
> >> are trying to solve here) for those bg workers?
>
> > The hang problem is not with the worker itself, it's with anything
> > that might be waiting around for the worker to finish.  It doesn't
> > seem to me to make a whole lot of sense to wait for a restartable
> > worker; what would that mean?
>
> Upon further looking around, I noted that autoprewarm's
> autoprewarm_start_worker() function does that, so we can't really
> dismiss it.
>
> However, what we can do instead is to change the condition to be
> "cancel pending bgworker requests if there is a waiting process".
> Almost all of the time, that means it's a dynamic bgworker with
> BGW_NEVER_RESTART, so there's no difference.  In the exceptional
> cases like autoprewarm_start_worker, this would result in removing
> a bgworker registration record for a restartable worker ... but
> since we're shutting down, that record would have no effect before
> the postmaster exits, anyway.  I think we can live with that, at
> least till such time as somebody redesigns this in a cleaner way.
>
> I pushed a fix along those lines.
>
>
Thanks for the change.

Cleanups like this in the BGW API definitely make life easier.


Re: [PATCH] More docs on what to do and not do in extension code

2021-01-21 Thread Craig Ringer
Hi

Thanks so much for reading over this!

Would you mind attaching a revised version of the patch with your edits?
Otherwise I'll go and merge them in once you've had your say on my comments
inline below.

Bruce, Robert, can I have an opinion from you on how best to locate and
structure these docs, or whether you think they're suitable for the main
docs at all? See patch upthread.

On Tue, 19 Jan 2021 at 22:33, Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

>
> Here are some comments:
>
> [1]
>background worker's main function, and must be unblocked by it; this is
> to
> allow the process to customize its signal handlers, if necessary.
> -   Signals can be unblocked in the new process by calling
> -   BackgroundWorkerUnblockSignals and blocked by
> calling
> -   BackgroundWorkerBlockSignals.
> +   It is important that all background workers set up and unblock signal
> +   handling before they enter their main loops. Signal handling in
> background
> +   workers is discussed separately in .
>
>

I wasn't sure either way on that, see your [3] below.

[2]
> +   interupt-aware APIs for the purpose. Do not
> usleep(),
> +   system(), make blocking system calls, etc.
> +  
>
> Is it "Do not use usleep(),
> system() or make blocking system calls etc." ?
>

Right. Good catch.

[3] IMO, we can remove following from "bgworker-signals" if we retain
> it where currently it is, as discussed in [1].
> +Signals can be unblocked in the new process by calling
> +BackgroundWorkerUnblockSignals and blocked by
> calling
> +BackgroundWorkerBlockSignals.
>

If so, need to mention that they start blocked and link to the main text
where that's mentioned.

That's part of why I moved this chunk into the signal section.

[4] Can we say
> +The default signal handlers set up for background workers do
> +default background worker signal handlers, it should call
>
> instead of
> +The default signal handlers installed for background workers
> do
> +default background worker signal handling it should call
>

Huh?

I don't understand this proposal.

s/install/set up/g?

[5] IMO, we can have something like below
> +request, etc. Set up these handlers before unblocking signals as
> +shown below:
>
> instead of
> +request, etc. To install these handlers, before unblocking interrupts
> +run:
>

Ditto

[6] I think logs and errors either elog() or ereport can be used, so how
> about
> +Use elog() or ereport()
> for
> +logging output or raising errors instead of any direct stdio
> calls.
>
> instead of
> +Use elog() and
> ereport() for
> +logging output and raising errors instead of any direct stdio
> calls.
>

OK.

[7] Can we use child processes instead of subprocess ? If okay in
> other places in the patch as well.
>

Fine with me. The point is really that they're non-postgres processes being
spawned by a backend, and that doing so must be done carefully.

[8] Why should file descriptor manager API be used to execute
> subprocesses/child processes?
> +To execute subprocesses and open files, use the routines provided
> by
> +the file descriptor manager like
> BasicOpenFile
> +and OpenPipeStream instead of a direct
>

Yeah, that wording is confusing, agreed. The point was that you shouldn't
use system() or popen(), you should OpenPipeStream(). And similarly, you
should avoid fopen() etc and instead use the Pg wrapper BasicOpenFile().

"
PostgreSQL backends are required to limit the number of file descriptors
they
open. To open files, use postgres file descriptor manager routines like
BasicOpenFile()
instead of directly using open() or fopen(). To open pipes to or from
external processes,
use OpenPipeStream() instead of popen().
"

?


> [9] "should always be"? even if it's a blocking extesion, does it
> work? If our intention is to recommend the developers, maybe we should
> avoid using the term "should" in the patch in other places as well.
>

The trouble is that it's a bit ... fuzzy.

You can get away with blocking for short periods without responding to
signals, but it's a "how long is a piece of string" issue.

"should be" is fine.

A hard "must" or "must not" would be misleading. But this isn't the place
to go into all the details of how time sensitive (or not) interrupt
handling of different kinds is in different places for different worker
types.


> [11] I think it is
> +block if this happens. So cleanup of resources is not
> entirely managed by PostgreSQL, it
> +   must be handled using appropriate callbacks provided by PostgreSQL
>
> instead of
> +block if this happens. So all cleanup of resources not already
> +managed by the PostgreSQL runtime must be handled using
> appropriate
>

I don't agree with the proposed new wording here.

Delete the "So all" from my original, or

... Cleanup of any resources that are not managed
> by the PostgreSQL runtime must be handled using 

Re: Printing backtrace of postgres processes

2021-01-20 Thread Craig Ringer
On Thu, 21 Jan 2021 at 09:56, Tom Lane  wrote:

> Craig Ringer  writes:
> > On Wed, 20 Jan 2021 at 05:23, Tom Lane  wrote:
> >> BTW, it also looks like the patch is doing nothing to prevent the
> >> backtrace from being sent to the connected client.
>
> > I don't see a good reason to send a bt to a client. Even though these
> > backtraces won't be analysing debuginfo and populating args, locals, etc,
> > it should still just go to the server log.
>
> Yeah.  That's easier than I was thinking, we just need to
> s/LOG/LOG_SERVER_ONLY/.
>
> >> Maybe, given all of these things, we should forget using elog at
> >> all and just emit the trace with fprintf(stderr).
>
> > That causes quite a lot of pain with MemoryContextStats() already
>
> True.  Given the changes discussed in the last couple messages, I don't
> see any really killer reasons why we can't ship the trace through elog.
> We can always try that first, and back off to fprintf if we do find
> reasons why it's too unstable.
>

Yep, works for me.

Thanks for being open to considering this.

I know lots of this stuff can seem like a pointless sidetrack because the
utility of it is not obvious on dev systems or when you're doing your own
hands-on expert support on systems you own and operate yourself. These
sorts of things really only start to make sense when you're touching many
different postgres systems "in the wild" - such as commercial support
services, helping people on -general, -bugs or stackoverflow, etc.

I really appreciate your help with it.


Re: Add docs stub for recovery.conf

2021-01-20 Thread Craig Ringer
On Wed, 20 Jan 2021 at 02:45, Stephen Frost  wrote:

> Greetings,
>
> * Stephen Frost (sfr...@snowman.net) wrote:
> > * Craig Ringer (craig.rin...@enterprisedb.com) wrote:
> > > On Thu, 14 Jan 2021 at 03:44, Stephen Frost 
> wrote:
> > > > Alright, how does this look?  The new entries are all under the
> > > > 'obsolete' section to keep it out of the main line, but should work
> to
> > > > 'fix' the links that currently 404 and provide a bit of a 'softer'
> > > > landing for the other cases that currently just forcibly redirect
> using
> > > > the website doc alias capability.
> > >
> > > Thanks for expanding the change to other high profile obsoleted or
> renamed
> > > features and tools.
> >
> > Thanks for taking the time to review it and comment on it!
> >
> > > One minor point. I'm not sure this is quite the best way to spell the
> index
> > > entries:
> > >
> > > +   
> > > + obsolete
> > > + pg_receivexlog
> > > +   
> > >
> > > as it will produce an index term "obsolete" with a list of various
> > > components under it. While that concentrates them nicely, it means
> people
> > > won't actually find them if they're using the index alphabetically.
> >
> > Ah, yeah, that's definitely a good point and one that I hadn't really
> > spent much time thinking about.
> >
> > > I'd slightly prefer
> > >
> > > +   
> > > + pg_receivexlog
> > > + pg_receivewal
> > > +   
> > >
> > > even though that bulks the index up a little, because then people are
> a bit
> > > more likely to find it.
> >
> > Yup, makes sense, updated patch attached which makes that change.
> >
> > > > I ended up not actually doing this for the catalog -> view change of
> > > > pg_replication_slots simply because I don't really think folks will
> > > > misunderstand or be confused by that redirect since it's still the
> same
> > > > relation.  If others disagree though, we could certainly change that
> > > > too.
> > >
> > > I agree with you.
> >
> > Ok, great.
> >
> > How does the attached look then?
>

Pretty good to me. Thanks so much for your help and support with this.


Index entries render as e.g.

pg_xlogdump, The pg_xlogdump command
(see also pg_waldump)

wheras with the obsolete subhead they would render as something like:

obsolete, Obsolete or renamed features, settings and files
pg_xlogdump, The pg_xlogdump command

The see also spelling is much easier to find in the index but doesn't make
it as obvious that it's obsoleted/replaced.

A look at the doxygen docs suggest we should use  not  for
these.

A quick

sed -i -e 's///g' -e 's/<\/seealso>/<\/see>/g'
doc/src/sgml/appendix-obsolete*

causes them to render much better:

pg_receivexlog, The pg_receivexlog command (see pg_receivewal)

It might be worth changing the s too, so I've done so in the
attached. The terms now render as:

pg_receivexlog, pg_receivexlog renamed to pg_recievewal (see
pg_receivewal)

which is good enough in my opinion. The duplication is messy but an
expected artifact of index generation. I don't see any docbook 
attribute that lets you suppress insertion of the  of the section
containing the , and it's not worth fiddling to try to eliminate
it with structural hacks.

The attached changes the titles, changes  to , and also
updates the comments in the obsolete entries SGML docs to specify that the
id must be unchanged + give a recommended index term format.
From c11e0f079c07c669abac0f00ae3f1ebdfc18eae7 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Thu, 21 Jan 2021 10:01:29 +0800
Subject: [PATCH v3] Add a docs section for obsoleted and renamed functions and
 settings

The new appendix groups information on renamed or removed settings,
commands, etc into an out-of-the-way part of the docs.

The original id elements are retained in each subsection to ensure that
the same filenames are produced for HTML docs. This prevents /current/
links on the web from breaking, and allows users of the web docs
to follow links from old version pages to info on the changes in the
new version. Prior to this change, a link to /current/ for renamed
sections like the recovery.conf docs would just 404. Similarly if
someone searched for recovery.conf they would find the pg11 docs,
but there would be no /12/ or /current/ link, so they couldn't easily
find out that it was removed in pg12 or how to adapt.

Index entries are also added so that there's a breadcrumb trail for
users to follow when they know the old name, but not w

Re: Printing backtrace of postgres processes

2021-01-20 Thread Craig Ringer
On Wed, 20 Jan 2021 at 05:23, Tom Lane  wrote:

>
> Recursion is scary, but it should (I think) not be possible if this
> is driven off CHECK_FOR_INTERRUPTS.  There will certainly be none
> of those in libbacktrace.
>

We can also hold interrupts for the call, and it might be wise to do so.

One point here is that it might be a good idea to suppress elog.c's
> calls to functions in the error context stack.  As we saw in another
> connection recently, allowing that to happen makes for a *very*
> large increase in the footprint of code that you are expecting to
> work at any random CHECK_FOR_INTERRUPTS call site.
>

I strongly agree. Treat it as errhidecontext().

BTW, it also looks like the patch is doing nothing to prevent the
> backtrace from being sent to the connected client.  I'm not sure
> what I think about whether it'd be okay from a security standpoint
> to do that on the connection that requested the trace, but I sure
> as heck don't want it to happen on connections that didn't.


I don't see a good reason to send a bt to a client. Even though these
backtraces won't be analysing debuginfo and populating args, locals, etc,
it should still just go to the server log.


> Maybe, given all of these things, we should forget using elog at
> all and just emit the trace with fprintf(stderr).


That causes quite a lot of pain with MemoryContextStats() already as it's
frequently difficult to actually locate the output given the variations
that exist in customer logging configurations. Sometimes stderr goes to a
separate file or to journald. It's also much harder to locate the desired
output since there's no log_line_prefix. I have a WIP patch floating around
somewhere that tries to teach MemoryContextStats to write to the ereport
channel when not called during an actual out-of-memory situation for that
reason; an early version is somewhere in the archives.

This is one of those "ok in development, painful in production" situations.

So I'm not a big fan of pushing it to stderr, though I'd rather have that
than not have the ability at all.


Re: Printing backtrace of postgres processes

2021-01-20 Thread Craig Ringer
On Wed, 20 Jan 2021 at 01:31, Robert Haas  wrote:

> On Sat, Jan 16, 2021 at 3:21 PM Tom Lane  wrote:
> > I'd argue that backtraces for those processes aren't really essential,
> > and indeed that trying to make the syslogger report its own backtrace
> > is damn dangerous.
>
> I agree. Ideally I'd like to be able to use the same mechanism
> everywhere and include those processes too, but surely regular
> backends and parallel workers are going to be the things that come up
> most often.
>
> > (Personally, I think this whole patch fails the safety-vs-usefulness
> > tradeoff, but I expect I'll get shouted down.)
>
> You and I are frequently on opposite sides of these kinds of
> questions, but I think this is a closer call than many cases. I'm
> convinced that it's useful, but I'm not sure whether it's safe. On the
> usefulness side, backtraces are often the only way to troubleshoot
> problems that occur on production systems. I wish we had better
> logging and tracing tools instead of having to ask for this sort of
> thing, but we don't.


Agreed.

In theory we should be able to do this sort of thing using external trace
and diagnostic tools like perf, systemtap, etc. In practice, these tools
tend to be quite version-sensitive and hard to get right without multiple
rounds of back-and-forth to deal with specifics of the site's setup,
installed debuginfo or lack thereof, specific tool versions, etc.

It's quite common to have to fall back on attaching gdb with a breakpoint
on a function in the export symbol table (so it works w/o debuginfo),
saving a core, and then analysing the core on a separate system on which
debuginfo is available for all the loaded modules. It's a major pain.

The ability to get a basic bt from within Pg is strongly desirable. IIRC
gdb's basic unwinder works without external debuginfo, if not especially
well. libunwind produces much better results, but that didn't pass the
extra-dependency bar when backtracing support was introduced to core
postgres.

On a side note, to help get better diagnostics I've also been meaning to
look into building --enable-debug with -ggdb3 so we can embed macro info,
and using dwz to deduplicate+compress the debuginfo so we can encourage
people to install it by default on production. I also want to start
exporting pointers to all the important data symbols for diagnostic use,
even if we do so in a separate ELF section just for debug use.


Re: [PATCH] ProcessInterrupts_hook

2021-01-18 Thread Craig Ringer
On Tue, 19 Jan 2021 at 12:44, Craig Ringer 
wrote:

>
> > We're about halfway there already, see 7e784d1dc.  I didn't do the
>> > other half because it wasn't necessary to the problem, but exposing
>> > the shutdown state more fully seems reasonable.
>>
>
> Excellent, I'll take a look. Thanks.
>

That looks very handy already.

Extending it to be set before SIGTERM too would be handy.

My suggestion, which I'm happy to post in patch form if you think it's
reasonable:

* Change QuitSignalReason to ExitSignalReason to cover both SIGTERM (fast)
and SIGQUIT (immediate)

* Rename PMQUIT_FOR_STOP to PMEXIT_IMMEDIATE_SHUTDOWN

* Add enumeration values PMEXIT_SMART_SHUTDOWN  and PMEXIT_FAST_SHUTDOWN

* For a fast shutdown, in pmdie()'s SIGINT case call
SetExitSignalReason(PMEXIT_FAST_SHUTDOWN), so that when
PostmasterStateMachine() calls SignalSomeChildren(SIGTERM, ...) in response
to PM_STOP_BACKENDS, the reason is already available.

* For smart shutdown, in pmdie()'s SIGTERM case call
SetExitSignalReason(PMEXIT_SMART_SHUTDOWN) and set the latch of every live
backend. There isn't any appropriate PROCSIG so unless we want to overload
PROCSIG_WALSND_INIT_STOPPING (ick), but I think it'd generally be
sufficient to check GetExitSignalReason() in backend main loops.

The fast shutdown case seems like a no-brainer extension of your existing
patch.

I'm not entirely sure about the smart shutdown case. I don't want to add a
ProcSignal slot just for this and the info isn't urgent anyway. I think
that by checking for postmaster shutdown in the backend main loop we'd be
able to support eager termination of idle backends on smart shutdown
(immediately, or after an idle grace period), which is something I've
wanted for quite some time. It shouldn't be significantly expensive
especially in the idle loop.

Thoughts?

(Also I want a hook in PostgresMain's idle loop for things like this).


How to expose session vs txn lock info in pg_locks view?

2021-01-18 Thread Craig Ringer
Presently there doesn't seem to be a way to tell whether a lock is
session-level or transaction-level in the pg_locks view.

I was expecting this to be a quick patch, but the comment on the definition
of PROCLOCKTAG in lock.h notes that shmem state for heavyweight locks does
not track whether the lock is session-level or txn-level. That explains why
it's not already exposed in pg_locks.

AFAICS it'd be necessary to expand PROCLOG to expose this in shmem.
Probably by adding a small bitfield where bit 0 is set if there's a txn
level lock and bit 1 is set if there's a session level lock. But I'm not
convinced that expanding PROCLOCK is justifiable for this. sizeof(PROCLOCK)
is 64 on a typical x64 machine. Adding anything to it increases it to 72
bytes.

(gdb) ptype /o struct PROCLOCK
/* offset|  size */  type = struct PROCLOCK {
/*0  |16 */PROCLOCKTAG tag;
/*   16  | 8 */PGPROC *groupLeader;
/*   24  | 4 */LOCKMASK holdMask;
/*   28  | 4 */LOCKMASK releaseMask;
/*   32  |16 */SHM_QUEUE lockLink;
/*   48  |16 */SHM_QUEUE procLink;
/*   64  | 1 */unsigned char locktypes;
/* XXX  7-byte padding  */

   /* total size (bytes):   72 */
 }

Going over 64 sets off possible alarm bells about cache line sizing to me,
but maybe it's not that critical? It'd also require (8 * max_locks_per_xact
* (MaxBackends+max_prepared_xacts)) extra shmem space; that could land up
being 128k on a default setup and a couple of megabytes on a big system.
Not huge, but not insignificant if it's hot data.

It's frustrating to be unable to tell the difference between session-level
and txn-level locks in diagnostic output. And the deadlock detector has no
way to tell the difference when selecting a victim for a deadlock abort -
it'd probably make sense to prefer to send a deadlock abort for txn-only
lockers. But I'm not sure I see a sensible way to add the info - PROCLOCK
is already free of any padding, and I wouldn't want to use hacks like
pointer-tagging.

Thoughts anyone?


Re: [PATCH] ProcessInterrupts_hook

2021-01-18 Thread Craig Ringer
On Tue, 19 Jan 2021, 02:01 Robert Haas,  wrote:

> On Mon, Jan 18, 2021 at 11:56 AM Tom Lane  wrote:
> > > I've wanted this in the past, too, so +1 from me.
> >
> > I dunno, this seems pretty scary and easily abusable.  There's not all
> > that much that can be done safely in ProcessInterrupts(), and we should
> > not be encouraging extensions to think they can add random processing
> > there.
>
> We've had this disagreement before about other things, and I just
> don't agree. If somebody uses a hook for something wildly unsafe, that
> will break their stuff, not ours.


Generally yeah.

And we have no shortage of hooks with plenty of error or abuse potential
and few safeguards already. I'd argue that in C code any external code is
inherently unsafe anyway. So it's mainly down to whether the hook actively
encourages unsafe actions without providing commensurate benefits, and
whether there's a better/safer way to achieve the same thing.

That's not to say I endorse adding

hooks for random purposes in random places. In particular, if it's
> impossible to use a particular hook in a reasonably safe way, that's a
> sign that the hook is badly-designed and that we should not have it.
>

Yep. Agreed.

Any hook is possible to abuse or write incorrectly, from simple fmgr
loadable functions right on up.

The argument that a hook could be abused would apply just as well to
exposing pqsignal() itself to extensions. Probably more so. Also to
anything like ProcessUtility_hook.


> > We're about halfway there already, see 7e784d1dc.  I didn't do the
> > other half because it wasn't necessary to the problem, but exposing
> > the shutdown state more fully seems reasonable.
>

Excellent, I'll take a look. Thanks.


[PATCH] ProcessInterrupts_hook

2021-01-18 Thread Craig Ringer
Hi folks

A few times lately I've been doing things in extensions that've made me
want to be able to run my own code whenever InterruptPending is true and
CHECK_FOR_INTERRUPTS() calls ProcessInterrupts()

So here's a simple patch to add ProcessInterrupts_hook. It follows the
usual pattern like ProcessUtility_hook and standard_ProcessUtility.

Why? Because sometimes I want most of the behaviour of die(), but the
option to override it with some bgworker-specific choices occasionally.
HOLD_INTERRUPTS() is too big a hammer.

What I really want to go along with this is a way for any backend to
observe the postmaster's pmState and its "Shutdown" variable's value, so
any backend can tell if we're in FastShutdown, SmartShutdown, etc. Copies
in shmem only obviously. But I'm not convinced it's right to just copy
these vars as-is to shmem, and I don't want to use the memory for a
ProcSignal slot for something that won't be relevant for most backends for
most of the postmaster lifetime. Ideas welcomed.
From 1c8c477a5814420011fa034323e82d8eabc6bc5f Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Mon, 18 Jan 2021 14:14:46 +0800
Subject: [PATCH v1 1/4] Provide a hook for ProcessInterrupts()

Code may now register a ProcessInterrupts_hook to fire its own
logic before and/or after the main ProcessInterrupts handler
for signal processing. The hook must call standard_ProcessInterrupts
to execute the normal interrupt handling or set InterruptsPending to
true to cause the interrupt to be re-processed at the next opportunity.

This follows the consistent pattern used by other PostgreSQL hooks like
the ProcessUtility_hook and standard_ProcessUtility() function.

The purpose of this hook is to allow extensions to react to their own
custom interrupt flags during CHECK_FOR_INTERRUPTS() calls invoked
in main PostgreSQL code paths.
---
 src/backend/tcop/postgres.c | 20 
 src/include/miscadmin.h |  3 +++
 2 files changed, 23 insertions(+)

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 28055680aa..8cf1359e33 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -102,6 +102,8 @@ int			max_stack_depth = 100;
 /* wait N seconds to allow attach from a debugger */
 int			PostAuthDelay = 0;
 
+/* Intercept CHECK_FOR_INTERRUPTS() responses */
+ProcessInterrupts_hook_type ProcessInterrupts_hook = NULL;
 
 
 /* 
@@ -3064,8 +3066,26 @@ ProcessInterrupts(void)
 	/* OK to accept any interrupts now? */
 	if (InterruptHoldoffCount != 0 || CritSectionCount != 0)
 		return;
+
 	InterruptPending = false;
 
+	if (ProcessInterrupts_hook)
+		ProcessInterrupts_hook();
+	else
+		standard_ProcessInterrupts();
+}
+
+/*
+ * Implement the default signal handling behaviour for most backend types
+ * including user backends and bgworkers.
+ *
+ * This is where CHECK_FOR_INTERRUPTS() eventually lands up unless intercepted
+ * by ProcessInterrupts_hook.
+ */
+void
+standard_ProcessInterrupts(void)
+{
+
 	if (ProcDiePending)
 	{
 		ProcDiePending = false;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bdc97e308..f082d04448 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -94,6 +94,9 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
+typedef void (*ProcessInterrupts_hook_type)(void);
+extern ProcessInterrupts_hook_type ProcessInterrupts_hook;
+extern void standard_ProcessInterrupts(void);
 
 #ifndef WIN32
 
-- 
2.29.2

From 58b9ac884ef5bd35a2aac9da85079e24097612be Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Mon, 18 Jan 2021 15:26:43 +0800
Subject: [PATCH v1 2/4] Test for ProcessInterrupts_hook

---
 src/test/modules/Makefile |   3 +-
 src/test/modules/test_hooks/.gitignore|   4 +
 src/test/modules/test_hooks/Makefile  |  28 
 .../expected/test_processinterrupt_hook.out   |   0
 src/test/modules/test_hooks/hooks.conf|   1 +
 .../sql/test_processinterrupt_hook.sql|  16 ++
 .../modules/test_hooks/test_hooks--1.0.sql|   8 +
 src/test/modules/test_hooks/test_hooks.c  |  37 +
 .../modules/test_hooks/test_hooks.control |   4 +
 src/test/modules/test_hooks/test_hooks.h  |  20 +++
 .../test_hooks/test_process_interrupts_hook.c | 140 ++
 11 files changed, 260 insertions(+), 1 deletion(-)
 create mode 100644 src/test/modules/test_hooks/.gitignore
 create mode 100644 src/test/modules/test_hooks/Makefile
 create mode 100644 src/test/modules/test_hooks/expected/test_processinterrupt_hook.out
 create mode 100644 src/test/modules/test_hooks/hooks.conf
 create mode 100644 src/test/modules/test_hooks/sql/test_processinterrupt_hook.sql
 create mode 100644 src/test/modules/test_hooks/test_hooks--1.0.sql
 create mode 100644 src/test/modules/test_hooks/test_hooks.c
 create mode 100644 src/test/modules/test_hooks/test_hooks.control
 create mode 100644 src/test/modules/

[PATCH] More docs on what to do and not do in extension code

2021-01-17 Thread Craig Ringer
Hi folks

The attached patch expands the xfunc docs and bgworker docs a little,
providing a starting point for developers to learn how to do some common
tasks the Postgres Way.

It mentions in brief these topics:

* longjmp() based exception handling with elog(ERROR), PG_CATCH() and
PG_RE_THROW() etc
* Latches, spinlocks, LWLocks, heavyweight locks, condition variables
* shm, DSM, DSA, shm_mq
* syscache, relcache, relation_open(), invalidations
* deferred signal handling, CHECK_FOR_INTERRUPTS()
* Resource cleanup hooks and callbacks like on_exit, before_shmem_exit, the
resowner callbacks, etc
* signal handling in bgworkers

All very superficial, but all things I really wish I'd known a little
about, or even that I needed to learn about, when I started working on
postgres.

I'm not sure it's in quite the right place. I wonder if there should be a
separate part of xfunc.sgml that covers the slightly more advanced bits of
postgres backend and function coding like this, lists relevant README files
in the source tree, etc.

I avoided going into details like how resource owners work. I don't want
the docs to have to cover all that in detail; what I hope to do is start
providing people with clear references to the right place in the code,
READMEs, etc to look when they need to understand specific topics.
From 96ce89cb7e1a97d9d247fbacba73ade85a01ea38 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Mon, 18 Jan 2021 14:05:15 +0800
Subject: [PATCH v1 2/2] Expand docs on PostgreSQL extension coding

Expand the docs on PostgreSQL extension coding and background worker
development a little so that key topics like allocation, interrupt
handling, exit callbacks, transaction callbacks, PG_TRY()/PG_CATCH(),
resource owners, transaction and snapshot state, etc are at least
briefly mentioned with a few pointers to where to learn more.

Add some detail on background worker signal handling.
---
 doc/src/sgml/bgworker.sgml |  86 ++--
 doc/src/sgml/xfunc.sgml| 162 -
 2 files changed, 239 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/bgworker.sgml b/doc/src/sgml/bgworker.sgml
index 7fd673ab54..9216b8e0ea 100644
--- a/doc/src/sgml/bgworker.sgml
+++ b/doc/src/sgml/bgworker.sgml
@@ -1,6 +1,6 @@
 
 
-
+
  Background Worker Processes
 
  
@@ -29,6 +29,12 @@
   
  
 
+ 
+  All code that will be executed in PostgreSQL background workers is expected
+  to follow the basic rules set out for all PostgreSQL extension code in .
+ 
+
  
   Background workers can be initialized at the time that
   PostgreSQL is started by including the module name in
@@ -95,6 +101,11 @@ typedef struct BackgroundWorker
buffers, or any custom data structures which the worker itself may
wish to create and use.
   
+  
+   See  for information on how to
+   request extension shared memory allocations, LWLocks,
+   etc.
+  
  
 
 
@@ -212,9 +223,9 @@ typedef struct BackgroundWorker
Signals are initially blocked when control reaches the
background worker's main function, and must be unblocked by it; this is to
allow the process to customize its signal handlers, if necessary.
-   Signals can be unblocked in the new process by calling
-   BackgroundWorkerUnblockSignals and blocked by calling
-   BackgroundWorkerBlockSignals.
+   It is important that all background workers set up and unblock signal
+   handling before they enter their main loops. Signal handling in background
+   workers is discussed separately in .
   
 
   
@@ -296,13 +307,74 @@ typedef struct BackgroundWorker
   
 
   
-   The src/test/modules/worker_spi module
-   contains a working example,
-   which demonstrates some useful techniques.
+   Background workers that require inter-process communication and/or
+   synchronisation should use PostgreSQL's built-in IPC and concurrency
+   features as discussed in 
+   and .
+  
+
+  
+   If a background worker needs to sleep or wait for activity it should
+   always use PostgreSQL's
+   interupt-aware APIs for the purpose. Do not usleep(),
+   system(), make blocking system calls, etc.
+  
+
+  
+   The src/test/modules/worker_spi and
+   src/test/modules/test_shm_mq contain working examples
+   that demonstrates some useful techniques.
   
 
   
The maximum number of registered background workers is limited by
.
   
+
+  
+   Signal Handling in Background Workers
+
+   
+Signals can be unblocked in the new process by calling
+BackgroundWorkerUnblockSignals and blocked by calling
+BackgroundWorkerBlockSignals.
+The default signal handlers installed for background workers do
+not interrupt queries or blocking calls into other postgres code
+so they are only suitable for simple background workers that frequently and
+predictably return control to their main loop. If your worker uses the
+default background worker signal handling it should call
+HandleMainLoopInterrupts() on each pass

[PATCH] Cross-reference comments on signal handling logic

2021-01-17 Thread Craig Ringer
Hi all

The attached comments-only patch expands the signal handling section in
miscadmin.h a bit so that it mentions ProcSignal, deferred signal handling
during blocking calls, etc. It adds cross-refs between major signal
handling routines and the miscadmin comment to help readers track the
various scattered but inter-related code.

I hope this helps some new developers in future.
From a16d0a0f8f502ba3631d20d51c7bb50cedea6d57 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Mon, 18 Jan 2021 12:21:18 +0800
Subject: [PATCH v1 1/2] Comments and cross-references for signal handling

To help new developers come to terms with the complexity of signal
handling in PostgreSQL's various types backend, expand the
main comment on signal handling in miscadmin.h. Cover the ProcSignal
mechanism for SIGUSR1 multiplexing. Mention that blocking calls into 3rd
party code and uninterruptible syscalls should be avoided in backend
code because of Pg's deferred signal processing logic. Provide a set of
cross-references to key routines related to signal handling.

In various signal handling routines, cross-reference the high level
overview comment in miscadmin.c.

This should possibly become part of the developer documentation rather
than a comment in a header file, but since we already have the comment,
it seemed sensible to start by updating it and making it more
discoverable.
---
 src/backend/postmaster/interrupt.c | 21 
 src/backend/tcop/postgres.c| 26 -
 src/include/miscadmin.h| 84 --
 src/port/pgsleep.c |  2 +-
 4 files changed, 113 insertions(+), 20 deletions(-)

diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index dd9136a942..e5256179ec 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -10,6 +10,15 @@
  *	  src/backend/postmaster/interrupt.c
  *
  *-
+ *
+ * This file defines bare-bones interrupt handlers for secondary helper
+ * processes run by the postmaster - the walwriter and bgwriter, and
+ * potentially some background workers.
+ *
+ * These handlers are NOT used by normal user backends as they do not support
+ * interruption of normal execution to respond to a signal, query cancellation,
+ * etc. See miscadmin.h for details on interrupt handling used by normal
+ * postgres backends - CHECK_FOR_INTERRUPTS(), ProcessInterrupts(), die(), etc.
  */
 
 #include "postgres.h"
@@ -28,6 +37,9 @@ volatile sig_atomic_t ShutdownRequestPending = false;
 
 /*
  * Simple interrupt handler for main loops of background processes.
+ *
+ * See also CHECK_FOR_INTERRUPTS() and ProcessInterrupts() for the user-backend
+ * variant of this function.
  */
 void
 HandleMainLoopInterrupts(void)
@@ -51,6 +63,8 @@ HandleMainLoopInterrupts(void)
  * Normally, this handler would be used for SIGHUP. The idea is that code
  * which uses it would arrange to check the ConfigReloadPending flag at
  * convenient places inside main loops, or else call HandleMainLoopInterrupts.
+ *
+ * Most backends use this handler.
  */
 void
 SignalHandlerForConfigReload(SIGNAL_ARGS)
@@ -67,6 +81,9 @@ SignalHandlerForConfigReload(SIGNAL_ARGS)
  * Simple signal handler for exiting quickly as if due to a crash.
  *
  * Normally, this would be used for handling SIGQUIT.
+ *
+ * See also quickdie() and die() for the separate signal handling logic
+ * used by normal user backends.
  */
 void
 SignalHandlerForCrashExit(SIGNAL_ARGS)
@@ -99,6 +116,10 @@ SignalHandlerForCrashExit(SIGNAL_ARGS)
  *
  * ShutdownRequestPending should be checked at a convenient place within the
  * main loop, or else the main loop should call HandleMainLoopInterrupts.
+ *
+ * See also die() for the extended version of this handler that's used in
+ * backends that may need to be interrupted while performing long-running
+ * actions.
  */
 void
 SignalHandlerForShutdownRequest(SIGNAL_ARGS)
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8cf1359e33..907b524e0f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2852,8 +2852,15 @@ quickdie(SIGNAL_ARGS)
 }
 
 /*
- * Shutdown signal from postmaster: abort transaction and exit
- * at soonest convenient time
+ * Shutdown signal from postmaster: set flags to ask ProcessInterrupts() to
+ * abort the current transaction and exit at the soonest convenient time.
+ *
+ * This handler is used as the SIGTERM handler by most backend types that may
+ * run arbitrary backend code, user queries, etc. Some backends use different
+ * handlers and sometimes different flags.
+ *
+ * See "System interrupt and critical section handling" in miscadmin.h for a
+ * higher level explanation of signal handling.
  */
 void
 die(SIGNAL_ARGS)
@@ -2924,6 +2931,9 @@ FloatExceptionHandler(SIGNAL_ARGS)
  * handling following receipt of SIGUSR1. Designed 

Re: Printing backtrace of postgres processes

2021-01-17 Thread Craig Ringer
On Mon, 18 Jan 2021 at 00:56, vignesh C  wrote:

>
> Thanks for your comments Andres, I will ignore it for the processes
> which do not have access to ProcSignal. I will make the changes and
> post a patch for this soon.
>

I think that's sensible.

I've had variable results with glibc's backtrace(), especially on older
platforms and/or with external debuginfo, but it's much better than
nothing. It's often not feasible to get someone to install gdb and run
commands on their production systems - they can be isolated and firewalled
or hobbled by painful change policies. Something basic built-in to
postgres, even if basic, is likely to come in very handy.


Re: Add docs stub for recovery.conf

2021-01-17 Thread Craig Ringer
On Thu, 14 Jan 2021 at 03:44, Stephen Frost  wrote:

>
> Alright, how does this look?  The new entries are all under the
> 'obsolete' section to keep it out of the main line, but should work to
> 'fix' the links that currently 404 and provide a bit of a 'softer'
> landing for the other cases that currently just forcibly redirect using
> the website doc alias capability.
>

Thanks for expanding the change to other high profile obsoleted or renamed
features and tools.

One minor point. I'm not sure this is quite the best way to spell the index
entries:

+   
+ obsolete
+ pg_receivexlog
+   

as it will produce an index term "obsolete" with a list of various
components under it. While that concentrates them nicely, it means people
won't actually find them if they're using the index alphabetically.

I'd slightly prefer


+   
+ pg_receivexlog
+ pg_receivewal
+   

even though that bulks the index up a little, because then people are a bit
more likely to find it.

Your extended and revised patch retains the above style for

+   
+ trigger_file
+ promote_trigger_file
+
...
+
+ standby_mode
+ standby.signal
+

so if you intend to change it, that entry needs changing too.


> I ended up not actually doing this for the catalog -> view change of
> pg_replication_slots simply because I don't really think folks will
> misunderstand or be confused by that redirect since it's still the same
> relation.  If others disagree though, we could certainly change that
> too.
>

I agree with you.


Re: [PATCH] Identify LWLocks in tracepoints

2021-01-14 Thread Craig Ringer
On Thu, 14 Jan 2021 at 15:56, Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 2020-12-19 06:00, Craig Ringer wrote:
> > Patch 1 fixes a bogus tracepoint where an lwlock__acquire event would be
> > fired from LWLockWaitForVar, despite that function never actually
> > acquiring the lock.
>
> This was added in 68a2e52bbaf when LWLockWaitForVar() was first
> introduced.  It looks like a mistake to me too, but maybe Heikki wants
> to comment.
>

I'm certain it's a copy/paste bug.


Re: [PATCH] Identify LWLocks in tracepoints

2021-01-14 Thread Craig Ringer
On Wed, 13 Jan 2021 at 19:19, Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > On Sat, Dec 19, 2020 at 01:00:01PM +0800, Craig Ringer wrote:
> >
> > The attached patch set follows on from the discussion in [1] "Add LWLock
> > blocker(s) information" by adding the actual LWLock* and the numeric
> > tranche ID to each LWLock related TRACE_POSTGRESQL_foo tracepoint.
> >
> > This does not provide complete information on blockers, because it's not
> > necessarily valid to compare any two LWLock* pointers between two process
> > address spaces. The locks could be in DSM segments, and those DSM
> segments
> > could be mapped at different addresses.
> >
> > I wasn't able to work out a sensible way to map a LWLock* to any sort of
> > (tranche-id, lock-index) because there's no requirement that locks in a
> > tranche be contiguous or known individually to the lmgr.
> >
> > Despite that, the patches improve the information available for LWLock
> > analysis significantly.
>
> Thanks for the patches, this could be indeed useful. I've looked through
> and haven't noticed any issues with either the tracepoint extensions or
> commentaries, except that I find it is not that clear how trance_id
> indicates a re-initialization here?
>
> /* Re-initialization of individual LWLocks is not permitted */
> Assert(tranche_id >= NUM_INDIVIDUAL_LWLOCKS || !IsUnderPostmaster);
>

There should be no reason for anything to call LWLockInitialize(...) on an
individual LWLock, since they are all initialized during postmaster startup.

Doing so must be a bug.

But that's a trivial change that can be done separately.


> > Patch 2 adds the tranche id and lock pointer for each trace hit. This
> makes
> > it possible to differentiate between individual locks within a tranche,
> and
> > (so long as they aren't tranches in a DSM segment) compare locks between
> > processes. That means you can do lock-order analysis etc, which was not
> > previously especially feasible.
>
> I'm curious in which kind of situations lock-order analysis could be
> helpful?
>

If code-path 1 does

LWLockAcquire(LockA, LW_EXCLUSIVE);
...
LWLockAcquire(LockB, LW_EXCLUSIVE);

and code-path 2 does:

LWLockAcquire(LockB, LW_EXCLUSIVE);
...
LWLockAcquire(LockA, LW_EXCLUSIVE);

then they're subject to deadlock. But you might not actually hit that often
in test workloads if the timing required for the deadlock to occur is tight
and/or occurs on infrequent operations.

It's not always easy to reason about or prove things about lock order when
they're potentially nested deep within many layers of other calls and
callbacks. Obviously something we try to avoid with LWLocks, but not
impossible.

If you trace a workload and derive all possible nestings of lock acquire
order, you can then prove things about whether there are any possible
ordering conflicts and where they might arise.

A PoC to do so is on my TODO.

> Traces also don't have to do userspace reads for the tranche name all
> > the time, so the trace can run with lower overhead.
>
> This one is also interesting. Just for me to clarify, wouldn't there be
> a bit of overhead anyway (due to switching from kernel context to user
> space when a tracepoint was hit) that will mask name read overhead? Or
> are there any available numbers about it?
>

I don't have numbers on that. Whether it matters will depend way too much
on how you're using the probe points and collecting/consuming the data
anyway.

It's a bit unfortunate (IMO) that we make a function call for each
tracepoint invocation to get the tranche names. Ideally I'd prefer to be
able to omit the tranche names lookups for these probes entirely for
something as hot as LWLocks. But it's a bit of a pain to look up the
tranche names from an external trace tool, so instead I'm inclined to see
if we can enable systemtap's semaphores and only compute the tranche name
if the target probe is actually enabled. But that'd be separate to this
patch and require a build change in how systemtap support is compiled and
linked.

BTW, a user->kernel->user context switch only occurs when the trace tool's
probes use kernel space - such as for perf based probes, or for systemtap's
kernel-runtime probes. The same markers can be used by e.g. systemtap's
"dyninst" runtime that runs entirely in userspace.


Re: Logical decoding without slots: decoding in lockstep with recovery

2021-01-12 Thread Craig Ringer
On Sat, 26 Dec 2020 at 06:51, Andres Freund  wrote:

> Hi,
>
> On 2020-12-23 14:56:07 +0800, Craig Ringer wrote:
> > I want to share an idea I've looked at a few times where I've run into
> > situations where logical slots were inadvertently dropped, or where it
> > became necessary to decode changes in the past on a slot.
> >
> > As most of you will know you can't just create a logical slot in the
> past.
> > Even if it was permitted, it'd be unsafe due to catalog_xmin retention
> > requirements and missing WAL.
> >
> > But if we can arrange a physical replica to replay the WAL of interest
> and
> > decode each commit as soon as it's replayed by the startup process, we
> know
> > the needed catalog rows must all exist, so it's safe to decode the
> change.
> >
> > So it should be feasible to run logical decoding in standby, even
> without a
> > replication slot, so long as we:
> >
> > * pause startup process after each xl_xact_commit
> > * wake the walsender running logical decoding
> > * decode and process until ReorderBufferCommit for the just-committed
> xact
> > returns
> > * wake the startup process to decode the up to the next commit
>
> I don't think it's safe to just do this for each xl_xact_commit - we can
> remove needed rows at quite a few places, not just around transaction
> commit.


Good point.

I vaguely recall spotting a possible decoding-on-standby issue with eager
removal of rows that are still ahead of the global xmin if the primary
"knows" can't be needed based on info about currently running backends. But
when looking over code related to HOT, visibility, and vacuum now I can't
for the life of me remember exactly what it was or find it. Hopefully I
just misunderstood at the time or was getting confused between decoding on
standby and xact streaming.


> Rows needed to correctly decode rows earlier in the transaction
> might not be available by the time the commit record was logged.
>

When can that happen?

I think you'd basically have to run logical decoding in lockstep with
> WAL replay, i.e. replay one record, then call logical decoding for that
> record, replay the next record, ...
>

That sounds likely to be unusably slow. The only way I can see it having
any hope of moving at a reasonable rate would be to run a decoding session
inside the startup process itself so we don't have to switch back/forth for
each record. But I imagine that'd probably cause a whole other set of
problems.

> Can anyone see any obvious problem with this?
>
> The patch for logical decoding on the standby
> https://postgr.es/m/20181212204154.nsxf3gzqv3gesl32%40alap3.anarazel.de
> should provide some of the infrastructure to do this properly. Should
> really commit it. /me adds an entry to the top of the todo list.
>

That would certainly be helpful for quite a number of things.


Re: [PATCH] Identify LWLocks in tracepoints

2021-01-07 Thread Craig Ringer
On Sat, 19 Dec 2020 at 13:00, Craig Ringer 
wrote:

> Hi all
>
> The attached patch set follows on from the discussion in [1] "Add LWLock
> blocker(s) information" by adding the actual LWLock* and the numeric
> tranche ID to each LWLock related TRACE_POSTGRESQL_foo tracepoint.
>
>
I've attached a systemtap script that makes use of the information exported
by the enhanced LWLock tracepoints. It offers something akin to dynamic
-DLWLOCK_STATS with automatic statistical aggregation and some selective
LWLOCK_DEBUG output.

The script isn't super pretty. I didn't try to separate event-data
collection from results output, and there's some duplication in places. But
it gives you an idea what's possible when we report lock pointers and
tranche IDs to tracepoints and add entry/exit tracepoints.

Key features:

* Collect statistical aggregates on lwlock hold and wait durations across
all processes. Stats are grouped by lockmode (shared or exclusive) and by
tranche name, as well as rollup stats across all tranches.
* Report lock wait and hold durations for each process when that process
exits. Again, reported by mode and tranche name.
* For long lock waits, print the waiter pid and waiting pid, along with
each process's backend type and application_name if known, the acquire
mode, and the acquire function

The output is intended to be human-readable, but it'd be quite simple to
convert it into raw tsv-style output suitable for ingestion into
statistical postprocessing or graphing tools.

It should be fairly easy to break down the stats by acquire function if
desired, so LWLockAcquire(), LWLockWaitForVar(), and LWLockAcquireOrWait()
are reported separately. They're combined in the current output.

Capturing the current query string is pretty simple if needed, but I didn't
think it was likely to be especially useful.

Sample output for a pg_regress run attached. Abridged version follows. Here
the !!W!! lines are "waited a long time", the !!H!! lines are "held a long
time". Then [pid]:MyBackendType tranche_name wait_time_us (wait_time) in
wait_func (appliation_name) => [blocker_pid] (blocker_application_name) .
If blocker pid wasn't identified it won't be reported - I know how to fix
that and will do so soon.

!!W!! [ 93030]:3  BufferContent12993 (0m0.012993s) in
lwlock__acquire__start (pg_regress/text)
!!W!! [ 93036]:3LockManager14540 (0m0.014540s) in
lwlock__acquire__start (pg_regress/float8) => [ 93045] (pg_regress/regproc)
!!W!! [ 93035]:3  BufferContent12608 (0m0.012608s) in
lwlock__acquire__start (pg_regress/float4) => [ 93034] (pg_regress/oid)
!!W!! [ 93036]:3LockManager10301 (0m0.010301s) in
lwlock__acquire__start (pg_regress/float8)
!!W!! [ 93043]:3LockManager10356 (0m0.010356s) in
lwlock__acquire__start (pg_regress/pg_lsn)
!!H!! [ 93033]:3  BufferContent20579 (0m0.020579s)
(pg_regress/int8)
!!W!! [ 93027]:3  BufferContent10766 (0m0.010766s) in
lwlock__acquire__start (pg_regress/char) => [ 93037] (pg_regress/bit)
!!W!! [ 93036]:3 OidGen12876 (0m0.012876s) in
lwlock__acquire__start (pg_regress/float8)
...

Then the summary rollup at the end of the run. This can also be output
periodically during the run. Abbreviated for highlights:

wait locks: all procstranche modecount
 total  avg variance  min  max
  W LW_EXCLUSIVE  (all)E54185
14062734  259  1850265144177
  W LW_SHARED (all)S 3668
 1116022  304  1527261218642

held locks: all procstranche modecount
 total  avg variance  min  max
  H LW_EXCLUSIVE  (all)E 10438060
 153077259   14370351   195043
  H LW_SHARED (all)S 14199902
654669344 5318144030

all procs by tranchetranche modecount
 total  avg variance  min  max
  W tranche   (all)S 3668
 1116022  304  1527261218642
  W tranche   (all)E54185
14062734  259  1850265144177
  W tranche   WALInsertE 9839
 2393229  243  1180294214209
  W tranche   BufferContentE 3012
 1726543  573  3869409228186
  W tranche   BufferContentS 1664
657855  395  2185694218642
  W trancheLockFastPathE28314
 6327801  223  1278053126133
  W trancheLockFastPathS   87
 59401  682  3703217   19 9454
  W tranche LockManagerE 7223
 2764392

Re: [PATCH] Identify LWLocks in tracepoints

2021-01-06 Thread Craig Ringer
On Mon, 28 Dec 2020 at 20:09, Masahiko Sawada  wrote:

> Hi Craig,
>
> On Sat, Dec 19, 2020 at 2:00 PM Craig Ringer
>  wrote:
> >
> > Hi all
> >
> > The attached patch set follows on from the discussion in [1] "Add LWLock
> blocker(s) information" by adding the actual LWLock* and the numeric
> tranche ID to each LWLock related TRACE_POSTGRESQL_foo tracepoint.
> >
> > This does not provide complete information on blockers, because it's not
> necessarily valid to compare any two LWLock* pointers between two process
> address spaces. The locks could be in DSM segments, and those DSM segments
> could be mapped at different addresses.
> >
> > I wasn't able to work out a sensible way to map a LWLock* to any sort of
> (tranche-id, lock-index) because there's no requirement that locks in a
> tranche be contiguous or known individually to the lmgr.
> >
> > Despite that, the patches improve the information available for LWLock
> analysis significantly.
> >
> > Patch 1 fixes a bogus tracepoint where an lwlock__acquire event would be
> fired from LWLockWaitForVar, despite that function never actually acquiring
> the lock.
> >
> > Patch 2 adds the tranche id and lock pointer for each trace hit. This
> makes it possible to differentiate between individual locks within a
> tranche, and (so long as they aren't tranches in a DSM segment) compare
> locks between processes. That means you can do lock-order analysis etc,
> which was not previously especially feasible. Traces also don't have to do
> userspace reads for the tranche name all the time, so the trace can run
> with lower overhead.
> >
> > Patch 3 adds a single-path tracepoint for all lock acquires and
> releases, so you only have to probe the lwlock__acquired and
> lwlock__release events to see all acquires/releases, whether conditional or
> otherwise. It also adds start markers that can be used for timing wallclock
> duration of LWLock acquires/releases.
> >
> > Patch 4 adds some comments on LWLock tranches to try to address some
> points I found confusing and hard to understand when investigating this
> topic.
> >
>
> You sent in your patch to pgsql-hackers on Dec 19, but you did not
> post it to the next CommitFest[1].  If this was intentional, then you
> need to take no action.  However, if you want your patch to be
> reviewed as part of the upcoming CommitFest, then you need to add it
> yourself before 2021-01-01 AoE[2]. Thanks for your contributions.
>
> Regards,
>
> [1] https://commitfest.postgresql.org/31/
> [2] https://en.wikipedia.org/wiki/Anywhere_on_Earth
>

Thanks.

CF entry created at https://commitfest.postgresql.org/32/2927/ . I don't
think it's urgent and will have limited review time so I didn't try to
wedge it into the current CF.


Re: [PATCH] Runtime control of CLOBBER_CACHE_ALWAYS

2021-01-06 Thread Craig Ringer
On Wed, 6 Jan 2021 at 18:23, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2021-01-06 00:30, Craig Ringer wrote:
> > Perhaps debug_invalidate_system_caches_always ? It's a bit long but we
> > use the debug prefix elsewhere too.
>
> Committed with that name.
>

Thanks very much.


> In your original patch, this hunk in pg_config_manual.h:
>
> + * You can define these by editing pg_config_manual.h but it's usually
> + * sufficient to pass CFLAGS to configure, e.g.
> + *
> + * ./configure --enable-debug --enable-cassert CFLAGS="-DUSE_VALGRIND"
> + *
> + * The flags are persisted in Makefile.global so they will be correctly
> + * applied to extensions, including those built by PGXS.
>
> I don't think that necessarily works for all settings.  Maybe we should
> make a pass through it and ensure it all works sensibly, but that seems
> like a separate project.
>

It should work for everything, since we persist the CFLAGS string, rather
than individual settings.

But I didn't mean to leave that in the same patch anyway, it's separate.


Re: [PATCH] Runtime control of CLOBBER_CACHE_ALWAYS

2021-01-05 Thread Craig Ringer
On Tue, 5 Jan 2021, 22:41 Peter Eisentraut, <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2020-12-03 07:01, Craig Ringer wrote:
> > To try it out, apply the patch (git am), build with --enable-cassert,
> > then compare:
> >
> > make -C src/test/regress check
> >
> > and
> >
> > PGOPTIONS="-c debug_clobber_cache_depth=1" \
> > make -C src/test/regress check
> >
> > The speed difference will be obvious if nothing else!
>
> This is a really useful feature change.  I have a version that I'm happy
> to commit, but I wanted to check on the name of the setting.  The
> proposed name arose during the discussion when it was just to set the
> recursion depth but not enable the feature altogether, so I think that
> name is a bit misleading now.  We could reuse the old macro name, as in
> clobber_cache_always=N, which is very recognizable.  But the feature
> itself doesn't clobber anything (that's done by CLOBBER_FREED_MEMORY),
> so most accurate would be something like
> invalidate_system_caches_always=N.  Thoughts?
>

Modulo typo, I think that's a better name.

Perhaps debug_invalidate_system_caches_always ? It's a bit long but we use
the debug prefix elsewhere too.



> --
> Peter Eisentraut
> 2ndQuadrant, an EDB company
> https://www.2ndquadrant.com/
>
>
>


Re: [PATCH] LWLock self-deadlock detection

2021-01-04 Thread Craig Ringer
On Wed, 30 Dec 2020 at 10:11, Andres Freund  wrote:

> Hi,
>
> On 2020-11-27 20:22:41 +0200, Heikki Linnakangas wrote:
> > On 26/11/2020 04:50, Tom Lane wrote:
> > > Craig Ringer  writes:
> > > > On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat <
> ashutosh.bapat@gmail.com>
> > > > wrote:
> > > > > I'd prefer to make the lock self deadlock check run for production
> > > > > builds, not just cassert builds.
> > >
> > > I'd like to register a strong objection to spending any cycles
> whatsoever
> > > on this.  If you're using LWLocks in a way that could deadlock, you're
> > > doing it wrong.  The entire point of that mechanism is to be Light
> Weight
> > > and not have all the overhead that the heavyweight lock mechanism has.
> > > Building in deadlock checks and such is exactly the wrong direction.
> > > If you think you need that, you should be using a heavyweight lock.
> > >
> > > Maybe there's some case for a cassert-only check of this sort, but
> running
> > > it in production is right out.
> > >
> > > I'd also opine that checking for self-deadlock, but not any more
> general
> > > deadlock, seems pretty pointless.  Careless coding is far more likely
> > > to create opportunities for the latter.  (Thus, I have little use for
> > > the existing assertions of this sort, either.)
> >
> > I've made the mistake of forgetting to release an LWLock many times,
> leading
> > to self-deadlock. And I just reviewed a patch that did that this week
> [1].
> > You usually find that mistake very quickly when you start testing
> though, I
> > don't think I've seen it happen in production.
>
> I think something roughly along Craig's original patch, basically adding
> assert checks against holding a lock already, makes sense. Compared to
> the other costs of running an assert build this isn't a huge cost, and
> it's helpful.
>
> I entirely concur that doing this outside of assertion builds is a
> seriously bad idea.
>

Yeah, given it only targets developer error that's sensible.


Re: Logical decoding without slots: decoding in lockstep with recovery

2020-12-25 Thread Craig Ringer
On Wed, 23 Dec 2020, 18:57 Amit Kapila,  wrote:

> On Wed, Dec 23, 2020 at 12:26 PM Craig Ringer
>  wrote:
> >
> > Hi all
> >
> > I want to share an idea I've looked at a few times where I've run into
> situations where logical slots were inadvertently dropped, or where it
> became necessary to decode changes in the past on a slot.
> >
> > As most of you will know you can't just create a logical slot in the
> past. Even if it was permitted, it'd be unsafe due to catalog_xmin
> retention requirements and missing WAL.
> >
> > But if we can arrange a physical replica to replay the WAL of interest
> and decode each commit as soon as it's replayed by the startup process, we
> know the needed catalog rows must all exist, so it's safe to decode the
> change.
> >
> > So it should be feasible to run logical decoding in standby, even
> without a replication slot, so long as we:
> >
> > * pause startup process after each xl_xact_commit
> > * wake the walsender running logical decoding
> > * decode and process until ReorderBufferCommit for the just-committed
> xact returns
> > * wake the startup process to decode the up to the next commit
> >
>
> How will you deal with subscriber restart?  I think you need some way
> to remember confirmed_flush_lsn and restart_lsn and then need to teach
> WAL machinery to restart from some previous point.
>

The simplest option, albeit slow, would be to require the subscriber to
confirm flush before allowing more WAL redo. That's what I'd initially
assumed. This is a bit of a corner case situation after all, it's never
going to be fast with the switching back and forth.

More efficient would be to decode and output to a local spool file and
fsync it. Then separately send that to the subscriber, like has been
discussed in other work on more efficient logical decoding. So the output
plugin output would go to a local spool.

That can be implemented with pg_recvlogical if needed.


> --
> With Regards,
> Amit Kapila.
>


Re: Improving LWLock wait events

2020-12-22 Thread Craig Ringer
On Wed, 23 Dec 2020 at 15:51, Craig Ringer 
wrote:

>
> I've struggled with this quite a bit myself.
>
>
By the way, I sent in a patch to enhance the static tracepoints available
for LWLocks. See
https://www.postgresql.org/message-id/cagry4nxjo+-hcc2i5h93ttsz4gzo-fsddcwvkb-qafq1zdx...@mail.gmail.com
.

It'd benefit significantly from the sort of changes you mentioned in #4.
For most purposes I've been able to just use the raw LWLock* but having a
nice neat (tranche,index) value would be ideal.

The trace patch has helped me identify some excessively long LWLock waits
in tools I work on. I'll share another of the systemtap scripts I used with
it soon.


Re: Improving LWLock wait events

2020-12-22 Thread Craig Ringer
On Mon, 21 Dec 2020 at 05:27, Andres Freund  wrote:

> Hi,
>
> The current wait events are already pretty useful. But I think we could
> make them more informative without adding real runtime overhead.
>
>
All 1-3 sound pretty sensible to me.

I also think there's a 4, but I think the tradeoffs are a bit more
> complicated:
>

> 4) For a few types of lwlock just knowing the tranche isn't
> sufficient. E.g. knowing whether it's one or different buffer mapping locks
> are being waited on is important to judge contention.
>

I've struggled with this quite a bit myself.

In particular, for tools that validate acquire-ordering safety it's
desirable to be able to identify a specific lock in a backend-independent
way.

The hardest part would be to know how to identify individual locks. The
> easiest would probably be to just mask in a parts of the lwlock address
> (e.g. shift it right by INTALIGN, and then mask in the result into the
> eventId). That seems a bit unsatisfying.
>

It also won't work reliably for locks in dsm segments, since the lock can
be mapped to a different address in different backends.

We could probably do a bit better: We could just store the information about
> tranche / offset within tranche at LWLockInitialize() time, instead of
> computing something just before waiting.  While LWLock.tranche is only
> 16bits
> right now, the following two bytes are currently padding...
>
> That'd allow us to have proper numerical identification for nearly all
> tranches, without needing to go back to the complexity of having tranches
> specify base & stride.
>

That sounds appealing. It'd work for any lock in MainLWLockArray - all
built-in individual LWLocks, LWTRANCHE_BUFFER_MAPPING,
LWTRANCHE_LOCK_MANAGER, LWTRANCHE_PREDICATE_LOCK_MANAGER, any lock
allocated by RequestNamedLWLockTranche().

Some of the other tranches allocate locks in contiguous fixed blocks or in
ways that would let them maintain a counter.

We'd need some kind of "unknown" placeholder value for LWLocks where that
doesn't make sense, though, like most locks allocated by callers that make
their own LWLockNewTrancheId() call and locks in some of the  built-in
tranches not allocated in MainLWLockArray.

So I suggest retaining the current LWLockInitialize() and making it a
wrapper for LWLockInitializeWithIndex() or similar. Use a 1-index and keep
0 as unknown, or use 0-index and use (max-1) as unknown.


TAP PostgresNode function to gdb stacks and optional cores for all backends

2020-12-22 Thread Craig Ringer
Hi all

I recently wrote a utility that adds a $node->gdb_backends() method to
PostgresNode instances - figured I'd share it here in case anyone finds it
useful, or wants to adopt it into the features of the TAP tools.

This function provides a one-line way to dump stacks for all running
backends to per-pid files or to the main test log, as well as the values of
various global variables that are potentially of interest. A default set of
globals will be dumped for each backend and the caller can specify
additional expressions of interest.

If requested, cores will be dumped for each running backend.

A subset of backends may be passed by pid instead, so you can easily target
specific backends you're interested in.

I initially wrote this to help debug a variety of issues with shutdown,
where I hacked the PostgresNode stop() method to trap failed shutdowns and
report stacks for all surviving processes + the postmaster in my wrapper
class for PostgresNode:

sub stop {
my ($self, $mode) = @_;
local($@);
eval {
PostgresNode::stop($self, $mode);
};
if ($@) {
$node->gdb_backends(want_cores => 1);
die $@;
}
}
#
# This is an excerpt from a subclass of PostgresNode
#

# Generate backtraces and optionally core files for all user backends and
# walsenders associated with this node. Requires gdb to be present. Cores
# will be labeled by node name.
sub gdb_backends {
my ($self, %kwargs) = @_;
$kwargs{backtrace_timeout_s} //= '60';
$kwargs{core_timeout_s} //= '60';
$kwargs{want_cores} //= 0;
$kwargs{core_name_pattern} //= 'core.{{pid}}';
$kwargs{gdb_logfile_pattern} //= '';

my $postmaster_pid = $self->{_pid};
my $pgname = $self->name;

# Globals
# TODO make these conditional on an expression to filter them.
# TODO handle statics that vary across files
# TODO add typecasts for when we don't have debuginfo
# TODO useful GUCs
#
my @print_exprs = (
# All backends
'IsPostmasterEnvironment',
'IsUnderPostmaster',
'PostmasterPid',
'LocalRecoveryInProgress',
'*MyProc',
'MyAuxProcType',
'*XLogCtl',
'*ControlFile',

# Generic signal handling
'InterruptPending',
'ProcDiePending',
'ShutdownRequestPending',
'ConfigReloadPending',

# user backend / postgres
'xact_started',
'doing_extended_query_message',
'ignore_till_sync',

# startup process
'ThisTimeLineID',
'LastRec',
'ArchiveRecoveryRequested',
'InArchiveRecovery',
'PrimaryConnInfo',
'PrimarySlotName',
'StandbyMode',

# autovac
'am_autovacuum_launcher',
'am_autovacuum_worker',
'got_SIGHUP',
'got_SIGUSR2',
'got_SIGTERM',
"'autovacuum.c':got_SIGTERM",

# for walsenders
'am_walsender',
'am_cascading_walsender',
'am_db_walsender',
'*MyWalSnd',
'*xlogreader',
'sendTimeLine',
'sentPtr',
'streamingDoneSending',
'streamingDoneReceiving',
"'walsender.c':got_SIGTERM",
'got_STOPPING',
'got_SIGUSR2',
'replication_active',
'*logical_decoding_ctx',
'logical_startptr',

# walreceiver
'recvFileTLI',
'*wrconn',

# checkpointer
'*CheckpointerShmem',
'last_checkpoint_time',
'ckpt_active',

# for bgworkers
'IsBackgroundWorker',

# for pgl backends
'*MyPGLogicalWorker',
'*MyPGLSubscription',

# for bdr backends
'*MyBdrSubscription',

# postmaster
'pmState',
);

# Add your own print expressions by passing print_exprs => ['var1', 
'var2']
push @print_exprs, @{$kwargs{print_exprs}}
if (defined($kwargs{print_exprs}));

my @pids;
if (defined($kwargs{pids})) {
if (ref($kwargs{pids}) eq 'ARRAY') {
# arrayref pid-list
@pids = @{$kwargs{pids}};
} elsif (ref($kwargs{pids}) eq '') {
# Scalar pid-list
@pids = split(qr/[\r\n]/, $kwargs{pids});
} else {
die("keyword argument 'pids' must be undef, an 
arrayref, 

Logical decoding without slots: decoding in lockstep with recovery

2020-12-22 Thread Craig Ringer
Hi all

I want to share an idea I've looked at a few times where I've run into
situations where logical slots were inadvertently dropped, or where it
became necessary to decode changes in the past on a slot.

As most of you will know you can't just create a logical slot in the past.
Even if it was permitted, it'd be unsafe due to catalog_xmin retention
requirements and missing WAL.

But if we can arrange a physical replica to replay the WAL of interest and
decode each commit as soon as it's replayed by the startup process, we know
the needed catalog rows must all exist, so it's safe to decode the change.

So it should be feasible to run logical decoding in standby, even without a
replication slot, so long as we:

* pause startup process after each xl_xact_commit
* wake the walsender running logical decoding
* decode and process until ReorderBufferCommit for the just-committed xact
returns
* wake the startup process to decode the up to the next commit

Can anyone see any obvious problem with this?

I don't think the potential issues with WAL commit visibility order vs
shmem commit visibility order should be a concern.

I see this as potentially useful in data recovery, where you might want to
be able to extract a change stream for a subset of tables from PITR
recovery, for example. Also for audit use.


Re: Preventing hangups in bgworker start/stop during DB shutdown

2020-12-22 Thread Craig Ringer
On Wed, 23 Dec 2020 at 05:40, Tom Lane  wrote:

> Here's an attempt at closing the race condition discussed in [1]
> (and in some earlier threads, though I'm too lazy to find them).
>
> The core problem is that the bgworker management APIs were designed
> without any thought for exception conditions, notably "we're not
> gonna launch any more workers because we're shutting down the database".
> A process waiting for a worker in WaitForBackgroundWorkerStartup or
> WaitForBackgroundWorkerShutdown will wait forever, so that the database
> fails to shut down without manual intervention.
>
> I'd supposed that we would need some incompatible changes in those APIs
> in order to fix this, but after further study it seems like we could
> hack things by just acting as though a request that won't be serviced
> has already run to completion.  I'm not terribly satisfied with that
> as a long-term solution --- it seems to me that callers should be told
> that there was a failure.  But this looks to be enough to solve the
> lockup condition for existing callers, and it seems like it'd be okay
> to backpatch.
>
> Thoughts?
>

Callers who launch bgworkers already have to cope with conditions such as
the worker failing immediately after launch, or before attaching to the
shmem segment used for worker management by whatever extension is launching
it.

So I think it's reasonable to lie and say we launched it. The caller must
already cope with this case to behave correctly.

Patch specifics:

> This function should only be called from the postmaster

It'd be good to

Assert(IsPostmasterEnvironment && !IsUnderPostmaster)

in these functions.

Otherwise at first read the patch and rationale looks sensible to me.

(When it comes to the bgw APIs in general I have a laundry list of things
I'd like to change or improve around signal handling, error trapping and
recovery, and lots more, but that's for another thread.)


[PATCH] Identify LWLocks in tracepoints

2020-12-18 Thread Craig Ringer
Hi all

The attached patch set follows on from the discussion in [1] "Add LWLock
blocker(s) information" by adding the actual LWLock* and the numeric
tranche ID to each LWLock related TRACE_POSTGRESQL_foo tracepoint.

This does not provide complete information on blockers, because it's not
necessarily valid to compare any two LWLock* pointers between two process
address spaces. The locks could be in DSM segments, and those DSM segments
could be mapped at different addresses.

I wasn't able to work out a sensible way to map a LWLock* to any sort of
(tranche-id, lock-index) because there's no requirement that locks in a
tranche be contiguous or known individually to the lmgr.

Despite that, the patches improve the information available for LWLock
analysis significantly.

Patch 1 fixes a bogus tracepoint where an lwlock__acquire event would be
fired from LWLockWaitForVar, despite that function never actually acquiring
the lock.

Patch 2 adds the tranche id and lock pointer for each trace hit. This makes
it possible to differentiate between individual locks within a tranche, and
(so long as they aren't tranches in a DSM segment) compare locks between
processes. That means you can do lock-order analysis etc, which was not
previously especially feasible. Traces also don't have to do userspace
reads for the tranche name all the time, so the trace can run with lower
overhead.

Patch 3 adds a single-path tracepoint for all lock acquires and releases,
so you only have to probe the lwlock__acquired and lwlock__release events
to see all acquires/releases, whether conditional or otherwise. It also
adds start markers that can be used for timing wallclock duration of LWLock
acquires/releases.

Patch 4 adds some comments on LWLock tranches to try to address some points
I found confusing and hard to understand when investigating this topic.



[1]
https://www.postgresql.org/message-id/CAGRY4nz%3DSEs3qc1R6xD3max7sg3kS-L81eJk2aLUWSQAeAFJTA%40mail.gmail.com
.
From 583c818e3121c0f7c6506b434497c81ae94ee9cb Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Thu, 19 Nov 2020 17:30:47 +0800
Subject: [PATCH v1 4/4] Comments on LWLock tranches

---
 src/backend/storage/lmgr/lwlock.c | 49 +--
 1 file changed, 46 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index cfdfa7f328..123bcc463e 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -112,11 +112,14 @@ extern slock_t *ShmemLock;
  *
  * 1. The individually-named locks defined in lwlocknames.h each have their
  * own tranche.  The names of these tranches appear in IndividualLWLockNames[]
- * in lwlocknames.c.
+ * in lwlocknames.c. The LWLock structs are allocated in MainLWLockArray.
  *
  * 2. There are some predefined tranches for built-in groups of locks.
  * These are listed in enum BuiltinTrancheIds in lwlock.h, and their names
- * appear in BuiltinTrancheNames[] below.
+ * appear in BuiltinTrancheNames[] below. The LWLock structs are allocated
+ * elsewhere under the control of the subsystem that manages the tranche. The
+ * LWLock code does not know or care where in shared memory they are allocated
+ * or how many there are in a tranche.
  *
  * 3. Extensions can create new tranches, via either RequestNamedLWLockTranche
  * or LWLockRegisterTranche.  The names of these that are known in the current
@@ -196,6 +199,13 @@ static int	LWLockTrancheNamesAllocated = 0;
  * This points to the main array of LWLocks in shared memory.  Backends inherit
  * the pointer by fork from the postmaster (except in the EXEC_BACKEND case,
  * where we have special measures to pass it down).
+ *
+ * This array holds individual LWLocks and LWLocks allocated in named tranches.
+ *
+ * It does not hold locks for any LWLock that's separately initialized with
+ * LWLockInitialize(). Locks in tranches listed in BuiltinTrancheIds or
+ * allocated with LWLockNewTrancheId() can be embedded in other structs
+ * anywhere in shared memory.
  */
 LWLockPadded *MainLWLockArray = NULL;
 
@@ -593,6 +603,12 @@ InitLWLockAccess(void)
  * Caller needs to retrieve the requested number of LWLocks starting from
  * the base lock address returned by this API.  This can be used for
  * tranches that are requested by using RequestNamedLWLockTranche() API.
+ *
+ * The locks are already initialized.
+ *
+ * This function can not be used for locks in builtin tranches or tranches
+ * registered with LWLockRegisterTranche(). There is no way to look those locks
+ * up by name.
  */
 LWLockPadded *
 GetNamedLWLockTranche(const char *tranche_name)
@@ -647,6 +663,14 @@ LWLockNewTrancheId(void)
  *
  * The tranche name will be user-visible as a wait event name, so try to
  * use a name that fits the style for those.
+ *
+ * The tranche ID should be a user-defined tranche ID acquired from
+ * LWLockNewTrancheId(). It is not necessary to call this for tranches
+ *

Re: [PATCH] Runtime control of CLOBBER_CACHE_ALWAYS

2020-12-09 Thread Craig Ringer
On Thu, 3 Dec 2020 at 15:53, Craig Ringer  wrote:

>
> On Tue, 27 Oct 2020 at 16:34, Peter Eisentraut <
> peter.eisentr...@2ndquadrant.com> wrote:
>
>> On 2020-09-25 09:40, Craig Ringer wrote:
>> > While working on extensions I've often wanted to enable cache
>> clobbering
>> > for a targeted piece of code, without paying the price of running it
>> for
>> > all backends during postgres startup and any initial setup tasks that
>> > are required.
>> >
>> > So here's a patch that, when CLOBBER_CACHE_ALWAYS or
>> > CLOBBER_CACHE_RECURSIVE are defined, adds a GUC named
>> > clobber_cache_depth . It defaults to 1 for CLOBBER_CACHE_ALWAYS or 3
>> for
>> > CLOBBER_CACHE_RECURSIVE to match the existing compiled-in behaviour.
>> But
>> > with this change it's now possible to run Pg with clobber_cache_depth=0
>> > then set it to 1 only for targeted tests.
>>
>> I think it would be great if the cache clobber code is always compiled
>> in (but turned off) when building with assertions.  The would reduce the
>> number of hoops to jump through to actually use this locally and reduce
>> the number of surprises from the build farm.
>
>
>
Updated per your comments Peter, see mail immediately up-thread.

Moved to this CF as https://commitfest.postgresql.org/31/2749/ .


Re: Logical archiving

2020-12-08 Thread Craig Ringer
On Tue, 8 Dec 2020 at 17:54, Andrey Borodin  wrote:


> > In pglogical3 we already support streaming decoded WAL data to
> alternative writer downstreams including RabbitMQ and Kafka via writer
> plugins.
> Yes, Yandex.Cloud Transfer Manger supports it too. But it has to be
> resynced after physical failover. And internal installation of YC have
> mandatory drills: few times in a month one datacenter is disconnected and
> failover happens for thousands a DBS.
>

You'll want to look at
https://wiki.postgresql.org/wiki/Logical_replication_and_physical_standby_failover#All-logical-replication_HA
then.


Re: Blocking I/O, async I/O and io_uring

2020-12-08 Thread Craig Ringer
On Tue, 8 Dec 2020 at 15:04, Andres Freund  wrote:

> Hi,
>
> On 2020-12-08 04:24:44 +, tsunakawa.ta...@fujitsu.com wrote:
> > I'm looking forward to this from the async+direct I/O, since the
> > throughput of some write-heavy workload decreased by half or more
> > during checkpointing (due to fsync?)
>
> Depends on why that is. The most common, I think, cause is that your WAL
> volume increases drastically just after a checkpoint starts, because
> initially all page modification will trigger full-page writes.  There's
> a significant slowdown even if you prevent the checkpointer from doing
> *any* writes at that point.  I got the WAL AIO stuff to the point that I
> see a good bit of speedup at high WAL volumes, and I see it helping in
> this scenario.
>
> There's of course also the issue that checkpoint writes cause other IO
> (including WAL writes) to slow down and, importantly, cause a lot of
> jitter leading to unpredictable latencies.  I've seen some good and some
> bad results around this with the patch, but there's a bunch of TODOs to
> resolve before delving deeper really makes sense (the IO depth control
> is not good enough right now).
>
> A third issue is that sometimes checkpointer can't really keep up - and
> that I think I've seen pretty clearly addressed by the patch. I have
> managed to get to ~80% of my NVMe disks top write speed (> 2.5GB/s) by
> the checkpointer, and I think I know what to do for the remainder.
>
>
Thanks for explaining this. I'm really glad you're looking into it. If I
get the chance I'd like to try to apply some wait-analysis and blocking
stats tooling to it. I'll report back if I make any progress there.


Re: Blocking I/O, async I/O and io_uring

2020-12-07 Thread Craig Ringer
On Tue, 8 Dec 2020 at 12:02, Andres Freund  wrote:

> Hi,
>
> On 2020-12-08 10:55:37 +0800, Craig Ringer wrote:
> > A new kernel API called io_uring has recently come to my attention. I
> > assume some of you (Andres?) have been following it for a while.
>
> Yea, I've spent a *lot* of time working on AIO support, utilizing
> io_uring. Recently Thomas also joined in the fun. I've given two talks
> referencing it (last pgcon, last pgday brussels), but otherwise I've not
> yet written much about. Things aren't *quite* right yet architecturally,
> but I think we're getting there.
>

That's wonderful. Thankyou.

I'm badly behind on the conference circuit due to geographic isolation and
small children. I'll hunt up your talks.

The current state is at https://github.com/anarazel/postgres/tree/aio
> (but it's not a very clean history at the moment).
>

Fantastic!

Have you done much bpf / systemtap / perf based work on measurement and
tracing of latencies etc? If not that's something I'd be keen to help with.
I've mostly been using systemtap so far but I'm trying to pivot over to
bpf.

I hope to submit a big tracepoints patch set for PostgreSQL soon to better
expose our wait points and latencies, improve visibility of blocking, and
help make activity traceable through all the stages of processing. I'll Cc
you when I do.


> > io_uring appears to offer a way to make system calls including reads,
> > writes, fsync()s, and more in a non-blocking, batched and pipelined
> manner,
> > with or without O_DIRECT. Basically async I/O with usable buffered I/O
> and
> > fsync support. It has ordering support which is really important for us.
>
> My results indicate that we really want to have have, optional & not
> enabled by default of course, O_DIRECT support. We just can't benefit
> fully of modern SSDs otherwise. Buffered is also important, of course.
>

Even more so for NVDRAM, Optane and all that, where zero-copy and low
context switches becomes important too.

We're a long way from that being a priority but it's still not to be
dismissed.

I'm pretty sure that I've got the basics of this working pretty well. I
> don't think the executor architecture is as big an issue as you seem to
> think. There are further benefits that could be unlocked if we had a
> more flexible executor model (imagine switching between different parts
> of the query whenever blocked on IO - can't do that due to the stack
> right now).
>

Yep, that's what I'm talking about being an issue.

Blocked on an index read? Move on to the next tuple and come back when the
index read is done.

I really like what I see of the io_uring architecture so far. It's ideal
for callback-based event-driven flow control. But that doesn't fit postgres
well for the executor. It's better for redo etc.



> The way it currently works is that things like sequential scans, vacuum,
> etc use a prefetching helper which will try to use AIO to read ahead of
> the next needed block. That helper uses callbacks to determine the next
> needed block, which e.g. vacuum uses to skip over all-visible/frozen
> blocks. There's plenty other places that should use that helper, but we
> already can get considerably higher throughput for seqscans, vacuum on
> both very fast local storage, and high-latency cloud storage.
>
> Similarly, for writes there's a small helper to manage a write-queue of
> configurable depth, which currently is used to by checkpointer and
> bgwriter (but should be used in more places). Especially with direct IO
> checkpointing can be a lot faster *and* less impactful on the "regular"
> load.
>

Sure sounds like a useful interim step. That's great.

I've got asynchronous writing of WAL mostly working, but need to
> redesign the locking a bit further. Right now it's a win in some cases,
> but not others. The latter to a significant degree due to unnecessary
> blocking
>

That's where io_uring's I/O ordering operations looked interesting. But I
haven't looked closely enough to see if they're going to help us with I/O
ordering in a multiprocessing architecture like postgres.

In an ideal world we could tell the kernel about WAL-to-heap I/O
dependencies and even let it apply WAL then heap changes out-of-order so
long as they didn't violate any ordering constraints we specify between
particular WAL records or between WAL writes and their corresponding heap
blocks. But I don't know if the io_uring interface is that capable.

I did some basic experiments a while ago with using write barriers between
WAL records and heap writes instead of fsync()ing, but as you note, the
increased blocking and reduction in the kernel's ability to do I/O
reordering is generally worse than the costs of the fsync()s we do now.

> I'm thinking that redo is probably a good first candidate. It doesn't
> > de

Re: Blocking I/O, async I/O and io_uring

2020-12-07 Thread Craig Ringer
References to get things started:

* https://lwn.net/Articles/810414/
* https://unixism.net/loti/what_is_io_uring.html
*
https://blogs.oracle.com/linux/an-introduction-to-the-io_uring-asynchronous-io-framework
*
https://thenewstack.io/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/

You'll probably notice how this parallels my sporadic activities around
pipelining in other areas, and the PoC libpq pipelining patch I sent in a
few years ago.


Blocking I/O, async I/O and io_uring

2020-12-07 Thread Craig Ringer
Hi all

A new kernel API called io_uring has recently come to my attention. I
assume some of you (Andres?) have been following it for a while.

io_uring appears to offer a way to make system calls including reads,
writes, fsync()s, and more in a non-blocking, batched and pipelined manner,
with or without O_DIRECT. Basically async I/O with usable buffered I/O and
fsync support. It has ordering support which is really important for us.

This should be on our radar. The main barriers to benefiting from linux-aio
based async I/O in postgres in the past has been its reliance on direct
I/O, the various kernel-version quirks, platform portability, and its
maybe-async-except-when-it's-randomly-not nature.

The kernel version and portability remain an issue with io_uring so it's
not like this is something we can pivot over to completely. But we should
probably take a closer look at it.

PostgreSQL spends a huge amount of time waiting, doing nothing, for
blocking I/O. If we can improve that then we could potentially realize some
major increases in I/O utilization especially for bigger, less concurrent
workloads. The most obvious candidates to benefit would be redo, logical
apply, and bulk loading.

But I have no idea how to even begin to fit this into PostgreSQL's executor
pipeline. Almost all PostgreSQL's code is synchronous-blocking-imperative
in nature, with a push/pull executor pipeline. It seems to have been
recognised for some time that this is increasingly hurting our performance
and scalability as platforms become more and more parallel.

To benefit from AIO (be it POSIX, linux-aio, io_uring, Windows AIO, etc) we
have to be able to dispatch I/O and do something else while we wait for the
results. So we need the ability to pipeline the executor and pipeline redo.

I thought I'd start the discussion on this and see where we can go with it.
What incremental steps can be done to move us toward parallelisable I/O
without having to redesign everything?

I'm thinking that redo is probably a good first candidate. It doesn't
depend on the guts of the executor. It is much less sensitive to ordering
between operations in shmem and on disk since it runs in the startup
process. And it hurts REALLY BADLY from its single-threaded blocking
approach to I/O - as shown by an extension written by 2ndQuadrant that can
double redo performance by doing read-ahead on btree pages that will soon
be needed.

Thoughts anybody?


Re: POC: Better infrastructure for automated testing of concurrency issues

2020-12-06 Thread Craig Ringer
On Wed, 25 Nov 2020 at 22:11, Alexander Korotkov 
wrote:

> Hackers,
>
> PostgreSQL is a complex multi-process system, and we are periodically
> faced with complicated concurrency issues. While the postgres community
> does a great job on investigating and fixing the problems, our ability to
> reproduce concurrency issues in the source code test suites is limited.
>
> I think we currently have two general ways to reproduce the concurrency
> issues.
> 1. A text scenario for manual reproduction of the issue, which could
> involve psql sessions, gdb sessions etc. Couple of examples are [1] and
> [2]. This method provides reliable reproduction of concurrency issues. But
> it's  hard to automate, because it requires external instrumentation
> (debugger) and it's not stable in terms of postgres code changes (that is
> particular line numbers for breakpoints could be changed). I think this is
> why we currently don't have such scenarios among postgres test suites.
> 2. Another way is to reproduce the concurrency issue without directly
> touching the database internals using pgbench or other way to simulate the
> workload (see [3] for example). This way is easier to automate, because it
> doesn't need external instrumentation and it's not so sensitive to source
> code changes. But at the same time this way is not reliable and is
> resource-consuming.
>

Agreed.

For a useful but limited set of cases there's (3) the isolation tester and
pg_isolation_regress. But IIRC the patches to teach it to support multiple
upstream nodes never got in, so it's essentially useless for any
replication related testing.

There's also (4), write a TAP test that uses concurrent psql sessions via
IPC::Run. Then play games with heavyweight or advisory lock waits to order
events, use instance starts/stops, change ports or connstrings to simulate
network issues, use SIGSTOP/SIGCONTs, add src/test/modules extensions that
inject faults or provide custom blocking wait functions for the event you
want, etc. I've done that more than I'd care to, and I don't want to do it
any more than I have to in future.

In some cases I've gone further and written tests that use systemtap in
"guru" mode (read/write, with embedded C enabled) to twiddle the memory of
the target process(es) when a probe is hit, e.g. to modify a function
argument or return value or inject a fault. Not exactly portable or
convenient, though very powerful.


In the view of above, I'd like to propose a POC patch, which implements new
> builtin infrastructure for reproduction of concurrency issues in automated
> test suites.  The general idea is so-called "stop events", which are
> special places in the code, where the execution could be stopped on some
> condition.  Stop event also exposes a set of parameters, encapsulated into
> jsonb value.  The condition over stop event parameters is defined using
> jsonpath language.
>

The patched PostgreSQL used by 2ndQuadrant internally has a feature called
PROBE_POINT()s that is somewhat akin to this. Since it's not a customer
facing feature I'm sure I can discuss it here, though I'll need to seek
permission before I can show code.

TL;DR: PROBE_POINT()s let you inject ERRORs, sleeps, crashes, and various
other behaviour at points in the code marked by name, using GUCs, hooks
loaded from test extensions, or even systemtap scripts to control what
fires and when. Performance impact is essentially zero when no probes are
currently enabled at runtime, so they're fine for cassert builds.

Details:

A PROBE_POINT() is a macro that works as a marker, a bit like a
TRACE_POSTGRESQL_ dtrace macro. But instead of the super lightweight
tracepoint that SDT marker points emit, a PROBE_POINT tests an
unlikely(probe_points_enabled) flag, and if true, it prepares arguments for
the probe handler: A probe name, probe action, sleep duration, and a hit
counter.

The default probe action and sleep duration come from GUCs. So your control
of the probe is limited to the granularity you can easily manage GUCs at.
That's often sufficient

But if you want finer control for something, there are two ways to achieve
it.

After picking the default arguments to the handler, the probe point checks
for a hook. If defined, it calls it with the probe point name and pointers
to the action and sleep duration values, so the hook function can modify
them per probe-point hit. That way you can use in src/test/modules
extensions or your own test extensions first, with the probe point name as
an argument and the action and sleep duration as out-params, as well as any
accessible global state, custom GUCs you define in your test extension,
etc. That's usually enough to target a probe very specifically but it's a
bit of a hassle.

Another option is to use a systemtap script. You can write your code in
systemtap with its language. When the systemtap marker for a probe point
event fires, decide if it's the one you want and twiddle the target process
variables that store the probe 

Re: Single transaction in the tablesync worker?

2020-12-06 Thread Craig Ringer
On Mon, 7 Dec 2020 at 11:44, Peter Smith  wrote:

>
> Basically, I was wondering why can't the "tablesync" worker just
> gather messages in a similar way to how the current streaming feature
> gathers messages into a "changes" file, so that they can be replayed
> later.
>
>
See the related thread "Logical archiving"

https://www.postgresql.org/message-id/20d9328b-a189-43d1-80e2-eb25b9284...@yandex-team.ru

where I addressed some parts of this topic in detail earlier today.

A) The "tablesync" worker (after the COPY) does not ever apply any of
> the incoming messages, but instead it just gobbles them into a
> "changes" file until it decides it has reached SYNCDONE state and
> exits.
>

This has a few issues.

Most importantly, the sync worker must cooperate with the main apply worker
to achieve a consistent end-of-sync cutover. The sync worker must have
replayed the pending changes in order to make this cut-over, because the
non-sync apply worker will need to start applying changes on top of the
resync'd table potentially as soon as the next transaction it starts
applying, so it needs to see the rows there.

Doing this would also add another round of write multiplication since the
data would get spooled then applied to WAL then heap. Write multiplication
is already an issue for logical replication so adding to it isn't
particularly desirable without a really compelling reason. With  the write
multiplication comes disk space management issues for big transactions as
well as the obvious performance/throughput impact.

It adds even more latency between upstream commit and downstream apply,
something that is again already an issue for logical replication.

Right now we don't have any concept of a durable and locally flushed spool.

It's not impossible to do as you suggest but the cutover requirement makes
it far from simple. As discussed in the logical archiving thread I think
it'd be good to have something like this, and there are times the write
multiplication price would be well worth paying. But it's not easy.

B) Then, when the "apply" worker proceeds, if it detects the existence
> of the "changes" file it will replay/apply_dispatch all those gobbled
> messages before just continuing as normal.
>

That's going to introduce a really big stall in the apply worker's progress
in many cases. During that time it won't be receiving from upstream (since
we don't spool logical changes to disk at this time) so the upstream lag
will grow. That will impact synchronous replication, pg_wal size
management, catalog bloat, etc. It'll also leave the upstream logical
decoding session idle, so when it resumes it may create a spike of I/O and
CPU load as it catches up, as well as a spike of network traffic. And
depending on how close the upstream write rate is to the max decode speed,
network throughput max, and downstream apply speed max, it may take some
time to catch up over the resulting lag.

Not a big fan of that approach.


RFC: Deadlock detector hooks for edge injection

2020-12-06 Thread Craig Ringer
Hi all

Related to my other post about deadlock detector hooks for victim
selection, I'd like to gauge opinions here about whether it's feasible to
inject edges into the deadlock detector's waits-for graph.

Doing so would help with detecting deadlocks relating to shm_mq waits, help
with implementing distributed deadlock detection solutions, make it
possible to spot deadlocks relating to condition-variable waits, etc.

I'm not sure quite how the implementation would look yet, this is an early
RFC and sanity check so I don't invest any work into it if it has no hope
of going anywhere.

I expect we'd want to build the graph only when the detector is triggered,
rather than proactively maintain such edges, so the code implementing the
hook would be responsible for keeping track of whatever state it needs to
in order to do so.

When called, it'd append "struct EDGE" s to the deadlock detector's
waits-for list.

We'd need to support a node representation other than a LOCK* for struct
EDGE, and to abstract edge sorting (TopoSort etc) to allow for other edge
types. So it wouldn't be a trivial change to make, hence opening with this
RFC.

I expect it'd be fine to require each EDGE* to have a PGPROC and to require
the PGPROC for waits-for and waiting-for not be the same proc. Distributed
systems that use libpq connections to remote nodes, or anything else, would
have to register the local-side PGPROC as the involved waiter or waited-on
object, and handle any mapping between the remote object and local resource
holder/acquirer themselves, probably using their own shmem state.

Bonus points if the callback could assign weights to the injected edges to
bias victim selection more gently. Or a way to tag an waited-for node as
not a candidate victim for cancellation.

General thoughts?


RFC: Deadlock detector hooks for victim selection and edge injection

2020-12-06 Thread Craig Ringer
Hi folks

Now that we're well on track for streaming logical decoding, it's becoming
more practical to look at parallel logical apply.

The support for this in pglogical3 benefits from a deadlock detector hook.
It was added in the optional patched postgres pglogical can use to enable
various extra features that weren't possible without core changes, but
isn't present in community postgres yet.

I'd like to add it.

The main benefit is that it lets the logical replication support tell the
deadlock detector that it should prefer to kill the victim whose txn has a
higher upstream commit lsn. That helps encourage parallel logical apply to
make progress in the face of deadlocks between concurrent txns.

Any in-principle objections?


RFC: Giving bgworkers walsender-like grace during shutdown (for logical replication)

2020-12-06 Thread Craig Ringer
Hi folks

TL;DR: Anyone object to a new bgworker flag that exempts background workers
(such as logical apply workers) from the first round of fast shutdown
signals, and instead lets them to finish their currently in-progress txn
and exit?

This is a change proposal raised for comment before patch submission so
please consider it. Explanation of why I think we need it comes first, then
proposed implementation.

Rationale:

Currently a fast shutdown causes logical replication subscribers to abort
their currently in-progress transaction and terminate along with user
backends. This means they cannot finish receiving and flushing the
currently in-progress transaction, possibly wasting a very large amount of
work.

After restart the subscriber must reconnect, decode and reorder buffer from
the restart_lsn up to the current confirmed_flush_lsn, receive the whole
txn on the wire all over again, and apply the whole txn again locally. We
don't currently spool received txn change-streams to disk on the subscriber
and flush them so we can't repeat just the local apply part (see the
related thread "Logical archiving" for relevant discussion there). This can
create a lot of bloat, a lot of excess WAL, etc, if a big txn was in
progress at the time.

I'd like to add a bgworker flag that tells the postmaster to treat the
logical apply bgworker (or extension equivalents) somewhat like a walsender
for the purpose of fast shutdown. Instead of immediately terminating it
like user backends on fast shutdown, the bgworker should be sent a
ProcSignal warning that shutdown is pending and instructing it to finish
receiving and applying its current transaction, then exit gracefully.

It's not quite the same as the walsender, since there we try to flush
changes to downstreams up to the end of the last commit before shutting
down. That doesn't make sense on a subscriber because the upstream is
likely still generating txns. We just want to avoid wasting our effort on
any in-flight txn.

Any objections?

Proposed implementation:

* Add new bgworker flag like BGW_DELAYED_SHUTDOWN

* Define new ProcSignal PROCSIG_SHUTDOWN_REQUESTED. On fast shutdown send
this instead of a SIGTERM to bgworker backends flagged
BGW_DELAYED_SHUTDOWN. On smart shutdown send it to all backends when the
shutdown request arrives, since that could be handy for other uses too.

* Flagged bgworker is expected to finish its current txn and exit promptly.
Impose a grace period after which they get SIGTERM'd anyway. Also send a
SIGTERM if the postmaster receives a second fast shutdown request.

* Defer sending PROCSIG_WALSND_INIT_STOPPING to walsenders until all
BGW_DELAYED_SHUTDOWN flagged bgworkers have exited, so we can ensure that
cascaded downstreams receive any txns applied from the upstream.

This doesn't look likely to be particularly complicated to implement.

It might be better to use a flag in PGPROC rather than the bgworker struct,
in case we want to extend this to other backend types in future. Also to
make it easier for the postmaster to check the flag during shutdown. Could
just claim a bit from statusFlags for the purpose. Thoughts?


Re: Logical archiving

2020-12-06 Thread Craig Ringer
Actually CC'd Petr this time.

On Mon, 7 Dec 2020 at 11:05, Craig Ringer 
wrote:

> Reply follows inline. I addressed your last point first, so it's out of
> order.
>
> On Fri, 4 Dec 2020 at 15:33, Andrey Borodin  wrote
>
> > If OLAP cannot consume data fast enough - we are out of space due to
> repl slot.
>
> There is a much simpler solution to this than logical PITR.
>
> What we should be doing is teaching xlogreader how to invoke the
> restore_command to fetch archived WALs for decoding.
>
> Replication slots already have a WAL retention limit, but right now when
> that limit is reached the slot is invalidated and becomes useless, it's
> effectively dropped. Instead, if WAL archiving is enabled, we should leave
> the slot as valid. If a consumer of the slot needs WAL that no longer
> exists in pg_wal, we should have the walsender invoke the restore_command
> to read the missing WAL segment, decode it, and remove it again.
>
> This would not be a technically difficult patch, and it's IMO one of the
> more important ones for improving logical replication.
>
> > I was discussing problems of CDC with scientific community and they
> asked this simple question: "So you have efficient WAL archive on a very
> cheap storage, why don't you have a logical archive too?"
>
> I've done work in this area, as has Petr (CC'd).
>
> In short, logical archiving and PITR is very much desirable, but we're not
> nearly ready for it yet and we're missing a lot of the foundations needed
> to make it really useful.
>
> IMO the strongest pre-requisite is that we need integrated DDL capture and
> replication in Pg. While this could be implemented in the
> publisher/subscriber logic for logical replication, it would make much more
> sense (IMO) to make it easier to feed DDL events into any logical
> replication output plugin.
>
> pglogical3 (the closed one) has quite comprehensive DDL replication
> support. Doing it is not simple though - there are plenty of complexities:
>
> * Reliably identifying the target objects and mapping them to replication
> set memberships for DML-replication
> * Capturing, replicating and managing the search_path and other DDL
> execution context (DateStyle and much more) reliably
>
>- Each statement type needs specific logic to indicate whether it
>needs DDL replication (and often filter functions since we have lots of
>sub-types where some need replication and some don't)
>- Handling DDL affecting global objects in pg_global correctly, like
>those affecting roles, grants, database security labels etc. There's no one
>right answer for this, it depends on the deployment and requires the user
>to cooperate.
>- Correct handling of transactions that mix DDL and DML (mostly only
>an issue for multimaster).
>- Identifying statements that target a mix of replicated and
>non-replicated objects and handling them appropriately, including for
>CASCADEs
>- Gracefully handling DDL statements that mix TEMPORARY and persistent
>targets. We can do this ok for DROPs but it still requires care. Anything
>else gets messier.
>- Lack of hooks into table rewrite operations and the extremely clumsy
>and inefficient way logical decoding currently exposes decoding of the
>temp-table data during decoding of rewrites means handling table-rewriting
>DDL is difficult and impractical to do correctly. In pglogical we punt on
>it entirely and refuse to permit DDL that would rewrite a table except
>where we can prove it's reliant only on immutable inputs so we can discard
>the upstream rewrite and rely on statement replication.
>- As a consequence of the above, reliably determining whether a given
>statement will cause a table rewrite.
>- Handling re-entrant ProcessUtility_hook calls for ALTER TABLE etc.
>- Handling TRUNCATE's pseudo-DDL pseudo-DML halfway state, doing
>something sensible for truncate cascade.
>- Probably more I've forgotten
>
>
> If we don't handle these, then any logical change-log archives will become
> largely useless as soon as there's any schema change.
>
> So we kind of have to solve DDL replication first IMO.
>
> Some consideration is also required for metadata management. Right now
> relation and type metadata has session-lifetime, but you'd want to be able
> to discard old logical change-stream archives and have the later ones still
> be usable. So we'd need to define some kind of restartpoint where we repeat
> the metadata, or we'd have to support externalizing the metadata so it can
> be retained when the main change archives get aged out.
>
> We'd also need to separate the existing apply worker into a &qu

Re: Logical archiving

2020-12-06 Thread Craig Ringer
Reply follows inline. I addressed your last point first, so it's out of
order.

On Fri, 4 Dec 2020 at 15:33, Andrey Borodin  wrote

> If OLAP cannot consume data fast enough - we are out of space due to repl
slot.

There is a much simpler solution to this than logical PITR.

What we should be doing is teaching xlogreader how to invoke the
restore_command to fetch archived WALs for decoding.

Replication slots already have a WAL retention limit, but right now when
that limit is reached the slot is invalidated and becomes useless, it's
effectively dropped. Instead, if WAL archiving is enabled, we should leave
the slot as valid. If a consumer of the slot needs WAL that no longer
exists in pg_wal, we should have the walsender invoke the restore_command
to read the missing WAL segment, decode it, and remove it again.

This would not be a technically difficult patch, and it's IMO one of the
more important ones for improving logical replication.

> I was discussing problems of CDC with scientific community and they asked
this simple question: "So you have efficient WAL archive on a very cheap
storage, why don't you have a logical archive too?"

I've done work in this area, as has Petr (CC'd).

In short, logical archiving and PITR is very much desirable, but we're not
nearly ready for it yet and we're missing a lot of the foundations needed
to make it really useful.

IMO the strongest pre-requisite is that we need integrated DDL capture and
replication in Pg. While this could be implemented in the
publisher/subscriber logic for logical replication, it would make much more
sense (IMO) to make it easier to feed DDL events into any logical
replication output plugin.

pglogical3 (the closed one) has quite comprehensive DDL replication
support. Doing it is not simple though - there are plenty of complexities:

* Reliably identifying the target objects and mapping them to replication
set memberships for DML-replication
* Capturing, replicating and managing the search_path and other DDL
execution context (DateStyle and much more) reliably

   - Each statement type needs specific logic to indicate whether it needs
   DDL replication (and often filter functions since we have lots of sub-types
   where some need replication and some don't)
   - Handling DDL affecting global objects in pg_global correctly, like
   those affecting roles, grants, database security labels etc. There's no one
   right answer for this, it depends on the deployment and requires the user
   to cooperate.
   - Correct handling of transactions that mix DDL and DML (mostly only an
   issue for multimaster).
   - Identifying statements that target a mix of replicated and
   non-replicated objects and handling them appropriately, including for
   CASCADEs
   - Gracefully handling DDL statements that mix TEMPORARY and persistent
   targets. We can do this ok for DROPs but it still requires care. Anything
   else gets messier.
   - Lack of hooks into table rewrite operations and the extremely clumsy
   and inefficient way logical decoding currently exposes decoding of the
   temp-table data during decoding of rewrites means handling table-rewriting
   DDL is difficult and impractical to do correctly. In pglogical we punt on
   it entirely and refuse to permit DDL that would rewrite a table except
   where we can prove it's reliant only on immutable inputs so we can discard
   the upstream rewrite and rely on statement replication.
   - As a consequence of the above, reliably determining whether a given
   statement will cause a table rewrite.
   - Handling re-entrant ProcessUtility_hook calls for ALTER TABLE etc.
   - Handling TRUNCATE's pseudo-DDL pseudo-DML halfway state, doing
   something sensible for truncate cascade.
   - Probably more I've forgotten


If we don't handle these, then any logical change-log archives will become
largely useless as soon as there's any schema change.

So we kind of have to solve DDL replication first IMO.

Some consideration is also required for metadata management. Right now
relation and type metadata has session-lifetime, but you'd want to be able
to discard old logical change-stream archives and have the later ones still
be usable. So we'd need to define some kind of restartpoint where we repeat
the metadata, or we'd have to support externalizing the metadata so it can
be retained when the main change archives get aged out.

We'd also need to separate the existing apply worker into a "receiver" and
"apply/writer" part, so the wire-protocol handling isn't tightly coupled
with the actual change apply code, in order to make it possible to actually
consume those archives and apply them to the database. In pglogical3 we did
that by splitting them into two processes, connected by a shm_mq.
Originally the process split was optional and you could run a combined
receiver/writer process without the shm_mq if you wanted, but we quickly
found it difficult to reliably handle locking issues etc that way so the
writers all moved 

Re: Single transaction in the tablesync worker?

2020-12-06 Thread Craig Ringer
On Sat, 5 Dec 2020, 10:03 Amit Kapila,  wrote:

> On Fri, Dec 4, 2020 at 7:12 PM Ashutosh Bapat
>  wrote:
> >
> > On Thu, Dec 3, 2020 at 7:24 PM Amit Kapila 
> wrote:
> > >
> > > On Thu, Dec 3, 2020 at 7:04 PM Ashutosh Bapat
> > >  wrote:
> > > >
> > > > On Thu, Dec 3, 2020 at 2:55 PM Amit Kapila 
> wrote:
> > > > >
> > > > > The tablesync worker in logical replication performs the table data
> > > > > sync in a single transaction which means it will copy the initial
> data
> > > > > and then catch up with apply worker in the same transaction. There
> is
> > > > > a comment in LogicalRepSyncTableStart ("We want to do the table
> data
> > > > > sync in a single transaction.") saying so but I can't find the
> > > > > concrete theory behind the same. Is there any fundamental problem
> if
> > > > > we commit the transaction after initial copy and slot creation in
> > > > > LogicalRepSyncTableStart and then allow the apply of transactions
> as
> > > > > it happens in apply worker? I have tried doing so in the attached
> (a
> > > > > quick prototype to test) and didn't find any problems with
> regression
> > > > > tests. I have tried a few manual tests as well to see if it works
> and
> > > > > didn't find any problem. Now, it is quite possible that it is
> > > > > mandatory to do the way we are doing currently, or maybe something
> > > > > else is required to remove this requirement but I think we can do
> > > > > better with respect to comments in this area.
> > > >
> > > > If we commit the initial copy, the data upto the initial copy's
> > > > snapshot will be visible downstream. If we apply the changes by
> > > > committing changes per transaction, the data visible to the other
> > > > transactions will differ as the apply progresses.
> > > >
> > >
> > > It is not clear what you mean by the above.  The way you have written
> > > appears that you are saying that instead of copying the initial data,
> > > I am saying to copy it transaction-by-transaction. But that is not the
> > > case. I am saying copy the initial data by using REPEATABLE READ
> > > isolation level as we are doing now, commit it and then process
> > > transaction-by-transaction till we reach sync-point (point till where
> > > apply worker has already received the data).
> >
> > Craig in his mail has clarified this. The changes after the initial
> > COPY will be visible before the table sync catches up.
> >
>
> I think the problem is not that the changes are visible after COPY
> rather it is that we don't have a mechanism to restart if it crashes
> after COPY unless we do all the sync up in one transaction. Assume we
> commit after COPY and then process transaction-by-transaction and it
> errors out (due to connection loss) or crashes, in-between one of the
> following transactions after COPY then after the restart we won't know
> from where to start for that relation. This is because the catalog
> (pg_subscription_rel) will show the state as 'd' (data is being
> copied) and the slot would have gone as it was a temporary slot. But
> as mentioned in one of my emails above [1] we can solve these problems
> which Craig also seems to be advocating for as there are many
> advantages of not doing the entire sync (initial copy + stream changes
> for that relation) in one single transaction. It will allow us to
> support decode of prepared xacts in the subscriber. Also, it seems
> pglogical already does processing transaction-by-transaction after the
> initial copy. The only thing which is not clear to me is why we
> haven't decided to go ahead initially and it would be probably better
> if the original authors would also chime-in to at least clarify the
> same.
>

It's partly a resource management issue.

Replication origins are a limited resource. We need to use a replication
origin for any sync we want to be durable across restarts.

Then again so are slots and we use temp slots for each sync.

If a sync fails cleanup on the upstream side is simple with a temp slot.
With persistent slots we have more risk of creating upstream issues. But
then, so long as the subscriber exists it can deal with that. And if the
subscriber no longer exists its primary slot is an issue too.

It'd help if we could register pg_shdepend entries between catalog entries
and slots, and from a main subscription slot to any extra slots used for
resynchronization.

And I should write a patch for a resource retention summarisation view.


> I am not sure why but it seems acceptable to original authors that the
> data of transactions are visibly partially during the initial
> synchronization phase for a subscription.


I don't think there's much alternative there.

Pg would need some kind of cross commit visibility control mechanism that
separates durable commit from visibility


Re: Single transaction in the tablesync worker?

2020-12-03 Thread Craig Ringer
On Thu, 3 Dec 2020 at 17:25, Amit Kapila  wrote:

> Is there any fundamental problem if
> we commit the transaction after initial copy and slot creation in
> LogicalRepSyncTableStart and then allow the apply of transactions as
> it happens in apply worker?

No fundamental problem. Both approaches are fine. Committing the
initial copy then doing the rest in individual txns means an
incomplete sync state for the table becomes visible, which may not be
ideal. Ideally we'd do something like sync the data into a clone of
the table then swap the table relfilenodes out once we're synced up.

IMO the main advantage of committing as we go is that it would let us
use a non-temporary slot and support recovering an incomplete sync and
finishing it after interruption by connection loss, crash, etc. That
would be advantageous for big table syncs or where the sync has lots
of lag to replay. But it means we have to remember sync states, and
give users a way to cancel/abort them. Otherwise forgotten temp slots
for syncs will cause a mess on the upstream.

It also allows the sync slot to advance, freeing any held upstream
resources before the whole sync is done, which is good if the upstream
is busy and generating lots of WAL.

Finally, committing as we go means we won't exceed the cid increment
limit in a single txn.

> The reason why I am looking into this area is to support the logical
> decoding of prepared transactions. See the problem [1] reported by
> Peter Smith. Basically, when we stream prepared transactions in the
> tablesync worker, it will simply commit the same due to the
> requirement of maintaining a single transaction for the entire
> duration of copy and streaming of transactions. Now, we can fix that
> problem by disabling the decoding of prepared xacts in tablesync
> worker.

Tablesync should indeed only receive a txn when the commit arrives, it
should not attempt to handle uncommitted prepared xacts.

> But that will arise to a different kind of problems like the
> prepare will not be sent by the publisher but a later commit might
> move lsn to a later step which will allow it to catch up till the
> apply worker. So, now the prepared transaction will be skipped by both
> tablesync and apply worker.

I'm not sure I understand. If what you describe is possible then
there's already a bug in prepared xact handling. Prepared xact commit
progress should be tracked by commit lsn, not by prepare lsn.

Can you set out the ordering of events in more detail?

> I think apart from unblocking the development of 'logical decoding of
> prepared xacts', it will make the code consistent between apply and
> tablesync worker and reduce the chances of future bugs in this area.
> Basically, it will reduce the checks related to am_tablesync_worker()
> at various places in the code.

I think we made similar changes in pglogical to switch to applying
sync work in individual txns.




Re: pg_ctl.exe file deleted automatically

2020-12-03 Thread Craig Ringer
On Thu, 3 Dec 2020, 13:06 Joel Mariadasan (jomariad), 
wrote:

> Hi,
>
>
>
> We are using Windows 2019 server.
>
>
>
> Sometimes we see the *pg_ctl.exe* getting automatically deleted.
>
> Due to this, while starting up Postgres windows service we are getting the
> error.
>
> “Error 2: The system cannot find the file specified”
>
>
>
> Can you let us know the potential causes for this pg_ctl.exe file deletion?
>
>
>

PostgreSQL will never delete pg_ctl.exe

Check your antivirus software logs for false positives.

This mailing list is for software development on postgres so I suggest
following up on pgsql-general.

> Joel
>


Re: Two fsync related performance issues?

2020-12-02 Thread Craig Ringer
On Wed, 2 Dec 2020 at 15:41, Michael Paquier  wrote:

> On Tue, Dec 01, 2020 at 07:39:30PM +0800, Craig Ringer wrote:
> > On Wed, 14 Oct 2020, 13:06 Michael Paquier,  wrote:
> >> Yeah, it is safer to assume that it is the responsability of the
> >> backup tool to ensure that because it could be possible that a host is
> >> unplugged just after taking a backup, and having Postgres do this work
> >> at the beginning of archive recovery would not help in most cases.
> >
> > Let's document that assumption in the docs for pg_basebackup and the file
> > system copy based replica creation docs. With a reference to initdb's
> > datadir sync option.
>
> Do you have any suggestion about that, in the shape of a patch perhaps?
>

I'll try to get to that, but I have some other docs patches outstanding
that I'd like to resolve first.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


  1   2   3   4   5   6   >