from:"David Christensen"

Re: [PATCH] psql: \dn+ to show size of each schema (and \dA+ for AMs)

2024-06-03 Thread David Christensen

On Mon, Jun 3, 2024 at 9:10 AM Justin Pryzby  wrote:
>
> On Thu, May 30, 2024 at 10:59:06AM -0700, David Christensen wrote:
> > Hi Justin,
> >
> > Thanks for the patch and the work on it.  In reviewing the basic
> > feature, I think this is something that has utility and is worthwhile
> > at the high level.
>
> Thanks for looking.
>
> > A few more specific notes:
> >
> > The pg_namespace_size() function can stand on its own, and probably
> > has some utility for the released Postgres versions.
>
> Are you suggesting to add the C function retroactively in back branches?
> I don't think anybody would consider doing that.
>
> It wouldn't be used by anything internally, and any module that wanted
> to use it would have to check the minor version, instead of just the
> major version, which is wrong.

Ah, I meant once released it would be useful going forward, but
re-reading it can see the ambiguity.  No, definitely not suggesting
adding to back branches.

> > I do think the psql implementation for the \dn+ or \dA+ commands
> > shouldn't need to use this same function; it's a straightforward
> > expansion of the SQL query that can be run in a way that will be
> > backwards-compatible with any connected postgres version, so no reason
> > to exclude this information for this cases.  (This may have been in an
> > earlier revision of the patchset; I didn't check everything.)
>
> I think you're suggesting to write the query in SQL rather than in C.

Yes.

> But I did that in the first version of the patch, and the response was
> that maybe in the future someone would want to add permission checks
> that would compromize the ability to get correct results from SQL, so
> then I presented the functionality writen in C.
>
> I recommend that reviewers try to read the existing communication on the
> thread, otherwise we end up going back and forth about the same things.

Yes, I reviewed the whole thread, just not all of the patch contents.
I'd agree that suggestion flapping is not useful, but just presenting
my take on this at this time (as well as several others in the Patch
Review workshop at PGConf.dev, which is where this one got revived).

> > I think the \dX++ command versions add code complexity without a real
> > need for it.
>
> If you view this as a way to "show schema sizes", then you're right,
> there's no need.  But I don't want this patch to necessary further
> embrace the idea that it's okay for "plus commands to be slow and show
> nonportable results".  If there were a consensus that it'd be fine in a
> plus command, I would be okay with that, though.

I think this is a separate discussion in terms of introducing
verbosity levels and not needed at this point for this feature.
Certainly revamping all inconsistencies with the psql command
interfaces is out of scope, just thinking about which one we'd want to
model going forward and providing an opinion there (for what it's
worth... :D)

> > We have precedence with \l(+) to show permissions on the
> > basic display and size on the extended display, and I think this is
> > sufficient in this case here.
>
> You also have the precedence that \db doesn't show the ACL, and you
> can't get it without also computing the sizes.  That's 1) inconsistent
> with \l and 2) pretty inconvenient for someone who wants to show the
> ACL (as mentioned in the first message on this thread).

I can't speak to this one, just think that adding more command options
adds to the overall complexity so probably to be avoided unless we
can't.

> > 0001-Add-pg_am_size-pg_namespace_size.patch
> > - fine, but needs rebase to work
>
> I suggest reviewers to consider sending a rebased patch, optionally with
> any proposed changes in a separate patch.

Sure, if you don't have time to do this; agree with your later remarks
that consensus is important before moving forward, so same applies
here.

> > 0002-psql-add-convenience-commands-dA-and-dn.patch
> > - split into just + variant; remove \l++
> > - make the \dn show permission and \dn+ show size
>
> > 0003-f-convert-the-other-verbose-to-int-too.patch
> > - unneeded
> > 0004-Move-the-double-plus-Size-columns-to-the-right.patch
> > - unneeded
>
> You say they're unneeded, but what I've been hoping for is a committer
> interested enough to at least suggest whether to run with 001, 001+002,
> 001+002+003, or 001+002+003+004.  They're intentionally presented as
> such.
>
> I've also thought about submitting a patch specifically dedicated to
> "moving size out of + and into ++".  I find the idea compelling, for the
> reasons I wrote in the the patch description.  That'd be like presenting
> 003+004 first.
>
> I'm opened to changing the behavior or the implementation.  But changing
> the patch as I've presented it based on one suggestion I think will lead
> to incoherent code trashing.  I need to hear a wider agreement.

Agreed that some sort of consensus is important.

Re: Proposal: Document ABI Compatibility

2024-06-03 Thread David Christensen

> > I don't think it's common for such new-fields-in-padding to cause problems
> > when using an earlier minor PG version. For that the extension would need to
> > actually rely on the presence of the new field, but typically that'd not be
> > the case when we introduce a new field in a minor version.
>
> That’s what I was thinking, so “compatibility assured only if you don’t use 
> padding yourself” would be super helpful.

I was under the impression that the (a?) concern was related to
compiling the newer version and using that against an older version,
so if you always compiled the extension against the latest point
release, it wouldn't necessarily be backwards-compatible with older
installations.  (While probably allocated padding is zero'd out in the
old version, I'd not guarantee that that's the case.)

> >> Unless, that is, we can provide a complete list of things not to do (like
> >> make use of padding) to avoid it. Is that feasible?
> >
> > You can't really rely on the contents of padding, in general. So I don't 
> > think
> > this is really something that needs to be called out.
>
> Sure, probably not a problem, but if that’s the sole qualifier for making 
> binary changes, I think it’s worth saying, as opposed to “we don’t make any”. 
> Something like “Only changes to padding, which you never used anyway, right?” 
> :-)

Wonder if what might be more useful is some sort of distribution list
for extensions authors to be notified of specific compatibility
changes.  Could be powered by parsing commit messages for conventions
about backwards-incompatible changes; padding usage could be one,
maybe there are others we could code for.  (For padding, imagine a
tool that is able to look at struct offsets between git revisions and
note any offset differences/changes here that extensions authors could
be aware of; even diffs of `pahole` output between revisions could be
helpful.)

ABI guarantees for extensions are hard because all of core *is*
potentially the ABI (or at least if you are well-behaved, public
functions and structs), so some sort of "these interfaces won't break
in minor releases but use other internals and you're on your own"
might be useful distinction. (We discussed trying to classify APIs
with varying levels of support, but that also comes with its own set
of issues/overhead/maintenance for code authors/committers.)

Re: Add memory context type to pg_backend_memory_contexts view

2024-05-31 Thread David Christensen



> On May 30, 2024, at 5:36 PM, David Rowley  wrote:
> 
> On Fri, 31 May 2024 at 07:21, David Christensen  wrote:
>> Giving this a once-through, this seems straightforward and useful.  I
>> have a slight preference for keeping "name" the first field in the
>> view and moving "type" to the second, but that's minor.
> 
> Not having it first make sense, but I don't think putting it between
> name and ident is a good idea. I think name and ident belong next to
> each other. parent likely should come after those since that also
> relates to the name.
> 
> How about putting it after "parent"?

That works for me. I skimmed the new patch and it seems fine but on my phone so 
didn’t do any build tests. 

>> Just confirming that the allocator types are not extensible without a
>> recompile, since it's using a specific node tag to switch on, so there
>> are no concerns with not properly displaying the output of something
>> else.
> 
> They're not extensible.

Good to confirm. 

>> The "" text placeholder might be more appropriate as "",
>> or perhaps stronger, include a WARNING in the logs, since an unknown
>> tag at this point would be an indication of some sort of memory
>> corruption.
> 
> This follows what we do in other places.  If you look at explain.c,
> you'll see lots of "???"s.
> 
> I think if you're worried about corrupted memory, then garbled output
> in pg_get_backend_memory_contexts wouldn't be that high on the list of
> concerns.

Heh, indeed. +1 for precedent. 

>> Since there are only four possible values, I think there would be
>> utility in including them in the docs for this field.
> 
> I'm not sure about this. We do try not to expose too much internal
> detail in the docs.  I don't know all the reasons for that, but at
> least one reason is that it's easy for things to get outdated as code
> evolves. I'm also unsure how much value there is in listing 4 possible
> values unless we were to document the meaning of each of those values,
> and doing that puts us even further down the path of detailing
> Postgres internals in the documents. I don't think that's a
> maintenance burden that's often worth the trouble.

I can see that and it’s consistent with what we do, just was thinking as a user 
that that may be useful, but if you’re using this view you likely already know 
what it means.

>> I also think it
>> would be useful to have some sort of comments at least in mmgr/README
>> to indicate that if a new type of allocator is introduced that you
>> will also need to add the node to the function for this type, since
>> it's not an automatic conversion.
> 
> I don't think it's sustainable to do this.  If we have to maintain
> documentation that lists everything you must do in order to add some
> new node types then I feel it's just going to get outdated as soon as
> someone adds something new that needs to be done.  I'm only one
> developer, but for me, I'd not even bother looking there if I was
> planning to add a new memory context type.
> 
> What I would be doing is searching the entire code base for where
> special handling is done for the existing types and ensuring I
> consider if I need to include a case for the new node type. In this
> case, I'd probably choose to search for "T_AllocSetContext", and I'd
> quickly land on PutMemoryContextsStatsTupleStore() and update it. This
> method doesn't get outdated, provided you do "git pull" occasionally.

Fair. 

>> (For that matter, instead of
>> switching on node type and outputting a given string, is there a
>> generic function that could just give us the string value for node
>> type so we don't need to teach anything else about it anyway?)
> 
> There isn't.  nodeToString() does take some node types as inputs and
> serialise those to something JSON-like, but that includes serialising
> each field of the node type too. The memory context types are not
> handled by those functions. I think it's fine to copy what's done in
> explain.c.  "git grep \"???\" -- *.c | wc -l" gives me 31 occurrences,
> so I'm not doing anything new.
> 
> I've attached an updated patch which changes the position of the new
> column in the view.

+1

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

2024-05-30 Thread David Christensen

Hello,

I am looking through some outstanding CommitFest entries; I wonder if
this particular patch is already effectively fixed by 5278d0a2, which
is both attributed to the original author as well as an extremely
similar approach.  Can this entry
(https://commitfest.postgresql.org/48/4553/) be closed?

Best,

David

Re: Add memory context type to pg_backend_memory_contexts view

2024-05-30 Thread David Christensen

Hi David,

Giving this a once-through, this seems straightforward and useful.  I
have a slight preference for keeping "name" the first field in the
view and moving "type" to the second, but that's minor.

Just confirming that the allocator types are not extensible without a
recompile, since it's using a specific node tag to switch on, so there
are no concerns with not properly displaying the output of something
else.

The "" text placeholder might be more appropriate as "",
or perhaps stronger, include a WARNING in the logs, since an unknown
tag at this point would be an indication of some sort of memory
corruption.

Since there are only four possible values, I think there would be
utility in including them in the docs for this field.  I also think it
would be useful to have some sort of comments at least in mmgr/README
to indicate that if a new type of allocator is introduced that you
will also need to add the node to the function for this type, since
it's not an automatic conversion. (For that matter, instead of
switching on node type and outputting a given string, is there a
generic function that could just give us the string value for node
type so we don't need to teach anything else about it anyway?)

Thanks,

David

Re: [PATCH] psql: \dn+ to show size of each schema (and \dA+ for AMs)

2024-05-30 Thread David Christensen

Hi Justin,

Thanks for the patch and the work on it.  In reviewing the basic
feature, I think this is something that has utility and is worthwhile
at the high level.

A few more specific notes:

The pg_namespace_size() function can stand on its own, and probably
has some utility for the released Postgres versions.

I do think the psql implementation for the \dn+ or \dA+ commands
shouldn't need to use this same function; it's a straightforward
expansion of the SQL query that can be run in a way that will be
backwards-compatible with any connected postgres version, so no reason
to exclude this information for this cases.  (This may have been in an
earlier revision of the patchset; I didn't check everything.)

I think the \dX++ command versions add code complexity without a real
need for it.  We have precedence with \l(+) to show permissions on the
basic display and size on the extended display, and I think this is
sufficient in this case here.  While moving the permissions to \dn is
a behavior change, it's adding information, not taking it away, and as
an interactive command it is unlikely to introduce significant
breakage in any scripting scenario.

(In reviewing the patch we've also seen a bit of odd behavior/possible
bug with the existing extended + commands, which introducing
significant ++ overloading might be confusing, but not the
fault/concern of this patch.)

Quickie summary:

0001-Add-pg_am_size-pg_namespace_size.patch
- fine, but needs rebase to work
0002-psql-add-convenience-commands-dA-and-dn.patch
- split into just + variant; remove \l++
- make the \dn show permission and \dn+ show size
0003-f-convert-the-other-verbose-to-int-too.patch
- unneeded
0004-Move-the-double-plus-Size-columns-to-the-right.patch
- unneeded

Docs on the first patch seemed fine; I do think we'll need docs
changes for the psql changes.

Best,

David

Re: Adding comments to help understand psql hidden queries

2024-04-04 Thread David Christensen

On Thu, Apr 4, 2024 at 9:32 AM Peter Eisentraut  wrote:
>
> On 03.04.24 19:16, David Christensen wrote:
> > I removed _() in the output of the query/stars since there'd be no
> > sensible existing translations for the constructed string, which
> > included the query string itself.  If we need it for the "QUERY"
> > string, this could be added fairly easily, but the existing piece
> > would have been nonsensical and never used in practice.
>
> "QUERY" is currently translated.  Your patch loses that.

I see; enclosed is v5 which fixes this.

The effective diff from the last one is:

-char *label = "QUERY";
+char *label = _("QUERY");

and

-label = psprintf("QUERY (\\%s)", curcmd);
+label = psprintf(_("QUERY (\\%s)"), curcmd);

Best,

David


v5-0001-Improve-SQL-comments-on-echo-hidden-output.patch
Description: Binary data

Re: Adding comments to help understand psql hidden queries

2024-04-03 Thread David Christensen

I got Greg's blessing on squashing the commits down, and now including
a v4 with additional improvements on the output formatting front.
Main changes:

- all generated comments are the same width
- width has been bumped to 80
- removed _() functions for consumers of the new output functions

This patch adds two new helper functions, OutputComment() and
OutputCommentStars() to output and wrap the comments to the
appropriate widths.  Everything should continue to work just fine if
we want to adjust the width by just adjusting the MAX_COMMENT_WIDTH
symbol.

I removed _() in the output of the query/stars since there'd be no
sensible existing translations for the constructed string, which
included the query string itself.  If we need it for the "QUERY"
string, this could be added fairly easily, but the existing piece
would have been nonsensical and never used in practice.

Thanks,

David


v4-0001-Improve-SQL-comments-on-echo-hidden-output.patch
Description: Binary data

Re: Adding comments to help understand psql hidden queries

2024-03-22 Thread David Christensen

On Fri, Mar 22, 2024 at 9:47 AM Greg Sabino Mullane  wrote:
>
> On Thu, Mar 21, 2024 at 6:20 PM Peter Eisentraut  wrote:
>>
>> lines are supposed to align vertically.  With your patch, the first line
>> would have variable length depending on the command.
>
>
> Yes, that is a good point. Aligning those would be quite tricky, what if we 
> just kept a standard width for the closing query? Probably the 24 stars we 
> currently have to match "QUERY", which it appears nobody has changed for 
> translation purposes yet anyway. (If I am reading the code correctly, it 
> would be up to the translators to maintain the vertical alignment).

I think it's easier to keep the widths balanced than constant (patch
version included here), but if we needed to squeeze the opening string
to a standard width that would be possible without too much trouble.
The internal comment strings seem to be a bit more of a pain if we
wanted all of the comments to be the same width, as we'd need a table
or something so we can compute the longest string width, etc; doesn't
seem worth the convolutions IMHO.

No changes to Greg's patch, just keeping 'em both so cfbot stays happy.

David

v3-0002-Add-output-of-the-command-that-got-us-here-to-the.patch
Description: Binary data

v3-0001-Include-SQL-comments-on-echo-hidden-output.patch
Description: Binary data

Re: Avoiding inadvertent debugging mode for pgbench

2024-03-21 Thread David Christensen

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:tested, passed

Did a quick review of this one; CFbot is now happy, local regression tests all 
pass.

I think the idea here is sane; it's particularly confusing for some tools 
included in the main distribution to have different semantics, and this seems 
low on the breakage risk here, so worth the tradeoffs IMO.

The new status of this patch is: Ready for Committer

Re: Adding comments to help understand psql hidden queries

2024-03-21 Thread David Christensen

Created the CF entry in commitfest 48 but didn't see it was already in 47; 
closing the CFEntry in 48. (Doesn't appear to be a different status than 
"Withdrawn"...)

Re: Adding comments to help understand psql hidden queries

2024-03-21 Thread David Christensen

Hi Jim,

Thanks for the feedback.  Enclosed is a v2 of this series(?) rebased
and with that warning fixed; @Greg Sabino Mullane I just created a
commit on your behalf with the message to hackers.  I'm also creating
a commit-fest entry for this thread.

Best,

David


v2-0002-Add-output-of-the-command-that-got-us-here-to-the.patch
Description: Binary data


v2-0001-Include-SQL-comments-on-echo-hidden-output.patch
Description: Binary data

Re: Constant Splitting/Refactoring

2024-03-13 Thread David Christensen

Here is a version 2 of this patch, rebased atop 97d85be365.

As before, this is a cleanup/prerequisite patch series for the page
features/reserved page size patches[1].  (Said patch series is going
to be updated shortly.)

This splits each of the 4 constants that care about page size into
Cluster-specific and -Limit variants, the first intended to become a
variable in time, and the second being the maximum value such a
variable may take (largely used for static
allocations).

Since these patches define these symbols to have the same values they
previously had, there are no changes in functionality.  These were
largely mechanical changes, and as such we should perhaps consider
making the same changes to back-branches to make it so context lines
and the like
would be the same, simplifying maintainer's efforts when applying code
in back branches that touch similar areas.

The script I have to make these changes is simple, and could be run
against the back branches with only the comments surrounding Calc()
pieces needing
to be adjusted once.

Thanks,

David

[1] https://commitfest.postgresql.org/47/3986/


v2-0001-Split-MaxHeapTuplesPerPage-into-runtime-and-max-v.patch
Description: Binary data


v2-0002-Split-MaxHeapTupleSize-into-runtime-and-max-value.patch
Description: Binary data


v2-0004-Split-MaxTIDsPerBTreePage-into-runtime-and-max-va.patch
Description: Binary data


v2-0003-Split-MaxIndexTuplesPerPage-into-runtime-and-max-.patch
Description: Binary data

Re: Constant Splitting/Refactoring

2024-02-14 Thread David Christensen

This should target PG 18, but that status is not available in the CF app yet, 
so just making a note here.

Constant Splitting/Refactoring

2024-02-07 Thread David Christensen

As part of the reserved page space/page features[1] work, there is a need
to convert some of the existing constants into variables, since the size of
those will no longer be fixed at compile time.  At FOSDEM, there was some
interest expressed about seeing what a cleanup patch like that would look
like.

Enclosed is a series of cleanup patches for HEAD.  This splits each of the
4 constants that care about page size into Cluster-specific and -Limit
variants, the first intended to become a variable in time, and the second
being the maximum value such a variable may take (largely used for static
allocations).

Since these patches define these symbols to have the same values
they previously had, there are no changes in functionality.  These were
largely mechanical changes, and as such we should perhaps consider making
the same changes to back-branches to make it so context lines and the like
would be the same, simplifying maintainer's efforts when applying code in
back branches that touch similar areas.

The script I have to make these changes is simple, and could be run against
the back branches with only the comments surrounding Calc() pieces needing
to be adjusted once.

These were based on HEAD as a few minutes ago, 902900b308f, and pass all
tests.
Thanks,

David

[1] https://commitfest.postgresql.org/47/3986/


v1-0002-Split-MaxHeapTupleSize-into-runtime-and-max-value.patch
Description: Binary data


v1-0001-Split-MaxHeapTuplesPerPage-into-runtime-and-max-v.patch
Description: Binary data


v1-0004-Split-MaxTIDsPerBTreePage-into-runtime-and-max-va.patch
Description: Binary data


v1-0003-Split-MaxIndexTuplesPerPage-into-runtime-and-max-.patch
Description: Binary data

Re: Adding comments to help understand psql hidden queries

2024-02-01 Thread David Christensen

On Thu, Feb 1, 2024 at 4:34 PM Greg Sabino Mullane  wrote:
>
> The use of the --echo-hidden flag in psql is used to show people the way psql 
> performs its magic for its backslash commands. None of them has more magic 
> than "\d relation", but it suffers from needing a lot of separate queries to 
> gather all of the information it needs. Unfortunately, those queries can get 
> overwhelming and hard to figure out which one does what, especially for those 
> not already very familiar with the system catalogs. Attached is a patch to 
> add a small SQL comment to the top of each SELECT query inside 
> describeOneTableDetail. All other functions use a single query, and thus need 
> no additional context. But "\d mytable" has the potential to run over a dozen 
> SQL queries! The new format looks like this:
>
> / QUERY */
> /* Get information about row-level policies */
> SELECT pol.polname, pol.polpermissive,
>   CASE WHEN pol.polroles = '{0}' THEN NULL ELSE 
> pg_catalog.array_to_string(array(select rolname from pg_catalog.pg_roles 
> where oid = any (pol.polroles) order by 1),',') END,
>   pg_catalog.pg_get_expr(pol.polqual, pol.polrelid),
>   pg_catalog.pg_get_expr(pol.polwithcheck, pol.polrelid),
>   CASE pol.polcmd
> WHEN 'r' THEN 'SELECT'
> WHEN 'a' THEN 'INSERT'
> WHEN 'w' THEN 'UPDATE'
> WHEN 'd' THEN 'DELETE'
> END AS cmd
> FROM pg_catalog.pg_policy pol
> WHERE pol.polrelid = '134384' ORDER BY 1;
> //
>
> Cheers,
> Greg

Thanks, this looks like some helpful information. In the same vein,
I'm including a patch which adds information about the command that
generates the given query as well (atop your commit).  This will
modify the query line to include the command itself:

/ QUERY (\dRs) */

Best,

David


0001-Add-output-of-the-command-that-got-us-here-to-the-QU.patch
Description: Binary data

Re: [PATCHES] Post-special page storage TDE support

2023-12-22 Thread David Christensen

Hi again!

Per some offline discussion with Stephen, I've continued to work on some
modifications here; this particular patchset is intended to facilitate
review by highlighting the mechanical nature of many of these changes.  As
such, I have taken the following approach to this rework:

0001 - Create PageUsableSpace to represent space post-smgr
0002 - Add support for fast, non-division-based div/mod algorithms
0003 - Use fastdiv code in visibility map
0004 - Make PageUsableSpace incorporate variable-sized limit
0005 - Add Calc, Limit and Dynamic forms of all variable constants
0006 - Split MaxHeapTuplesPerPage into Limit and Dynamic variants
0007 - Split MaxIndexTuplesPerPage into Limit and Dynamic variants
0008 - Split MaxHeapTupleSize into Limit and Dynamic variants
0009 - Split MaxTIDsPerBTreePage into Limit and Dynamic variant

0001 - 0003 have appeared in this thread or in other forms on the list
already, though 0001 refactors things slightly more aggressively, but makes
StaticAssert() to ensure that this change is still sane.

0004 adds the ReservedPageSpace variable, and also redefines the previous
BLCKSZ - SizeOfPageHeaderDate as PageUsableSpaceMax; there are a few
related fixups.

0005 adds the macros to compute the former constants while leaving their
original definitions to evaluate to the same place (the infamous Calc* and
*Limit, plus we invite *Dynamic to the party as well; the names are
terrible and there must be something better)

0006 - 0009 are all the same approach; we undefine the old constant name
and modify the existing uses of this symbol to be either the *Limit or
*Dynamic, depending on if the changed available space would impact the
calculations.  Since we are touching every use of this symbol, this
facilitates review of the impact, though I would contend that almost every
piece I've spot-checked seems like it really does need to know about the
runtime limit.  Perhaps there is more we could do here.  I could also see a
variable per constant rather than recalculating this every time, in which
case the *Dynamic would just be the variable and we'd need a hook to
initialize this or otherwise set on first use.

There are a number of additional things remaining to be done to get this to
fully work, but I did want to get some of this out there for review.

Still to do (almost all in some form in original patch, so just need to
extract the relevant pieces):
- set reserved-page-size via initdb
- load reserved-page-size from pg_control
- apply to the running cluster
- some form of compatibility for these constants in common and ensuring
bin/ works
- some toast-related changes (this requires a patch to support dynamic
relopts, which I can extract, as the existing code is using a constant
lookup table)
- probably some more pieces I'm forgetting

Thanks,
David


v2-0005-Add-Calc-Limit-and-Dynamic-forms-of-all-variable-.patch
Description: Binary data


v2-0001-Create-PageUsableSpace-to-represent-space-post-sm.patch
Description: Binary data


v2-0002-Add-support-for-fast-non-division-based-div-mod-a.patch
Description: Binary data


v2-0003-Use-fastdiv-code-in-visibility-map.patch
Description: Binary data


v2-0004-Make-PageUsableSpace-incorporate-variable-sized-l.patch
Description: Binary data


v2-0007-Split-MaxIndexTuplesPerPage-into-Limit-and-Dynami.patch
Description: Binary data


v2-0008-Split-MaxHeapTupleSize-into-Limit-and-Dynamic-var.patch
Description: Binary data


v2-0006-Split-MaxHeapTuplesPerPage-into-Limit-and-Dynamic.patch
Description: Binary data


v2-0009-Split-MaxTIDsPerBTreePage-into-Limit-and-Dynamic-.patch
Description: Binary data

Re: [PATCHES] Post-special page storage TDE support

2023-11-29 Thread David Christensen

On Tue, Nov 7, 2023 at 6:20 PM Andres Freund  wrote:

> Hi,
>
> - IMO the patch touches many places it shouldn't need to touch, because of
>   essentially renaming a lot of existing macro names to *Limit,
>   necessitating modifying a lot of users. I think instead the few places
> that
>   care about the runtime limit should be modified.
>
>   As-is the patch would cause a lot of fallout in extensions that just do
>   things like defining an on-stack array of Datums or such - even though
> all
>   they'd need is to change the define to the *Limit one.
>
>   Even leaving extensions aside, it must makes reviewing (and I'm sure
>   maintaining) the patch very tedious.
>

Hi Andres et al,

So I've been looking at alternate approaches to this issue and considering
how to reduce churn, and I think we still need the *Limit variants.  Let's
take a simple example:

Just looking at MaxHeapTuplesPerPage and breaking down instances in the
code, loosely partitioning into whether it's used as an array index or
other usage (doesn't discriminate against code vs comments, unfortunately)
we get the following breakdown:

$ git grep -hoE [[]?MaxHeapTuplesPerPage | sort | uniq -c
 18 [MaxHeapTuplesPerPage
 51 MaxHeapTuplesPerPage

This would be 18 places where we would need at adjust in a fairly
mechanical fashion to add the MaxHeapTuplesPerPageLimit instead of
MaxHeapTuplesPerPage vs some significant fraction of non-comment--even if
you assumed half were in comments, there would presumably need to be some
sort of adjustments in verbage since we are going to be changing some of
the interpretation.

I am working on a patch to cleanup some of the assumptions that smgr makes
currently about its space usage and how the individual access methods
consider it, as they should only be calculating things based on how much
space is available after smgr is done with it.  That has traditionally been
BLCKSZ - SizeOfPageHeaderData, but this patch (included) factors that out
into a single expression that we can now use in access methods, so we can
then reserve additional page space and not need to adjust the access
methods furter.

Building on top of this patch, we'd define something like this to handle
the #defines that need to be dynamic:

extern Size reserved_page_space;
#define PageUsableSpace (BLCKSZ - SizeOfPageHeaderData -
reserved_page_space)
#define MaxHeapTuplesPerPage CalcMaxHeapTuplesPerPage(PageUsableSpace)
#define MaxHeapTuplesPerPageLimit CalcMaxHeapTuplesPerPage(BLCKSZ -
SizeOfPageHeaderData)
#define CalcMaxHeapTuplesPerPage(freesize)
  ((int) ((freesize) / \
  (MAXALIGN(SizeofHeapTupleHeader) +
sizeof(ItemIdData

In my view, extensions that are expecting to need no changes when it comes
to changing how these are interpreted are better off needing to only change
the static allocation in a mechanical sense than revisit any other uses of
code; this seems more likely to guarantee a correct result than if you
exceed the page space and start overwriting things you weren't because
you're not aware that you need to check for dynamic limits on your own.

Take another thing which would need adjusting for reserving page space,
MaxHeapTupleSize:

$ git grep -ohE '[[]?MaxHeapTupleSize' | sort | uniq -c
  3 [MaxHeapTupleSize
 16 MaxHeapTupleSize

Here there are 3 static arrays which would need to be adjusted vs 16 other
instances. If we kept MaxHeapTupleSize interpretation the same and didn't
adjust an extension it would compile just fine, but with too large of a
length compared to the smaller PageUsableSpace, so you could conceivably
overwrite into the reserved space depending on what you were doing.

(since by definition the reserved_page_space >= 0, so PageUsableSpace will
always be <= BLCKSZ - SizeOfPageHeaderData, so any expression based on it
as a basis will be smaller).

In short, I think the approach I took originally actually will reduce
errors out-of-core, and while churn is still necessary churn.

I can produce a second patch which implements this calc/limit atop this
first one as well.

Thanks,

David

0001-Create-PageUsableSpace-to-represent-space-post-smgr.patch
Description: Binary data

Re: [PATCHES] Post-special page storage TDE support

2023-11-13 Thread David Christensen

On Mon, Nov 13, 2023 at 2:52 PM Andres Freund  wrote:

>
> > This scheme would open up space per page that would now be available for
> > plenty of other things; the encoding in the header and the corresponding
> > available space in the footer would seem to open up quite a few options
> > now, no?
>
> Sure, if you're willing to rewrite the whole cluster to upgrade and
> willing to
> permanently sacrifice some data density.  If the stored data is actually
> specific to the page - that is the place to put the data. If not, then the
> tradeoff is much more complicated IMO.
>
> Of course this isn't a new problem - storing the page size on each page was
> just silly, it's never going to change across the cluster and even more
> definitely not going to change within a single relation.
>

Crazy idea; since stored pagesize is already a fixed cost that likely isn't
going away, what if instead of the pd_checksum field, we instead
reinterpret pd_pagesize_version; 4 would mean "no page features", but
anything 5 or higher could be looked up as an external page feature set,
with storage semantics outside of the realm of the page itself (other than
what the page features code itself needs to know); i.e,. move away from the
on-page bitmap into a more abstract representation of features which could
be something along the lines of what you were suggesting, including
extension support.

It seems like this could also support adding/removing features on page
read/write as long as there was sufficient space in the reserved_page
space; read the old feature set on page read, convert to the new feature
set which will write out the page with the additional/changed format.
Obviously there would be bookkeeping to be done in terms of making sure all
pages had been converted from one format to another, but for the page level
this would be straightforward.

Just thinking aloud here...

David

Re: [PATCHES] Post-special page storage TDE support

2023-11-13 Thread David Christensen

On Mon, Nov 13, 2023 at 2:27 PM Andres Freund  wrote:

> Hi,
>
> On 2023-11-08 18:47:56 -0800, Peter Geoghegan wrote:
> > On Wed, Nov 8, 2023 at 6:04 AM Stephen Frost  wrote:
> > > In conversations with folks (my memory specifically is a discussion
> with
> > > Peter G, added to CC, and my apologies to Peter if I'm misremembering)
> > > there was a pretty strong push that a page should be able to 'stand
> > > alone' and not depend on something else (eg: pg_control, or whatever)
> to
> > > provide info needed be able to interpret the page.  For my part, I
> don't
> > > have a particularly strong feeling on that, but that's what lead to
> this
> > > design.
> >
> > The term that I have used in the past is "self-contained". Meaning
> > capable of being decoded more or less as-is, without any metadata, by
> > tools like pg_filedump.
>
> I'm not finding that very convincing - without cluster wide data, like
> keys, a
> tool like pg_filedump isn't going to be able to do much with encrypted
> pages. Given the need to look at some global data, figuring out the offset
> at
> which data starts based on a value in pg_control isn't meaningfully worse
> than
> having the data on each page.
>
> Storing redundant data in each page header, when we've wanted space in the
> page header for plenty other things, just doesn't seem a good use of said
> space.
>

This scheme would open up space per page that would now be available for
plenty of other things; the encoding in the header and the corresponding
available space in the footer would seem to open up quite a few options
now, no?

Re: [PATCHES] Post-special page storage TDE support

2023-11-13 Thread David Christensen

On Tue, Nov 7, 2023 at 6:20 PM Andres Freund  wrote:

> Hi,
>
> On 2023-05-09 17:08:26 -0500, David Christensen wrote:
> > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
> > From: David Christensen 
> > Date: Tue, 9 May 2023 16:56:15 -0500
> > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
> >
> > This space is reserved for extended data on the Page structure which
> will be ultimately used for
> > encrypted data, extended checksums, and potentially other things.  This
> data appears at the end of
> > the Page, after any `pd_special` area, and will be calculated at runtime
> based on specific
> > ControlFile features.
> >
> > No effort is made to ensure this is backwards-compatible with existing
> clusters for `pg_upgrade`, as
> > we will require logical replication to move data into a cluster with
> > different settings here.
>
> The first part of the last paragraph makes it sound like pg_upgrade won't
> be
> supported across this commit, rather than just between different
> settings...
>

Thanks for the review.


> I think as a whole this is not an insane idea. A few comments:
>
> - IMO the patch touches many places it shouldn't need to touch, because of
>   essentially renaming a lot of existing macro names to *Limit,
>   necessitating modifying a lot of users. I think instead the few places
> that
>   care about the runtime limit should be modified.
>
>   As-is the patch would cause a lot of fallout in extensions that just do
>   things like defining an on-stack array of Datums or such - even though
> all
>   they'd need is to change the define to the *Limit one.
>
>   Even leaving extensions aside, it must makes reviewing (and I'm sure
>   maintaining) the patch very tedious.
>

You make a good point, and I think you're right that we could teach the
places that care about runtime vs compile time differences about the
changes while leaving other callers alone. The *Limit ones were introduced
since we need constant values here from the Calc...() macros, but could try
keeping the existing *Limit with the old name and switching things around.
I suspect there will be the same amount of code churn, but less mechanical.


> - I'm a bit worried about how the extra special page will be managed - if
>   there are multiple features that want to use it, who gets to put their
> data
>   at what offset?
>
>   After writing this I saw that 0002 tries to address this - but I don't
> like
>   the design. It introduces runtime overhead that seems likely to be
> visible.
>

Agreed this could be optimized.


> - Checking for features using PageGetFeatureOffset() seems the wrong
> design to
>   me - instead of a branch for some feature being disabled, perfectly
>   predictable for the CPU, we need to do an external function call every
> time
>   to figure out that yet, checksums are *still* disabled.
>

This is probably not a supported approach (it felt a little icky), but I'd
played around with const pointers to structs of const elements, where the
initial values of a global var was populated early on (so set once and
never changed post init), and the compiler didn't complain and things
seemed to work ok; not sure if this approach might help balance the early
mutability and constant lookup needs:

typedef struct PageFeatureOffsets {
  const Size feature0offset;
  const Size feature1offset;
  ...
} PageFeatureOffsets;

PageFeatureOffsets offsets = {0};
const PageFeatureOffsets *exposedOffsets = 

void InitOffsets() {
*((Size*)) = ...;
*((Size*)) = ...;
...
}

- Recomputing offsets every time in PageGetFeatureOffset() seems too
>   expensive. The offsets can't change while running as
> PageGetFeatureOffset()
>   have enough information to distinguish between different kinds of
> relations
>

Yes, this was a simple approach for ease of implementation; there is
certainly a way to precompute a lookup table from the page feature bitmask
into the offsets themselves or otherwise precompute, turn from function
call into inline/macro, etc.


>   - so why do we need to recompute offsets on every single page?  I'd
> instead
>   add a distinct offset variable for each feature.
>

This would work iff there is a single page feature set across all pages in
the cluster; I'm not sure we don't want more flexibility here.


> - Modifying every single PageInit() call doesn't make sense to me. That'll
>   just create a lot of breakage for - as far as I can tell - no win.
>

This was a placeholder to allow different features depending on page type;
to keep things simple for now I just used the same values here, but we
could move this inside PageInit() instead (again, assuming single feature
set per cluster).


> - Why is it worth sacr

Re: [PATCHES] Post-special page storage TDE support

2023-11-08 Thread David Christensen

On Wed, Nov 8, 2023 at 8:04 AM Stephen Frost  wrote:

> Greetings,
>
> * Andres Freund (and...@anarazel.de) wrote:
> > On 2023-05-09 17:08:26 -0500, David Christensen wrote:
> > > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
> > > From: David Christensen 
> > > Date: Tue, 9 May 2023 16:56:15 -0500
> > > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
> > >
> > > This space is reserved for extended data on the Page structure which
> will be ultimately used for
> > > encrypted data, extended checksums, and potentially other things.
> This data appears at the end of
> > > the Page, after any `pd_special` area, and will be calculated at
> runtime based on specific
> > > ControlFile features.
> > >
> > > No effort is made to ensure this is backwards-compatible with existing
> clusters for `pg_upgrade`, as
> > > we will require logical replication to move data into a cluster with
> > > different settings here.
> >
> > The first part of the last paragraph makes it sound like pg_upgrade
> won't be
> > supported across this commit, rather than just between different
> settings...
>

Yeah, that's vague, but you picked up on what I meant.


> > I think as a whole this is not an insane idea. A few comments:
>
> Thanks for all the feedback!
>
> > - Why is it worth sacrificing space on every page to indicate which
> features
> >   were enabled?  I think there'd need to be some convincing reasons for
> >   introducing such overhead.
>
> In conversations with folks (my memory specifically is a discussion with
> Peter G, added to CC, and my apologies to Peter if I'm misremembering)
> there was a pretty strong push that a page should be able to 'stand
> alone' and not depend on something else (eg: pg_control, or whatever) to
> provide info needed be able to interpret the page.  For my part, I don't
> have a particularly strong feeling on that, but that's what lead to this
> design.
>

Unsurprisingly, I agree that it's useful to keep these features on the page
itself; from a forensic standpoint this seems much easier to interpret what
is happening here, as well it would allow you to have different features on
a given page or type of page depending on need.  The initial patch utilizes
pg_control to store the cluster page features, but there's no reason it
couldn't be dependent on fork/page type or stored in pg_tablespace to
utilize different features.

Thanks,

David

Re: Moving forward with TDE [PATCH v3]

2023-11-08 Thread David Christensen

On Tue, Nov 7, 2023 at 5:49 PM Andres Freund  wrote:

> Hi,
>
> On 2023-11-06 09:56:37 -0500, Stephen Frost wrote:
> > * Andres Freund (and...@anarazel.de) wrote:
> > > I still am quite quite unconvinced that using the LSN as a nonce is a
> good
> > > design decision.
> >
> > This is a really important part of the overall path to moving this
> > forward, so I wanted to jump to it and have a specific discussion
> > around this.  I agree that there are downsides to using the LSN, some of
> > which we could possibly address (eg: include the timeline ID in the IV),
> > but others that would be harder to deal with.
>
> > The question then is- what's the alternative?
> >
> > One approach would be to change the page format to include space for an
> > explicit nonce.  I don't see the community accepting such a new page
> > format as the only format we support though as that would mean no
> > pg_upgrade support along with wasted space if TDE isn't being used.
>
> Right.
>

Hmm, if we /were/ to introduce some sort of page format change, Couldn't
that be a use case for modifying the pd_version field?  Could v4 pages be
read in and written out as v5 pages with different interpretations?


> > Ideally, we'd instead be able to support multiple page formats where
> > users could decide when they create their cluster what features they
> > want- and luckily we even have such an effort underway with patches
> > posted for review [1].
>
> I think there are some details wrong with that patch - IMO the existing
> macros
> should just continue to work as-is and instead the places that want the
> more
> narrow definition should be moved to the new macros and it changes places
> that
> should continue to use compile time constants - but it doesn't seem like a
> fundamentally bad idea to me.  I certainly like it much better than making
> the
> page size runtime configurable.
>

There had been some discussion about this WRT renaming macros and the like
(constants, etc)—I think a new pass eliminating the variable blocksize
pieces and seeing if we can minimize churn here is worthwhile, will take a
look and see what the minimally-viable set of changes is here.


> (I'll try to reply with the above points to [1])
>
>
> > Certainly, with the base page-special-feature patch, we could have an
> option
> > for users to choose that they want a better nonce than the LSN, or we
> could
> > bundle that assumption in with, say, the authenticated-encryption feature
> > (if you want authenticated encryption, then you want more from the
> > encryption system than the basics, and therefore we presume you also
> want a
> > better nonce than the LSN).
>
> I don't think we should support using the LSN as a nonce if we have an
> alternative. The cost and complexity overhead is just not worth it.  Yes,
> it'll be harder for users to migrate to encryption, but adding complexity
> elsewhere in the system to get an inferior result isn't worth it.
>

>From my read, XTS (which I'd see as inferior to authenticated encryption,
but better than some other options) could use LSN as an IV without leakage
concerns, perhaps mixing in the BlockNumber as well.  If we are going to
allow multiple encryption types, I think we may need to consider that needs
for IVs may be different, so this may need to be something that is
selectable per encryption type.

I am unclear how much of a requirement this is, but seems like having a
design supporting this to be pluggable—even if a static lookup table
internally for encryption type, block length, IV source, etc—seems the most
future proof if we had to retire an encryption method or prevent creation
of specific methods, say.


> > Another approach would be a separate fork, but that then has a number of
> > downsides too- every write has to touch that too, and a single page of
> > nonces would cover a pretty large number of pages also.
>
> Yea, the costs of doing so is nontrivial. If you were trying to implement
> encryption on the smgr level - which I doubt you should but am not certain
> about - my suggestion would be to interleave pages with metadata like
> nonces
> and AEAD with the data pages. I.e. one metadata page would be followed by
>   (BLCKSZ - SizeOfPageHeaderData) / (sizeof(nonce) + sizeof(AEAD))
> pages containing actual relation data.  That way you still get decent
> locality
> during scans and writes.
>

Hmm, this is actually an interesting idea, I will think about this a bit.


> Relation forks were a mistake, we shouldn't use them in more places.
>
>
> I think it'd be much better if we also encrypted forks, rather than just
> the
> main fork...
>

I believe the existing code should just work by modifying
the PageNeedsToBeEncrypted macro; I will test that and see if anything
blows up.

David

Re: Moving forward with TDE [PATCH v3]

2023-11-08 Thread David Christensen

On Tue, Nov 7, 2023 at 6:47 PM Andres Freund  wrote:

> Hi,
>
> On 2023-11-06 11:26:44 +0100, Matthias van de Meent wrote:
> > On Sat, 4 Nov 2023 at 03:38, Andres Freund  wrote:
> > > On 2023-11-02 22:09:40 +0100, Matthias van de Meent wrote:
> > > > I'm quite surprised at the significant number of changes being made
> > > > outside the core storage manager files. I thought that changing out
> > > > mdsmgr with an encrypted smgr (that could wrap mdsmgr if so desired)
> > > > would be the most obvious change to implement cluster-wide encryption
> > > > with the least code touched, as relations don't need to know whether
> > > > the files they're writing are encrypted, right? Is there a reason to
> > > > not implement this at the smgr level that I overlooked in the
> > > > documentation of these patches?
> > >
> > > You can't really implement encryption transparently inside an smgr
> without
> > > significant downsides. You need a way to store an initialization vector
> > > associated with the page (or you can store that elsewhere, but then
> you've
> > > doubled the worst cse amount of random reads/writes). The patch uses
> the LSN
> > > as the IV (which I doubt is a good idea). For authenticated encryption
> further
> > > additional storage space is required.
> >
> > I am unaware of any user of the smgr API that doesn't also use the
> > buffer cache, and thus implicitly the Page layout with PageHeader
> > [^1]
>
> Everything indeed uses a PageHeader - but there are a number of places
> that do
> *not* utilize pd_lower/upper/special. E.g. visibilitymap.c just assumes
> that
> those fields are zero - and changing that wouldn't be trivial / free,
> because
> we do a lot of bitmasking/shifting with constants derived from
>
> #define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
>
> which obviously wouldn't be constant anymore if you could reserve space on
> the
> page.
>

While not constants, I was able to get this working with variable values
here in a way that did not have the overhead of the original patch for the
vismap specifically using Montgomery Multiplication for division/mod. This
was actually the heaviest of the changes from moving to runtime-calculated,
so we might be able to use this approach in this specific case even if only
this change is required for this specific fork.


> > The API of smgr is also tailored to page-sized quanta of data
> > with mostly relation-level information. I don't see why there would be
> > a veil covering the layout of Page for smgr when all other information
> > already points to the use of PageHeader and Page layouts. In my view,
> > it would even make sense to allow the smgr to get exclusive access to
> > some part of the page in the current Page layout.
> >
> > Yes, I agree that there will be an impact on usable page size if you
> > want authenticated encryption, and that AMs will indeed need to
> > account for storage space now being used by the smgr - inconvenient,
> > but it serves a purpose. That would happen regardless of whether smgr
> > or some higher system decides where to store the data for encryption -
> > as long as it is on the page, the AM effectively can't use those
> > bytes.
> > But I'd say that's best solved by making the Page documentation and
> > PageInit API explicit about the potential use of that space by the
> > chosen storage method (encrypted, plain, ...) instead of requiring the
> > various AMs to manually consider encryption when using Postgres' APIs
> > for writing data to disk without hitting shared buffers; page space
> > management is already a task of AMs, but handling the actual
> > encryption is not.
>
> I don't particularly disagree with any detail here - but to me reserving
> space
> for nonces etc at PageInit() time pretty much is the opposite of handling
> encryption inside smgr.
>

Originally, I was anticipating that we might want different space amounts
reserved on different classes of pages (apart from encryption), so while
we'd be storing the default page reserved size in pg_control we'd not be
limited to this in the structure of the page calls.  We could presumably
just move the logic into PageInit() itself if every reserved allocation is
the same and individual call sites wouldn't need to know about it.  The
call sites do have more context as to the requirements of the page or the
"type" of page in play, which if we made it dependent on page type would
need to get passed in somehow, which was where the reserved_page_size
parameter came in to the current patch.

>
> > Should the AM really care whether the data on disk is encrypted or
> > not? I don't think so. When the disk contains encrypted bytes, but
> > smgrread() and smgrwrite() both produce and accept plaintext data,
> > who's going to complain? Requiring AMs to be mindful about encryption
> > on all common paths only adds pitfalls where encryption would be
> > forgotten by the developer of AMs in one path or another.
>
> I agree with that - I think the way the

Re: Moving forward with TDE [PATCH v3]

2023-11-06 Thread David Christensen

On Fri, Nov 3, 2023 at 9:53 PM Andres Freund  wrote:

> On 2023-11-02 19:32:28 -0700, Andres Freund wrote:
> > > From 327e86d52be1df8de9c3a324cb06b85ba5db9604 Mon Sep 17 00:00:00 2001
> > > From: David Christensen 
> > > Date: Fri, 29 Sep 2023 15:16:00 -0400
> > > Subject: [PATCH v3 5/5] Add encrypted/authenticated WAL
> > >
> > > When using an encrypted cluster, we need to ensure that the WAL is also
> > > encrypted. While we could go with an page-based approach, we use
> instead a
> > > per-record approach, using GCM for the encryption method and storing
> the AuthTag
> > > in the xl_crc field.
>
> What was the reason for this decision?
>

This was mainly to prevent IV reuse by using a per-record encryption rather
than per-page, since partial writes out on the WAL buffer would result in
reuse there.  This was somewhat of an experiment since authenticated data
per record was basically equivalent in function to the CRC.

There was a switch here so normal clusters use the crc field with the
existing CRC implementation, only encrypted clusters use this alternate
approach.

Re: Moving forward with TDE [PATCH v3]

2023-11-06 Thread David Christensen

Hi, thanks for the detailed feedback here.

I do think it's worth addressing the question Stephen raised as far as what
we use for the IV[1]; whether LSN or something else entirely, and if so
what.  The choice of LSN here is fairly fundamental to the existing
implementation, so if we decide to do something different some of this
might be moot.

Best,

David

[1]
https://www.mail-archive.com/pgsql-hackers@lists.postgresql.org/msg154613.html

Re: Moving forward with TDE [PATCH v3]

2023-10-31 Thread David Christensen

On Tue, Oct 31, 2023 at 4:30 PM Bruce Momjian  wrote:

> On Tue, Oct 31, 2023 at 04:23:17PM -0500, David Christensen wrote:
> > Greetings,
> >
> > I am including an updated version of this patch series; it has been
> rebased
> > onto 6ec62b7799 and reworked somewhat.
> >
> > The patches are as follows:
> >
> > 0001 - doc updates
> > 0002 - Basic key management and cipher support
> > 0003 - Backend-related changes to support heap encryption
> > 0004 - modifications to bin tools and programs to manage key rotation
> and add
> > other knowledge
> > 0005 - Encrypted/authenticated WAL
> >
> > These are very broad strokes at this point and should be split up a bit
> more to
> > make things more granular and easier to review, but I wanted to get this
> update
> > out.
>
> This lacks temp table file encryption, right?


Temporary /files/ are handled in a different patch set and are not included
here (not sure of the status of integrating at this point). I  believe that
this patch should handle temporary heap files just fine since they go
through the Page API.

Re: Initdb-time block size specification

2023-09-05 Thread David Christensen

On Tue, Sep 5, 2023 at 2:52 PM Hannu Krosing  wrote:
>
> Something I also asked at this years Unconference - Do we currently
> have Build Farm animals testing with different page sizes ?
>
> I'd say that testing all sizes from 4KB up (so 4, 8, 16, 32) should be
> done at least before each release if not continuously.
>
> -- Cheers
>
> Hannu

The regression tests currently have a lot of breakage when running
against non-standard block sizes, so I would assume the answer here is
no.  I would expect that we will want to add regression test variants
or otherwise normalize results so they will work with differing block
sizes, but have not done that yet.

Re: Initdb-time block size specification

2023-09-05 Thread David Christensen

On Tue, Sep 5, 2023 at 9:04 AM Robert Haas  wrote:
>
> On Sat, Sep 2, 2023 at 3:09 PM Tomas Vondra
>  wrote:
> > Except that no one is forcing you to actually go 130mph or 32mph, right?
> > You make it seem like this patch forces people to use some other page
> > size, but that's clearly not what it's doing - it gives you the option
> > to use smaller or larger block, if you chose to. Just like increasing
> > the speed limit to 130mph doesn't mean you can't keep going 65mph.
> >
> > The thing is - we *already* allow using different block size, except
> > that you have to do custom build. This just makes it easier.
>
> Right. Which is worth doing if it doesn't hurt performance and is
> likely to be useful to a lot of people, and is not worth doing if it
> will hurt performance and be useful to relatively few people. My bet
> is on the latter.

Agreed that this doesn't make sense if there are major performance
regressions, however the goal here is patch evaluation to measure this
against other workloads and see if this is the case; from my localized
testing things were within acceptable noise levels with the latest
version.

I agree with Tomas' earlier thoughts: we already allow different block
sizes, and if there are baked-in algorithmic assumptions about block
size (which there probably are), then identifying those or places in
the code where we need additional work or test coverage will only
improve things overall for those non-standard block sizes.

Best,

David

Re: Initdb-time block size specification

2023-08-31 Thread David Christensen

> I was definitely hand-waving additional implementation here for
> non-native 128 bit support; the modulus algorithm as presented
> requires 4 times the space as the divisor, so a uint16 implementation
> should work for all 64-bit machines.  Certainly open to other ideas or
> implementations, this was the one I was able to find initially.  If
> the 16bit approach is all that is needed in practice we can also see
> about narrowing the domain and not worry about making this a
> general-purpose function.

Here's a patch atop the series which converts to 16-bit uints and
passes regressions, but I don't consider well-vetted at this point.

David


0001-Switch-to-using-only-16-bit-to-avoid-128-bit-require.patch
Description: Binary data

Re: Initdb-time block size specification

2023-08-31 Thread David Christensen

> + * pg_fastmod - calculates the modulus of a 32-bit number against a constant
> + * divisor without using the division operator
> + */
> +static inline uint32 pg_fastmod(uint32 n, uint32 divisor, uint64 fastinv)
> +{
> +#ifdef HAVE_INT128
> + uint64_t lowbits = fastinv * n;
> + return ((uint128)lowbits * divisor) >> 64;
> +#else
> + return n % divisor;
> +#endif
> +}
>
> Requiring 128-bit arithmetic to avoid serious regression is a non-starter as 
> written. Software that relies on fast 128-bit multiplication has to do 
> backflips to get that working on multiple platforms. But I'm not sure it's 
> necessary -- if the max block number is UINT32_MAX and max block size is 
> UINT16_MAX, can't we just use 64-bit multiplication?

I was definitely hand-waving additional implementation here for
non-native 128 bit support; the modulus algorithm as presented
requires 4 times the space as the divisor, so a uint16 implementation
should work for all 64-bit machines.  Certainly open to other ideas or
implementations, this was the one I was able to find initially.  If
the 16bit approach is all that is needed in practice we can also see
about narrowing the domain and not worry about making this a
general-purpose function.

Best,

David

Re: Initdb-time block size specification

2023-08-31 Thread David Christensen

Enclosed are TPC-H results for 1GB shared_buffers, 64MB work_mem on a
64GB laptop with SSD storage; everything else is default settings.

TL;DR: unpatched version: 17.30 seconds, patched version: 17.15; there
are some slight variations in runtime, but seems to be within the
noise level at this point.


tpch.unpatched.rst
Description: Binary data


tpch.patched.rst
Description: Binary data

Re: Initdb-time block size specification

2023-06-30 Thread David Christensen

On Fri, Jun 30, 2023 at 4:17 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-06-30 15:05:54 -0500, David Christensen wrote:
> > > I am fairly certain this is going to be causing substantial performance
> > > regressions.  I think we should reject this even if we don't immediately 
> > > find
> > > them, because it's almost guaranteed to cause some.
> >
> > What would be considered substantial? Some overhead would be expected,
> > but I think having an actual patch to evaluate lets us see what
> > potential there is.
>
> Anything beyond 1-2%, although even that imo is a hard sell.

I'd agree that that threshold seems like a reasonable target, and
anything much above that would be regressive.

> > > Besides this, I've not really heard any convincing justification for 
> > > needing
> > > this in the first place.
> >
> > Doing this would open up experiments in larger block sizes, so we
> > would be able to have larger indexable tuples, say, or be able to
> > store data types that are larger than currently supported for tuple
> > row limits without dropping to toast (native vector data types come to
> > mind as a candidate here).
>
> You can do experiments today with the compile time option. Which does not
> require regressing performance for everyone.

Sure, not arguing that this is more performant than the current approach.

> > We've had 8k blocks for a long time while hardware has improved over 20+
> > years, and it would be interesting to see how tuning things would open up
> > additional avenues for performance without requiring packagers to make a
> > single choice on this regardless of use-case.
>
> I suspect you're going to see more benefits from going to a *lower* setting
> than a higher one. Some practical issues aside, plenty of storage hardware
> these days would allow to get rid of FPIs if you go to 4k blocks (although it
> often requires explicit sysadmin action to reformat the drive into that mode
> etc).  But obviously that's problematic from the "postgres limits" POV.
>
>
> If we really wanted to do this - but I don't think we do - I'd argue for
> working on the buildsystem support to build the postgres binary multiple
> times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
> exec's the relevant "real" binary based on the pg_control value.  I really
> don't see us ever wanting to make BLCKSZ runtime configurable within one
> postgres binary. There's just too much intrinsic overhead associated with
> that.

You may well be right, but I think if we haven't tried to do that and
measure it, it's hard to say exactly.  There are of course more parts
of the system that are about BLCKSZ than the backend, plus you'd need
to build other extensions to support each option, so there is a lot
more that would need to change. (That's neither here nor there, as my
approach also involves changing all those places, so change isn't
inherently bad; just saying it's not a trivial solution to merely
iterate over the block size for binaries.)

David

Re: Initdb-time block size specification

2023-06-30 Thread David Christensen

On Fri, Jun 30, 2023 at 4:27 PM Tomas Vondra
 wrote:
> On 6/30/23 23:11, Andres Freund wrote:
> > ...
> >
> > If we really wanted to do this - but I don't think we do - I'd argue for
> > working on the buildsystem support to build the postgres binary multiple
> > times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
> > exec's the relevant "real" binary based on the pg_control value.  I really
> > don't see us ever wanting to make BLCKSZ runtime configurable within one
> > postgres binary. There's just too much intrinsic overhead associated with
> > that.
> >
>
> I don't quite understand why we shouldn't do this (or at least try to).
> IMO the benefits of using smaller blocks were substantial (especially
> for 4kB, most likely due matching the internal SSD page size). The other
> benefits (reducing WAL volume) seem rather interesting too.

If it's dynamic, we could also be able to add detection of the best
block size at initdb time, leading to improvements all around, say.
:-)

> Sure, there are challenges (e.g. the overhead due to making it dynamic).
> No doubt about that.

I definitely agree that it seems worth looking into.  If nothing else,
being able to quantify just what/where the overhead comes from may be
instructive for future efforts.

David

Re: Initdb-time block size specification

2023-06-30 Thread David Christensen

On Fri, Jun 30, 2023 at 4:12 PM Tomas Vondra
 wrote:
>
>
>
> On 6/30/23 22:05, David Christensen wrote:
> > On Fri, Jun 30, 2023 at 2:39 PM Andres Freund  wrote:
> >>
> >> ...
> >>
> >> Besides this, I've not really heard any convincing justification for 
> >> needing
> >> this in the first place.
> >
> > Doing this would open up experiments in larger block sizes, so we
> > would be able to have larger indexable tuples, say, or be able to
> > store data types that are larger than currently supported for tuple
> > row limits without dropping to toast (native vector data types come to
> > mind as a candidate here).  We've had 8k blocks for a long time while
> > hardware has improved over 20+ years, and it would be interesting to
> > see how tuning things would open up additional avenues for performance
> > without requiring packagers to make a single choice on this regardless
> > of use-case.  (The fact that we allow compiling this at a different
> > value suggests there is thought to be some utility having this be
> > something other than the default value.)
> >
> > I just think it's one of those things that is hard to evaluate without
> > actually having something specific, which is why we have this patch
> > now.
> >
>
> But it's possible to evaluate that - you just need to rebuild with a
> different configuration option. Yes, allowing doing that at initdb is
> simpler and allows testing this on systems where rebuilding is not
> convenient. And having a binary that can deal with any block size would
> be nice too.
>
> In fact, I did exactly that a year ago for a conference, and I spoke
> about it at the 2022 unconference too. Not sure if there's recording
> from pgcon, but there is one from the other conference [1][2].
>
> The short story is that the possible gains are significant (say +50%)
> for data sets that don't fit into RAM. But that was with block size set
> at compile time, the question is what's the impact of making it a
> variable instead of a macro 
>
> [1] https://www.youtube.com/watch?v=mVKpoQxtCXk
> [2] https://blog.pgaddict.com/pdf/block-sizes-postgresvision-2022.pdf

Cool, thanks for the video links; will review.

David

Re: Initdb-time block size specification

2023-06-30 Thread David Christensen

On Fri, Jun 30, 2023 at 3:29 PM Andres Freund  wrote:
>> And indeed. Comparing e.g. TPC-H, I see *massive* regressions.  Some queries
> are the same, sobut others regress by up to 70% (although more commonly around
> 10-20%).

Hmm, that is definitely not good.

> That's larger than I thought, which makes me suspect that there's some bug in
> the new code.

Will do a little profiling here to see if I can figure out the
regression. Which build optimization settings are you seeing this
under?

> Interestingly, repeating the benchmark with a larger work_mem setting, the
> regressions are still quite present, but smaller. I suspect the planner
> chooses smarter plans which move bottlenecks more towards hashjoin code etc,
> which won't be affected by this change.

Interesting.

> IOW, you seriously need to evaluate analytics queries before this is worth
> looking at further.

Makes sense, thanks for reviewing.

David

Re: Initdb-time block size specification

2023-06-30 Thread David Christensen

On Fri, Jun 30, 2023 at 2:39 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-06-30 12:35:09 -0500, David Christensen wrote:
> > As discussed somewhat at PgCon, enclosed is v1 of a patch to provide
> > variable block sizes; basically instead of BLCKSZ being a compile-time
> > constant, a single set of binaries can support all of the block sizes
> > Pg can support, using the value stored in pg_control as the basis.
> > (Possible future plans would be to make this something even more
> > dynamic, such as configured per tablespace, but this is out of scope;
> > this just sets up the infrastructure for this.)
>
> I am extremely doubtful this is a good idea. For one it causes a lot of churn,
> but more importantly it turns currently cheap code paths into more expensive
> ones.
>
> Changes like
>
> > -#define BufHdrGetBlock(bufHdr)   ((Block) (BufferBlocks + ((Size) 
> > (bufHdr)->buf_id) * BLCKSZ))
> > +#define BufHdrGetBlock(bufHdr)   ((Block) (BufferBlocks + ((Size) 
> > (bufHdr)->buf_id) * CLUSTER_BLOCK_SIZE))
>
> Note that CLUSTER_BLOCK_SIZE, despite looking like a macro that's constant, is
> actually variable.

Correct; that is mainly a notational device which would be easy enough
to change (and presumably would follow along the lines of the commit
Tomas pointed out above).

> I am fairly certain this is going to be causing substantial performance
> regressions.  I think we should reject this even if we don't immediately find
> them, because it's almost guaranteed to cause some.

What would be considered substantial? Some overhead would be expected,
but I think having an actual patch to evaluate lets us see what
potential there is. Seems like this will likely be optimized as an
offset stored in a register, so wouldn't expect huge changes here.
(There may be other approaches I haven't thought of in terms of
getting this.)

> Besides this, I've not really heard any convincing justification for needing
> this in the first place.

Doing this would open up experiments in larger block sizes, so we
would be able to have larger indexable tuples, say, or be able to
store data types that are larger than currently supported for tuple
row limits without dropping to toast (native vector data types come to
mind as a candidate here).  We've had 8k blocks for a long time while
hardware has improved over 20+ years, and it would be interesting to
see how tuning things would open up additional avenues for performance
without requiring packagers to make a single choice on this regardless
of use-case.  (The fact that we allow compiling this at a different
value suggests there is thought to be some utility having this be
something other than the default value.)

I just think it's one of those things that is hard to evaluate without
actually having something specific, which is why we have this patch
now.

Thanks,

David

Re: Initdb-time block size specification

2023-06-30 Thread David Christensen

On Fri, Jun 30, 2023 at 1:14 PM Tomas Vondra
 wrote:

> Do we really want to prefix the option with CLUSTER_? That seems to just
> add a lot of noise into the patch, and I don't see much value in this
> rename. I'd prefer keeping BLCKSZ and tweak just the couple places that
> need "default" to use BLCKSZ_DEFAULT or something like that.
>
> But more importantly, I'd say we use CAPITALIZED_NAMES for compile-time
> values, so after making this initdb-time parameter we should abandon
> that (just like fc49e24fa69a did for segment sizes). Perhaps something
> like cluste_block_size would work ... (yes, that touches all the places
> too).

Yes, I can see that being an equivalent change; thanks for the pointer
there. Definitely the "cluster_block_size" could be an approach,
though since it's just currently a #define for GetBlockSize(), maybe
we just replace with the equivalent instead. I was mainly trying to
make it something that was conceptually similar and easy to reason
about without getting bogged down in the details locally, but can see
that ALL_CAPS does have a specific meaning.  Also eliminating the
BLCKSZ symbol meant it was easier to catch anything which depended on
that value.  If we wanted to keep BLCKSZ, I'd be okay with that at
this point vs the CLUSTER_BLOCK_SIZE approach, could help to make the
patch smaller at this point.

> > Initial (basic) performance testing shows only minor changes with the 
> > pgbench -S
> > benchmark, though this is obviously an area that will need considerable
> > testing/verification across multiple workloads.
> >
>
> I wonder how to best evaluate/benchmark this. AFAICS what we want to
> measure is the extra cost of making the values dynamic (which means the
> compiler can't just optimize them out). I'd say a "pgbench -S" seems
> like a good test.

Yep, I tested 100 runs apiece with pgbench -S at scale factor 100,
default settings for optimized builds of the same base commit with and
without the patch; saw a reduction of TPS around 1% in that case, but
I do think we'd want to look at different workloads; I presume the
largest impact would be seen when it's all in shared_buffers and no IO
is required.

Thanks,

David

Re: Add TLI number to name of files generated by pg_waldump --save-fullpage

2023-06-27 Thread David Christensen

> Adjusted as per the v2 attached.

+1

Re: Add TLI number to name of files generated by pg_waldump --save-fullpage

2023-06-27 Thread David Christensen

On Tue, Jun 27, 2023 at 1:12 AM Michael Paquier  wrote:
>
> Hi all,
> (Fujii-san and David in CC.)
>
> Fujii-san has reported on Twitter that we had better add the TLI
> number to what pg_waldump --save-fullpage generates for the file names
> of the blocks, as it could be possible that we overwrite some blocks.
> This information can be added thanks to ws_tli, that tracks the TLI of
> the opened segment.
>
> Attached is a patch to fix this issue, adding an open item assigned to
> me.  The file format is documented in the TAP test and the docs, the
> two only places that would need a refresh.
>
> Thoughts or comments?

Patch looks good, but agreed that that comment should also be fixed.

Thanks!

David

Re: [PATCHES] Post-special page storage TDE support

2023-05-30 Thread David Christensen

On Fri, May 12, 2023 at 7:48 PM Stephen Frost  wrote:
>
> Greetings,
>
> * David Christensen (david.christen...@crunchydata.com) wrote:
> > Refreshing this with HEAD as of today, v4.
>
> Thanks for updating this!

Thanks for the patience in my response here.

> > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
> >
> > This space is reserved for extended data on the Page structure which will 
> > be ultimately used for
> > encrypted data, extended checksums, and potentially other things.  This 
> > data appears at the end of
> > the Page, after any `pd_special` area, and will be calculated at runtime 
> > based on specific
> > ControlFile features.
> >
> > No effort is made to ensure this is backwards-compatible with existing 
> > clusters for `pg_upgrade`, as
> > we will require logical replication to move data into a cluster with 
> > different settings here.
>
> This initial patch, at least, does maintain pg_upgrade as the
> reserved_page_size (maybe not a great name?) is set to 0, right?
> Basically this is just introducing the concept of a reserved_page_size
> and adjusting all of the code that currently uses BLCKSZ or
> PageGetPageSize() to account for this extra space.

Correct; a reserved_page_size of 0 would be the same page format as
currently exists, so you could use pg_upgrade with no page features
and be binary compatible with existing clusters.

> Looking at the changes to bufpage.h, in particular ...
>
> > diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
>
> > @@ -19,6 +19,14 @@
> >  #include "storage/item.h"
> >  #include "storage/off.h"
> >
> > +extern PGDLLIMPORT int reserved_page_size;
> > +
> > +#define SizeOfPageReservedSpace() reserved_page_size
> > +#define MaxSizeOfPageReservedSpace 0
> > +
> > +/* strict upper bound on the amount of space occupied we have reserved on
> > + * pages in this cluster */
>
> This will eventually be calculated based on what features are supported
> concurrently?

Correct; these are fleshed out in later patches.

> > @@ -36,10 +44,10 @@
> >   * |  v pd_upper   
> > |
> >   * +-++
> >   * |  | tupleN ... |
> > - * +-+--+-+
> > - * |... tuple3 tuple2 tuple1 | "special space" |
> > - * ++-+
> > - *   ^ 
> > pd_special
> > + * +-+-++++
> > + * | ... tuple2 tuple1 | "special space" | "reserved" |
> > + * +---++++
> > + *  ^ pd_special  ^ 
> > reserved_page_space
>
> Right, adds a dynamic amount of space 'post-special area'.

Dynamic as in "fixed at initdb time" instead of compile time. However,
things are coded in such a way that the page feature bitmap is stored
on a given page, so different pages could have different
reserved_page_size depending on use case/code path. (Basically
preserving future flexibility while minimizing code changes here.)  We
could utilize different features depending on what type of page it is,
say, or have different relations or tablespaces with different page
feature defaults.

> > @@ -73,6 +81,8 @@
> >   * stored as the page trailer.  an access method should always
> >   * initialize its pages with PageInit and then set its own opaque
> >   * fields.
> > + *
> > + * XXX - update more comments here about reserved_page_space
> >   */
>
> Would be good to do. ;)

Next revision... :D

> > @@ -325,7 +335,7 @@ static inline void
> >  PageValidateSpecialPointer(Page page)
> >  {
> >   Assert(page);
> > - Assert(((PageHeader) page)->pd_special <= BLCKSZ);
> > + AssertPageHeader) page)->pd_special + reserved_page_size) <= 
> > BLCKSZ);
> >   Assert(((PageHeader) page)->pd_special >= SizeOfPageHeaderData);
> >  }
>
> This is just one usage ... but seems like maybe we should be using
> PageGetPageSize() here instead of BLCKSZ, and that more-or-less
> throughout?  Nearly everywhere we're using BLCKSZ today to give us that
> compile-time advantage of a fixed block size is going to lose that
> advantage anyway thanks to reserved_page_size being run-time.  Now, one
> up-side to this is that it'd also get us closer to being able to support
>

Re: Kerberos delegation support in libpq and postgres_fdw

2023-04-07 Thread David Christensen

Ok, based on the interdiff there, I'm happy with that last change.  Marking
as Ready For Committer.

Best,

David

Re: Kerberos delegation support in libpq and postgres_fdw

2023-04-07 Thread David Christensen

Reviewed v8; largely looking good, though I notice this hunk, which may
arguably be a bug fix, but doesn't appear to be relevant to this specific
patch, so could probably be debated independently (and if a bug, should
probably be backpatched):

diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 4229d2048c..11d41979c6 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -288,6 +288,9 @@ InitPgFdwOptions(void)
  {"sslcert", UserMappingRelationId, true},
  {"sslkey", UserMappingRelationId, true},

+ /* gssencmode is also libpq option, same to above. */
+ {"gssencmode", UserMappingRelationId, true},
+
  {NULL, InvalidOid, false}
  };

That said, should "gssdeleg" be exposed as a user mapping?  (This shows up
in postgresql_fdw; not sure if there are other places that would be
relevant, like in dblink somewhere as well, just a thought.)

Best,

David

Re: Kerberos delegation support in libpq and postgres_fdw

2023-04-05 Thread David Christensen

On Wed, Apr 5, 2023 at 3:30 PM Stephen Frost  wrote:

> Greetings,
>
> * David Christensen (david...@pgguru.net) wrote:
> > Did a code review pass here; here is some feedback.
>
> Thanks!
>
> > + /* If password was used to connect, make sure it was one provided
> */
> > + if (PQconnectionUsedPassword(conn) &&
> dblink_connstr_has_pw(connstr))
> > + return;
> >
> >   Do we need to consider whether these passwords are the same?  Is there
> a different vector where a different password could be acquired from a
> different source (PGPASSWORD, say) while both of these criteria are true?
> Seems like it probably doesn't matter that much considering we only checked
> Password alone in previous version of this code.
>
> Note that this patch isn't really changing how these checks are being
> done but more moving them around and allowing a GSSAPI-based approach
> with credential delegation to also be allowed.
>
> That said, as noted in the comments above dblink_connstr_check():
>
>  * For non-superusers, insist that the connstr specify a password, except
>  * if GSSAPI credentials have been proxied (and we check that they are used
>  * for the connection in dblink_security_check later).  This prevents a
>  * password or GSSAPI credentials from being picked up from .pgpass, a
>  * service file, the environment, etc.  We don't want the postgres user's
>  * passwords or Kerberos credentials to be accessible to non-superusers.
>
> The point of these checks is, indeed, to ensure that environmental
> values such as a .pgpass or variable don't end up getting picked up and
> used (or, if they do, we realize it post-connection and then throw away
> the connection).
>
> libpq does explicitly prefer to use the password passed in as part of
> the connection string and won't attempt to look up passwords in a
> .pgpass file or similar if a password has been included in the
> connection string.
>

The case I think I was thinking of was (manufactured) when we connected to
a backend with one password but the dblink or postgresql_fdw includes an
explicit password to a different server.  But now I'm thinking that this
PQconnectionUsedPassword() is checking the outgoing connection for dblink
itself, not the connection of the backend that connected to the main
server, so I think this objection is moot, like you say.

> Looks like the pg_gssinfo struct hides the `proxy_creds` def behind:
> >
> > #if defined(ENABLE_GSS) | defined(ENABLE_SSPI)
> > typedef struct
> > {
> > gss_buffer_desc outbuf;   /* GSSAPI output token
> buffer */
> > #ifdef ENABLE_GSS
> > ...
> > bool  proxy_creds;/* GSSAPI Delegated/proxy
> credentials */
> > #endif
> > } pg_gssinfo;
> > #endif
>
> ... right, proxy_creds only exists (today anyway) if ENABLE_GSS is set.
>
> > Which means that the later check in `be_gssapi_get_proxy()` we have:


 [analysis snipped]

Fairly confident the analysis here is wrong, further, the cfbot seems to
> agree that there isn't a compile failure here:
>
> https://cirrus-ci.com/task/6589717672624128
>
> [20:19:15.985] gss: NO
>
> (we always build with SSPI on Windows, per
> src/include/port/win32_port.h).
>

Cool; since we have coverage for that case seems like my concern was
unwarranted.

[snip]


> > 
> > 
> >  Only superusers may connect to foreign servers without password
> > -authentication, so always specify the password
> option
> > -for user mappings belonging to non-superusers.
> > +authentication or using gssapi proxied credentials, so specify the
> > +password option for user mappings belonging to
> > +non-superusers who are not able to proxy GSSAPI credentials.
> > 
> > 
> >
> > s/gssapi/GSSAPI/; this is kind of confusing, as this makes it sound like
> only superuser may use GSSAPI proxied credentials, which I disbelieve to be
> true.  Additionally, it sounds like you're wanting to explicitly maintain a
> denylist for users to not be allowed proxying; is that correct?
>
> Updated to GSSAPI and reworded in the updated patch (attached).
> Certainly open to suggestions on how to improve the documentation here.
> There is no 'denylist' for users when it comes to GSSAPI proxied
> credentials.  If there's a use-case for that then it could be added in
> the future.
>

Okay, I think your revisions here seem more clear, thanks.


>
> > ---
> >
> > libpq/auth.c:
> >
> >   if (proxy != NULL)
> >   {
> >   pg_store_proxy_creden

Re: Kerberos delegation support in libpq and postgres_fdw

2023-04-03 Thread David Christensen

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

Did a code review pass here; here is some feedback.


+   /* If password was used to connect, make sure it was one provided */
+   if (PQconnectionUsedPassword(conn) && dblink_connstr_has_pw(connstr))
+   return;

  Do we need to consider whether these passwords are the same?  Is there a 
different vector where a different password could be acquired from a different 
source (PGPASSWORD, say) while both of these criteria are true?  Seems like it 
probably doesn't matter that much considering we only checked Password alone in 
previous version of this code.

---

Looks like the pg_gssinfo struct hides the `proxy_creds` def behind:

#if defined(ENABLE_GSS) | defined(ENABLE_SSPI)
typedef struct
{
gss_buffer_desc outbuf; /* GSSAPI output token buffer */
#ifdef ENABLE_GSS
...
boolproxy_creds;/* GSSAPI Delegated/proxy credentials */
#endif
} pg_gssinfo;
#endif

Which means that the later check in `be_gssapi_get_proxy()` we have:

/*
 * Return if GSSAPI delegated/proxy credentials were included on this
 * connection.
 */
bool
be_gssapi_get_proxy(Port *port)
{
if (!port || !port->gss)
return NULL;

return port->gss->proxy_creds;
}

So in theory it'd be possible to have SSPI enabled but GSS disabled and we'd 
fail to compile in that case.  (It may be that this routine is never *actually* 
called in that case, just noting compile-time considerations.)  I'm not seeing 
guards in the actual PQ* routines, but don't think I've done an exhaustive 
search.

---

gss_accept_deleg


+   
+Forward (delegate) GSS credentials to the server.  The default is
+disable which means credentials will not be 
forwarded
+to the server.  Set this to enable to have
+credentials forwarded when possible.

When is this not possible?  Server policy?  External factors?

---



 Only superusers may connect to foreign servers without password
-authentication, so always specify the password option
-for user mappings belonging to non-superusers.
+authentication or using gssapi proxied credentials, so specify the
+password option for user mappings belonging to
+non-superusers who are not able to proxy GSSAPI credentials.



s/gssapi/GSSAPI/; this is kind of confusing, as this makes it sound like only 
superuser may use GSSAPI proxied credentials, which I disbelieve to be true.  
Additionally, it sounds like you're wanting to explicitly maintain a denylist 
for users to not be allowed proxying; is that correct?

---

libpq/auth.c:

if (proxy != NULL)
{
pg_store_proxy_credential(proxy);
port->gss->proxy_creds = true;
}

Per GSS docs, seems like we should be comparing to GSS_C_NO_CREDENTIAL and 
validating that the gflags has the `deleg_flag` bit set before considering 
whether there are valid credentials; in practice this might be the same effect 
(haven't looked at what that symbol actually resolves to, but NULL would be 
sensible).

Are there other cases we might need to consider here, like valid credentials, 
but they are expired? (GSS_S_CREDENTIALS_EXPIRED)

---

+   /*
+* Set KRB5CCNAME for this backend, so that later calls to 
gss_acquire_cred
+* will find the proxied credentials we stored.
+*/

So I'm not seeing this in other use in the code; I assume this is just used by 
the krb5 libs?

Similar q's for the other places the pg_gss_accept_deleg are used.

---

+int
+PQconnectionUsedGSSAPI(const PGconn *conn)
+{
+   if (!conn)
+   return false;
+   if (conn->gssapi_used)
+   return true;
+   else
+   return false;
+}

Micro-gripe: this routine seems like could be simpler, though the compiler 
probably has the same thing to say for either, so maybe code clarity is better 
as written:

int
PQconnectionUsedGSSAPI(const PGconn *conn)
{
return conn && conn->gssapi_used;
}

---

Anything required for adding meson support? I notice src/test/kerberos has 
Makefile updated, but no meson.build files are changed.

---

Two tests in src/test/kerberos/t/001_auth.pl at :535 and :545 have the same 
test description:

+   'succeeds with GSS-encrypted access required and hostgssenc hba and 
credentials not forwarded',

Since the first test has only `gssencmode` defined (so implicit `gssdeleg` 
value) and the second has `gssdeleg=disable` I'd suggest that the test on :545 
should have its description updated to add the word "explicitly":

'succeeds with GSS-encrypted access required and hostgssenc hba and

Re: Improving btree performance through specializing by key shape, take 2

2023-01-12 Thread David Christensen

Hi Matthias,

I'm going to look at this patch series if you're still interested.  What was 
the status of your final performance testing for the 0008 patch alone vs the 
specialization series?  Last I saw on the thread you were going to see if the 
specialization was required or not.

Best,

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-12-26 Thread David Christensen

> On Dec 26, 2022, at 1:29 AM, Michael Paquier  wrote:
> 
> On Sat, Dec 24, 2022 at 06:23:29PM +0530, Bharath Rupireddy wrote:
>> Thanks for the patch. I've made the above change as well as renamed
>> the test file name to be save_fpi.pl, everything else remains the same
>> as v11. Here's the v12 patch which LGTM. I'll mark it as RfC -
>> https://commitfest.postgresql.org/41/3628/.
> 
> I have done a review of that, and here are my notes:
> - The variable names were a bit inconsistent, so I have switched most
> of the new code to use "fullpage".
> - The code was not able to handle the case of a target directory
> existing but empty, so I have added a wrapper on pg_check_dir().
> - XLogRecordHasFPW() could be checked directly in the function saving
> the blocks.  Still, there is no need for it as we apply the same
> checks again in the inner loop of the routine.
> - The new test has been renamed.
> - RestoreBlockImage() would report a failure and the code would just
> skip it and continue its work.  This could point out to a compression
> failure for example, so like any code paths calling this routine I
> think that we'd better do a pg_fatal() and fail hard.
> - I did not understand why there is a reason to make this option
> conditional on the record prints or even the stats, so I have moved
> the FPW save routine into a separate code path.  The other two could
> be silenced (or not) using --quiet for example, for the same result as
> v12 without impacting the usability of this feature.
> - Few tweaks to the docs, the --help output, the comments and the
> tests.
> - Indentation applied.
> 
> Being able to filter the blocks saved using start/end LSNs or just
> --relation is really cool, especially as the file names use the same
> order as what's needed for this option.

Sounds good, definitely along the ideas of what I’d originally envisioned.

Thanks,

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-12-23 Thread David Christensen

On Fri, Dec 23, 2022 at 12:57 PM David Christensen
 wrote:
>
> On Wed, Dec 21, 2022 at 5:47 AM Bharath Rupireddy
>  wrote:

[snip]

> > 2. +$node->init(extra => ['-k'], allows_streaming => 1);
> > When enabled with allows_streaming, there are a bunch of things that
> > happen to the node while initializing, I don't think we need all of
> > them for this.
>
> I think the "allows_streaming" was required to ensure the WAL files
> were preserved properly, and was the approach we ended up taking
> rather than trying to fail the archive_command or other approaches I'd
> taken earlier. I'd rather keep this if we can, unless you can propose
> a different approach that would continue to work in the same way.

Confirmed that we needed this in order to create the replication slot,
so this /is/ required for the test to work.

Enclosing v11 with yours and Michael's latest feedback.

Best,

David


v11-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-st.patch
Description: Binary data

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-12-23 Thread David Christensen

On Mon, Dec 19, 2022 at 12:23 AM Michael Paquier  wrote:
>
> On Thu, Dec 15, 2022 at 05:17:46PM -0600, David Christensen wrote:
> > On Thu, Dec 15, 2022 at 12:36 AM Michael Paquier  
> > wrote:
> > This v10 should incorporate your feedback as well as Bharath's.
>
> Thanks for the new version.  I have minor comments.
>
> >> It seems to me that you could allow things to work even if the
> >> directory exists and is empty.  See for example
> >> verify_dir_is_empty_or_create() in pg_basebackup.c.
> >
> > The `pg_mkdir_p()` supports an existing directory (and I don't think
> > we want to require it to be empty first), so this only errors when it
> > can't create a directory for some reason.
>
> Sure, but things can also be made so as we don't fail if the directory
> exists and is empty?  This would be more consistent with the base
> directories created by pg_basebackup and initdb.

I guess I'm feeling a little dense here; how is this failing if there
is an existing empty directory?

> >> +$node->safe_psql('postgres', < >> +SELECT 'init' FROM 
> >> pg_create_physical_replication_slot('regress_pg_waldump_slot', true, 
> >> false);
> >> +CREATE TABLE test_table AS SELECT generate_series(1,100) a;
> >> +CHECKPOINT; -- required to force FPI for next writes
> >> +UPDATE test_table SET a = a + 1;
> >> Using an EOF to execute a multi-line query would be a first.  Couldn't
> >> you use the same thing as anywhere else?  009_twophase.pl just to
> >> mention one.  (Mentioned by Bharath upthread, where he asked for an
> >> extra opinion so here it is.)
> >
> > Fair enough, while idiomatic perl to me, not a hill to die on;
> > converted to a standard multiline string.
>
> By the way, knowing that we have an option called --fullpage, could be
> be better to use --save-fullpage for the option name?

Works for me.  I think I'd just wanted to avoid reformatting the
entire usage message which is why I'd gone with the shorter version.

> +   OPF = fopen(filename, PG_BINARY_W);
> +   if (!OPF)
> +   pg_fatal("couldn't open file for output: %s", filename);
> [..]
> +   if (fwrite(page, BLCKSZ, 1, OPF) != 1)
> +   pg_fatal("couldn't write out complete full page image to file: 
> %s", filename);
> These should more more generic, as of "could not open file \"%s\"" and
> "could not write file \"%s\"" as the file name provides all the
> information about what this writes.

Sure, will update.

Best,

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-12-23 Thread David Christensen

On Wed, Dec 21, 2022 at 5:47 AM Bharath Rupireddy
 wrote:
>
> On Fri, Dec 16, 2022 at 4:47 AM David Christensen
>  wrote:
> >
> > On Thu, Dec 15, 2022 at 12:36 AM Michael Paquier  
> > wrote:
> > >
> > > On Wed, Dec 14, 2022 at 04:44:34PM -0600, David Christensen wrote:
> > > > I can get one sent in tomorrow.
> >
> > This v10 should incorporate your feedback as well as Bharath's.
>
> Thanks for the patch. Here're some minor comments:
>
> 1. +my $node =  PostgreSQL::Test::Cluster->new('primary');
> Can the name be other than 'primary' because we don't create a standby
> for this test? Something like - 'node_a' or 'node_extract_fpi' or some
> other.

Sure, no issues.

> 2. +$node->init(extra => ['-k'], allows_streaming => 1);
> When enabled with allows_streaming, there are a bunch of things that
> happen to the node while initializing, I don't think we need all of
> them for this.

I think the "allows_streaming" was required to ensure the WAL files
were preserved properly, and was the approach we ended up taking
rather than trying to fail the archive_command or other approaches I'd
taken earlier. I'd rather keep this if we can, unless you can propose
a different approach that would continue to work in the same way.

> 3. +$node->init(extra => ['-k'], allows_streaming => 1);
> Can we use --data-checksums instead of -k for more readability?
> Perhaps, a comment on why we need that option helps greatly.

Yeah, can spell out; don't recall exactly why we needed it offhand,
but will confirm or remove if insignificant.

> 4.
> +page = (Page) buf.data;
> +
> +if (!XLogRecHasBlockRef(record, block_id))
> +continue;
> +
> +if (!XLogRecHasBlockImage(record, block_id))
> +continue;
> +
> +if (!RestoreBlockImage(record, block_id, page))
> +continue;
> Can you shift  page = (Page) buf.data; just before the last if
> condition RestoreBlockImage() so that it doesn't get executed for the
> other two continue statements?

Sure; since it was just setting a pointer value I didn't consider it
to be a hotspot for optimization.

Best,

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-12-15 Thread David Christensen

On Thu, Dec 15, 2022 at 12:36 AM Michael Paquier  wrote:
>
> On Wed, Dec 14, 2022 at 04:44:34PM -0600, David Christensen wrote:
> > I can get one sent in tomorrow.

This v10 should incorporate your feedback as well as Bharath's.

> -XLogRecordHasFPW(XLogReaderState *record)
> +XLogRecordHasFPI(XLogReaderState *record)
> This still refers to a FPW, so let's leave that out as well as any
> renamings of this kind..

Reverted those changes.

> +   if (config.save_fpi_path != NULL)
> +   {
> +   /* Create the dir if it doesn't exist */
> +   if (pg_mkdir_p(config.save_fpi_path, pg_dir_create_mode) < 0)
> +   {
> +   pg_log_error("could not create output directory \"%s\": %m",
> +config.save_fpi_path);
> +   goto bad_argument;
> +   }
> +   }
> It seems to me that you could allow things to work even if the
> directory exists and is empty.  See for example
> verify_dir_is_empty_or_create() in pg_basebackup.c.

The `pg_mkdir_p()` supports an existing directory (and I don't think
we want to require it to be empty first), so this only errors when it
can't create a directory for some reason.

> +my $file_re =
> +  
> qr/^([0-9A-F]{8})-([0-9A-F]{8})[.][0-9]+[.][0-9]+[.][0-9]+[.][0-9]+(?:_vm|_init|_fsm|_main)?$/;
> This is artistic to parse for people not used to regexps (I do, a
> little).  Perhaps this could use a small comment with an example of
> name or a reference describing this format?

Added a description of what this is looking for.

> +# Set umask so test directories and files are created with default 
> permissions
> +umask(0077);
> Incorrect copy-paste coming from elsewhere like the TAP tests of
> pg_basebackup with group permissions?  Doesn't
> PostgreSQL::Test::Utils::tempdir give already enough protection in
> terms of umask() and permissions?

I'd expect that's where that came from. Removed.

> +   if (config.save_fpi_path != NULL)
> +   {
> +   /* We fsync our output directory only; since these files are not part
> +* of the production database we do not require the performance hit
> +* that fsyncing every FPI would entail, so are doing this as a
> +* compromise. */
> +
> +   fsync_fname(config.save_fpi_path, true);
> +   }
> Why is it necessary to flush anything at all, then?

I personally don't think it is, but added it per Bharath's request.
Removed in this revision.

> +my $relation = $node->safe_psql('postgres',
> +   q{SELECT format('%s/%s/%s', CASE WHEN reltablespace = 0 THEN
> dattablespace ELSE reltablespace END, pg_database.oid,
> pg_relation_filenode(pg_class.oid)) FROM pg_class, pg_database WHERE
> relname = 'test_table' AND datname = current_database()}
> Could you rewrite that to be on multiple lines?

Sure, reformated.

> +diag "using walfile: $walfile";
> You should avoid the use of "diag", as this would cause extra output
> noise with a simple make check.

Had been using it for debugging and didn't realize it'd cause issues.
Removed both instances.

> +$node->safe_psql('postgres', "SELECT 
> pg_drop_replication_slot('regress_pg_waldump_slot')")
> That's not really necessary, the nodes are wiped out at the end of the
> test.

Removed.

> +$node->safe_psql('postgres', < +SELECT 'init' FROM 
> pg_create_physical_replication_slot('regress_pg_waldump_slot', true, false);
> +CREATE TABLE test_table AS SELECT generate_series(1,100) a;
> +CHECKPOINT; -- required to force FPI for next writes
> +UPDATE test_table SET a = a + 1;
> Using an EOF to execute a multi-line query would be a first.  Couldn't
> you use the same thing as anywhere else?  009_twophase.pl just to
> mention one.  (Mentioned by Bharath upthread, where he asked for an
> extra opinion so here it is.)

Fair enough, while idiomatic perl to me, not a hill to die on;
converted to a standard multiline string.

Best,

David


v10-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-st.patch
Description: Binary data

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-12-14 Thread David Christensen

Hi Bharath,

I can get one sent in tomorrow.

Thanks,

David

Re: Moving forward with TDE

2022-11-17 Thread David Christensen

Hi Dilip,

Thanks for the feedback here. I will review the docs changes and add to my tree.

Best,

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-17 Thread David Christensen

On Wed, Nov 16, 2022 at 3:30 AM Bharath Rupireddy
 wrote:
> 1.
> -if (config.filter_by_fpw && !XLogRecordHasFPW(xlogreader_state))
> +if (config.filter_by_fpw && !XLogRecordHasFPI(xlogreader_state))
> These changes are not related to this feature, hence renaming those
> variables/function names must be dealt with separately. If required,
> proposing another patch can be submitted to change filter_by_fpw to
> filter_by_fpi and XLogRecordHasFPW() to XLogRecordHasFPI().

Not required; can revert the changes unrelated to this specific patch.
(I'd written the original ones for it, so didn't really think anything
of it... :-))

> 2.
> +/* We fsync our output directory only; since these files are not part
> + * of the production database we do not require the performance hit
> + * that fsyncing every FPI would entail, so are doing this as a
> + * compromise. */
> The commenting style doesn't match the standard that we follow
> elsewhere in postgres, please refer to other multi-line comments.

Will fix.

> 3.
> +fsync_fname(config.save_fpi_path, true);
> +}
> It looks like fsync_fname()/fsync() in general isn't recursive, in the
> sense that it doesn't fsync the files under the directory, but the
> directory only. So, the idea of directory fsync doesn't seem worth it.
> We either 1) get rid of fsync entirely or 2) fsync all the files after
> they are created and the directory at the end or 3) do option (2) with
> --no-sync option similar to its friends. Since option (2) is a no go,
> we can either choose option (1) or option (2). My vote at this point
> is for option (1).

Agree to remove.

> 4.
> +($walfile_name, $blocksize) = split '\|' =>
> $node->safe_psql('postgres',"SELECT pg_walfile_name(pg_switch_wal()),
> current_setting('block_size')");
> +my $walfile = $node->basedir . '/pgdata/pg_wal/' . $walfile_name;
> I think there's something wrong with this, no? pg_switch_wal() can, at
> times, return end+1 of the prior segment (see below snippet) and I'm
> not sure if such a case can happen here.
>
>  * The return value is either the end+1 address of the switch record,
>  * or the end+1 address of the prior segment if we did not need to
>  * write a switch record because we are already at segment start.
>  */
> XLogRecPtr
> RequestXLogSwitch(bool mark_unimportant)

I think this approach is pretty common to get the walfile name, no?
While there might be an edge case here, since the rest of the test is
a controlled environment I'm inclined to just not worry about it; this
would require the changes prior to this to exactly fill a WAL segment
which strikes me as extremely unlikely to the point of impossible in
this specific scenario.

> 5.
> +my $walfile = $node->basedir . '/pgdata/pg_wal/' . $walfile_name;
> +ok(-f $walfile, "Got a WAL file");
> Is this checking if the WAL file is present or not in PGDATA/pg_wal?
> If yes, I think this isn't required as pg_switch_wal() ensures that
> the WAL is written and flushed to disk.

You are correct, probably another artifact of the earlier version.
That said, not sure I see the harm in keeping it as a sanity-check.

> 6.
> +my $walfile = $node->basedir . '/pgdata/pg_wal/' . $walfile_name;
> Isn't "pgdata" hardcoded here? I think you might need to do the following:
> $node->data_dir . '/pg_wal/' . $walfile_name;;

Can fix.

> 7.
> +# save filename for later verification
> +$files{$file}++;
>
> +# validate that we ended up with some FPIs saved
> +ok(keys %files > 0, 'verify we processed some files');
> Why do we need to store filenames in an array when we later just check
> the size of the array? Can't we use a boolean (file_found) or an int
> variable (file_count) to verify that we found the file.

Another artifact; we were comparing the files output between two
separate lists of arbitrary numbers of pages being written out and
verifying the raw/fixup versions had the same lists.

> 8.
> +$node->safe_psql('postgres', < +SELECT 'init' FROM
> pg_create_physical_replication_slot('regress_pg_waldump_slot', true,
> false);
> +CREATE TABLE test_table AS SELECT generate_series(1,100) a;
> +CHECKPOINT; -- required to force FPI for next writes
> +UPDATE test_table SET a = a + 1;
> +EOF
> The EOF with append_conf() is being used in 4 files and elsewhere in
> the TAP test files (more than 100?) qq[] or quotes is being used. I
> have no strong opinion here, I'll leave it to the other reviewers or
> committer.

I'm inclined to leave it just for (personal) readability, but can
change if there's a strong consensus against.

> > > 11.
> > > +# verify filename formats matches w/--save-fpi
> > > +for my $fullpath (glob "$tmp_folder/raw/*")
> > > Do we need to look for the exact match of the file that gets created
> > > in the save-fpi path? While checking for this is great, it makes the
> > > test code non-portable (may not work on Windows or other platforms,
> > > no?) and complex? This way, you can get rid of

Re: Moving forward with TDE

2022-11-17 Thread David Christensen

Hi Jacob,

Thanks, I've added this patch in my tree [1]. (For now, just adding
fixes and the like atop the original separate patches, but will
eventually get things winnowed down into probably the same 12 parts
the originals were reviewed in.

Best,

David

[1] https://github.com/pgguru/postgres/tree/tde

ScanSourceDatabasePgClass

2022-11-16 Thread David Christensen

Hi Robert,

In 9c08aea6a you introduce the block-by-block strategy for creating a
copy of the database.  In the main loop, this utilizes this call:

buf = ReadBufferWithoutRelcache(rlocator, MAIN_FORKNUM, blkno,
RBM_NORMAL, bstrategy, false);

Here, the last parameter is "false" for the permanence factor of this
relation.  Since we know that pg_class is in fact a permanent
relation, this ends up causing issues for the TDE patches that I am
working on updating, due using the opposite value when calculating the
page's IV and thus failing the decryption when trying to create a
database based on template0.

Is there a reason why this needs to be "false" here?  I recognize that
this routine is accessing the table outside of a formal connection, so
there might be more subtle effects that I am not aware of.  If so this
should be documented.  If it's an oversight, I think we should change
to be "true" to match the actual permanence state of the relation.

I did test changing it to true and didn't notice any adverse effects
in `make installcheck-world`, but let me know if there is more to this
story than meets the eye.

(I did review the original discussion at
https://www.postgresql.org/message-id/flat/CA%2BTgmoYtcdxBjLh31DLxUXHxFVMPGzrU5_T%3DCYCvRyFHywSBUQ%40mail.gmail.com
and did not notice any discussion of this specific parameter choice.)

Thanks,

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-15 Thread David Christensen

Enclosed is v9.

- code style consistency (FPI instead of FPW) internally.
- cleanup of no-longer needed checksum-related pieces from code and tests.
- test cleanup/simplification.
- other comment cleanup.

Passes all CI checks.

Best,

David


v9-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-str.patch
Description: Binary data

Re: Moving forward with TDE

2022-11-15 Thread David Christensen



> On Nov 15, 2022, at 1:08 PM, Jacob Champion  wrote:
> 
> On Mon, Oct 24, 2022 at 9:29 AM David Christensen
>  wrote:
>> I would love to open a discussion about how to move forward and get
>> some of these features built out.  The historical threads here are
>> quite long and complicated; is there a "current state" other than the
>> wiki that reflects the general thinking on this feature?  Any major
>> developments in direction that would not be reflected in the code from
>> May 2021?
> 
> I don't think the patchset here has incorporated the results of the
> discussion [1] that happened at the end of 2021. For example, it looks
> like AES-CTR is still in use for the pages, which I thought was
> already determined to be insufficient.

Good to know about the next steps, thanks. 

> The following next steps were proposed in that thread:
> 
>> 1. modify temporary file I/O to use a more centralized API
>> 2. modify the existing cluster file encryption patch to use XTS with a
>>   IV that uses more than the LSN
>> 3. add XTS regression test code like CTR
>> 4. create WAL encryption code using CTR
> 
> Does this patchset need review before those steps are taken (or was
> there additional conversation/work that I missed)?

This was just a refresh of the old patches on the wiki to work as written on 
HEAD. If there are known TODOs here this then that work is still needing to be 
done. 

I was going to take 2) and Stephen was going to work on 3); I am not sure about 
the other two but will review the thread you pointed to. Thanks for pointing 
that out. 

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-15 Thread David Christensen

On Tue, Nov 15, 2022 at 4:41 AM Bharath Rupireddy
 wrote:
>
> On Tue, Nov 15, 2022 at 1:29 AM David Christensen
>  wrote:
> >
> > Enclosed is v8, which uses the replication slot method to retain WAL
> > as well as fsync'ing the output directory when everything is done.
>
> Thanks. It mostly is in good shape. However, few more comments:
>
> 1.
> +if it does not exist.  The images saved will be subject to the same
> +filtering and limiting criteria as display records, but in this
> +mode pg_waldump will not output any other
> +information.
> May I know what's the intention of the statement 'The images saved
> '? If it's not necessary and convey anything useful to the user,
> can we remove it?

Basically I mean if you're limiting to a specific relation or rmgr
type, etc, it only saves those FPIs.  (So filtering is applied first
before considering whether to save the FPI or not.)

> 2.
> +#include "storage/checksum.h"
> +#include "storage/checksum_impl.h"
> I think we don't need the above includes as we got rid of verifying
> page checksums. The patch compiles without them for me.

Good catch.

> 3.
> +char   *save_fpw_path;
> Can we rename the above variable to save_fpi_path, just to be in sync
> with what we expose to the user, the option name 'save-fpi'?

Sure.

> 4.
> +if (config.save_fpw_path != NULL)
> +{
> +/* Fsync our output directory */
> +fsync_fname(config.save_fpw_path, true);
> +}
> I guess adding a comment there as to why we aren't fsyncing for every
> file that gets created, but once per the directory at the end. That'd
> help clarify doubts that other members might get while looking at the
> code.

Can do.

> 5.
> +if (config.save_fpw_path != NULL)
> +{
> +/* Fsync our output directory */
> +fsync_fname(config.save_fpw_path, true);
> +}
> So, are we sure that we don't want to fsync for time_to_stop exit(0)
> cases, say when CTRL+C'ed. Looks like we handle time_to_stop safely
> meaning exiting with return code 0, shouldn't we fsync the directory?

We can. Like I've said before, since these aren't production parts of
the cluster I don't personally have much of an opinion if fsync() is
appropriate at all, so don't have strong feelings here.

> 6.
> +else if (config.save_fpw_path)
> Let's use the same convention to check non-NULLness,
> config.save_fpw_path != NULL.

Good catch.

> 7.
> +CHECKPOINT;
> +SELECT pg_switch_wal();
> +UPDATE test_table SET a = a + 1;
> +SELECT pg_switch_wal();
> I don't think switching WAL after checkpoint is necessary here,
> because the checkpoint ensures all the WAL gets flushed to disk.
> Please remove it.

The point is to ensure we have a clean WAL segment that we know will
contain the relation we are filtering by.  Will test if this still
holds without the extra pg_switch_wal(), but that's the rationale.

> PS: I've seen the following code:
> +my $walfile = [sort { $a <=> $b } glob("$waldir/00*")]->[1]; # we
> want the second WAL file, which will be a complete WAL file with
> full-page writes for our specific relation.

I don't understand the question.

> 8.
> +$node->safe_psql('postgres', < +EOF
> Why EOF is used here? Can't we do something like below to execute
> multiple statements?
> $node->safe_psql(
> 'postgres', qq[
> SELECT data FROM pg_logical_slot_get_changes('regression_slot1', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
> SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
> SELECT data FROM pg_logical_slot_get_changes('regression_slot3', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
> SELECT data FROM pg_logical_slot_get_changes('regression_slot4', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
> ]);
>
> Same here:
> +$node->safe_psql('postgres', < +SELECT pg_drop_replication_slot('regress_pg_waldump_slot');
> +EOQ

As a long-time perl programmer, heredocs seem more natural and easier
to read rather than a string that accomplishes the same function.  If
there is an established project style I'll stick with it, but it just
rolled out that way. :-)

> 9.
> +my $walfile = [sort { $a <=> $b } glob("$waldir/00*")]->[1]; # we
> want the second WAL file, which will be a complete WAL file with
> full-page writes for our specific relation.
> Is it guaranteed that just looking at the second WAL file in the
> pg_wal directory assures WAL file with FPIs? I think we have to save
> the WAL file that contains FPIs, that is the file after, CHECK

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-14 Thread David Christensen

Enclosed is v8, which uses the replication slot method to retain WAL
as well as fsync'ing the output directory when everything is done.


v8-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-str.patch
Description: Binary data

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-14 Thread David Christensen

On Fri, Nov 11, 2022 at 4:57 AM Bharath Rupireddy
 wrote:
> > Well if it's not the same output then I guess you're right and there's
> > not a use for the `--fixup` mode.  By the same token, I'd say
> > calculating/setting the checksum also wouldn't need to be done, we
> > should just include the page as included in the WAL stream.
>
> Let's hear from others, we may be missing something here. I recommend
> keeping the --fixup patch as 0002, in case if we decide to discard
> it's easier, however I'll leave that to you.

I've whacked in `git` for now; I can resurrect if people consider it useful.

> > > > > 5.
> > > > > +if (!RestoreBlockImage(record, block_id, page))
> > > > > +continue;
> > > > > +
> > > > > +/* we have our extracted FPI, let's save it now */
> > > > > After extracting the page from the WAL record, do we need to perform a
> > > > > checksum on it?
> > >
> > > I think you just need to do the following, this will ensure the
> > > authenticity of the page that pg_waldump returns.
> > > if ((PageHeader) page)->pd_checksum != pg_checksum_page((char *) page, 
> > > blk))
> > > {
> > > pg_fatal("page checksum failed");
> > > }
> >
> > The WAL already has a checksum, so not certain this makes sense on its
> > own. Also I'm inclined to make it a warning if it doesn't match rather
> > than a fatal. (I'd also have to verify that the checksum is properly
> > set on the page prior to copying the FPI into WAL, which I'm pretty
> > sure it is but not certain.)
>
> How about having it as an Assert()?

Based on empirical testing, the checksums don't match, so
asserting/alerting on each block extracted seems next to useless, so
going to just remove that.

> > > 5.
> > > +fsync(fileno(OPF));
> > > +fclose(OPF);
> > > I think just the fsync() isn't enough, you still need fsync_fname()
> > > and/or fsync_parent_path(), perhaps after for (block_id = 0; block_id
> > > <= XLogRecMaxBlockId(record); block_id++) loop.
> >
> > I'm not sure I get the value of the fsyncs here; if you are using this
> > tool at this capacity you're by definition doing some sort of
> > transient investigative steps.  Since the WAL was fsync'd, you could
> > always rerun/recreate as needed in the unlikely event of an OS crash
> > in the middle of this investigation.  Since this is outside the
> > purview of the database operations proper (unlike, say, initdb) seems
> > like it's unnecessary (or definitely shouldn't need to be selectable).
> > My thoughts are that if we're going to fsync, just do the fsyncs
> > unconditionally rather than complicate the interface further.
>
> -1 for fysnc() per file created as it can create a lot of sync load on
> production servers impacting performance. How about just syncing the
> directory at the end assuming it doesn't cost as much as fsync() per
> FPI file created would?

I can fsync the dir if that's a useful compromise.

> > > 4.
> > > +$primary->append_conf('postgresql.conf', "wal_level='replica'");
> > > +$primary->append_conf('postgresql.conf', 'archive_mode=on');
> > > +$primary->append_conf('postgresql.conf', "archive_command='/bin/false'");
> > > Why do you need to set wal_level to replica, out of the box your
> > > cluster comes with replica only no?
> > > And why do you need archive_mode on and set the command to do nothing?
> > > Why archiving is needed for testing your feature firstly?
> >
> > I think it had shown "minimal" in my testing; I was purposefully
> > failing archives so the WAL would stick around.  Maybe a custom
> > archive command that just copied a single WAL file into a known
> > location so I could use that instead of the current approach would
> > work, though not sure how Windows support would work with that.  Open
> > to other ideas to more cleanly get a single WAL file that isn't the
> > last one.  (Earlier versions of this test were using /all/ of the
> > generated WAL files rather than a single one, so maybe I am
> > overcomplicating things for a single WAL file case.)
>
> Typically we create a physical replication slot at the beginning so
> that the server keeps the WAL required for you in pg_wal itself, for
> instance, please see pg_walinspect:
>
> -- Make sure checkpoints don't interfere with the test.
> SELECT 'init' FROM
> pg_create_physical_replication_slot('regress_pg_walinspect_slot',
> true, false);

Will see if I can get something like this to work; I'm currently
stopping the server before running the file-based tests, but I suppose
there's no reason to do so, so a temporary slot that holds it around
until the test is complete is probably fine.

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-11 Thread David Christensen

On Fri, Nov 11, 2022 at 8:15 AM Justin Pryzby  wrote:
>
> On Wed, Nov 09, 2022 at 02:37:29PM -0600, David Christensen wrote:
> > On Wed, Nov 9, 2022 at 2:08 PM David Christensen 
> >  wrote:
> > > Justin sez:
> > > > I was wondering if there's any reason to do "CREATE DATABASE".  The vast
> > > > majority of TAP tests don't.
> > > >
> > > > $ git grep -ho 'safe_psql[^ ]*' '*pl' |sort |uniq -c |sort -nr |head
> > > >1435 safe_psql('postgres',
> > > > 335 safe_psql(
> > > >  23 safe_psql($connect_db,
> > >
> > > If there was a reason, I don't recall offhand; I will test removing it
> > > and if things still work will consider it good enough.
> >
> > Things blew up when I did that; rather than hunt it down, I just left it 
> > in. :-)
>
> > +$primary->safe_psql('db1',  <
> It worked for me when I removed the 3 references to db1.
> That's good for efficiency of the test.

I did figure that out later; fixed in git.

> > +my $blocksize = 8192;
>
> I think this should be just "my $blocksize;" rather than setting a value
> which is later overwriten.

Yep.  Fixed in git.

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-10 Thread David Christensen

On Thu, Nov 10, 2022 at 2:33 AM Bharath Rupireddy
 wrote:
>
> On Thu, Nov 10, 2022 at 1:31 AM David Christensen
>  wrote:
> >
>
> Thanks for providing the v7 patch, please see my comments and responses below.

Hi Bharath, Thanks for the feedback.

> > > 2. I'm unable to understand the use-case for --fixup-fpi option.
> > > pg_waldump is supposed to be just WAL reader, and must not return any
> > > modified information, with --fixup-fpi option, the patch violates this
> > > principle i.e. it sets page LSN and returns. Without actually
> > > replaying the WAL record on the page, how is it correct to just set
> > > the LSN? How will it be useful? ISTM, we must ignore this option
> > > unless there's a strong use-case.
> >
> > How I was envisioning this was for cases like extreme surgery for
> > corrupted pages, where you extract the page from WAL but it has lsn
> > and checksum set so you could do something like `dd if=fixup-block
> > of=relation ...`, so it *simulates* the replay of said fullpage blocks
> > in cases where for some reason you can't play the intermediate
> > records; since this is always a fullpage block, it's capturing what
> > would be the snapshot so you could manually insert somewhere as needed
> > without needing to replay (say if dealing with an incomplete or
> > corrupted WAL stream).
>
> Recovery sets the page LSN after it replayed the WAL record on the
> page right? Recovery does this - base_page/FPI +
> apply_WAL_record_and_then_set_applied_WAL_record's_LSN =
> new_version_of_page. Essentially, in your patch, you are just setting
> the WAL record LSN with the page contents being the base page's. I'm
> still not sure what's the real use-case here. We don't have an
> independent function in postgres, given a base page and a WAL record
> that just replays the WAL record and output's the new version of the
> page, so I think what you do in the patch with fixup option seems
> wrong to me.

Well if it's not the same output then I guess you're right and there's
not a use for the `--fixup` mode.  By the same token, I'd say
calculating/setting the checksum also wouldn't need to be done, we
should just include the page as included in the WAL stream.

> > > 5.
> > > +if (!RestoreBlockImage(record, block_id, page))
> > > +continue;
> > > +
> > > +/* we have our extracted FPI, let's save it now */
> > > After extracting the page from the WAL record, do we need to perform a
> > > checksum on it?
>
> I think you just need to do the following, this will ensure the
> authenticity of the page that pg_waldump returns.
> if ((PageHeader) page)->pd_checksum != pg_checksum_page((char *) page, blk))
> {
> pg_fatal("page checksum failed");
> }

The WAL already has a checksum, so not certain this makes sense on its
own. Also I'm inclined to make it a warning if it doesn't match rather
than a fatal. (I'd also have to verify that the checksum is properly
set on the page prior to copying the FPI into WAL, which I'm pretty
sure it is but not certain.)

> > > case 'W':
> > > config.save_fpw_path = pg_strdup(optarg);
> > > case 'X':
> > >config.fixup_fpw = true;
> > >config.save_fpw_path = pg_strdup(optarg);
> >
> > Like separate opt processing with their own `break` statement?
> > Probably a bit more readable/conventional.
>
> Yes.

Moot with the removal of the --fixup mode.

> Some more comments:
>
> 1.
> +PGAlignedBlockzerobuf;
> Essentially, it's not a zero buffer, please rename the variable to
> something like 'buf' or 'page_buf' or someother?

Sure.

> 2.
> +if (pg_pwrite(fileno(OPF), page, BLCKSZ, 0) != BLCKSZ)
> Replace pg_pwrite with fwrite() and avoid fileno() system calls that
> should suffice here, AFICS, we don't need pg_pwrite.

Sure.

> 3.
> +if (config.save_fpw_path != NULL)
> +{
> +/* Create the dir if it doesn't exist */
> +if (pg_mkdir_p(config.save_fpw_path, pg_dir_create_mode) < 0)
> I think  you still need pg_check_dir() here, how about something like below?

I was assuming pg_mkdir_p() acted just like mkdir -p, where it's just
an idempotent action, so an existing dir is just treated the same.
What's the benefit here? Would assume if a non-dir file existed at
that path or other permissions issues arose we'd just get an error
from pg_mkdir_p().  (Will review the code there and confirm.)

> 4.
> +/* Create the dir if it doesn't exist */
> +if (pg_mkdir_p(config.save_fpw_path, pg_dir_create_mode) < 0)
> +{
> +pg_log_error("could not creat

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-09 Thread David Christensen

On Wed, Nov 9, 2022 at 2:08 PM David Christensen
 wrote:
> Justin sez:
> > I was wondering if there's any reason to do "CREATE DATABASE".  The vast
> > majority of TAP tests don't.
> >
> > $ git grep -ho 'safe_psql[^ ]*' '*pl' |sort |uniq -c |sort -nr |head
> >1435 safe_psql('postgres',
> > 335 safe_psql(
> >  23 safe_psql($connect_db,
>
> If there was a reason, I don't recall offhand; I will test removing it
> and if things still work will consider it good enough.

Things blew up when I did that; rather than hunt it down, I just left it in. :-)

Enclosed is v7, with changes thus suggested thus far.


v7-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-str.patch
Description: Binary data

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-09 Thread David Christensen

> > 6.
> > +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0)
> > Use pg_dir_create_mode instead of hard-coded 0007?
>
> I think I thought of that when I first looked at the patch ...  but, I'm
> not sure, since it says:
>
> src/include/common/file_perm.h-/* Modes for creating directories and files IN 
> THE DATA DIRECTORY */
> src/include/common/file_perm.h:extern PGDLLIMPORT int pg_dir_create_mode;

Looks like it's pretty evenly split in src/bin:

$ git grep -o -E -w '(pg_mkdir_p|mkdir)' '**.c' | sort | uniq -c
  3 initdb/initdb.c:mkdir
  3 initdb/initdb.c:pg_mkdir_p
  1 pg_basebackup/bbstreamer_file.c:mkdir
  2 pg_basebackup/pg_basebackup.c:pg_mkdir_p
  1 pg_dump/pg_backup_directory.c:mkdir
  1 pg_rewind/file_ops.c:mkdir
  4 pg_upgrade/pg_upgrade.c:mkdir

So if that is the preferred approach I'll go ahead and use it.

> I was wondering if there's any reason to do "CREATE DATABASE".  The vast
> majority of TAP tests don't.
>
> $ git grep -ho 'safe_psql[^ ]*' '*pl' |sort |uniq -c |sort -nr |head
>1435 safe_psql('postgres',
> 335 safe_psql(
>  23 safe_psql($connect_db,

If there was a reason, I don't recall offhand; I will test removing it
and if things still work will consider it good enough.

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-09 Thread David Christensen

On Wed, Nov 9, 2022 at 6:30 AM Bharath Rupireddy
 wrote:
>
> On Wed, Nov 9, 2022 at 5:08 AM David Christensen
>  wrote:
> >
> > Enclosed is v6, which squashes your refactor and adds the additional
> > recent suggestions.
>
> Thanks for working on this feature. Here are some comments for now. I
> haven't looked at the tests, I will take another look at the code and
> tests after these and all other comments are addressed.
>
> 1. For ease of review, please split the test patch to 0002.

Per later discussion seems like new feature tests are fine in the same
patch, yes?

> 2. I'm unable to understand the use-case for --fixup-fpi option.
> pg_waldump is supposed to be just WAL reader, and must not return any
> modified information, with --fixup-fpi option, the patch violates this
> principle i.e. it sets page LSN and returns. Without actually
> replaying the WAL record on the page, how is it correct to just set
> the LSN? How will it be useful? ISTM, we must ignore this option
> unless there's a strong use-case.

How I was envisioning this was for cases like extreme surgery for
corrupted pages, where you extract the page from WAL but it has lsn
and checksum set so you could do something like `dd if=fixup-block
of=relation ...`, so it *simulates* the replay of said fullpage blocks
in cases where for some reason you can't play the intermediate
records; since this is always a fullpage block, it's capturing what
would be the snapshot so you could manually insert somewhere as needed
without needing to replay (say if dealing with an incomplete or
corrupted WAL stream).

> 3.
> +if (fork >= 0 && fork <= MAX_FORKNUM)
> +{
> +if (fork)
> +sprintf(forkname, "_%s", forkNames[fork]);
> +else
> +forkname[0] = 0;
> +}
> +else
> +pg_fatal("found invalid fork number: %u", fork);
>
> Isn't the above complex? What's the problem with something like below?
> Why do we need if (fork) - else block?
>
> if (fork >= 0  && fork <= MAX_FORKNUM)
> sprintf(forkname, "_%s", forkNames[fork]);
> else
> pg_fatal("found invalid fork number: %u", fork);

This was to suppress any suffix for main forks, but yes, could
simplify and include the `_` in the suffix name.  Will include such a
change.

> 3.
> +charpage[BLCKSZ] = {0};
> I think when writing to a file, we need PGAlignedBlock rather than a
> simple char array of bytes, see the description around PGAlignedBlock
> for why it is so.

Easy enough change, and makes sense.

> 4.
> +if (pg_pwrite(fileno(OPF), page, BLCKSZ, 0) != BLCKSZ)
> Why pg_pwrite(), why not just fwrite()? If fwrite() is used, you can
> avoid fileno() system calls, no? Do you need the file position to
> remain the same after writing, hence pg_pwrite()?

I don't recall the original motivation, TBH.

> 5.
> +if (!RestoreBlockImage(record, block_id, page))
> +continue;
> +
> +/* we have our extracted FPI, let's save it now */
> After extracting the page from the WAL record, do we need to perform a
> checksum on it?

That is there in fixup mode (or should be).  Are you thinking this
should also be set if not in fixup mode?  That defeats the purpose of
the raw page extract, which is to see *exactly* what the WAL stream
has.

> 6.
> +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0)
> Use pg_dir_create_mode instead of hard-coded 0007?

Sure.

> 7.
> +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0)
> +fsync(fileno(OPF));
> +fclose(OPF);
> Since you're creating the directory in case it's not available, you
> need to fsync the directory too?

Sure.

> 8.
> +case 'W':
> +case 'X':
> +config.fixup_fpw = (option == 'X');
> +config.save_fpw_path = pg_strdup(optarg);
> +break;
> Just set  config.fixup_fpw = false before the switch block starts,
> like the other variables, and then perhaps doing like below is more
> readable:
> case 'W':
> config.save_fpw_path = pg_strdup(optarg);
> case 'X':
>config.fixup_fpw = true;
>config.save_fpw_path = pg_strdup(optarg);

Like separate opt processing with their own `break` statement?
Probably a bit more readable/conventional.

> 9.
> +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0)
> Should we use pg_mkdir_p() instead of mkdir()?

Sure.

Thanks,

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-08 Thread David Christensen

Enclosed is v6, which squashes your refactor and adds the additional
recent suggestions.

Thanks!


v6-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-str.patch
Description: Binary data

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-08 Thread David Christensen

On Tue, Nov 8, 2022 at 4:45 PM Justin Pryzby  wrote:
>
> On Mon, Nov 07, 2022 at 05:01:01PM -0600, David Christensen wrote:
> > Hi Justin et al,
> >
> > Enclosed is v5 of this patch which now passes the CirrusCI checks for
> > all supported OSes. I went ahead and reworked the test a bit so it's a
> > little more amenable to the OS-agnostic approach for testing.
>
> Great, thanks.
>
> This includes the changes that I'd started a few months ago.
> Plus adding the test which was missing for meson.

Cool, will review, thanks.

> +format: 
> LSN.TSOID.DBOID.RELNODE.BLKNO
>
> I'd prefer if the abbreviations were "reltablespace" and "datoid"

Sure, no issues there.

> Also, should the test case call pg_relation_filenode() rather than using
> relfilenode directly ?  Is it a problem that the test code assumes
> pagesize=8192 ?

Both good points. Is pagesize just exposed via
`current_setting('block_size')` or is there a different approach?

David

Re: [PATCHES] Post-special page storage TDE support

2022-11-08 Thread David Christensen

Looking into some CF bot failures which didn't show up locally.  Will
send a v3 when resolved.

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-07 Thread David Christensen

Hi Justin et al,

Enclosed is v5 of this patch which now passes the CirrusCI checks for
all supported OSes. I went ahead and reworked the test a bit so it's a
little more amenable to the OS-agnostic approach for testing.

Best,

David


v5-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-str.patch
Description: Binary data

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-04 Thread David Christensen

On Fri, Nov 4, 2022 at 1:38 PM Justin Pryzby  wrote:
> > > As I recall, that's due to relying on "cp".  And "rsync", which
> > > shouldn't be assumed to exist by regression tests).

Will poke around other TAP tests to see if there's a more consistent
interface, what perl version we can assume and available modules, etc.
If there's not some trivial wrapper at this point so all TAP tests
could use it regardless of OS, it would definitely be good to
introduce such a method.

> > I will work on supporting the windows compatibility here. Is there some 
> > list of guidelines for what you can and can’t use? I don’t have a windows 
> > machine available to develop on.
>
> I think a lot (most?) developers here don't have a windows environment
> available, so now have been using cirrusci's tests to verify.  If you
> haven't used cirrusci directly (not via cfbot) before, start at:
> src/tools/ci/README

Thanks, good starting point.

> > Was it failing on windows? I was attempting to skip it as I recall.
>
> I don't see anything about skipping, and cirrus's logs from 2
> commitfests ago were pruned.  I looked at this patch earlier this year,
> but never got around to replacing the calls to rsync and cp.

Ah, it's skipped (not fixed) in my git repo, but never got around to
submitting that version through email.  That explains it.

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-04 Thread David Christensen

On Nov 4, 2022, at 9:02 AM, Justin Pryzby  wrote:
> 
> On Fri, Nov 04, 2022 at 11:52:59AM +0900, Ian Lawrence Barwick wrote:
>> 2022年5月3日(火) 8:45 David Christensen :
>>> 
>>> ...and pushing a couple fixups pointed out by cfbot, so here's v4.
>> 
>> cfbot reports the patch no longer applies [1]. As CommitFest 2022-11 is
>> currently underway, this would be an excellent time to update the patch.
> 
> More important than needing to be rebased, the patch has never passed
> its current tests on windows.
> 
> As I recall, that's due to relying on "cp".  And "rsync", which
> shouldn't be assumed to exist by regression tests).

I will work on supporting the windows compatibility here. Is there some list of 
guidelines for what you can and can’t use? I don’t have a windows machine 
available to develop on. 

Was it failing on windows? I was attempting to skip it as I recall. 

Best,

David

Re: [PATCHES] Post-special page storage TDE support

2022-10-28 Thread David Christensen

Hi Matthias,

> Did you read the related thread with related discussion from last June, "Re: 
> better page-level checksums" [0]? In that I argued that space at the end of a 
> page is already allocated for the AM, and that reserving variable space at 
> the end of the page for non-AM usage is wasting the AM's performance 
> potential.

Yes, I had read parts of that thread among others, but have given it a
re-read.  I can see the point you're making here, and agree that if we
can allocate between pd_special and pd_upper that could make sense.  I
am a little unclear as to what performance impacts for the AM there
would be if this additional space were ahead or behind the page
special area; it seems like if this is something that needs to live on
the page *somewhere* just being aligned correctly would be sufficient
from the AM's standpoint.  Considering that I am trying to make this
have zero storage impact if these features are not active, the impact
on a cluster with no additional features would be moot from a storage
perspective, no?

> Apart from that: Is this variable-sized 'metadata' associated with smgr 
> infrastructure only, or is it also available for AM features? If not; then 
> this is a strong -1. The amount of tasks smgr needs to do on a page is 
> generally much less than the amount of tasks an AM needs to do; so in my view 
> the AM has priority in prime page real estate, not smgr or related 
> infrastructure.

I will confess to a slightly wobbly understanding of the delineation
of responsibility here. I was under the impression that by modifying
any consumer of PageHeaderData this would be sufficient to cover all
AMs for the types of cluster-wide options we'd be concerned about (say
extended checksums, multiple page encryption schemes, or other
per-page information we haven't yet anticipated).  Reading smgr/README
and the various access/*/README has not made the distinction clear to
me yet.

> re: PageFeatures
> I'm not sure I understand the goal, nor the reasoning. Shouldn't this be part 
> of the storage manager (smgr) implementation / can't this be part of the smgr 
> of the relation?

For at least the feature cases I'm anticipating, this would apply to
any disk page that may have user data, set (at least initially) at
initdb time, so should apply to any pages in the cluster, regardless
of AM.

> re: use of pd_checksum
> I mentioned this in the above-mentioned thread too, in [1], that we could use 
> pd_checksum as an extra area marker for this storage-specific data, which 
> would be located between pd_upper and pd_special.

I do think that we could indeed use this as an additional in-page
pointer, but at least for this version was keeping things
backwards-compatible.  Peter G (I think) also made some good points
about how to include the various status bits on the page somehow in
terms of making a page completely self-contained.

> Re: patch contents
>
> 0001:
> >+ specialSize = MAXALIGN(specialSize) + reserved_page_size;
>
> This needs to be aligned, so MAXALIGN(specialSize + reserved_page_size), or 
> an assertion that reserved_page_size is MAXALIGNED, would be better.

It is currently aligned via the space calculation return value but
agree that folding it into an assert or reworking it explicitly is
clearer.

> >  PageValidateSpecialPointer(Page page)
> >  {
> >  Assert(page);
> > -Assert(((PageHeader) page)->pd_special <= BLCKSZ);
> > +AssertPageHeader) page)->pd_special - reserved_page_size) <= 
> > BLCKSZ);
>
> This check is incorrect. With your code it would allow pd_special past the 
> end of the block. If you want to put the reserved_space_size effectively 
> inside the special area, this check should instead be:
>
> +Assert(((PageHeader) page)->pd_special <= (BLCKSZ - reserved_page_size));
>
> Or, equally valid
>
> +AssertPageHeader) page)->pd_special + reserved_page_size) <= BLCKSZ);

Yup, I think I inverted my logic there; thanks.

> > + * +-+-++-+
> > + * | ... tuple2 tuple1 | "special space" | "reserved" |
> > + * +---++-+
>
> Could you fix the table display if / when you revise the patchset? It seems 
> to me that the corners don't line up with the column borders.

Sure thing.

> 0002:
> > Make the output of "select_views" test stable
> > Changing the reserved_page_size has resulted in non-stable results for this 
> > test.
>
> This makes sense, what kind of instability are we talking about? Are there 
> different results for runs with the same binary, or is this across 
> compilations?

When running with the same compilation/initdb settings, the test
results are stable, but differ depending what options you chose, so
`make installcheck` output will fail when testing a cluster with
different options vs upstream HEAD without these patches, etc.

> 0003 and up were not yet reviewed in depth.

Thanks, I appreciate the feedback so far.

Re: [PATCHES] Post-special page storage TDE support

2022-10-25 Thread David Christensen

> > Explicitly
> > locking (assuming you stay in your lane) should only need to guard
> > against access from other
> > backends of this type if using shared buffers, so will be use-case 
> > dependent.
>
> I'm not sure what you mean here?

I'm mainly pointing out that the specific code that manages this
feature is the only one who has to worry about modifying said page
region.

> > This does have a runtime overhead due to moving some offset
> > calculations from compile time to
> > runtime. It is thought that the utility of this feature will outweigh
> > the costs here.
>
> Have you done some benchmarking to give an idea of how much overhead we're
> talking about?

Not yet, but I am going to work on this.  I suspect the current code
could be improved, but will try to get some sort of measurement of the
additional overhead.

> > Candidates for page features include 32-bit or 64-bit checksums,
> > encryption tags, or additional
> > per-page metadata.
> >
> > While we are not currently getting rid of the pd_checksum field, this
> > mechanism could be used to
> > free up that 16 bits for some other purpose.
>
> IIUC there's a hard requirement of initdb-time initialization, as there's
> otherwise no guarantee that you will find enough free space in each page at
> runtime.  It seems like a very hard requirement for a full replacement of the
> current checksum approach (even if I agree that the current implementation
> limitations are far from ideal), especially since there's no technical reason
> that would prevent us from dynamically enabling data-checksums without doing
> all the work when the cluster is down.

As implemented, that is correct; we are currently assuming this
specific feature mechanism is set at initdb time only.  Checksums are
not the primary motivation here, but were something that I could use
for an immediate illustration of the feature.  That said, presumably
you could define a way to set the features per-relation (say with a
template field in pg_class) which would propagate to a relation on
rewrite, so there could be ways to handle things incrementally, were
this an overall goal.

Thanks for looking,

David

[PATCH] comment fixes for delayChkptFlags

2022-10-14 Thread David Christensen

Enclosed is a trivial fix for a typo and misnamed field I noted when doing
some code review.

Best,

David


0001-fix-comment-typos-for-delayChkptFlags.patch
Description: Binary data

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-05-02 Thread David Christensen

...and pushing a couple fixups pointed out by cfbot, so here's v4.


On Mon, May 2, 2022 at 8:42 AM David Christensen
 wrote:
>
> Enclosed is v3 of this patch; this adds two modes for this feature,
> one with the raw page `--save-fullpage/-W` and one with the
> LSN+checksum fixups `--save-fullpage-fixup/-X`.
>
> I've added at least some basic sanity-checking of the underlying
> feature, as well as run the test file and the changes to pg_waldump.c
> through pgindent/perltidy to make them adhere to project standards.
> Threw in a rebase as well.
>
> Would appreciate any additional feedback here.
>
> Best,
>
> David


v4-pg_waldump-save-fpi.patch
Description: Binary data

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-05-02 Thread David Christensen

Enclosed is v3 of this patch; this adds two modes for this feature,
one with the raw page `--save-fullpage/-W` and one with the
LSN+checksum fixups `--save-fullpage-fixup/-X`.

I've added at least some basic sanity-checking of the underlying
feature, as well as run the test file and the changes to pg_waldump.c
through pgindent/perltidy to make them adhere to project standards.
Threw in a rebase as well.

Would appreciate any additional feedback here.

Best,

David


v3-pg_waldump-save-fpi.patch
Description: Binary data

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-26 Thread David Christensen

On Mon, Apr 25, 2022 at 9:42 PM Michael Paquier  wrote:
> On Mon, Apr 25, 2022 at 10:24:52AM -0500, David Christensen wrote:
> > On Mon, Apr 25, 2022 at 6:03 AM Bharath Rupireddy
> >  wrote:
> >> Thanks for working on this. I'm just thinking if we can use these FPIs
> >> to repair the corrupted pages? I would like to understand more
> >> detailed usages of the FPIs other than inspecting with pageinspect.
> >
> > My main use case was for being able to look at potential corruption,
> > either in the WAL stream, on heap, or in tools associated with the WAL
> > stream.  I suppose you could use the page images to replace corrupted
> > on-disk pages (and in fact I think I've heard of a tool or two that
> > try to do that), though don't know that I consider this the primary
> > purpose (and having toast tables and the list, as well as clog would
> > make it potentially hard to just drop-in a page version without
> > issues).  Might help in extreme situations though.
>
> You could do a bunch of things with those images, even make things
> worse if you are not careful enough.

True. :-) This does seem like a tool geared towards "expert mode", so
maybe we just assume if you need it you know what you're doing?

> >> 10) Along with pg_pwrite(), can we also fsync the files (of course
> >> users can choose it optionally) so that the writes will be durable for
> >> the OS crashes?
> >
> > Can add; you thinking a separate flag to disable this with default true?
>
> We expect data generated by tools like pg_dump, pg_receivewal
> (depending on the use --synchronous) or pg_basebackup to be consistent
> when we exit from the call.  FWIW, flushing this data does not seem
> like a strong requirement for something aimed at being used page-level
> chirurgy or lookups, because the WAL segments should still be around
> even if the host holding the archives is unplugged.

I have added the fsync to the latest path (forthcoming), but I have no
strong preferences here as to the correct/expected behavior.

Best,

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-26 Thread David Christensen

On Mon, Apr 25, 2022 at 9:54 PM Michael Paquier  wrote:
> On Mon, Apr 25, 2022 at 10:11:10AM -0500, David Christensen wrote:
> > On Mon, Apr 25, 2022 at 1:11 AM Michael Paquier  wrote:
> >> I don't think that there is any need to rely on a new logic if there
> >> is already some code in place able to do the same work.  See
> >> verify_dir_is_empty_or_create() in pg_basebackup.c, as one example,
> >> that relies on pg_check_dir().  I think that you'd better rely at
> >> least on what pgcheckdir.c offers.
> >
> > Yeah, though I am tending towards what another user had suggested and
> > just accepting an existing directory rather than requiring it to be
> > empty, so thinking I might just skip this one. (Will review the file
> > you've pointed out to see if there is a relevant function though.)
>
> OK.  FWIW, pg_check_dir() is used in initdb and pg_basebackup because
> these care about the behavior to use when specifying a target path.
> You could reuse it, but use a different policy depending on its
> returned result for the needs of what you see fit in this case.

I have a new version of the patch (pending tests) that uses
pg_check_dir's return value to handle things appropriately, so at
least some code reuse now.  It did end up simplifying a lot.

> >> +   PageSetLSN(page, record->ReadRecPtr);
> >> +   /* if checksum field is non-zero then we have 
> >> checksums enabled,
> >> +* so recalculate the checksum with new LSN (yes, this 
> >> is a hack)
> >> +*/
> >> Yeah, that looks like a hack, but putting in place a page on a cluster
> >> that has checksums enabled would be more annoying with
> >> zero_damaged_pages enabled if we don't do that, so that's fine by me
> >> as-is.  Perhaps you should mention that FPWs don't have their
> >> pd_checksum updated when written.
> >
> > Can make a mention; you thinking just a comment in the code is
> > sufficient, or want there to be user-visible docs as well?
>
> I was thinking about a comment, at least.

New patch version has significantly more comments.

> > This was to make the page as extracted equivalent to the effect of
> > applying the WAL record block on replay (the LSN and checksum both);
> > since recovery calls this function to mark the LSN where the page came
> > from this is why I did this as well.  (I do see perhaps a case for
> > --raw output that doesn't munge the page whatsoever, just made
> > comparisons easier.)
>
> Hm.  Okay.  The argument goes both ways, I guess, depending on what we
> want to do with those raw pages.  Still you should not need pd_lsn if
> the point is to be able to stick the pages back in place to attempt to
> get back as much data as possible when loading it back to the shared
> buffers?

Yeah, I can see that too; I think there's at least enough of an
argument for a flag to apply the fixups or just extract only the raw
page pre-modification.  Not sure which should be the "default"
behavior; either `--raw` or `--fixup-metadata` or something could
work. (Naming suggestions welcomed.)

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-25 Thread David Christensen

On Mon, Apr 25, 2022 at 6:03 AM Bharath Rupireddy
 wrote:
> Thanks for working on this. I'm just thinking if we can use these FPIs
> to repair the corrupted pages? I would like to understand more
> detailed usages of the FPIs other than inspecting with pageinspect.

My main use case was for being able to look at potential corruption,
either in the WAL stream, on heap, or in tools associated with the WAL
stream.  I suppose you could use the page images to replace corrupted
on-disk pages (and in fact I think I've heard of a tool or two that
try to do that), though don't know that I consider this the primary
purpose (and having toast tables and the list, as well as clog would
make it potentially hard to just drop-in a page version without
issues).  Might help in extreme situations though.

> Given that others have realistic use-cases (of course I would like to
> know more about those), +1 for the idea. However, I would suggest
> adding a function to extract raw FPI data to the pg_walinspect
> extension that got recently committed in PG 15, the out of which can
> directly be fed to pageinspect functions or

Yeah, makes sense to have some overlap here; will review what is there
and see if there is some shared code base we can utilize.  (ISTR some
work towards getting these two tools using more of the same code, and
this seems like another such instance.)

> Few comments:
> 1) I think it's good to mention the stored file name format.
> + printf(_("  -W, --raw-fpi=path save found full page images to
> given path\n"));

+1, though I've also thought there could be uses to have multiple
possible output formats here (most immediately, there may be cases
where we want *each* FPI for a block vs the *latest*, so files name
with/without the LSN component seem the easiest way forward here).
That would introduce some additional complexity though, so might need
to see if others think that makes any sense.

> 2)
> + for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
> + {
> + /* we will now extract the fullpage image from the XLogRecord and save
> + * it to a calculated filename */
> +
> + if (XLogRecHasBlockImage(record, block_id))
>
> I think we need XLogRecHasBlockRef to be true to check
> XLogRecHasBlockImage otherwise, we will see some build farms failing,
> recently I've seen this failure for pg_walinspect..
>
> for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
> {
> if (!XLogRecHasBlockRef(record, block_id))
> continue;
>
> if (XLogRecHasBlockImage(record, block_id))
> *fpi_len += XLogRecGetBlock(record, block_id)->bimg_len;
> }

Good point; my previous patch that got committed here (127aea2a65)
probably also needed this treatment.

> 3) Please correct the commenting format:
> + /* we will now extract the fullpage image from the XLogRecord and save
> + * it to a calculated filename */

Ack.

> 4) Usually we start errors with lower case letters "could not ."
> + pg_fatal("Couldn't open file for output: %s", filename);
> + pg_fatal("Couldn't write out complete FPI to file: %s", filename);
> And the variable name too:
> + FILE *OPF;

Ack.

> 5) Not sure how the FPIs of TOASTed tables get stored, but it would be
> good to check.

What would be different here? Are there issues you can think of, or
just more from the pageinspect side of things?

> 6) Good to specify the known usages of FPIs in the documentation.

Ack. Prob good to get additional info/use cases from others, as mine
is fairly short. :-)

> 7) Isn't it good to emit an error if RestoreBlockImage returns false?
> + if (RestoreBlockImage(record, block_id, page))
> + {

Ack.

> 8) I think I don't mind if a non-empty directory is specified - IMO
> better usability is this - if the directory is non-empty, just go add
> the FPI files if FPI file exists just replace it, if the directory
> isn't existing, create and write the FPI files.
> + /* we accept an empty existing directory */
> + if (stat(config.save_fpw_path, ) == 0 && S_ISDIR(st.st_mode))
> + {

Agreed; was mainly trying to prevent accidental expansion inside
`pg_wal` when an earlier version of the patch implied `.` as the
current dir with an optional path, but I've since made the path
non-optional and agree that this is unnecessarily restrictive.

>  9) Instead of following:
> + if (XLogRecordHasFPW(xlogreader_state))
> + XLogRecordSaveFPWs(xlogreader_state, config.save_fpw_path);
> I will just do this in XLogRecordSaveFPWs:
> for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
> {
> if (!XLogRecHasBlockRef(record, block_id))
> continue;
>
> if (XLogRecHasBlockImage(record, block_id))
> {
>
> }
> }

Yeah, a little redundant.

> 10) Along with pg_pwrite(), can we also fsync the files (of course
> users can choose it optionally) so that the writes will be durable for
> the OS crashes?

Can add; you thinking a separate flag to disable this

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-25 Thread David Christensen

On Mon, Apr 25, 2022 at 2:00 AM Drouvot, Bertrand  wrote:
>
> Hi,
>
> On 4/25/22 8:11 AM, Michael Paquier wrote:
> > On Sat, Apr 23, 2022 at 01:43:36PM -0500, David Christensen wrote:
> >> Hi Matthias, great point.  Enclosed is a revised version of the patch
> >> that adds the fork identifier to the end if it's a non-main fork.
> > Like Alvaro, I have seen cases where this would have been really
> > handy.  So +1 from me, as well, to have more tooling like what you are
> > proposing.
>
> +1 on the idea.
>
> FWIW, there is an extension doing this [1] but having the feature
> included in pg_waldump would be great.

Cool, glad to see there is some interest; definitely some overlap in
forensics inside and outside the database both, as there are different
use cases for both.

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-25 Thread David Christensen

On Mon, Apr 25, 2022 at 1:11 AM Michael Paquier  wrote:
>
> On Sat, Apr 23, 2022 at 01:43:36PM -0500, David Christensen wrote:
> > Hi Matthias, great point.  Enclosed is a revised version of the patch
> > that adds the fork identifier to the end if it's a non-main fork.
>
> Like Alvaro, I have seen cases where this would have been really
> handy.  So +1 from me, as well, to have more tooling like what you are
> proposing.  Fine for me to use one file for each block with a name
> like what you are suggesting for each one of them.
>
> +   /* we accept an empty existing directory */
> +   if (stat(config.save_fpw_path, ) == 0 && S_ISDIR(st.st_mode))
> +   {
> I don't think that there is any need to rely on a new logic if there
> is already some code in place able to do the same work.  See
> verify_dir_is_empty_or_create() in pg_basebackup.c, as one example,
> that relies on pg_check_dir().  I think that you'd better rely at
> least on what pgcheckdir.c offers.

Yeah, though I am tending towards what another user had suggested and
just accepting an existing directory rather than requiring it to be
empty, so thinking I might just skip this one. (Will review the file
you've pointed out to see if there is a relevant function though.)

> +   {"raw-fpi", required_argument, NULL, 'W'},
> I think that we'd better rename this option.  "fpi", that is not used
> much in the user-facing docs, is additionally not adapted when we have
> an other option called -w/--fullpage.  I can think of
> --save-fullpage.

Makes sense.

> +   PageSetLSN(page, record->ReadRecPtr);
> +   /* if checksum field is non-zero then we have checksums 
> enabled,
> +* so recalculate the checksum with new LSN (yes, this is 
> a hack)
> +*/
> Yeah, that looks like a hack, but putting in place a page on a cluster
> that has checksums enabled would be more annoying with
> zero_damaged_pages enabled if we don't do that, so that's fine by me
> as-is.  Perhaps you should mention that FPWs don't have their
> pd_checksum updated when written.

Can make a mention; you thinking just a comment in the code is
sufficient, or want there to be user-visible docs as well?

> +   /* we will now extract the fullpage image from the XLogRecord and save
> +* it to a calculated filename */
> The format of this comment is incorrect.
>
> +The LSN of the record with this block, formatted
> +as %08x-%08X instead of the
> +conventional %X/%X due to filesystem naming
> +limits
> The last part of the sentence about %X/%X could just be removed.  That
> could be confusing, at worse.

Makes sense.

> +   PageSetLSN(page, record->ReadRecPtr);
> Why is pd_lsn set?

This was to make the page as extracted equivalent to the effect of
applying the WAL record block on replay (the LSN and checksum both);
since recovery calls this function to mark the LSN where the page came
from this is why I did this as well.  (I do see perhaps a case for
--raw output that doesn't munge the page whatsoever, just made
comparisons easier.)

> git diff --check complains a bit.

Can look into this.

> This stuff should include some tests.  With --end, the tests can
> be cheap.

Yeah, makes sense, will include some in the next version.

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-23 Thread David Christensen

On Sat, Apr 23, 2022 at 9:49 AM Matthias van de Meent
 wrote:

> Regardless of my (lack of) opinion on the inclusion of this patch in
> PG (I did not significantly review this patch); I noticed that you do
> not yet identify the 'fork' of the FPI in the file name.
>
> A lack of fork identifier in the exported file names would make
> debugging much more difficult due to the relatively difficult to
> identify data contained in !main forks, so I think this oversight
> should be fixed, be it through `_forkname` postfix like normal fork
> segments, or be it through `.` numerical in- or postfix in
> the filename.
>
> -Matthias

Hi Matthias, great point.  Enclosed is a revised version of the patch
that adds the fork identifier to the end if it's a non-main fork.

Best,

David


v2-pg_waldump-save-fpi.patch
Description: Binary data

[PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-22 Thread David Christensen

Hi -hackers,

Enclosed is a patch to allow extraction/saving of FPI from the WAL
stream via pg_waldump.

Description from the commit:

Extracts full-page images from the WAL stream into a target directory,
which must be empty or not
exist.  These images are subject to the same filtering rules as normal
display in pg_waldump, which
means that you can isolate the full page writes to a target relation,
among other things.

Files are saved with the filename:  with
formatting to make things
somewhat sortable; for instance:

-01C0.1663.1.6117.0
-01000150.1664.0.6115.0
-010001E0.1664.0.6114.0
-01000270.1663.1.6116.0
-01000300.1663.1.6113.0
-01000390.1663.1.6112.0
-01000420.1663.1.8903.0
-010004B0.1663.1.8902.0
-01000540.1663.1.6111.0
-010005D0.1663.1.6110.0

It's noteworthy that the raw images do not have the current LSN stored
with them in the WAL
stream (as would be true for on-heap versions of the blocks), nor
would the checksum be valid in
them (though WAL itself has checksums, so there is some protection
there).  This patch chooses to
place the LSN and calculate the proper checksum (if non-zero in the
source image) in the outputted
block.  (This could perhaps be a targetted flag if we decide we don't
always want this.)

These images could be loaded/inspected via `pg_read_binary_file()` and
used in the `pageinspect`
suite of tools to perform detailed analysis on the pages in question,
based on historical
information, and may come in handy for forensics work.

Best,

David


v1-pg_waldump-save-fpi.patch
Description: Binary data

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-03-24 Thread David Christensen


> On Mar 24, 2022, at 4:12 PM, Thomas Munro  wrote:
> 
> On Fri, Mar 25, 2022 at 1:43 AM David Christensen
>  wrote:
>>>> On Mar 24, 2022, at 6:43 AM, Thomas Munro  wrote:
>>> On Fri, Mar 25, 2022 at 12:26 AM Thomas Munro  
>>> wrote:
>>>>> On Fri, Mar 25, 2022 at 12:01 AM Peter Eisentraut
>>>>>  wrote:
>>>>> Or even:  Why are we exposing fork *numbers* in the user interface?
>>>>> Even low-level tools such as pageinspect use fork *names* in their
>>>>> interface.
>>>> 
>>>> I wondered about that but thought it seemed OK for such a low level
>>>> tool.  It's a fair point though, especially if other low level tools
>>>> are doing that.  Here's a patch to change it.
>>> 
>>> Oh, and there's already a name lookup function to use for this.
>> 
>> +1 on the semantic names.
> 
> Cool.
> 
> I had another thought while changing that (and also re-alphabetising):
> Why don't we switch to  -B for --block and -R for --relation?  I
> gather you used -k and -l because -b and -r were already taken, but
> since we already started using upper case for -F, it seems consistent
> this way.  Or were they chosen for consistency with something else?

Works here; was just trying to get semi-memorable ones from the available 
lowercase ones, but I like your idea here, and it kind of puts them in the same 
mental space for remembering. 

> It's also slightly more helpful to a user if the help says
> --relation=T/D/R instead of N/N/N (TS/DB/REL would be nicer but
> doesn't fit in the space).


0001-Improve-command-line-switches-in-pg_waldump-option.patch
Description: Binary data

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-03-24 Thread David Christensen



> On Mar 24, 2022, at 6:43 AM, Thomas Munro  wrote:
> 
> On Fri, Mar 25, 2022 at 12:26 AM Thomas Munro  wrote:
>>> On Fri, Mar 25, 2022 at 12:01 AM Peter Eisentraut
>>>  wrote:
>>> Or even:  Why are we exposing fork *numbers* in the user interface?
>>> Even low-level tools such as pageinspect use fork *names* in their
>>> interface.
>> 
>> I wondered about that but thought it seemed OK for such a low level
>> tool.  It's a fair point though, especially if other low level tools
>> are doing that.  Here's a patch to change it.
> 
> Oh, and there's already a name lookup function to use for this.

+1 on the semantic names. 

David

Re: [PATCH] Proof of concept for GUC improvements

2022-03-22 Thread David Christensen

> On Mar 21, 2022, at 7:53 PM, Tom Lane  wrote:
> 
> Andres Freund  writes:
>> My impression is that there's not a lot of enthusiasm for the concept? If
>> that's true we maybe ought to mark the CF entry as rejected?
> 
> Yeah, I'm kind of leaning that way too.  I don't see how we can
> incorporate the symbolic values into any existing display paths
> without breaking applications that expect the old output.
> That being the case, it seems like we'd have "two ways to do it"
> indefinitely, which would add enough confusion that I'm not
> sure there's a net gain.  In particular, I foresee novice questions
> along the lines of "I set foo to disabled, why is it showing
> as zero?"

Yeah, my main motivation here was about trying to have less special values in 
the config files, but I guess it would effectively be making everything 
effectively an enum and still relies on knowing just what the specific magic 
values are, no not really a net gain in this department. 

For the record, I thought this would have a fairly low chance of getting in, 
was mainly curious what level of effort it would take to get something like 
this going and get some feedback on the actual idea. 

> If we'd done it like this from the beginning, it'd have been
> great, but retrofitting it now is a lot less appealing.

Yeah, agreed on this. As far as I’m concerned we can reject. 

Thanks,

David

Re: [PATCH] pgbench: add multiconnect option

2022-03-22 Thread David Christensen

On Sat, Mar 19, 2022 at 11:43 AM Fabien COELHO  wrote:

>
> Hi Sami,
>
> > Pgbench is a simple benchmark tool by design, and I wonder if adding
> > a multiconnect feature will cause pgbench to be used incorrectly.
>
> Maybe, but I do not see how it would be worse that what pgbench already
> allows.
>

I agree that pgbench is simple; perhaps really too simple when it comes to
being able to measure much more than basic query flows.  What pgbench does
have in its favor is being distributed with the core distribution.

I think there is definitely space for a more complicated benchmarking tool
that exercises more scenarios and more realistic query patterns and
scenarios.  Whether that is distributed with the core is another question.

David

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-03-21 Thread David Christensen

On Mon, Mar 21, 2022 at 4:39 PM Thomas Munro  wrote:

[snip]

I guess you did this because init fork references aren't really
> expected in the WAL, but I think it's more consistent to allow up to
> MAX_FORKNUM, not least because your documentation mentions 3 as a
> valid value.  So I adjust this to allow MAX_FORKNUM.  Make sense?
>

Makes sense, but I think I'd actually thought it was +1 of the max forks,
so you give me more credit than I deserve in this case... :-)

> Here are some more details I noticed, as a likely future user of this
> very handy feature, which I haven't changed, because they seem more
> debatable and you might disagree...
>
> 1.  I think it'd be less surprising if the default value for --fork
> wasn't 0... why not show all forks?
>

Agreed; made it default to all, with the ability to filter down if desired.

> 2.  I think it'd be less surprising if --fork without --relation
> either raised an error (like --block without --relation), or were
> allowed, with the meaning "show me this fork of all relations".
>

Agreed; reworked to support the use case of only showing target forks.

> 3.  It seems funny to have no short switch for --fork when everything
> else has one... what about -F?
>

Good idea; I'd hadn't seen capitals in the getopt list so didn't consider
them, but I like this.

Enclosed is v6, incorporating these fixes and docs tweaks.

Best,

David

v6-0001-Add-additional-filtering-options-to-pg_waldump.patch
Description: Binary data

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-03-21 Thread David Christensen

Updated to include the V3 fixes as well as the unsigned int/enum fix.

>


v4-0001-Add-additional-filtering-options-to-pg_waldump.patch
Description: Binary data

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-03-21 Thread David Christensen

On Sun, Mar 20, 2022 at 11:56 PM Thomas Munro 
wrote:

> On Mon, Mar 21, 2022 at 4:36 PM Thomas Munro 
> wrote:
> > On Sat, Feb 26, 2022 at 7:58 AM David Christensen
> >  wrote:
> > > Attached is V2 with additional feedback from this email, as well as
> the specification of the
> > > ForkNumber and FPW as specifiable options.
> >
> > Trivial fixup needed after commit 3f1ce973.
>
> [04:30:50.630] pg_waldump.c:963:26: error: format ‘%u’ expects
> argument of type ‘unsigned int *’, but argument 3 has type ‘ForkNumber
> *’ [-Werror=format=]
> [04:30:50.630] 963 | if (sscanf(optarg, "%u",
> _by_relation_forknum) != 1 ||
> [04:30:50.630] | ~^ ~~
> [04:30:50.630] | | |
> [04:30:50.630] | | ForkNumber *
> [04:30:50.630] | unsigned int *
>
> And now that this gets to the CompilerWarnings CI task, it looks like
> GCC doesn't like an enum as a scanf %u destination (I didn't see that
> warning locally when I compiled the above fixup because clearly Clang
> is cool with it...).  Probably needs a temporary unsigned int to
> sscanf into first.
>

Do you need me to fix this, or are you incorporating that into a V4 of this
patch? (Similar to your fixup prior in this thread?)

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-02-25 Thread David Christensen

Bharath Rupireddy  writes:

> On Fri, Feb 25, 2022 at 12:36 AM David Christensen
>  wrote:
>>
>> Greetings,
>>
>> This patch adds the ability to specify a RelFileNode and optional BlockNum 
>> to limit output of
>> pg_waldump records to only those which match the given criteria.  This 
>> should be more performant
>> than `pg_waldump | grep` as well as more reliable given specific variations 
>> in output style
>> depending on how the blocks are specified.
>>
>> This currently affects only the main fork, but we could presumably add the 
>> option to filter by fork
>> as well, if that is considered useful.
>
> Thanks for the patch. This is not adding something that users can't do
> right now, but definitely improves the usability of the pg_waldump as
> it avoids external filterings. Also, it can give the stats/info at
> table and block level. So, +1 from my side.

Attached is V2 with additional feedback from this email, as well as the 
specification of the
ForkNumber and FPW as specifiable options.

Best,

David

>From 1b04f04317d364006371bdab0db9086f79138b25 Mon Sep 17 00:00:00 2001
From: David Christensen 
Date: Fri, 25 Feb 2022 12:52:56 -0600
Subject: [PATCH] Add additional filtering options to pg_waldump

This feature allows you to only output records that are targeting a specific RelFileNode and optional
BlockNumber within this relation, while specifying which ForkNum you want to filter to.

We also add the independent ability to filter via Full Page Write.
---
 doc/src/sgml/ref/pg_waldump.sgml |  48 
 src/bin/pg_waldump/pg_waldump.c  | 128 ++-
 2 files changed, 175 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/pg_waldump.sgml b/doc/src/sgml/ref/pg_waldump.sgml
index 5735a161ce..f157175764 100644
--- a/doc/src/sgml/ref/pg_waldump.sgml
+++ b/doc/src/sgml/ref/pg_waldump.sgml
@@ -100,6 +100,44 @@ PostgreSQL documentation
   
  
 
+ 
+  -k block
+  --block=block
+  
+   
+Display only records touching the given block. (Requires also
+providing the relation via --relation.)
+   
+  
+ 
+
+ 
+  --fork=fork
+  
+   
+When using the --relation filter, output only records
+from the given fork.  The valid values here are 0
+for the main fork, 1 for the Free Space
+Map, 2 for the Visibility Map,
+and 3 for the Init fork.  If unspecified, defaults
+to the main fork.
+   
+  
+ 
+
+ 
+  -l tbl/db/rel
+  --relation=tbl/db/rel
+  
+   
+Display only records touching the given relation.  The relation is
+specified via tablespace OID, database OID, and relfilenode separated
+by slashes, for instance, 1234/12345/12345.  This
+is the same format used for relations in the WAL dump output.
+   
+  
+ 
+
  
   -n limit
   --limit=limit
@@ -183,6 +221,16 @@ PostgreSQL documentation

  
 
+ 
+   -w
+   --fullpage
+   
+   
+   Filter records to only those which have full page writes.
+   
+   
+ 
+
  
   -x xid
   --xid=xid
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..a527cd4dc6 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -55,6 +55,12 @@ typedef struct XLogDumpConfig
 	bool		filter_by_rmgr_enabled;
 	TransactionId filter_by_xid;
 	bool		filter_by_xid_enabled;
+	RelFileNode filter_by_relation;
+	bool		filter_by_relation_enabled;
+	BlockNumber	filter_by_relation_block;
+	bool		filter_by_relation_block_enabled;
+	ForkNumber	filter_by_relation_forknum;
+	bool		filter_by_fpw;
 } XLogDumpConfig;
 
 typedef struct Stats
@@ -394,6 +400,56 @@ WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 	return count;
 }
 
+/*
+ * Boolean to return whether the given WAL record matches a specific relation and optional block
+ */
+static bool
+XLogRecordMatchesRelationBlock(XLogReaderState *record, RelFileNode matchRnode, BlockNumber matchBlock, ForkNumber matchFork)
+{
+	int			block_id;
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		RelFileNode rnode;
+		ForkNumber	forknum;
+		BlockNumber blk;
+
+		if (!XLogRecHasBlockRef(record, block_id))
+			continue;
+
+		XLogRecGetBlockTag(record, block_id, , , );
+
+		if (forknum == matchFork &&
+			matchRnode.spcNode == rnode.spcNode &&
+			matchRnode.dbNode == rnode.dbNode &&
+			matchRnode.relNode == rnode.relNode &&
+			(matchBlock == InvalidBlockNumber || matchBlock == blk))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Boolean to return whether the given WAL record contains a full page write
+ */
+static bool
+XLogRecordHasFPW(XLogReaderState *record)
+{
+	i

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-02-25 Thread David Christensen

On Fri, Feb 25, 2022 at 7:08 AM Japin Li  wrote:

>
> On Fri, 25 Feb 2022 at 20:48, David Christensen <
> david.christen...@crunchydata.com> wrote:
> >> Cool.  I think we can report an error instead of reading wal files,
> >> if the tablespace, database, or relation is invalid.  Does there any
> >> WAL record that has invalid tablespace, database, or relation OID?
> >
> > The only sort of validity check we could do here is range checking for
> the underlying data types
> > (which we certainly could/should add if it’s known to never be valid for
> the underlying types);
>
> The invalid OID I said here is such as negative number and zero, for those
> parameters, we do not need to read the WAL files, since it always invalid.
>

Agreed.  Can add some additional range validation to the parsed values.

David

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-02-25 Thread David Christensen

On Fri, Feb 25, 2022 at 7:33 AM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> Thanks for the patch. This is not adding something that users can't do
> right now, but definitely improves the usability of the pg_waldump as
> it avoids external filterings. Also, it can give the stats/info at
> table and block level. So, +1 from my side.
>

Thanks for the feedback; I will be incorporating most of this into a new
version, with a couple of responses below.

> 3) Crossing 80 char limit
>

This is neither here nor there, but have we as a project considered
increasing that to something more modern?  I know a lot of current projects
consider 132 to be a more reasonable limit.  (Will reduce it down to that
for now, but consider this a vote towards increasing that limit.)

> 5) I would also see a need for "filter by FPW" i.e. list all WAL
> records with "FPW".
>

Yes, that wouldn't be too hard to add to this, can add to the next
version.  We probably ought to also add the fork number as specifiable as
well. Other thoughts on could be some wildcard value in the relation part,
so `1234/23456/*` could filter WAL to a specific database only, say, or
some other multiple specifier, like `--block=1234,123456,121234`.  (I
honestly consider this to be more advanced than we'd need to support in
this patch, but if probably wouldn't be too hard to add to it.)

Thanks,

David

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-02-25 Thread David Christensen

> Cool.  I think we can report an error instead of reading wal files,
> if the tablespace, database, or relation is invalid.  Does there any
> WAL record that has invalid tablespace, database, or relation OID?

The only sort of validity check we could do here is range checking for the 
underlying data types (which we certainly could/should add if it’s known to 
never be valid for the underlying types); non-existence of objects is a no-go, 
since that depends purely on the WAL range you are looking at and you’d have 
to, you know, scan it to see if it existed before marking as invalid. :)

Thanks,

David

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-02-24 Thread David Christensen

Added to commitfest as:

https://commitfest.postgresql.org/37/3565/

[PATCH] add relation and block-level filtering to pg_waldump

2022-02-24 Thread David Christensen

Greetings,

This patch adds the ability to specify a RelFileNode and optional BlockNum to 
limit output of
pg_waldump records to only those which match the given criteria.  This should 
be more performant
than `pg_waldump | grep` as well as more reliable given specific variations in 
output style
depending on how the blocks are specified.

This currently affects only the main fork, but we could presumably add the 
option to filter by fork
as well, if that is considered useful.

Best,

David

>From 9194b2cb07172e636030b9b4e979b7f2caf7cbc0 Mon Sep 17 00:00:00 2001
From: David Christensen 
Date: Thu, 24 Feb 2022 11:00:46 -0600
Subject: [PATCH] Add relation/block filtering to pg_waldump

This feature allows you to only output records that are targeting a specific RelFileNode and optional
BlockNumber within this relation.  Currently only applies this filter to the relation's main fork.
---
 doc/src/sgml/ref/pg_waldump.sgml | 23 ++
 src/bin/pg_waldump/pg_waldump.c  | 74 +++-
 2 files changed, 96 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/pg_waldump.sgml b/doc/src/sgml/ref/pg_waldump.sgml
index 5735a161ce..c953703bc8 100644
--- a/doc/src/sgml/ref/pg_waldump.sgml
+++ b/doc/src/sgml/ref/pg_waldump.sgml
@@ -100,6 +100,29 @@ PostgreSQL documentation
   
  
 
+ 
+  -k block
+  --block=block
+  
+   
+Display only records touching the given block. (Requires also
+providing the relation via --relation.)
+   
+  
+ 
+
+ 
+  -l tbl/db/rel
+  --relation=tbl/db/rel
+  
+   
+Display only records touching the given relation.  The relation is
+specified via tablespace oid, database oid, and relfilenode separated
+by slashes.
+   
+  
+ 
+
  
   -n limit
   --limit=limit
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..faae547a5c 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -55,6 +55,10 @@ typedef struct XLogDumpConfig
 	bool		filter_by_rmgr_enabled;
 	TransactionId filter_by_xid;
 	bool		filter_by_xid_enabled;
+	RelFileNode filter_by_relation;
+	bool		filter_by_relation_enabled;
+	BlockNumber	filter_by_relation_block;
+	bool		filter_by_relation_block_enabled;
 } XLogDumpConfig;
 
 typedef struct Stats
@@ -394,6 +398,34 @@ WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 	return count;
 }
 
+/*
+ * Boolean to return whether the given WAL record matches a specific relation and optional block
+ */
+static bool
+XLogRecordMatchesRelationBlock(XLogReaderState *record, RelFileNode matchRnode, BlockNumber matchBlock)
+{
+	RelFileNode rnode;
+	ForkNumber	forknum;
+	BlockNumber blk;
+	int			block_id;
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		if (!XLogRecHasBlockRef(record, block_id))
+			continue;
+
+		XLogRecGetBlockTag(record, block_id, , , );
+
+		if (forknum == MAIN_FORKNUM &&
+			matchRnode.spcNode == rnode.spcNode &&
+			matchRnode.dbNode == rnode.dbNode &&
+			matchRnode.relNode == rnode.relNode &&
+			(matchBlock == InvalidBlockNumber || matchBlock == blk))
+			return true;
+	}
+	return false;
+}
+
 /*
  * Calculate the size of a record, split into !FPI and FPI parts.
  */
@@ -767,6 +799,8 @@ usage(void)
 	printf(_("  -b, --bkp-details  output detailed information about backup blocks\n"));
 	printf(_("  -e, --end=RECPTR   stop reading at WAL location RECPTR\n"));
 	printf(_("  -f, --follow   keep retrying after reaching end of WAL\n"));
+	printf(_("  -k, --block=N  only show records matching a given relation block (requires -l)\n"));
+	printf(_("  -l, --relation=N/N/N   only show records that touch a specific relation\n"));
 	printf(_("  -n, --limit=N  number of records to display\n"));
 	printf(_("  -p, --path=PATHdirectory in which to find log segment files or a\n"
 			 " directory with a ./pg_wal that contains such files\n"
@@ -802,12 +836,14 @@ main(int argc, char **argv)
 
 	static struct option long_options[] = {
 		{"bkp-details", no_argument, NULL, 'b'},
+		{"block", required_argument, NULL, 'k'},
 		{"end", required_argument, NULL, 'e'},
 		{"follow", no_argument, NULL, 'f'},
 		{"help", no_argument, NULL, '?'},
 		{"limit", required_argument, NULL, 'n'},
 		{"path", required_argument, NULL, 'p'},
 		{"quiet", no_argument, NULL, 'q'},
+		{"relation", required_argument, NULL, 'l'},
 		{"rmgr", required_argument, NULL, 'r'},
 		{"start", required_argument, NULL, 's'},
 		{"timeline", required_argument, NULL, 't'},
@@ -860,6 +896,8 @@ main(int argc, char **argv)
 	config.filter_by_rmgr_enabled

Re: DELETE CASCADE

2022-01-31 Thread David Christensen

On Sun, Jan 30, 2022 at 9:47 PM Julien Rouhaud  wrote:

> Hi,
>
> On Tue, Jan 25, 2022 at 10:26:53PM +0800, Julien Rouhaud wrote:
> >
> > It's been almost 4 months since your last email, and almost 2 weeks
> since the
> > notice that this patch doesn't apply anymore.  Without update in the next
> > couple of days this patch will be closed as Returned with Feedback per
> the
> > commitfest rules.
>
> Closed.
>

Sounds good; when I get time to look at this again I will resubmit (if
people think the base functionality is worth it, which is still a topic of
discussion).

David

Re: [PATCH] Proof of concept for GUC improvements

2022-01-12 Thread David Christensen

> Hi,
> 
> According to the cfbot, the patch doesn't apply anymore and needs a
> rebase: http://cfbot.cputube.org/patch_36_3290.log

V4 rebased attached. 



special-guc-values-v4.patch
Description: Binary data

Re: CREATE ROLE IF NOT EXISTS

2021-11-22 Thread David Christensen

On Mon, Nov 22, 2021 at 6:49 AM Daniel Gustafsson  wrote:

> > On 10 Nov 2021, at 18:14, David Christensen <
> david.christen...@crunchydata.com> wrote:
>
> > Modulo other issues/discussions, here is a version of this patch..
>
> This patch fails to compile since you renamed the if_not_exists member in
> CreateRoleStmt but still set it in the parser.
>

D'oh! Enclosed is a fixed/rebased version.

Best,

David


CREATE-OR-REPLACE-ROLE-v2.patch
Description: Binary data

1 2 >

1 - 100 of 154 matches

Mail list logo