Re: function for testing that causes the backend to terminate

2021-04-30 Thread Bharath Rupireddy
On Thu, Apr 29, 2021 at 4:36 PM Bharath Rupireddy
 wrote:
>
> On Thu, Apr 29, 2021 at 4:27 PM Dave Cramer  wrote:
> > For testing unusual situations I'd like to be able to cause a backend to 
> > terminate due to something like a segfault. Do we currently have this in 
> > testing ?
>
> Well, you could use pg_terminate_backend which sends SIGTERM to the
> backend. However, we don't have a function that sends SIGSEGV yet, you
> could signal the backend with SIGSEGV directly, if possible.

And, I came across an extension called pg_crash [1], see if that helps.

[1] 
https://www.cybertec-postgresql.com/en/pg_crash-crashing-postgresql-automatically/

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Enhanced error message to include hint messages for redundant options error

2021-04-30 Thread Dilip Kumar
On Sat, May 1, 2021 at 10:43 AM Bharath Rupireddy
 wrote:
>
> On Fri, Apr 30, 2021 at 2:49 PM Dilip Kumar  wrote:
> > Looking into this again, why not as shown below?  IMHO, this way the
> > code will be logically the same as it was before the patch, basically
> > why to process an extra statement ( *volatility_item = defel;) if we
> > have already decided to error.
>
> I changed my mind given the concerns raised on removing the goto
> statements. We could just do as below:

Okay, that make sense.

> diff --git a/src/backend/commands/functioncmds.c
> b/src/backend/commands/functioncmds.c
> index 9548287217..1f1c74c379 100644
> --- a/src/backend/commands/functioncmds.c
> +++ b/src/backend/commands/functioncmds.c
> @@ -575,7 +575,7 @@ compute_common_attribute(ParseState *pstate,
>  duplicate_error:
>  ereport(ERROR,
>  (errcode(ERRCODE_SYNTAX_ERROR),
> - errmsg("conflicting or redundant options"),
> + errmsg("option \"%s\" specified more than once", 
> defel->defname),
>   parser_errposition(pstate, defel->location)));
>  return false;/* keep compiler quiet */
>
> I'm not attaching above one line change as a patch, maybe Vignesh can
> merge this into the main patch.

+1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Enhanced error message to include hint messages for redundant options error

2021-04-30 Thread Bharath Rupireddy
On Fri, Apr 30, 2021 at 2:49 PM Dilip Kumar  wrote:
> Looking into this again, why not as shown below?  IMHO, this way the
> code will be logically the same as it was before the patch, basically
> why to process an extra statement ( *volatility_item = defel;) if we
> have already decided to error.

I changed my mind given the concerns raised on removing the goto
statements. We could just do as below:

diff --git a/src/backend/commands/functioncmds.c
b/src/backend/commands/functioncmds.c
index 9548287217..1f1c74c379 100644
--- a/src/backend/commands/functioncmds.c
+++ b/src/backend/commands/functioncmds.c
@@ -575,7 +575,7 @@ compute_common_attribute(ParseState *pstate,
 duplicate_error:
 ereport(ERROR,
 (errcode(ERRCODE_SYNTAX_ERROR),
- errmsg("conflicting or redundant options"),
+ errmsg("option \"%s\" specified more than once", defel->defname),
  parser_errposition(pstate, defel->location)));
 return false;/* keep compiler quiet */

I'm not attaching above one line change as a patch, maybe Vignesh can
merge this into the main patch.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Log enhancement for aclcheck permissions failures

2021-04-30 Thread Bharath Rupireddy
On Sat, May 1, 2021 at 5:26 AM Bingyu Shen  wrote:
> Hi hackers,
>
> I was wondering if we can improve the error messages for acl permission 
> failures.
> Current implementation to report errors is in "backend/catalog/aclchk.c"
>  void aclcheck_error(AclResult aclerr, ObjectType objtype, const char 
> *objectname);
>
> based on the AclResult type, it print log messages like
> "permission denied for schema %s"
> which tells the admins what could be the domain of the permission-deny,
> like table name or schema name.
>
> However, I find that the log messages *lack* more details, i.e., the
> *exact permission* that causes the permission-deny. For the novice users,
> they may end up over-granting the permission to fix the issues
> and cause security vulnerability in the database.
>
> I think the log messages can be better if we add some diagnostic
> information like which *role* is denied and what *permission* it lacks.
> This way the users know which permission to grant exactly
> without the trial-and-errors.

I think it's easy for users (even if they are novice) to know exactly
what permission they are lacking by just looking at the query. See,
the permissions we have in parsenodes.h with ACL_, they are quite
clear and can be understood by the type of query. So, I don't think
printing that obvious information in the log message is something we
would want to improve.

To know the current role with which the query is run, users can use
CURRENT_ROLE or pg_roles.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: New assertion failed in lazy_scan_heap() on mereswine

2021-04-30 Thread Justin Pryzby
On Sat, May 01, 2021 at 04:43:21PM +1200, Thomas Munro wrote:
> TRAP: FailedAssertion("!all_visible_according_to_vm || 
> prunestate.all_visible", File: 
> "/home/pgsql/build-farm/buildroot/HEAD/pgsql.build/../pgsql/src/backend/access/heap/vacuumlazy.c",
>  Line: 1347, PID: 16926)
> 2021-04-30 04:15:50.317 PDT [10349:18] DETAIL:  Failed process was
> running: autovacuum: VACUUM ANALYZE pg_catalog.pg_attribute
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine=2021-04-30%2006:43:27

The same assertion was reported here
https://www.postgresql.org/message-id/flat/OS0PR01MB611340CBD300A7C4FD6B6101FB5F9%40OS0PR01MB6113.jpnprd01.prod.outlook.com
|TRAP: FailedAssertion("!all_visible_according_to_vm || 
prunestate.all_visible", File: 
"/home/pgsql/build-farm/buildroot/HEAD/pgsql.build/../pgsql/src/backend/access/heap/vacuumlazy.c",
 Line: 1347, PID: 16926)
|2021-04-30 04:15:50.317 PDT [10349:18] DETAIL:  Failed process was running: 
autovacuum: VACUUM ANALYZE pg_catalog.pg_attribute

And Michael added
https://wiki.postgresql.org/index.php?title=PostgreSQL_14_Open_Items=revision=35954=35952

-- 
Justin




Re: New assertion failed in lazy_scan_heap() on mereswine

2021-04-30 Thread Dilip Kumar
On Sat, May 1, 2021 at 10:14 AM Thomas Munro  wrote:
>
> Hi,
>
> TRAP: FailedAssertion("!all_visible_according_to_vm ||
> prunestate.all_visible", File:
> "/home/pgsql/build-farm/buildroot/HEAD/pgsql.build/../pgsql/src/backend/access/heap/vacuumlazy.c",
> Line: 1347, PID: 16926)
>
> 2021-04-30 04:15:50.317 PDT [10349:18] DETAIL:  Failed process was
> running: autovacuum: VACUUM ANALYZE pg_catalog.pg_attribute
>

There is another thread[1] that has reported the same issue and also
provided a script to reproduce the issue.

[1] 
https://www.postgresql.org/message-id/os0pr01mb611340cbd300a7c4fd6b6101fb...@os0pr01mb6113.jpnprd01.prod.outlook.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




New assertion failed in lazy_scan_heap() on mereswine

2021-04-30 Thread Thomas Munro
Hi,

TRAP: FailedAssertion("!all_visible_according_to_vm ||
prunestate.all_visible", File:
"/home/pgsql/build-farm/buildroot/HEAD/pgsql.build/../pgsql/src/backend/access/heap/vacuumlazy.c",
Line: 1347, PID: 16926)

2021-04-30 04:15:50.317 PDT [10349:18] DETAIL:  Failed process was
running: autovacuum: VACUUM ANALYZE pg_catalog.pg_attribute

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine=2021-04-30%2006:43:27




Re: Transactions involving multiple postgres foreign servers, take 2

2021-04-30 Thread Masahiko Sawada
On Wed, Mar 17, 2021 at 6:03 PM Zhihong Yu  wrote:
>
> Hi,
> For v35-0007-Prepare-foreign-transactions-at-commit-time.patch :

Thank you for reviewing the patch!

>
> With this commit, the foreign server modified within the transaction marked 
> as 'modified'.
>
> transaction marked -> transaction is marked

Will fix.

>
> +#define IsForeignTwophaseCommitRequested() \
> +(foreign_twophase_commit > FOREIGN_TWOPHASE_COMMIT_DISABLED)
>
> Since the other enum is FOREIGN_TWOPHASE_COMMIT_REQUIRED, I think the macro 
> should be named: IsForeignTwophaseCommitRequired.

But even if foreign_twophase_commit is
FOREIGN_TWOPHASE_COMMIT_REQUIRED, the two-phase commit is not used if
there is only one modified server, right? It seems the name
IsForeignTwophaseCommitRequested is fine.

>
> +static bool
> +checkForeignTwophaseCommitRequired(bool local_modified)
>
> +   if (!ServerSupportTwophaseCommit(fdw_part))
> +   have_no_twophase = true;
> ...
> +   if (have_no_twophase)
> +   ereport(ERROR,
>
> It seems the error case should be reported within the loop. This way, we 
> don't need to iterate the other participant(s).
> Accordingly, nserverswritten should be incremented for local server prior to 
> the loop. The condition in the loop would become if 
> (!ServerSupportTwophaseCommit(fdw_part) && nserverswritten > 1).
> have_no_twophase is no longer needed.

Hmm, I think If we process one 2pc-non-capable server first and then
process another one 2pc-capable server, we should raise an error but
cannot detect that.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Granting control of SUSET gucs to non-superusers

2021-04-30 Thread Isaac Morland
On Fri, 30 Apr 2021 at 22:00, Mark Dilger 
wrote:


> Viewing all of this in terms of which controls allow the tenant to escape
> a hypothetical sandbox seems like the wrong approach.  Shouldn't we let
> service providers decide which controls would allow the tenant to escape
> the specific sandbox the provider has designed?
>

I’m not even sure I should be mentioning this possibility, but what if we
made each GUC parameter a grantable privilege? I’m honestly not sure if
this is insane or not. I mean numerically it’s a lot of privileges, but
conceptually it’s relatively simple.

What I like the least about it is actually the idea of giving up entirely
on the notion of grouping privileges into reasonable packages: some of
these privileges would be quite safe to grant in many or even most
circumstances, while others would usually not be reasonable to grant.


Re: Granting control of SUSET gucs to non-superusers

2021-04-30 Thread Mark Dilger



> On Apr 30, 2021, at 4:28 PM, Stephen Frost  wrote:
> 
> “Can’t be used to gain superuser” may be a sufficiently clear grouping, as 
> was more or less contemplated by the “admin” approach.  If that doesn’t work 
> though then we need an understanding of what the limits on these groups are, 
> so we can competently fit new GUCs into these groups (or invent new ones if a 
> new GUC truly falls outside all existing but I would expect that to be a 
> rather rare case..). 

When I first heard that providers want to build sandboxes around PostgreSQL, I 
thought the idea was a little silly, because providers can just spin up a 
virtual machine per tenant and give each tenant superuser privileges on their 
respective VM.  Who cares if they mess it up after that?

The problem with that idea turns out to be that the providers want to take 
responsibility for some of the database maintenance, possibly including 
backups, replication, etc.  I think the set of controls the provider hands over 
to the tenant will depend very much on the division of responsibility.  If the 
provider is managing replication, then control over session_replication_role 
and wal_compression is unlikely to be handed to the tenant, but if the tenant 
is responsible for their own replication scheme, it might be.

Viewing all of this in terms of which controls allow the tenant to escape a 
hypothetical sandbox seems like the wrong approach.  Shouldn't we let service 
providers decide which controls would allow the tenant to escape the specific 
sandbox the provider has designed?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Transactions involving multiple postgres foreign servers, take 2

2021-04-30 Thread Masahiko Sawada
On Tue, Apr 27, 2021 at 10:03 AM Masahiro Ikeda
 wrote:
>
>
>
> On 2021/03/17 12:03, Masahiko Sawada wrote:
> > I've attached the updated version patch set.
>
> Thanks for updating the patches! I'm now restarting to review of 2PC because
> I'd like to use this feature in PG15.

Thank you for reviewing the patch! Much appreciated.

>
>
> I think the following logic of resolving and removing the fdwxact entries
> by the transaction resolver needs to be fixed.
>
> 1. check if pending fdwxact entries exist
>
> HoldInDoubtFdwXacts() checks if there are entries which the condition is
> InvalidBackendId and so on. After that it gets the indexes of the fdwxacts
> array. The fdwXactLock is released at the end of this phase.
>
> 2. resolve and remove the entries held in 1th phase.
>
> ResolveFdwXacts() resloves the status per each fdwxact entry using the
> indexes. The end of resolving, the transaction resolver remove the entry in
> fdwxacts array via remove_fdwact().
>
> The way to remove the entry is the following. Since to control using the
> index, the indexes of getting in the 1st phase are meaningless anymore.
>
> /* Remove the entry from active array */
> FdwXactCtl->num_fdwxacts--;
> FdwXactCtl->fdwxacts[i] = FdwXactCtl->fdwxacts[FdwXactCtl->num_fdwxacts];
>
> This seems to lead resolving the unexpected fdwxacts and it can occur the
> following assertion error. That's why I noticed. For example, there is the
> case which a backend inserts new fdwxact entry in the free space, which the
> resolver removed the entry right before, and the resolver accesses the new
> entry which doesn't need to resolve yet because it use the indexes checked in
> 1st phase.
>
> Assert(fdwxact->lockeing_backend == MyBackendId);

Good point. I agree with your analysis.

>
>
>
> The simple solution is that to get fdwXactLock exclusive all the time from the
> begining of 1st phase to the finishing of 2nd phase. But, I worried that the
> performance impact became too big...
>
> I came up with two solutions although there may be better solutions.
>
> A. to remove resolved entries at once after resolution for all held entries is
> finished
>
> If so, we don't need to take the exclusive lock for a long time. But, this
> have other problems, which pg_remove_foreign_xact() can still remove entries
> and we need to handle the fail of resolving.
>
> I wondered that we can solve the first problem to introduce a new lock like
> "removing lock" and only the processes which hold the lock can remove the
> entries. The performance impact is limited since the insertion the fdwxact
> entries is not blocked by this lock. And second problem can be solved using
> try-catch sentence.
>
>
> B. to merge 1st and 2nd phase
>
> Now, the resolver resolves the entries together. That's the reason why it's
> difficult to remove the entries. So, it seems to solve the problem to execute
> checking, resolving and removing per each entry. I think it's better since
> this is simpler than A. If I'm missing something, please let me know.

It seems to me that solution B would be simpler and better. I'll try
to fix this issue by using solution B and rebase the patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: RFE: Make statistics robust for unplanned events

2021-04-30 Thread Peter Geoghegan
On Thu, Apr 22, 2021 at 3:35 PM Tom Lane  wrote:
> Peter Geoghegan  writes:
> > We already *almost* pay the full cost of durably storing the
> > information used by autovacuum.c's relation_needs_vacanalyze() to
> > determine if a VACUUM is required -- we're only missing
> > new_dead_tuples/tabentry->n_dead_tuples. Why not go one tiny baby step
> > further to fix this issue?
>
> Definitely worth thinking about, but I'm a little confused about how
> you see this working.  Those pg_class fields are updated by vacuum
> (or analyze) itself.  How could they usefully serve as input to
> autovacuum's decisions?

Honestly, the details weren't very well thought out. My basic point
still stands, which is that it shouldn't be *that* expensive to make
the relevant information crash-safe, which would protect the system
from certain pathological cases. Maybe it doesn't even have to be
crash-safe in the way that pg_class.reltuples is -- something
approximate might work quite well. Assuming that there are no dead
tuples after a crash seems rather naive.

I seem to recall that certain init scripts I saw years ago used
Immediate Shutdown mode very frequently. Stuff like that is bound to
happen in some installations, and so we should protect users from
hard-to-foresee extreme consequences. Sure, using immediate shutdown
like that isn't recommended practice, but that's no reason to allow a
nasty failure mode.

--
Peter Geoghegan




Re: Granting control of SUSET gucs to non-superusers

2021-04-30 Thread Tom Lane
Stephen Frost  writes:
> On Fri, Apr 30, 2021 at 19:19 Mark Dilger 
> wrote:
>> PostgreSQL defines a number of GUCs that can only be set by superusers.  I
>> would like to support granting privileges on subsets of these to
>> non-superuser roles, inspired by Stephen Frost's recent work on
>> pg_read_all_data and pg_write_all_data roles.

> New predefined roles are relatively inexpensive. That said, whatever sets
> we define need to have some meaning to them- one which is reasonably
> future-proofed so that we have some idea what category a new GUC would fit
> into.
> “Can’t be used to gain superuser” may be a sufficiently clear grouping, as
> was more or less contemplated by the “admin” approach.  If that doesn’t
> work though then we need an understanding of what the limits on these
> groups are, so we can competently fit new GUCs into these groups (or invent
> new ones if a new GUC truly falls outside all existing but I would expect
> that to be a rather rare case..).
> We may also wish to keep some GUCs superuser only when they really only
> make sense to be used in a developer context...

Hmm, is there really any point in that?  We already have roles
like "pg_write_server_files" and "pg_execute_server_program",
which allow trivial escalation to superuser if one wishes,
but are still useful as being roles you're a bit less likely
to break your database with accidentally than running as full
superuser.

So ISTM that "pg_set_superuser_parameters" could be treated as
being one of those same sorts of roles that you don't give out
to untrusted people, and then we don't have to worry about
exactly which GUCs might be exactly how dangerous to whom.

If we try to define it as being some lesser level of
privilege than that, I'm afraid we will spend lots of
not-very-productive time trying to classify the security
threats from different GUCs ... and they all have *some*
security issue involved, or they wouldn't be restricted in
the first place.  Plus, I'm not looking forward to having
to issue CVEs when we realize we misclassified something.

regards, tom lane




Re: Granting control of SUSET gucs to non-superusers

2021-04-30 Thread Chapman Flack
On 04/30/21 19:19, Mark Dilger wrote:

> We could certainly debate which GUCs could be used to escape the sandbox
> vs. which ones could not, but I would prefer a design that allows the
> provider to make that determination.

I find myself wondering how many GUCs flagged SUSET are not flagged that way
because of a determination already made that they could be used to escape.
(Maybe some of the logging ones, only usable to conceal your escape.)

But there might be ways for a provider, scrutinizing each of those
individually, to conclude "this will not allow escape from the sandbox
/I/ have set up, provided the value being set satisfies constraints
x and y" ... a generalization of the LOAD from $libdir/plugins idea.

So that suggests to me some mechanism where a provider could grant
setting foo to role bar using validator baz().

Can SUSET GUCs be set from SECURITY DEFINER functions? Maybe there are
already the pieces to do that, minus some syntax sugar.

Regards,
-Chap




Log enhancement for aclcheck permissions failures

2021-04-30 Thread Bingyu Shen
Hi hackers,

I was wondering if we can improve the error messages for acl permission
failures.
Current implementation to report errors is in "backend/catalog/aclchk.c"
 void aclcheck_error(AclResult aclerr, ObjectType objtype, const char
*objectname);

based on the AclResult type, it print log messages like
"permission denied for schema %s"
which tells the admins what could be the domain of the permission-deny,
like table name or schema name.

However, I find that the log messages *lack* more details, i.e., the
*exact permission* that causes the permission-deny. For the novice users,
they may end up over-granting the permission to fix the issues
and cause security vulnerability in the database.

I think the log messages can be better if we add some diagnostic
information like which *role* is denied and what *permission* it lacks.
This way the users know which permission to grant exactly
without the trial-and-errors.

It is not hard to improve the log messages after looking into the code.
Most places use the function aclcheck_error() exactly after the permission
check, e.g., pg_type_aclcheck(), pg_tablespace_aclcheck().
For example, in backend/commands/dbcommands.c, it checks whether
the user has CREATE permission.

aclresult = pg_tablespace_aclcheck(dst_deftablespace, GetUserId(),
ACL_CREATE);
if (aclresult != ACLCHECK_OK)
aclcheck_error(aclresult, OBJECT_TABLESPACE, tablespacename);

We can simply change the aclcheck_error() function parameter a bit,
then we can pass the exact permission to the function, and tell the users
exactly why the permission is denied. Something would be like

void aclcheck_error(AclResult aclerr, ObjectType objtype,
  const char *objectname,
  const char *privilegename)

Any thoughts would be appreciated. Thanks!

Best regards,
Bingyu


Re: Granting control of SUSET gucs to non-superusers

2021-04-30 Thread Stephen Frost
Greetings,

On Fri, Apr 30, 2021 at 19:19 Mark Dilger 
wrote:

> PostgreSQL defines a number of GUCs that can only be set by superusers.  I
> would like to support granting privileges on subsets of these to
> non-superuser roles, inspired by Stephen Frost's recent work on
> pg_read_all_data and pg_write_all_data roles.


There’s been some effort started in this direction which I was working on
(see the patches about an “admin” role and set of GUCs).  I have been
meaning to get back to that but the specific concern I had was about coming
up with how to define the proper set of GUCs.

The specific use case motivating this work is that of a PostgreSQL service
> provider.  The provider guarantees certain aspects of the service, such as
> periodic backups, replication, uptime, availability, etc., while making no
> guarantees of other aspects, such as performance associated with the design
> of the schema or the queries executed.  The provider should be able to
> grant to the tenant privileges to set any GUC which cannot be used to
> "escape the sandbox" and interfere with the handful of metrics being
> guaranteed.  Given that the guarantees made by one provider may differ from
> those made by another, the exact set of GUCs which the provider allows the
> tenant to control may differ.
>
> By my count, there are currently 50 such GUCs, already broken down into 15
> config groups.  Creating a single new role pg_set_all_gucs seems much too
> coarse a control, but creating 50 new groups may be excessive.  We could
> certainly debate which GUCs could be used to escape the sandbox vs. which
> ones could not, but I would prefer a design that allows the provider to
> make that determination.  The patch I would like to submit would only give
> the provider the mechanism for controlling these things, but would not make
> the security choices for them.
>
> Do folks think it would make sense to create a role per config group?
> Adding an extra 15 default roles seems high to me, but organizing the
> feature this way would make the roles easier to document, because there
> would be a one-to-one correlation between the roles and the config groups.


New predefined roles are relatively inexpensive. That said, whatever sets
we define need to have some meaning to them- one which is reasonably
future-proofed so that we have some idea what category a new GUC would fit
into.

“Can’t be used to gain superuser” may be a sufficiently clear grouping, as
was more or less contemplated by the “admin” approach.  If that doesn’t
work though then we need an understanding of what the limits on these
groups are, so we can competently fit new GUCs into these groups (or invent
new ones if a new GUC truly falls outside all existing but I would expect
that to be a rather rare case..).

We may also wish to keep some GUCs superuser only when they really only
make sense to be used in a developer context...

Thanks,

Stephen


Granting control of SUSET gucs to non-superusers

2021-04-30 Thread Mark Dilger
Hackers,

PostgreSQL defines a number of GUCs that can only be set by superusers.  I 
would like to support granting privileges on subsets of these to non-superuser 
roles, inspired by Stephen Frost's recent work on pg_read_all_data and 
pg_write_all_data roles.

The specific use case motivating this work is that of a PostgreSQL service 
provider.  The provider guarantees certain aspects of the service, such as 
periodic backups, replication, uptime, availability, etc., while making no 
guarantees of other aspects, such as performance associated with the design of 
the schema or the queries executed.  The provider should be able to grant to 
the tenant privileges to set any GUC which cannot be used to "escape the 
sandbox" and interfere with the handful of metrics being guaranteed.  Given 
that the guarantees made by one provider may differ from those made by another, 
the exact set of GUCs which the provider allows the tenant to control may 
differ.

By my count, there are currently 50 such GUCs, already broken down into 15 
config groups.  Creating a single new role pg_set_all_gucs seems much too 
coarse a control, but creating 50 new groups may be excessive.  We could 
certainly debate which GUCs could be used to escape the sandbox vs. which ones 
could not, but I would prefer a design that allows the provider to make that 
determination.  The patch I would like to submit would only give the provider 
the mechanism for controlling these things, but would not make the security 
choices for them.

Do folks think it would make sense to create a role per config group?  Adding 
an extra 15 default roles seems high to me, but organizing the feature this way 
would make the roles easier to document, because there would be a one-to-one 
correlation between the roles and the config groups.

I have a WIP patch that I'm not attaching, but if I get any feedback, I might 
be able to adjust the patch before the first version posted.  The basic idea is 
that it allows things like:

GRANT pg_set_stats_monitoring TO tenant_role;

And then tenant_role could, for example

SET log_parser_stats TO off;

Thanks

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Peter Geoghegan
On Fri, Apr 30, 2021 at 2:07 PM Robert Haas  wrote:
> OK. I thought about this in regards to zheap, which has this exact
> problem, because it wants to do so-called "in place" updates where the
> new version of the row goes right on top of the old one in the table
> page, and the old version of the row gets written into the undo log.
> Just to keep things simple, we said that initially we'd only use this
> in-place update strategy when no indexed columns were changed, so that
> there's only ever one set of index entries for a given TID. In that
> model, the index AMs don't really need to care that there are actually
> multiple tuples for the same TID, because those tuples differ only in
> columns that the index doesn't care about anyway. An index scan has to
> be careful to fetch the correct version of the tuple, but it has a
> Snapshot available, so it can do that.

Right. So zheap (in the current prototype implementation) is like
heapam with its HOT optimization, except that it isn't subject to the
same limitations with regard to fitting heap tuples on the same heap
page to keep the same HOT chain going over time. You kind of have the
moral equivalent of a HOT chain that can largely live in UNDO. That
seems like a very useful thing on its own. A lot of the problems with
HOT are in this area -- maybe the vast majority, even.

A remaining problem is that we must generate a new round of index
tuples for each and every index when only one indexed column is
logically modified by an UPDATE statement. I think that this is much
less of a problem now due to bottom-up index deletion. Sure, it sucks
that we still have to dirty the page at all. But it's nevertheless
true that it all but eliminates version-driven page splits, which are
where almost all of the remaining downside is. It's very reasonable to
now wonder if this particular all-indexes problem is worth solving at
all in light of that. (Modern hardware characteristics also make a
comprehensive fix less valuable in practice.)

> However, there's no easy and
> efficient way to handle updates and deletes. Suppose for example that
> a tuple has been updated 5 times, creating versions t1..t5. t5 is now
> in the zheap page, and the other versions are in the undo. t5 points
> to t4 which points to t3 and so forth. Now an updater comes along and
> let's say that the updater's snapshot sees t2. It may be that t3..t5
> are *uncommitted* updates in which case the attempt to update t2 may
> succeed if the transaction that performed then aborts, or it may be
> that the updating transactions have committed, in which case we're
> going to have to fail. But that decision isn't made by the scan that
> sees t3; it happens when the TID reaches the ModifyTable node. So what
> zheap ends up doing is finding the right tuple version during the
> scan, by making use of the snapshot, and then having to go repeat that
> work when it's time to try to perform the update. It would be nice to
> avoid this.

I believe that this is another consequence of the fact that Postgres
versions tuples, not pages. This is not a minor theoretical point.
It's very different to what Oracle does. It's almost a necessary
consequence of our basic approach to extensibility, because you can
have things like index tuples whose values are equal but visibly
distinct (e.g., the numeric value '5.0' is equal to but distinct from
'5'). It also has a lot to do with how crash recovery works.

> If we could feed system columns from the scan through to
> the update, we could pass along an undo pointer and avoid the extra
> overhead. So it seems to me, so far anyway, that there's no very
> fundamental problem here, but there is an efficiency issue which we
> could address if we had a bit more planner and executor infrastructure
> to help us out.

FWIW you don't necessarily have to do the EPQ stuff. You could in
theory do a statement-level rollback, and repeat. The EPQ mechanism is
unique to Postgres. Maybe it doesn't matter, but I don't think that
it's essential to follow this in other table AMs.

> Now in the long run the vision for zheap was that we'd eventually want
> to do in-place updates even when indexed columns have been modified,
> and this gets a whole lot trickier, because now there can be multiple
> sets of index entries pointing at the same TID which don't agree on
> the values of the indexed columns.

It's much easier when you have a very simple type system that doesn't
allow differences like my "numeric '5.0' vs '5'" example -- a system
that is built for this from the ground up. If there are meaningful
semantic differences among opclass-equal index tuples, then we can
never assume that index tuples will always be locatable after an
update affecting indexed columns (if only because we need to preserve
the '5.0' and '5' variants in an index on a numeric column).

If we could at least be sure that two index tuples that point to the
same stable/logical zheap TID (in a world where TIDs were stable

Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Peter Geoghegan
On Fri, Apr 30, 2021 at 2:13 PM Jeff Davis  wrote:
> FWIW, this is not a problem in my table AM. I am fine having different
> TIDs for each version, just like heapam.

This means that we are largely in agreement about the general nature
of the problem. That seems like a good basis to redefine TID-like
identifiers so that they can accommodate what you want to do.

> For index-organized tables it does seem like an interesting problem.

I strongly suspect that index-organized tables (or indirect indexes,
or anything else that assumes that TID-like identifiers map directly
to logical rows as opposed to physical versions) are going to break
too many assumptions to ever be tractable. Assuming I have that right,
it would advance the discussion if we could all agree on that being a
non-goal for the tableam interface in general. This would allow us to
clearly discuss how to solve the remaining problem of accommodating
column stores and suchlike. That seems hard, but much more tractable.

The fact that the tableam has almost no non-goals has always bothered
me a bit. Especially on this particular point about purely logical
TID-like identifiers.

-- 
Peter Geoghegan




Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Jeff Davis
On Fri, 2021-04-30 at 12:29 -0700, Peter Geoghegan wrote:
> > Is the problem you're worried about here that, with something like
> > an
> > index-organized table, you can have multiple row versions that have
> > the same logical tuple ID, i.e. primary key value? 
> 
> That's what I'm talking about. I'd like to hear what you think about
> it.

FWIW, this is not a problem in my table AM. I am fine having different
TIDs for each version, just like heapam.

For index-organized tables it does seem like an interesting problem.

Regards,
Jeff Davis






Re: pg_amcheck contrib application

2021-04-30 Thread Robert Haas
On Fri, Apr 30, 2021 at 4:26 PM Mark Dilger
 wrote:
> After further reflection, no other verbiage seems any better.  I'd say go 
> ahead and commit it this way.

OK. I'll plan to do that on Monday, barring objections.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Robert Haas
On Fri, Apr 30, 2021 at 3:30 PM Peter Geoghegan  wrote:
> > Is the problem you're worried about here that, with something like an
> > index-organized table, you can have multiple row versions that have
> > the same logical tuple ID, i.e. primary key value? And that the
> > interfaces aren't well-suited to that? Because that's a problem I have
> > thought about and can comment on, even though I think the question of
> > having multiple versions with the same TID is distinguishable from the
> > question of how *wide* TIDs should be. But maybe that's not what you
> > are talking about here, in which case I guess I need a clearer
> > explanation of the concern.
>
> That's what I'm talking about. I'd like to hear what you think about it.

OK. I thought about this in regards to zheap, which has this exact
problem, because it wants to do so-called "in place" updates where the
new version of the row goes right on top of the old one in the table
page, and the old version of the row gets written into the undo log.
Just to keep things simple, we said that initially we'd only use this
in-place update strategy when no indexed columns were changed, so that
there's only ever one set of index entries for a given TID. In that
model, the index AMs don't really need to care that there are actually
multiple tuples for the same TID, because those tuples differ only in
columns that the index doesn't care about anyway. An index scan has to
be careful to fetch the correct version of the tuple, but it has a
Snapshot available, so it can do that. However, there's no easy and
efficient way to handle updates and deletes. Suppose for example that
a tuple has been updated 5 times, creating versions t1..t5. t5 is now
in the zheap page, and the other versions are in the undo. t5 points
to t4 which points to t3 and so forth. Now an updater comes along and
let's say that the updater's snapshot sees t2. It may be that t3..t5
are *uncommitted* updates in which case the attempt to update t2 may
succeed if the transaction that performed then aborts, or it may be
that the updating transactions have committed, in which case we're
going to have to fail. But that decision isn't made by the scan that
sees t3; it happens when the TID reaches the ModifyTable node. So what
zheap ends up doing is finding the right tuple version during the
scan, by making use of the snapshot, and then having to go repeat that
work when it's time to try to perform the update. It would be nice to
avoid this. If we could feed system columns from the scan through to
the update, we could pass along an undo pointer and avoid the extra
overhead. So it seems to me, so far anyway, that there's no very
fundamental problem here, but there is an efficiency issue which we
could address if we had a bit more planner and executor infrastructure
to help us out.

Now in the long run the vision for zheap was that we'd eventually want
to do in-place updates even when indexed columns have been modified,
and this gets a whole lot trickier, because now there can be multiple
sets of index entries pointing at the same TID which don't agree on
the values of the indexed columns. As old row versions die off, some
of those pointers need to be cleaned out, and others do not. I thought
we might solve this problem by something akin to retail index
deletion: have an update or delete on a zheap tuple go re-find the
associated index entries and mark them for possible cleanup, and then
vacuum can ignore all unmarked tuples. There might be some efficiency
problems with this idea I hadn't considered, based on your remarks
today. But regardless of the wisdom or folly of this approach, the
broader point is that we can't assume that all heap types are going to
have the same maintenance requirements. I think most of them are going
to have some kind of maintenance operation that need to or at least
can optionally be performed from time to time, but it might be
triggered by completely different criteria than vacuum. New table AMs
might well choose to use 64-bit XIDs, avoiding the need for wraparound
processing altogether. Maybe they have such good opportunistic cleanup
mechanisms that periodic vacuum for bloat isn't even really needed.
Maybe they bloat when updates and deletes commit but not when inserts
and updates abort, because those cases are handled via some other
mechanism. Who knows, really? It's hard to predict what
not-yet-written AMs might care about, and even if we knew, it seems
crazy to try to rewrite the way vacuum works to cater to those needs
before we actually have some working AMs to use as a testbed.

It strikes me that certain interesting cases might not really need
anything very in-depth here. For example, consider indirect indexes,
where the index references the primary key value rather than the TID.
Well, the indirect index should probably be vacuumed periodically to
prevent bloat, but it doesn't need to be vacuumed to recycle TIDs
because it doesn't contain TIDs. BRIN indexes, BTW, also 

Re: pg_amcheck contrib application

2021-04-30 Thread Mark Dilger



> On Apr 30, 2021, at 1:04 PM, Mark Dilger  wrote:
> 
>> toast value %u was expected to end at chunk %d, but ended while
>> expecting chunk %d
>> 
>> i.e. same as the currently-committed code, except for changing "ended
>> at" to "ended while expecting."
> 
> I find the grammar of this new formulation anomalous for hard to articulate 
> reasons not quite the same as but akin to mismatched verb aspect.

After further reflection, no other verbiage seems any better.  I'd say go ahead 
and commit it this way.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: pg_amcheck contrib application

2021-04-30 Thread Mark Dilger



> On Apr 30, 2021, at 12:47 PM, Robert Haas  wrote:
> 
> Hmm, I think that might need adjustment, actually. What I was trying
> to do is compensate for the fact that what we now have is the next
> chunk_seq value we expect, not the last one we saw, nor the total
> number of chunks we've seen regardless of what chunk_seq they had. But
> I thought it would be too confusing to just give the chunk number we
> were expecting and not say anything about how many chunks we thought
> there would be in total. So maybe what I should do is change it to
> something like this:
> 
> toast value %u was expected to end at chunk %d, but ended while
> expecting chunk %d
> 
> i.e. same as the currently-committed code, except for changing "ended
> at" to "ended while expecting."

I find the grammar of this new formulation anomalous for hard to articulate 
reasons not quite the same as but akin to mismatched verb aspect.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: pg_amcheck contrib application

2021-04-30 Thread Robert Haas
On Fri, Apr 30, 2021 at 3:41 PM Mark Dilger
 wrote:
> I think that's committable.
>
> The only nitpick might be
>
> -   psprintf("toast value %u was expected to end 
> at chunk %d, but ended at chunk %d",
> +   psprintf("toast value %u index scan ended 
> early while expecting chunk %d of %d",
>
> When reporting to users about positions within a zero-based indexing scheme, 
> what does "while expecting chunk 3 of 4" mean?  Is it talking about the last 
> chunk from the set [0..3] which has cardinality 4, or does it mean the 
> next-to-last chunk from [0..4] which ends with chunk 4, or what?  The prior 
> language isn't any more clear than what you have here, so I have no objection 
> to committing this, but the prior language was probably as goofy as it was 
> because it was trying to deal with this issue.

Hmm, I think that might need adjustment, actually. What I was trying
to do is compensate for the fact that what we now have is the next
chunk_seq value we expect, not the last one we saw, nor the total
number of chunks we've seen regardless of what chunk_seq they had. But
I thought it would be too confusing to just give the chunk number we
were expecting and not say anything about how many chunks we thought
there would be in total. So maybe what I should do is change it to
something like this:

toast value %u was expected to end at chunk %d, but ended while
expecting chunk %d

i.e. same as the currently-committed code, except for changing "ended
at" to "ended while expecting."

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_amcheck contrib application

2021-04-30 Thread Mark Dilger



> On Apr 30, 2021, at 12:29 PM, Robert Haas  wrote:
> 
> OK, how about this version?

I think that's committable.

The only nitpick might be

-   psprintf("toast value %u was expected to end at 
chunk %d, but ended at chunk %d",
+   psprintf("toast value %u index scan ended early 
while expecting chunk %d of %d",

When reporting to users about positions within a zero-based indexing scheme, 
what does "while expecting chunk 3 of 4" mean?  Is it talking about the last 
chunk from the set [0..3] which has cardinality 4, or does it mean the 
next-to-last chunk from [0..4] which ends with chunk 4, or what?  The prior 
language isn't any more clear than what you have here, so I have no objection 
to committing this, but the prior language was probably as goofy as it was 
because it was trying to deal with this issue.

Thoughts?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Peter Geoghegan
On Fri, Apr 30, 2021 at 12:20 PM Robert Haas  wrote:
> Why can't it do what it does already? It's not broken for heap, so why
> should it be broken for anything else? And why are non-HOT updates
> specifically a problem?

No reason.

> > You obviously cannot have the equivalent of
> > duplicate TIDs when your new table AM runs into these scenarios. So
> > what do you do instead? How do you make your clustered index/IoT style
> > identifiers (i.e. your strictly logical TID-like identifiers) deal
> > with that case?
>
> Is the problem you're worried about here that, with something like an
> index-organized table, you can have multiple row versions that have
> the same logical tuple ID, i.e. primary key value? And that the
> interfaces aren't well-suited to that? Because that's a problem I have
> thought about and can comment on, even though I think the question of
> having multiple versions with the same TID is distinguishable from the
> question of how *wide* TIDs should be. But maybe that's not what you
> are talking about here, in which case I guess I need a clearer
> explanation of the concern.

That's what I'm talking about. I'd like to hear what you think about it.

It's not exactly a narrow concern. For one thing, it is enough to
totally validate my suggestion about how we might widen TIDs and still
have nbtree deduplication.

-- 
Peter Geoghegan




Re: pg_amcheck contrib application

2021-04-30 Thread Robert Haas
On Fri, Apr 30, 2021 at 3:26 PM Mark Dilger
 wrote:
> It looks mostly good to me.  There is a off-by-one error introduced with:
>
> -   else if (chunkno != (endchunk + 1))
> +   else if (expected_chunk_seq < last_chunk_seq)
>
> I think that needs to be
>
> +  else if (expected_chunk_seq <= last_chunk_seq)
>
> because otherwise it won't complain if the only missing chunk is the very 
> last one.

OK, how about this version?

-- 
Robert Haas
EDB: http://www.enterprisedb.com


simply-remove-chunkno-concept-v4.patch
Description: Binary data


Re: pg_amcheck contrib application

2021-04-30 Thread Mark Dilger



> On Apr 30, 2021, at 11:56 AM, Robert Haas  wrote:
> 
> Can you review this version?

It looks mostly good to me.  There is a off-by-one error introduced with:

-   else if (chunkno != (endchunk + 1))
+   else if (expected_chunk_seq < last_chunk_seq)

I think that needs to be

+  else if (expected_chunk_seq <= last_chunk_seq)

because otherwise it won't complain if the only missing chunk is the very last 
one.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: pg_amcheck contrib application

2021-04-30 Thread Mark Dilger


> On Apr 30, 2021, at 11:56 AM, Robert Haas  wrote:
> 
> On Fri, Apr 30, 2021 at 2:31 PM Mark Dilger
>  wrote:
>> Just to be clear, I did not use your patch v1 as the starting point.
> 
> I thought that might be the case, but I was trying to understand what
> you didn't like about my version, and comparing them seemed like a way
> to figure that out.
> 
>> I took the code as committed to master as the starting point, used your 
>> corruption report verbiage changes and at least some of your variable naming 
>> choices, but did not use the rest, in large part because it didn't work.  It 
>> caused corruption messages to be reported against tables that have no 
>> corruption.  For that matter, your v2 patch doesn't work either, and in the 
>> same way.  To wit:
>> 
>>  heap table "postgres"."pg_catalog"."pg_rewrite", block 6, offset 4, 
>> attribute 7:
>> toast value 13461 chunk 0 has size 1995, but expected size 1996
>> 
>> I think there is something wrong with the way you are trying to calculate 
>> and use extsize, because I'm not corrupting pg_catalog.pg_rewrite.  You can 
>> get these same results by applying your patch to master, building, and 
>> running 'make check' from src/bin/pg_amcheck/
> 
> Argh, OK, I didn't realize. Should be fixed in this version.
> 
>>> 4. You don't return if chunk_seq > last_chunk_seq. That seems wrong,
>>> because we cannot compute a sensible expected size in that case. I
>>> think your code will subtract a larger value from a smaller one and,
>>> this being unsigned arithmetic, say that the expected chunk size is
>>> something gigantic.
>> 
>> Your conclusion is probably right, but I think your analysis is based on a 
>> misreading of what "last_chunk_seq" means.  It's not the last one seen, but 
>> the last one expected.  (Should we rename the variable to avoid confusion?)  
>> It won't compute a gigantic size.  Rather, it will expect *every* chunk with 
>> chunk_seq >= last_chunk_seq to have whatever size is appropriate for the 
>> last chunk.
> 
> I realize it's the last one expected. That's the point: we don't have
> any expectation for the sizes of chunks higher than the last one we
> expected to see. If the value is 2000 bytes and the chunk size is 1996
> bytes, we expect chunk 0 to be 1996 bytes and chunk 1 to be 4 bytes.
> If not, we can complain. But it makes no sense to complain about chunk
> 2 being of a size we don't expect. We don't expect it to exist in the
> first place, so we have no notion of what size it ought to be.
> 
>> If we have seen any chunks, the variable is holding the expected next chunk 
>> seq, which is one greater than the last chunk seq we saw.
>> 
>> If we expect chunks 0..3 and see chunk 0 but not chunk 1, it will complain 
>> ..."expected to end at chunk 4, but ended at chunk 1".  This is clearly by 
>> design and not merely a bug, though I tend to agree with you that this is a 
>> strange wording choice.  I can't remember exactly when and how we decided to 
>> word the message this way, but it has annoyed me for a while, and I assumed 
>> it was something you suggested a while back, because I don't recall doing 
>> it.  Either way, since you seem to also be bothered by this, I agree we 
>> should change it.
> 
> Can you review this version?
> 
> -- 
> Robert Haas
> EDB: http://www.enterprisedb.com
> 

As requested off-list, here are NOT FOR COMMIT, WIP patches for testing only.

The first patch allows toast tables to be updated and adds regression tests of 
corrupted toasted attributes.  I never quite got deletes from toast tables to 
work, and there are probably other gotchas still lurking even with inserts and 
updates, but it limps along well enough for testing pg_amcheck.

The second patch updates the expected output of pg_amcheck to match the 
verbiage that you suggested upthread.



v1-0001-Adding-modify-toast-and-test-pg_amcheck.patch.WIP
Description: Binary data


v1-0002-Modifying-toast-corruption-test-expected-output.patch.WIP
Description: Binary data


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Robert Haas
On Fri, Apr 30, 2021 at 2:23 PM Peter Geoghegan  wrote:
> I don't know how it's possible to do any of this without first
> addressing what the table AM does in cases where heapam currently does
> a non-HOT update.

Why can't it do what it does already? It's not broken for heap, so why
should it be broken for anything else? And why are non-HOT updates
specifically a problem?

> You obviously cannot have the equivalent of
> duplicate TIDs when your new table AM runs into these scenarios. So
> what do you do instead? How do you make your clustered index/IoT style
> identifiers (i.e. your strictly logical TID-like identifiers) deal
> with that case?

Is the problem you're worried about here that, with something like an
index-organized table, you can have multiple row versions that have
the same logical tuple ID, i.e. primary key value? And that the
interfaces aren't well-suited to that? Because that's a problem I have
thought about and can comment on, even though I think the question of
having multiple versions with the same TID is distinguishable from the
question of how *wide* TIDs should be. But maybe that's not what you
are talking about here, in which case I guess I need a clearer
explanation of the concern.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_amcheck contrib application

2021-04-30 Thread Robert Haas
On Fri, Apr 30, 2021 at 2:31 PM Mark Dilger
 wrote:
> Just to be clear, I did not use your patch v1 as the starting point.

I thought that might be the case, but I was trying to understand what
you didn't like about my version, and comparing them seemed like a way
to figure that out.

> I took the code as committed to master as the starting point, used your 
> corruption report verbiage changes and at least some of your variable naming 
> choices, but did not use the rest, in large part because it didn't work.  It 
> caused corruption messages to be reported against tables that have no 
> corruption.  For that matter, your v2 patch doesn't work either, and in the 
> same way.  To wit:
>
>   heap table "postgres"."pg_catalog"."pg_rewrite", block 6, offset 4, 
> attribute 7:
>  toast value 13461 chunk 0 has size 1995, but expected size 1996
>
> I think there is something wrong with the way you are trying to calculate and 
> use extsize, because I'm not corrupting pg_catalog.pg_rewrite.  You can get 
> these same results by applying your patch to master, building, and running 
> 'make check' from src/bin/pg_amcheck/

Argh, OK, I didn't realize. Should be fixed in this version.

> > 4. You don't return if chunk_seq > last_chunk_seq. That seems wrong,
> > because we cannot compute a sensible expected size in that case. I
> > think your code will subtract a larger value from a smaller one and,
> > this being unsigned arithmetic, say that the expected chunk size is
> > something gigantic.
>
> Your conclusion is probably right, but I think your analysis is based on a 
> misreading of what "last_chunk_seq" means.  It's not the last one seen, but 
> the last one expected.  (Should we rename the variable to avoid confusion?)  
> It won't compute a gigantic size.  Rather, it will expect *every* chunk with 
> chunk_seq >= last_chunk_seq to have whatever size is appropriate for the last 
> chunk.

I realize it's the last one expected. That's the point: we don't have
any expectation for the sizes of chunks higher than the last one we
expected to see. If the value is 2000 bytes and the chunk size is 1996
bytes, we expect chunk 0 to be 1996 bytes and chunk 1 to be 4 bytes.
If not, we can complain. But it makes no sense to complain about chunk
2 being of a size we don't expect. We don't expect it to exist in the
first place, so we have no notion of what size it ought to be.

> If we have seen any chunks, the variable is holding the expected next chunk 
> seq, which is one greater than the last chunk seq we saw.
>
> If we expect chunks 0..3 and see chunk 0 but not chunk 1, it will complain 
> ..."expected to end at chunk 4, but ended at chunk 1".  This is clearly by 
> design and not merely a bug, though I tend to agree with you that this is a 
> strange wording choice.  I can't remember exactly when and how we decided to 
> word the message this way, but it has annoyed me for a while, and I assumed 
> it was something you suggested a while back, because I don't recall doing it. 
>  Either way, since you seem to also be bothered by this, I agree we should 
> change it.

Can you review this version?

-- 
Robert Haas
EDB: http://www.enterprisedb.com


simply-remove-chunkno-concept-v3.patch
Description: Binary data


Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Peter Geoghegan
On Fri, Apr 30, 2021 at 11:23 AM Robert Haas  wrote:
> On Fri, Apr 30, 2021 at 2:05 PM Peter Geoghegan  wrote:
> > I agree in principle, but making that work well is very hard in
> > practice because of the format of IndexTuple -- which bleeds into
> > everything. That TID is special is probably a natural consequence of
> > the fact that we don't have an offset-based format of the kind you see
> > in other DB systems -- systems that don't emphasize extensibility. We
> > cannot jump to a hypothetical TID attribute inexpensively inside code
> > like _bt_compare() because we don't have a cheap way to jump straight
> > to the datum for any attribute. So we just store TID in IndexTuple
> > directly instead. Imagine how much more expensive VACUUM would be if
> > it had to grovel through the IndexTuple format.
>
> I can't imagine that, so maybe you want to enlighten me? I see that
> there's a potential problem there, and I'm glad you pointed it out
> because I hadn't thought about it previously ... but if you always put
> the column or columns that VACUUM would need first, it's not obvious
> to me that it would be all that expensive.

Maybe. The point is that it is a problem that needs to be solved.

> Deforming the tuple to a
> sufficient degree to extract the first column, which would even be
> fixed-width, shouldn't take much work.

I think that it's reasonable to impose some cost on index AMs here,
but that needs to be bounded sensibly and unambiguously. For example,
it would probably be okay if you had either 6 byte or 8 byte TIDs, but
no other variations. You could require index AMs (the subset of index
AMs that are ever able to store 8 byte TIDs) to directly encode which
width they're dealing with at the level of each IndexTuple. That would
create some problems for nbtree deduplication, especially in boundary
cases, but ISTM that you can manage the complexity by sensibly
restricting how the TIDs work across the board. For example, the TIDs
should always work like unsigned integers -- the table AM must be
willing to work with that restriction.

You'd then have posting lists tuples in nbtree whose TIDs were all
either 6 bytes or 8 bytes wide, with a mix of each possible (though
not particularly likely) on the same leaf page. Say when you have a
table that exceeds the current MaxBlockNumber restrictions. It would
be relatively straightforward for nbtree deduplication to simply
refuse to mix 6 byte and 8 byte datums together to avoid complexity in
boundary cases. The deduplication pass logic has the flexibility that
this requires already.

> > I wonder how the same useful performance characteristics can be
> > maintained with a variable-width TID design. If you solve the problem
> > by changing IndexTuple, then you are kind of obligated to not use
> > varlena headers to keep the on-disk size manageable. Who knows where
> > it all ends?
>
> What's wrong with varlena headers? It would end up being a 1-byte
> header in practically every case, and no variable-width representation
> can do without a length word of some sort. I'm not saying varlena is
> as efficient as some new design could hypothetically be, but it
> doesn't seem like it'd be a big enough problem to stress about. If you
> used a variable-width representation for integers, you might actually
> save bytes in a lot of cases. An awful lot of the TIDs people store in
> practice probably contain several zero bytes, and if we make them
> wider, that's going to be even more true.

Maybe all of this is true, and maybe it works out to be the best path
forward in the long term, all things considered. But whether or not
that's true is crucially dependent on what real practical table AMs
(of which there will only ever be a tiny number) actually need to do.
Why should we assume that the table AM cannot accept some
restrictions? What good does it do to legalistically define the
problem as a problem for index AMs to solve?

-- 
Peter Geoghegan




Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Jeff Davis
On Fri, 2021-04-30 at 13:56 -0400, Robert Haas wrote:
> I think that would be the best long-term plan.

We should still have *some* answer in the short term for table AM
authors like me. If I use offset numbers as high as MaxOffsetNumber,
then itemptr_to_uint64 will fail. If I base my calculations for the TID
to row number mapping on MaxOffsetNumber at all, then it will break if
we change MaxOffsetNumber (as was suggested[1]).

My takeaway so far is that the only safe thing to do is hard code it to
2000. I suppose I can do that until we settle on something better (at
which point I can force a reindex, I suppose).

[1] 
https://postgr.es/m/caeze2wit1ethhbhj+cyvbpthrwuzu2vqc-bmzu3apk3ioth...@mail.gmail.com

> Even though these are distinguishable concerns, they basically point
> in the same direction as far as index layout is concerned. The
> implications for the table AM layer are somewhat different in the two
> cases, but both argue that some places that are now talking about
> TIDs
> should be changed to talk about Datums or something of that sort.

Logically, that makes a lot of sense to me. Peter seems to have quite a
few practical implementation concerns though, so it could be a long
road.

Regards,
Jeff Davis






Re: pg_upgrade test for binary compatibility of core data types

2021-04-30 Thread Justin Pryzby
On Sat, Mar 06, 2021 at 03:01:43PM -0500, Tom Lane wrote:
> Peter Eisentraut  writes:
> > On 2021-01-12 22:44, Andrew Dunstan wrote:
> >> Cross version pg_upgrade is tested regularly in the buildfarm, but not
> >> using test.sh. Instead it uses the saved data repository from a previous
> >> run of the buildfarm client for the source branch, and tries to upgrade
> >> that to the target branch.
> 
> > Does it maintain a set of fixups similar to what is in test.sh?  Are 
> > those two sets the same?
> 
> Responding to Peter: the first answer is yes, the second is I didn't
> check, but certainly Justin's patch makes them closer.

Right - I had meant to send this.

https://github.com/PGBuildFarm/client-code/blob/master/PGBuild/Modules/TestUpgradeXversion.pm

$opsql = 'drop operator if exists public.=> (bigint, NONE)';
..
my $missing_funcs = q{drop function if exists 
public.boxarea(box);
  drop function if exists public.funny_dup17();
..
my $prstmt = join(';',
'drop operator if exists #@# (bigint,NONE)',
'drop operator if exists #%# (bigint,NONE)',
'drop operator if exists !=- (bigint,NONE)',
..
$prstmt = join(';',
'drop operator @#@ (NONE, bigint)',
..
'drop aggregate if exists 
public.array_cat_accum(anyarray)',

> I spent some time poking through this set of patches.  I agree that
> there's problem(s) here that we need to solve, but it feels like this
> isn't a great way to solve them.  What I see in the patchset is:

For starters, is there a "release beta checklist" ?
Testing test.sh should be on it.
So should fuzz testing.

> v4-0001 mostly teaches test.sh about specific changes that have to be
> made to historic versions of the regression database to allow them
> to be reloaded into current servers.  As already discussed, this is
> really duplicative of knowledge that's been embedded into the buildfarm
> client over time.  It'd be better if we could refactor that so that
> the buildfarm shares a common database of these actions with test.sh.
> And said database ought to be in our git tree, so committers could
> fix problems without having to get Andrew involved every time.
> I think this could be represented as a psql script, at least in
> versions that have psql \if (but that came in in v10, so maybe
> we're there already).

I started this.  I don't know if it's compatible with the buildfarm client, but
I think any issues maybe can be avoided by using "IF EXISTS".

> v4-0002 is a bunch of random changes that mostly seem to revert hacky
> adjustments previously made to improve test coverage.  I don't really
> agree with any of these, nor see why they're necessary.  If they
> are necessary then we need to restore the coverage somewhere else.
> Admittedly, the previous changes were a bit hacky, but deleting them
> (without even bothering to adjust the relevant comments) isn't the
> answer.

It was necessary to avoid --wal-segsize and -g to allow testing upgrades from
versions which don't support those options.  I think test.sh should be portable
back to all supported versions.

When those options were added, it broke test.sh upgrading from old versions.
I changed this to a shell conditional for the "new" features:
| "$1" -N -A trust ${oldsrc:+--wal-segsize 1 -g}
Ideally it would check the version.

> v4-0003 is really the heart of the matter: it adds a table with some
> previously-not-covered datatypes plus a query that purports to make sure
> that we are covering all types of interest.

Actually the 'manytypes' table intends to include *all* core datatypes itself,
not just those that aren't included somewhere else.  I think "included
somewhere else" depends on the order of the regression these, and type_sanity
runs early, so the table might need to include many types that are created
later, to avoid "false positives" in the associated test.

> But I'm not sure I believe
> that query.  It's got hard-wired assumptions about which typtype values
> need to be covered.  Why is it okay to exclude range and multirange?
> Are we sure that all composites are okay to exclude?  Likewise, the
> restriction to pg_catalog and information_schema schemas seems likely to
> bite us someday.  There are some very random exclusions based on name
> patterns, which seem unsafe (let's list the specific type OIDs), and
> again the nearby comments don't match the code.  But the biggest issue
> is that this can only cover core datatypes, not any contrib stuff.

I changed to use regtype/OIDs, included range/multirange and stopped including
only pg_catalog/information_schema.  But didn't yet handle composites.

> I don't know what we could do about contrib types.  Maybe we should
> figure that covering core types is already a step forward, and be
> happy with getting that done.

Right .. this is meant 

Re: pg_amcheck contrib application

2021-04-30 Thread Mark Dilger



> On Apr 30, 2021, at 9:39 AM, Robert Haas  wrote:
> 
> On Mon, Apr 26, 2021 at 1:52 PM Mark Dilger
>  wrote:
>> The attached patch changes amcheck corruption reports as discussed upthread. 
>>  This patch is submitted for the v14 development cycle as a bug fix, per 
>> your complaint that the committed code generates reports sufficiently 
>> confusing to a user as to constitute a bug.
>> 
>> All other code refactoring and additional checks discussed upthread are 
>> reserved for the v15 development cycle and are not included here.
>> 
>> The minimal patch (not attached) that does not rename any variables is 135 
>> lines.  Your patch was 159 lines.  The patch (attached) which includes your 
>> variable renaming is 174 lines.
> 
> Hi,
> 
> I have compared this against my version. I found the following differences:

Just to be clear, I did not use your patch v1 as the starting point.  I took 
the code as committed to master as the starting point, used your corruption 
report verbiage changes and at least some of your variable naming choices, but 
did not use the rest, in large part because it didn't work.  It caused 
corruption messages to be reported against tables that have no corruption.  For 
that matter, your v2 patch doesn't work either, and in the same way.  To wit:

  heap table "postgres"."pg_catalog"."pg_rewrite", block 6, offset 4, attribute 
7:
 toast value 13461 chunk 0 has size 1995, but expected size 1996

I think there is something wrong with the way you are trying to calculate and 
use extsize, because I'm not corrupting pg_catalog.pg_rewrite.  You can get 
these same results by applying your patch to master, building, and running 
'make check' from src/bin/pg_amcheck/


> 1. This version passes last_chunk_seq rather than extsize to
> check_toast_tuple(). But this results in having to call
> VARATT_EXTERNAL_GET_EXTSIZE() inside that function. I thought it was
> nicer to do that in the caller, so that we don't do it twice.

I don't see that VARATT_EXTERNAL_GET_EXTSIZE() is worth too much concern, given 
that is just a struct access and a bit mask.  You are avoiding calculating that 
twice, but at the expense of calculating last_chunk_seq twice, which involves 
division.  I don't think the division can be optimized as a mere bit shift, 
since TOAST_MAX_CHUNK_SIZE is not in general a power of two.  (For example, on 
my laptop it is 1996.)

I don't say this to nitpick at the performance one way vs. the other.  I doubt 
it makes any real difference.  I'm just confused why you want to change this 
particular thing right now, given that it is not a bug.

> 2. You fixed some out-of-date comments.

Yes, because they were wrong.  That's on me.  I failed to update them in a 
prior patch.

> 3. You move the test for an unexpected chunk sequence further down in
> the function. I don't see the point;

Relative to your patch, perhaps.  Relative to master, no tests have been moved.

> I had put it by the related null
> check, and still think that's better. You also deleted my comment /*
> Either the TOAST index is corrupt, or we don't have all chunks. */
> which I would have preferred to keep.

That's fine.  I didn't mean to remove it.  I was just taking a minimalist 
approach to constructing the patch.

> 4. You don't return if chunk_seq > last_chunk_seq. That seems wrong,
> because we cannot compute a sensible expected size in that case. I
> think your code will subtract a larger value from a smaller one and,
> this being unsigned arithmetic, say that the expected chunk size is
> something gigantic.

Your conclusion is probably right, but I think your analysis is based on a 
misreading of what "last_chunk_seq" means.  It's not the last one seen, but the 
last one expected.  (Should we rename the variable to avoid confusion?)  It 
won't compute a gigantic size.  Rather, it will expect *every* chunk with 
chunk_seq >= last_chunk_seq to have whatever size is appropriate for the last 
chunk. 

> Returning and not issuing that complaint at all
> seems better.

That might be best.  I had been resisting that because I don't want the 
extraneous chunks to be reported without chunk size information.  When 
debugging corrupted toast, it may be interesting to know the size of the 
extraneous chunks.  If there are 1000 extra chunks, somebody might want to see 
the sizes of them.

> 5. You fixed the incorrect formula I had introduced for the expected
> size of the last chunk.

Not really.  I just didn't introduce any change in that area.

> 6. You changed the variable name in check_toasted_attribute() from
> expected_chunkno to chunkno, and initialized it later in the function
> instead of at declaration time. I don't find this to be an
> improvement;

I think I just left the variable name and its initialization unchanged.

> including the word "expected" seems to me to be
> substantially clearer. But I think I should have gone with
> expected_chunk_seq for better consistency.

I agree that is a better name.


Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Jeff Davis
On Fri, 2021-04-30 at 10:50 -0700, Peter Geoghegan wrote:
> I don't know. This conversation is still too abstract for me to be
> able to take a firm position. ISTM that we tend to talk about the
> table AM in excessively abstract terms. It would be a lot easier if
> we
> had clear fixed goals for a small number of additional table AMs.

https://github.com/citusdata/citus/tree/master/src/backend/columnar

My colleagues and I have been working on a "columnar" table AM. It
doesn't currently support indexes, but it would be useful to support
them.

The basic idea is we have "stripes" of ~15 tuples that are
rearranged and compressed, and stored in an smgr-controlled file that
goes through the buffer cache and uses generic WAL.

To support indexes, we could do our own lookups from a "row number" to
a particular offset where we can find and decompress the stripe that
holds that row number, and then scan forward in the stripe to find the
particular row. This will be terrible for random access, but [*waves
hands*] we will keep state and use a few optimizations so that this is
not terribly slow for clustered access.

Granted, TID lookup on columnar will be much slower than for a heap
(and we can use a CustomScan so that the costs reflect that). But it
will satisfy important use cases:

  1. Indexes defined on partition parent tables. Even if the index is
never used for queries, we don't want to throw an error when defining
the partitioned parent index.
  2. Unique indexes and exclusion constraints.
  3. Clustered index scans can still be reasonably fast.
  4. Could be used for UPDATE/DELETE as well.

> More generally, it seems like a good idea to try to make new table
> AMs
> reasonably close to heapam insofar as possible. The reality is that
> everything evolved around heapam, and that that's likely to matter in
> all kinds of ways that nobody fully understands just yet.

Agreed. I think of this as an evolving situation where we take steps
toward a better abstraction.

One (hopefully reasonable) step I'd like to take is a well-specified
TID.

Regards,
Jeff Davis






Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Robert Haas
On Fri, Apr 30, 2021 at 2:05 PM Peter Geoghegan  wrote:
> I agree in principle, but making that work well is very hard in
> practice because of the format of IndexTuple -- which bleeds into
> everything. That TID is special is probably a natural consequence of
> the fact that we don't have an offset-based format of the kind you see
> in other DB systems -- systems that don't emphasize extensibility. We
> cannot jump to a hypothetical TID attribute inexpensively inside code
> like _bt_compare() because we don't have a cheap way to jump straight
> to the datum for any attribute. So we just store TID in IndexTuple
> directly instead. Imagine how much more expensive VACUUM would be if
> it had to grovel through the IndexTuple format.

I can't imagine that, so maybe you want to enlighten me? I see that
there's a potential problem there, and I'm glad you pointed it out
because I hadn't thought about it previously ... but if you always put
the column or columns that VACUUM would need first, it's not obvious
to me that it would be all that expensive. Deforming the tuple to a
sufficient degree to extract the first column, which would even be
fixed-width, shouldn't take much work.

> I wonder how the same useful performance characteristics can be
> maintained with a variable-width TID design. If you solve the problem
> by changing IndexTuple, then you are kind of obligated to not use
> varlena headers to keep the on-disk size manageable. Who knows where
> it all ends?

What's wrong with varlena headers? It would end up being a 1-byte
header in practically every case, and no variable-width representation
can do without a length word of some sort. I'm not saying varlena is
as efficient as some new design could hypothetically be, but it
doesn't seem like it'd be a big enough problem to stress about. If you
used a variable-width representation for integers, you might actually
save bytes in a lot of cases. An awful lot of the TIDs people store in
practice probably contain several zero bytes, and if we make them
wider, that's going to be even more true.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Peter Geoghegan
On Fri, Apr 30, 2021 at 10:56 AM Robert Haas  wrote:
> I think that would be the best long-term plan. I guess I have two
> distinguishable concerns. One is that I want to be able to have
> indexes with a payload that's not just a 6-byte TID; e.g. adding a
> partition identifier to support global indexes, or replacing the
> 6-byte TID with a primary key reference to support indirect indexes.
> The other concern is to be able to have table AMs that use arbitrary
> methods to identify a tuple. For example, if somebody implemented an
> index-organized table, the "TID" would really be the primary key.
>
> Even though these are distinguishable concerns, they basically point
> in the same direction as far as index layout is concerned. The
> implications for the table AM layer are somewhat different in the two
> cases, but both argue that some places that are now talking about TIDs
> should be changed to talk about Datums or something of that sort.

I don't know how it's possible to do any of this without first
addressing what the table AM does in cases where heapam currently does
a non-HOT update. You obviously cannot have the equivalent of
duplicate TIDs when your new table AM runs into these scenarios. So
what do you do instead? How do you make your clustered index/IoT style
identifiers (i.e. your strictly logical TID-like identifiers) deal
with that case?

ISTM that this is by far the biggest issue with generalizing the table
AM for use by a tableam (that isn't already very much like heapam). I
am always surprised to be the one that has to point it out during
these discussions. It's a huge issue.

-- 
Peter Geoghegan




Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Peter Geoghegan
On Fri, Apr 30, 2021 at 10:39 AM Robert Haas  wrote:
> I agree up to a point but ... are you imagining that the TID continues
> to have its own special place in the page, while the partition
> identifier is stored more like a regular tuple column? Because it
> seems to me that it would be better to try to eliminate the
> special-storage case, just like we did for OIDs.

I agree in principle, but making that work well is very hard in
practice because of the format of IndexTuple -- which bleeds into
everything. That TID is special is probably a natural consequence of
the fact that we don't have an offset-based format of the kind you see
in other DB systems -- systems that don't emphasize extensibility. We
cannot jump to a hypothetical TID attribute inexpensively inside code
like _bt_compare() because we don't have a cheap way to jump straight
to the datum for any attribute. So we just store TID in IndexTuple
directly instead. Imagine how much more expensive VACUUM would be if
it had to grovel through the IndexTuple format.

I wonder how the same useful performance characteristics can be
maintained with a variable-width TID design. If you solve the problem
by changing IndexTuple, then you are kind of obligated to not use
varlena headers to keep the on-disk size manageable. Who knows where
it all ends?

-- 
Peter Geoghegan




Re: Procedures versus the "fastpath" API

2021-04-30 Thread Robert Haas
On Fri, Apr 30, 2021 at 12:57 PM Tom Lane  wrote:
> By my count, we have three votes for forbidding procedure calls via
> fastpath in all branches (me, Joe, Michael), and two for doing
> something laxer (Noah, Laurenz).  The former is surely the safer
> choice, so I'm going to go do that.

FWIW, I'm also for the stricter approach.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Robert Haas
On Fri, Apr 30, 2021 at 1:37 PM Jeff Davis  wrote:
> The particular problem I have now is that Table AMs seem to support
> indexes just fine, but TIDs are under-specified so I don't know what I
> really have to work with. BlockNumber seems well-specified as
> 0..0XFFFE (inclusive), but I don't know what the valid range of
> OffsetNumber is for the purposes of a TableAM.

I agree that this is a problem.

> Part of changing to uint64 would be specifying the TIDs in a way that I
> could rely on in the future.

I mean, from my perspective, the problem here is that the abstraction
layer is leaky and things outside of the table AM layer know what heap
is doing under the hood, and rely on it. If we could refactor the
abstraction to be less leaky, it would be clearer what assumptions
table AM authors can make. If we can't, any specification doesn't seem
worth much.

> In the future we may support primary unique indexes at the table AM
> layer, which would get more interesting. I can see an argument for a
> TID being an arbitrary datum in that case, but I haven't really
> considered the design implications. Is this what you are suggesting?

I think that would be the best long-term plan. I guess I have two
distinguishable concerns. One is that I want to be able to have
indexes with a payload that's not just a 6-byte TID; e.g. adding a
partition identifier to support global indexes, or replacing the
6-byte TID with a primary key reference to support indirect indexes.
The other concern is to be able to have table AMs that use arbitrary
methods to identify a tuple. For example, if somebody implemented an
index-organized table, the "TID" would really be the primary key.

Even though these are distinguishable concerns, they basically point
in the same direction as far as index layout is concerned. The
implications for the table AM layer are somewhat different in the two
cases, but both argue that some places that are now talking about TIDs
should be changed to talk about Datums or something of that sort.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Jeff Davis
On Fri, 2021-04-30 at 12:35 -0400, Tom Lane wrote:
> ISTM that would be up to the index AM.  We'd need some interlocks on
> which index AMs could be used with which table AMs in any case, I
> think.

I'm not sure why? It seems like we should be able to come up with
something that's generic enough.

> I think the hard part may really be in places like tidbitmap.c, which
> one would wish to be AM-independent, but right now it's quite
> specific
> to heap-style TIDs.  Maybe we can think of a way to parameterize it.

For my particular AM, being able to have a parameterized granularity
might be nice, but not required.

Regards,
Jeff Davis






Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Peter Geoghegan
On Fri, Apr 30, 2021 at 10:04 AM Jeff Davis  wrote:
> On Fri, 2021-04-30 at 08:36 -0700, Peter Geoghegan wrote:
> > Compatibility with index AMs is more than a matter of switching out
> > the tuple identifier -- if you invent something that has totally
> > different performance characteristics for index AMs, then it's likely
> > to break tacit assumptions about the cost of certain things.
>
> I think it would be reasonable to document and expect that table AMs
> offer some locality of access for tuples with similar IDs. Do you think
> we need something stronger than that?

I don't know. This conversation is still too abstract for me to be
able to take a firm position. ISTM that we tend to talk about the
table AM in excessively abstract terms. It would be a lot easier if we
had clear fixed goals for a small number of additional table AMs.

> > Plus deduplication ameliorates problems with version churn
> > in
> > indexes -- presumably the same problems will exist when any new table
> > AM is used, and so it'll be useful to address the same problems in
> > the
> > same way.
>
> I got lost after "presumably the same problems", can you explain?

Well, there are now two things in nbtree that specifically help with
version churn caused by non-HOT updates: deduplication, and bottom-up
index deletion (especially the latter). Presumably any new table AM
will have something like non-HOT updates. Though they may rarely be a
problem (say because the new table AM isn't really for OLTP), whenever
they are a problem they'll be a very big problem. It seems like a good
idea to have the same index AM level protections against accumulating
version-churn index tuples in an unbounded way.

More generally, it seems like a good idea to try to make new table AMs
reasonably close to heapam insofar as possible. The reality is that
everything evolved around heapam, and that that's likely to matter in
all kinds of ways that nobody fully understands just yet. We have a
somewhat abstract table AM interface, which is good, but that doesn't
mean that table AMs can be designed as if it was a green field
situation.

-- 
Peter Geoghegan




Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Robert Haas
On Fri, Apr 30, 2021 at 1:28 PM Peter Geoghegan  wrote:
> Global indexes should work by adding an extra column that is somewhat
> like a TID, that may even have its own pg_attribute entry. It's much
> more natural to make the partition number a separate column IMV --
> nbtree suffix truncation and deduplication can work in about the same
> way as before. Plus you'll need to do predicate pushdown using the
> partition identifier in some scenarios anyway. You can make the
> partition identifier variable-width without imposing the cost and
> complexity of variable-width TIDs on index AMs.

I agree up to a point but ... are you imagining that the TID continues
to have its own special place in the page, while the partition
identifier is stored more like a regular tuple column? Because it
seems to me that it would be better to try to eliminate the
special-storage case, just like we did for OIDs. If you want a 6-byte
TID, put a 6-byte column in the tuple for it. If you also want a
partition identifier, put an extra column in the tuple for that. If
you want a wider TID or a varlena TID, well, put columns for that into
the tuple instead of the 6-byte column you'd normally put. This seems
extremely flexible and a lot more aesthetically appealing than what we
have today.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Jeff Davis
On Fri, 2021-04-30 at 12:51 -0400, Robert Haas wrote:
> There are two major reasons why I want variable-width tuple IDs. One
> is global indexes, where you need as many bits as the AMs
> implementing
> the partitions need, plus some extra bits to identify which partition
> is relevant for a particular tuple. No fixed number of bits that you
> make available can ever be sufficient here, because a global index
> always needs to have extra bits compared to a partition-local index;
> if you let the partition-local index use more bits, the global index
> now needs even more space. The other is indirect indexes, work Álvaro
> proposed a few years ago, where the index entry points to the primary
> key value rather than a TID. The space needs for that are based on
> the
> type of the primary key column. This proposal solves neither of those
> problems.

The particular problem I have now is that Table AMs seem to support
indexes just fine, but TIDs are under-specified so I don't know what I
really have to work with. BlockNumber seems well-specified as
0..0XFFFE (inclusive), but I don't know what the valid range of
OffsetNumber is for the purposes of a TableAM.

Part of changing to uint64 would be specifying the TIDs in a way that I
could rely on in the future.

The problems you mention above are above the table AM layer, so they
seem orthogonal. There would still need to be an ID that table AMs can
use to fetch a tuple from a particular physical table.

In the future we may support primary unique indexes at the table AM
layer, which would get more interesting. I can see an argument for a
TID being an arbitrary datum in that case, but I haven't really
considered the design implications. Is this what you are suggesting?

Regards,
Jeff Davis






Re: Use simplehash.h instead of dynahash in SMgr

2021-04-30 Thread Alvaro Herrera
Hi David,

You're probably aware of this, but just to make it explicit: Jakub
Wartak was testing performance of recovery, and one of the bottlenecks
he found in some of his cases was dynahash as used by SMgr.  It seems
quite possible that this work would benefit some of his test workloads.
He last posted about it here:

https://postgr.es/m/vi1pr0701mb69608cbce44d80857e59572ef6...@vi1pr0701mb6960.eurprd07.prod.outlook.com

though the fraction of dynahash-from-SMgr is not as high there as in
some of other reports I saw.

-- 
Álvaro Herrera   Valdivia, Chile




Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Robert Haas
On Fri, Apr 30, 2021 at 1:10 PM Tom Lane  wrote:
> I agree that global indexes need more bits, but it doesn't necessarily
> follow that we must have variable-width TIDs.  We could for example
> say that "real" TIDs are only 48 bits and index AMs that want to be
> usable as global indexes must be capable of handling 64-bit TIDs,
> leaving 16 bits for partition ID.  A more forward-looking definition
> would require global index AMs to store 96 bits (partition OID plus
> 64-bit TID).  Either way would be far simpler for every moving part
> involved than going over to full varlena TIDs.

16 bits is not much for a partition identifier. We've already had
complaints about INNER_VAR being too small, so apparently there are
people who want to use really large numbers of partitions. But even if
we imagine a hypothetical world where nobody uses more than a couple
thousand partitions at once, it's very reasonable to want to avoid
recycling partition identifiers so that detaching a partition can be
O(1), and there's no way that's going to be viable if the whole
address space is only 16 bits, because with time series data people
are going to be continually creating new partitions and dropping old
ones. I would guess that it probably is viable with 32 bits, but we'd
have to have a mapping layer rather than using the OID directly to
avoid wraparound collisions.

Now this problem can be avoided by just requiring the AM to store more
bits, exactly as you say. I suspect 96 bits is large enough for all of
the practical use cases people have, or at least within spitting
distance. But it strikes me as relatively inefficient to say that
we're always going to store 96 bits for every TID. I certainly don't
think we want to break on-disk compatibility and widen every existing
btree index by changing all the 6-byte TIDs they're storing now to
store 12 bytes TIDs that are at least half zero bytes, so I think
we're bound to end up with at least two options: 6 and 12. But
variable-width would be a lot nicer. You could store small TIDs and
small partition identifiers very compactly, and only use the full
number of bytes when the situation demands it.

> > What problem do you think this proposal does solve?
>
> Accommodating table AMs that want more than 48 bits for a TID.
> We're already starting to run up against the fact that that's not
> enough bits for plausible use-cases.  64 bits may someday in the far
> future not be enough either, but I think that's a very long way off.

Do people actually want to store more than 2^48 rows in a table, or is
this more about the division of a TID into a block number and an item
number?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Peter Geoghegan
On Fri, Apr 30, 2021 at 10:10 AM Tom Lane  wrote:
> > There are two major reasons why I want variable-width tuple IDs. One
> > is global indexes, where you need as many bits as the AMs implementing
> > the partitions need, plus some extra bits to identify which partition
> > is relevant for a particular tuple. No fixed number of bits that you
> > make available can ever be sufficient here,
>
> I agree that global indexes need more bits, but it doesn't necessarily
> follow that we must have variable-width TIDs.  We could for example
> say that "real" TIDs are only 48 bits and index AMs that want to be
> usable as global indexes must be capable of handling 64-bit TIDs,
> leaving 16 bits for partition ID.  A more forward-looking definition
> would require global index AMs to store 96 bits (partition OID plus
> 64-bit TID).  Either way would be far simpler for every moving part
> involved than going over to full varlena TIDs.

The question of how the on-disk format on indexes needs to be changed
to accomodate global indexes seems like an entirely separate question
to how we go about expanding or redefining TIDs.

Global indexes should work by adding an extra column that is somewhat
like a TID, that may even have its own pg_attribute entry. It's much
more natural to make the partition number a separate column IMV --
nbtree suffix truncation and deduplication can work in about the same
way as before. Plus you'll need to do predicate pushdown using the
partition identifier in some scenarios anyway. You can make the
partition identifier variable-width without imposing the cost and
complexity of variable-width TIDs on index AMs.

I believe that the main reason why there have been so few problems
with any of the nbtree work in the past few releases is that it
avoided certain kinds of special cases. Any special cases in the
on-disk format and in the space accounting used when choosing a split
point ought to be avoided at all costs. We can probably afford to add
a lot of complexity to make global indexes work, but it ought to be
contained to cases that actually use global indexes in an obvious way.

-- 
Peter Geoghegan




Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Tom Lane
Robert Haas  writes:
> On Fri, Apr 30, 2021 at 11:06 AM Tom Lane  wrote:
>> Andres seems to feel that we should try to allow variable-width
>> tupleids, but I'm afraid that the cost/benefit ratio for that
>> would be pretty poor.

> There are two major reasons why I want variable-width tuple IDs. One
> is global indexes, where you need as many bits as the AMs implementing
> the partitions need, plus some extra bits to identify which partition
> is relevant for a particular tuple. No fixed number of bits that you
> make available can ever be sufficient here,

I agree that global indexes need more bits, but it doesn't necessarily
follow that we must have variable-width TIDs.  We could for example
say that "real" TIDs are only 48 bits and index AMs that want to be
usable as global indexes must be capable of handling 64-bit TIDs,
leaving 16 bits for partition ID.  A more forward-looking definition
would require global index AMs to store 96 bits (partition OID plus
64-bit TID).  Either way would be far simpler for every moving part
involved than going over to full varlena TIDs.

> Another problem in this general area is that there is quite a bit of
> code that thinks a TID is specifically a block number and an offset,
> like the Bitmap Index/Heap Scan code, for example. But making tuple
> IDs wider doesn't help with that problem either.

Agreed, that's an area that will need a lot of thought for anything that
we do here.  But varlena TIDs surely do not make that easier to fix.

> What problem do you think this proposal does solve?

Accommodating table AMs that want more than 48 bits for a TID.
We're already starting to run up against the fact that that's not
enough bits for plausible use-cases.  64 bits may someday in the far
future not be enough either, but I think that's a very long way off.

regards, tom lane




Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Jeff Davis
On Fri, 2021-04-30 at 08:36 -0700, Peter Geoghegan wrote:
> Compatibility with index AMs is more than a matter of switching out
> the tuple identifier -- if you invent something that has totally
> different performance characteristics for index AMs, then it's likely
> to break tacit assumptions about the cost of certain things.

I think it would be reasonable to document and expect that table AMs
offer some locality of access for tuples with similar IDs. Do you think
we need something stronger than that?

> Plus deduplication ameliorates problems with version churn
> in
> indexes -- presumably the same problems will exist when any new table
> AM is used, and so it'll be useful to address the same problems in
> the
> same way.

I got lost after "presumably the same problems", can you explain?

> I agree that it might well be useful to make TIDs fully logical (not
> "physiological" -- physical across blocks, logical within blocks) for
> some new table AM. Even then, it would still definitely be a good
> idea
> to make these logical TID values correlate with the physical table
> structure in some way.

Agreed.

Regards,
Jeff Davis





Re: Procedures versus the "fastpath" API

2021-04-30 Thread Tom Lane
Michael Paquier  writes:
> On Tue, Mar 09, 2021 at 02:33:47PM -0500, Joe Conway wrote:
>> My vote would be reject using fastpath for procedures in all relevant 
>> branches.
>> If someday someone cares enough to make it work, it is a new feature for a 
>> new
>> major release.

> FWIW, my vote would go for issuing an error if attempting to use a
> procedure in the fast path for all the branches.  The lack of
> complaint about the error you are mentioning sounds like a pretty good
> argument to fail properly on existing branches, and work on this as a
> new feature in the future if there is anybody willing to make a case
> for it.

I let this thread grow cold because I was hoping for some more votes,
but with the quarterly releases approaching, it's time to close out
the issue one way or the other.

By my count, we have three votes for forbidding procedure calls via
fastpath in all branches (me, Joe, Michael), and two for doing
something laxer (Noah, Laurenz).  The former is surely the safer
choice, so I'm going to go do that.

regards, tom lane




Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Robert Haas
On Fri, Apr 30, 2021 at 11:06 AM Tom Lane  wrote:
> My thought at the moment is that all APIs above the AM level ought
> to be redefined to use uint64 for tuple identifiers.  heapam and
> related index AMs could map block + offset into that in some
> convenient way, and other AMs could do what they like.
>
> Andres seems to feel that we should try to allow variable-width
> tupleids, but I'm afraid that the cost/benefit ratio for that
> would be pretty poor.

There are two major reasons why I want variable-width tuple IDs. One
is global indexes, where you need as many bits as the AMs implementing
the partitions need, plus some extra bits to identify which partition
is relevant for a particular tuple. No fixed number of bits that you
make available can ever be sufficient here, because a global index
always needs to have extra bits compared to a partition-local index;
if you let the partition-local index use more bits, the global index
now needs even more space. The other is indirect indexes, work Álvaro
proposed a few years ago, where the index entry points to the primary
key value rather than a TID. The space needs for that are based on the
type of the primary key column. This proposal solves neither of those
problems.

Another problem in this general area is that there is quite a bit of
code that thinks a TID is specifically a block number and an offset,
like the Bitmap Index/Heap Scan code, for example. But making tuple
IDs wider doesn't help with that problem either.

What problem do you think this proposal does solve?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Jeff Davis
On Fri, 2021-04-30 at 12:04 +0200, Matthias van de Meent wrote:
> Other than that, I believe you've also missed the special offset
> numbers specified in itemptr.h (SpecTokenOffsetNumber and
> MovedPartitionsOffsetNumber). I am not well enough aware of the usage
> of these OffsetNumber values, but those might also be limiting the
> values any tableAM can use for their TIDs.

Yes, thank you. If it is treated specially in a heap tuple, it can't be
a regular TID.

> > It just seems like it's used when scanning heap or index pages for
> > stack-allocated arrays. For a table AM it appears to waste 5 bits.
> 
> MaxOffsetNumber is used for postgres' Page layout, of which the
> MaxOffsetNumber is defined as how many item pointers could exist on a
> page, and AFAIK should be used for postgres' Page layout only. No
> thing can or should change that. If any code asserts limitations to
> the ip_posid of table tuples that could also not be tableam tuples,
> then I believe that is probably a mistake in postgres, and that
> should be fixed.

A name that would better fit your definition would be something like
"MaxItemsPerPage".

The name "MaxOffsetNumber" implies that any number past that must be
either invalid or special. But it seems like you are saying that if I
use an offset number of 5000 in my table AM, then that's fine and
should be treated like a normal TID.

> No. The documentation for that function explicitly mentions that
> these item pointers are optimized for storage when using the heap
> tableam, and that that code will be changed once there exist tableAMs
> with different TID / ip_posid constraints (see the comment on lines
> 32 and 33 of that file). 

Thank you.

I'm a table AM author, and I'd like to use whatever the real range of
TIDs is. Does that mean it's time to change that code in
ginpostinglist.c now?

> >   1. Keep MaxOffsetNumber as 2048 and fix itemptr_to_uint64().
> 
> I believe that this is the right way when there exist tableAMs that
> use those upper 5 bits.

Does that mean we should declare the valid range of offsets to be
between 1 and 0xfffc (inclusive)?

I'm trying to use some mapping now that's somewhat stable so that I
don't have to worry that something will break later, and then require
reindexing all tables with my table AM.

Regards,
Jeff Davis






Re: pg_amcheck contrib application

2021-04-30 Thread Robert Haas
On Mon, Apr 26, 2021 at 1:52 PM Mark Dilger
 wrote:
> The attached patch changes amcheck corruption reports as discussed upthread.  
> This patch is submitted for the v14 development cycle as a bug fix, per your 
> complaint that the committed code generates reports sufficiently confusing to 
> a user as to constitute a bug.
>
> All other code refactoring and additional checks discussed upthread are 
> reserved for the v15 development cycle and are not included here.
>
> The minimal patch (not attached) that does not rename any variables is 135 
> lines.  Your patch was 159 lines.  The patch (attached) which includes your 
> variable renaming is 174 lines.

Hi,

I have compared this against my version. I found the following differences:

1. This version passes last_chunk_seq rather than extsize to
check_toast_tuple(). But this results in having to call
VARATT_EXTERNAL_GET_EXTSIZE() inside that function. I thought it was
nicer to do that in the caller, so that we don't do it twice.

2. You fixed some out-of-date comments.

3. You move the test for an unexpected chunk sequence further down in
the function. I don't see the point; I had put it by the related null
check, and still think that's better. You also deleted my comment /*
Either the TOAST index is corrupt, or we don't have all chunks. */
which I would have preferred to keep.

4. You don't return if chunk_seq > last_chunk_seq. That seems wrong,
because we cannot compute a sensible expected size in that case. I
think your code will subtract a larger value from a smaller one and,
this being unsigned arithmetic, say that the expected chunk size is
something gigantic. Returning and not issuing that complaint at all
seems better.

5. You fixed the incorrect formula I had introduced for the expected
size of the last chunk.

6. You changed the variable name in check_toasted_attribute() from
expected_chunkno to chunkno, and initialized it later in the function
instead of at declaration time. I don't find this to be an
improvement; including the word "expected" seems to me to be
substantially clearer. But I think I should have gone with
expected_chunk_seq for better consistency.

7. You restored the message "toast value %u was expected to end at
chunk %d, but ended at chunk %d" which my version deleted. I deleted
that message because I thought it was redundant, but I guess it's not:
there's nothing else to complain if the sequence of chunks ends early.
I think we should change the test from != to < though, because if it's
>, then we must have already complained about unexpected chunks. Also,
I think the message is actually wrong, because even though you renamed
the variable, it still ends up being the expected next chunkno rather
than the last chunkno we actually saw.

PFA my counter-proposal based on the above analysis.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


simply-remove-chunkno-concept-v2.patch
Description: Binary data


Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Tom Lane
Jeff Davis  writes:
> On Fri, 2021-04-30 at 11:06 -0400, Tom Lane wrote:
>> My thought at the moment is that all APIs above the AM level ought
>> to be redefined to use uint64 for tuple identifiers.

> Do you mean that indexes would be expected to hold a uint64, a 48-bit
> int (that directly maps to a uint64), or still hold an ItemPointerData?

ISTM that would be up to the index AM.  We'd need some interlocks on
which index AMs could be used with which table AMs in any case, I think.

It'd likely not be hard for existing index AMs to be repurposed to store
"any 48-bit TID", but extending them to full 64-bit TIDs may be
impractical.

I think the hard part may really be in places like tidbitmap.c, which
one would wish to be AM-independent, but right now it's quite specific
to heap-style TIDs.  Maybe we can think of a way to parameterize it.

regards, tom lane




Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Jeff Davis
On Fri, 2021-04-30 at 11:06 -0400, Tom Lane wrote:
> My thought at the moment is that all APIs above the AM level ought
> to be redefined to use uint64 for tuple identifiers.

One challenge might be reliance on InvalidOffsetNumber as a special
value in a number of places (e.g. bitmap index scans). That doesn't
seem like a major problem though.

> heapam and
> related index AMs could map block + offset into that in some
> convenient way, and other AMs could do what they like.

Do you mean that indexes would be expected to hold a uint64, a 48-bit
int (that directly maps to a uint64), or still hold an ItemPointerData?

Regards,
Jeff Davis






Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-04-30 Thread Andres Freund
Hi,

On 2021-04-29 13:28:20 -0400, Álvaro Herrera wrote:
> On 2021-Apr-07, Andres Freund wrote:
> 
> > I'm also confused by the use of ConditionVariableTimedSleep(timeout =
> > 10). Why do we need a timed sleep here in the first place? And why with
> > such a short sleep?
> 
> I was scared of the possibility that a process would not set the CV for
> whatever reason, causing checkpointing to become stuck.  Maybe that's
> misguided thinking if CVs are reliable enough.

They better be, or we have bigger problems. And if it's an escape hatch
we surely ought not to use 10ms as the timeout. That's an appropriate
time for something *not* using condition variables...


> I attach a couple of changes to your 0001.  It's all cosmetic; what
> looks not so cosmetic is the change of "continue" to "break" in helper
> routine; if !s->in_use, we'd loop infinitely.  The other routine
> already checks that before calling the helper; since you hold
> ReplicationSlotControlLock at that point, it should not be possible to
> drop it in between.  Anyway, it's a trivial change to make, so it should
> be correct.

> I also added a "continue" at the bottom of one block; currently that
> doesn't change any behavior, but if we add code at the other block, it
> might not be what's intended.

Seems sane.


> Are you getting this set pushed, or would you like me to handle it?
> (There seems to be some minor conflict in 13)

I'd be quite happy for you to handle it...

Greetings,

Andres Freund




Re: Race condition in InvalidateObsoleteReplicationSlots()

2021-04-30 Thread Andres Freund
Hi,

On 2021-04-08 17:03:41 +0530, Amit Kapila wrote:
> I haven't tested the patch but I couldn't spot any problems while
> reading it. A minor point, don't we need to use
> ConditionVariableCancelSleep() at some point after doing
> ConditionVariableTimedSleep?

It's not really necessary - unless the CV could get deallocated as part
of dynamic shared memory or such.

Greetings,

Andres Freund




Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Peter Geoghegan
On Fri, Apr 30, 2021 at 8:06 AM Tom Lane  wrote:
> My thought at the moment is that all APIs above the AM level ought
> to be redefined to use uint64 for tuple identifiers.  heapam and
> related index AMs could map block + offset into that in some
> convenient way, and other AMs could do what they like.
>
> Andres seems to feel that we should try to allow variable-width
> tupleids, but I'm afraid that the cost/benefit ratio for that
> would be pretty poor.

I agree. It'll be easier for a new table AM to be developed with that
constraint than it will be to retrofit it to every index AM. It
probably wouldn't be that hard to make nbtree deduplication and GIN
posting list compression work with uint64 TIDs. But variable-width
TIDs are a very different matter.

Compatibility with index AMs is more than a matter of switching out
the tuple identifier -- if you invent something that has totally
different performance characteristics for index AMs, then it's likely
to break tacit assumptions about the cost of certain things. For
example, index tuple deletion probably relies on the fact that there
just isn't that many table blocks to visit (to get an XID for recovery
conflict purposes) in practice due to various locality-related
effects. Plus deduplication ameliorates problems with version churn in
indexes -- presumably the same problems will exist when any new table
AM is used, and so it'll be useful to address the same problems in the
same way.

I agree that it might well be useful to make TIDs fully logical (not
"physiological" -- physical across blocks, logical within blocks) for
some new table AM. Even then, it would still definitely be a good idea
to make these logical TID values correlate with the physical table
structure in some way. Plus TIDs should still be fixed size. If a new
table AM can't do it that way then that certainly needs to be
justified -- it's unreasonable to imagine that it simply isn't the
table AM's problem to solve.

-- 
Peter Geoghegan




Re: Help needed with a reproducer for CVE-2020-25695 not based on REFRESH MATERIALIZED VIEW

2021-04-30 Thread Patrik Novotny
We've figured it out. Please ignore.


Regards.

On Fri, Apr 30, 2021 at 3:13 PM Patrik Novotny  wrote:

> Hi,
>
> I need to reproduce the CVE-2020-25695 on PostgreSQL 9.2.24. I know this
> is not a supported version, however, it is important for us to have a
> reproducer for this version as well.
>
> The reproducer for supported versions[1] is based on REFRESH MATERIALIZED
> VIEW which is not implemented until version 9.3.
>
> I was trying to reproduce this using ANALYZE as you can see in this
> poc.sql file[2]. However, it doesn't reproduce the issue.
>
> It would be really appreciated if someone could take a look at it and help.
>
>
> [1]
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/test/regress/sql/privileges.sql;h=013bc95c74bd20e5ab7f1826ea7e676da2a0e85b;hb=HEAD#l896
> [2] https://pastebin.com/6hgziYRD
>
>
> Regards,
>
> --
> Patrik Novotný
> Associate Software Engineer
> Red Hat
> panov...@redhat.com
>


-- 
Patrik Novotný
Associate Software Engineer
Red Hat
panov...@redhat.com


Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Tom Lane
Jeff Davis  writes:
> The notion of TID is based on pages and line pointers, which makes
> sense for heapam, but that's not likely to make sense for a custom
> table AM.
> The obvious answer is to make a simple mapping between a TID and
> whatever makes sense to the AM (for the sake of discussion, let's say a
> plain row number).

I'm inclined to think that when we get around to doing something
about this, we need to make a much bigger change than just poking
at the margins of type tid.

My thought at the moment is that all APIs above the AM level ought
to be redefined to use uint64 for tuple identifiers.  heapam and
related index AMs could map block + offset into that in some
convenient way, and other AMs could do what they like.

Andres seems to feel that we should try to allow variable-width
tupleids, but I'm afraid that the cost/benefit ratio for that
would be pretty poor.

regards, tom lane




Re: pg_hba.conf.sample wording improvement

2021-04-30 Thread Bruce Momjian
On Wed, Apr 28, 2021 at 07:51:43AM +0200, Peter Eisentraut wrote:
> I propose the attached patch to shake up the wording in the connection type
> section of pg_hba.conf.sample a bit.  After the hostgssenc part was added
> on, the whole thing became a bit wordy, and it's also a bit inaccurate for
> example in that the current wording for "host" appears to say that it does
> not apply to GSS-encrypted connections.

Yes, much better.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2021-04-30 Thread Dilip Kumar
On Fri, Apr 30, 2021 at 3:01 PM Amit Kapila  wrote:
>
> On Wed, Apr 28, 2021 at 11:03 AM Dilip Kumar  wrote:
> >
> > On Wed, Apr 28, 2021 at 11:00 AM Amit Kapila  
> > wrote:
> > >
> >
> >  The idea I have is to additionally check that we are decoding
> > > streaming or prepared transaction (the same check as we have for
> > > setting curtxn) or we can check if CheckXidAlive is a valid
> > > transaction id. What do you think?
> >
> > I think a check based on CheckXidAlive looks good to me.  This will
> > protect against if a similar error is raised from any other path as
> > you mentioned above.
> >
>
> We can't use CheckXidAlive because it is reset by that time.

Right.

So, I
> used the other approach which led to the attached.

The patch looks fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-04-30 Thread Amit Langote
(Thanks for committing the fix.)

On Thu, Apr 29, 2021 at 1:11 AM Álvaro Herrera  wrote:
> On Wed, Apr 28, 2021, at 10:21 AM, Amit Langote wrote:
> I noticed that rd_partdesc_nodetached_xmin can sometimes end up with
> value 0. While you seem to be already aware of that, because otherwise
> you wouldn't have added TransactionIdIsValid(...) in condition in
> RelationGetPartitionDesc(), the comments nearby don't mention why such
> a thing might happen.  Also, I guess it can't be helped that the
> partdesc_nodetached will have to be leaked when the xmin is 0, but
> that shouldn't be as problematic as the case we discussed earlier.
>
>
> The only case I am aware where that can happen is if the pg_inherits tuple is 
> frozen. (That's exactly what the affected test case was testing, note the 
> "VACUUM FREEZE pg_inherits" there).  So that test case blew up immediately; 
> but I think the real-world chances that people are going to be doing that are 
> pretty low, so I'm not really concerned about the leak.

The case I was looking at is when a partition detach appears as
in-progress to a serializable transaction.  If the caller wants to
omit detached partitions, such a partition ends up in
rd_partdesc_nodetached, with the corresponding xmin being set to 0 due
to the way find_inheritance_children_extended() sets *detached_xmin.
The next query in the transaction that wants to omit detached
partitions, seeing rd_partdesc_nodetached_xmin being invalid, rebuilds
the partdesc, again including that partition because the snapshot
wouldn't have changed, and so on until the transaction ends.  Now,
this can perhaps be "fixed" by making
find_inheritance_children_extended() set the xmin outside the
snapshot-checking block, but maybe there's no need to address this on
priority.

Rather, a point that bothers me a bit is that we're including a
detached partition in the partdesc labeled "nodetached" in this
particular case.  Maybe we should avoid that by considering in this
scenario that no detached partitions exist for this transactions and
so initialize rd_partdesc, instead of rd_partdesc_nodetached.  That
will let us avoid the situations where the xmin is left in invalid
state.  Maybe like the attached (it also fixes a couple of
typos/thinkos in the previous commit).

Note that we still end up in the same situation as before where each
query in the serializable transaction that sees the detach as
in-progress to have to rebuild the partition descriptor omitting the
detached partitions, even when it's clear that the detach-in-progress
partition will be included every time.

--
Amit Langote
EDB: http://www.enterprisedb.com


partdesc_nodetached-only-if-detached-visible.patch
Description: Binary data


Help needed with a reproducer for CVE-2020-25695 not based on REFRESH MATERIALIZED VIEW

2021-04-30 Thread Patrik Novotny
Hi,

I need to reproduce the CVE-2020-25695 on PostgreSQL 9.2.24. I know this is
not a supported version, however, it is important for us to have a
reproducer for this version as well.

The reproducer for supported versions[1] is based on REFRESH MATERIALIZED
VIEW which is not implemented until version 9.3.

I was trying to reproduce this using ANALYZE as you can see in this poc.sql
file[2]. However, it doesn't reproduce the issue.

It would be really appreciated if someone could take a look at it and help.


[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/test/regress/sql/privileges.sql;h=013bc95c74bd20e5ab7f1826ea7e676da2a0e85b;hb=HEAD#l896
[2] https://pastebin.com/6hgziYRD


Regards,

-- 
Patrik Novotný
Associate Software Engineer
Red Hat
panov...@redhat.com


Re: MaxOffsetNumber for Table AMs

2021-04-30 Thread Matthias van de Meent
On Fri, 30 Apr 2021, 09:46 Jeff Davis,  wrote:

>
> The notion of TID is based on pages and line pointers, which makes
> sense for heapam, but that's not likely to make sense for a custom
> table AM.
>
> The obvious answer is to make a simple mapping between a TID and
> whatever makes sense to the AM (for the sake of discussion, let's say a
> plain row number).
>
> The most natural thing would be to say that we have 48 bits, so it can
> just be a 48-bit number. Of course, there are some restrictions on
> valid values that complicate this:
>
>   * InvalidBlockNumber of 0x. Not a problem.
>   * InvalidOffsetNumber of 0. Not a problem.
>   * MaxOffsetNumber of 2048. Does this limit really apply to table AMs?
>

MaxOffsetNumber is not per se 2048. It is BLCKSZ / sizeof(ItemIdData),
which is only 2048 for a 8kiB BLCKSZ. As we support BLCKSZ up to 32kiB,
MaxOffsetNumber can be as large as 8196.

Other than that, I believe you've also missed the special offset numbers
specified in itemptr.h (SpecTokenOffsetNumber and MovedPartitionsOffsetNumber).
I am not well enough aware of the usage of these OffsetNumber values, but
those might also be limiting the values any tableAM can use for their TIDs.

It just seems like it's used when scanning heap or index pages for
> stack-allocated arrays. For a table AM it appears to waste 5 bits.
>

MaxOffsetNumber is used for postgres' Page layout, of which the
MaxOffsetNumber is defined as how many item pointers could exist on a page,
and AFAIK should be used for postgres' Page layout only. No thing can or
should change that. If any code asserts limitations to the ip_posid of
table tuples that could also not be tableam tuples, then I believe that is
probably a mistake in postgres, and that should be fixed.

  * ginpostinglist.c:itemptr_to_uint64() seems to think 2047 is the max
> offset number. Is this a bug?


No. The documentation for that function explicitly mentions that these item
pointers are optimized for storage when using the heap tableam, and that
that code will be changed once there exist tableAMs with different TID /
ip_posid constraints (see the comment on lines 32 and 33 of that file).

Note that the limiting number that itemptr_to_uint64 should mention for bit
calculations is actually MaxHeaptuplesPerPage, which is about one seventh
of MaxOffsetNumber. The resulting number of bits reserved is not a
miscalculation though, because MaxHeaptuplesPerPage (for 32kiB BLCKSZ)
requires the mentioned 11 bits, and adapting bit swizzling for multiple
page sizes was apparently not considered worth the effort.

As a table AM author, I'd like to know what the real limits are so that
> I can use whatever bits are available to map from TID to row number and
> back, without worrying that something will break in the future. A few
> possibilities:
>
>   1. Keep MaxOffsetNumber as 2048 and fix itemptr_to_uint64().
>

I believe that this is the right way when there exist tableAMs that use
those upper 5 bits.


>   2. Change MaxOffsetNumber to 2047. This seems likely to break
> extensions that rely on it.
>

If you're going to change MaxOffsetNumber, I believe that it's better to
change it to ((BLCKSZ - sizeof(PageHeaderData)) / sizeof(ItemIdData)), as
that is the maximum amount of ItemIds you could put on a Page that has no
page opaque.

  3. Define MaxOffsetNumber as 65536 and introduce a new
> MaxItemsPerPage as 2048 for the stack-allocated arrays. We'd still need
> to fix itemptr_to_uint64().


I believe that that breaks more things than otherwise required. ip_posid is
already limited to uint16, so I see no reason to add a constant that would
assert that the value of any uint16 is less its max value plus one.


With regards,

Matthias van de Meent


Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2021-04-30 Thread Amit Kapila
On Wed, Apr 28, 2021 at 11:03 AM Dilip Kumar  wrote:
>
> On Wed, Apr 28, 2021 at 11:00 AM Amit Kapila  wrote:
> >
>
>  The idea I have is to additionally check that we are decoding
> > streaming or prepared transaction (the same check as we have for
> > setting curtxn) or we can check if CheckXidAlive is a valid
> > transaction id. What do you think?
>
> I think a check based on CheckXidAlive looks good to me.  This will
> protect against if a similar error is raised from any other path as
> you mentioned above.
>

We can't use CheckXidAlive because it is reset by that time. So, I
used the other approach which led to the attached.

-- 
With Regards,
Amit Kapila.


v1-0001-Tighten-the-concurrent-abort-check-during-decodin.patch
Description: Binary data


Re: Enhanced error message to include hint messages for redundant options error

2021-04-30 Thread Dilip Kumar
On Fri, Apr 30, 2021 at 12:36 PM Bharath Rupireddy
 wrote:
> >
> > other comments
> >
> >  if (strcmp(defel->defname, "volatility") == 0)
> >   {
> >   if (is_procedure)
> > - goto procedure_error;
> > + is_procedure_error =  true;
> >   if (*volatility_item)
> > - goto duplicate_error;
> > + is_duplicate_error = true;
> >
> > Another side effect I see is, in the above check earlier if
> > is_procedure was true it was directly goto procedure_error, but now it
> > will also check the if (*volatility_item) and it may set
> > is_duplicate_error also true, which seems wrong to me.  I think you
> > can change it to
> >
> > if (is_procedure)
> >is_procedure_error =  true;
> > else if (*volatility_item)
> >   is_duplicate_error = true;
>
> Thanks. Done that way, see the attached v3. Let's see what others has to say.
>
> Also attaching Vignesh's v3 patch as-is, just for completion.

Looking into this again, why not as shown below?  IMHO, this way the
code will be logically the same as it was before the patch, basically
why to process an extra statement ( *volatility_item = defel;) if we
have already decided to error.

 if (is_procedure)
is_procedure_error =  true;
else if (*volatility_item)
  is_duplicate_error = true;
else
  *volatility_item = defel;

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Enhanced error message to include hint messages for redundant options error

2021-04-30 Thread Peter Smith
On Fri, Apr 30, 2021 at 5:06 PM Bharath Rupireddy
 wrote:
>
> On Fri, Apr 30, 2021 at 11:23 AM Dilip Kumar  wrote:
> >
> > On Fri, Apr 30, 2021 at 11:09 AM Bharath Rupireddy
> >  wrote:
> > >
> > > On Fri, Apr 30, 2021 at 10:51 AM Dilip Kumar  
> > > wrote:
> > > >
> > > > On Fri, Apr 30, 2021 at 10:43 AM Bharath Rupireddy
> > > >  wrote:
> > > > >
> > > > > On Fri, Apr 30, 2021 at 10:17 AM Dilip Kumar  
> > > > > wrote:
> > > > > > In this function, we already have the "defel" variable then I do not
> > > > > > understand why you are using one extra variable and assigning defel 
> > > > > > to
> > > > > > that?
> > > > > > If the goal is to just improve the error message then you can simply
> > > > > > use defel->defname?
> > > > >
> > > > > Yeah. I can do that. Thanks for the comment.
> > > > >
> > > > > While on this, I also removed  the duplicate_error and procedure_error
> > > > > goto statements, because IMHO, using goto statements is not an elegant
> > > > > way. I used boolean flags to do the job instead. See the attached and
> > > > > let me know what you think.
> > > >
> > > > Okay, but I see one side effect of this, basically earlier on
> > > > procedure_error and duplicate_error we were not assigning anything to
> > > > output parameters, e.g. volatility_item,  but now those values will be
> > > > assigned with defel even if there is an error.
> > >
> > > Yes, but on ereport(ERROR, we don't come back right? The txn gets
> > > aborted and the control is not returned to the caller instead it will
> > > go to sigjmp_buf of the backend.
> > >
> > > > So I think we should
> > > > better avoid such change.  But even if you want to do then better
> > > > check for any impacts on the caller.
> > >
> > > AFAICS, there will not be any impact on the caller, as the control
> > > doesn't return to the caller on error.
> >
> > I see.
> >
> > other comments
> >
> >  if (strcmp(defel->defname, "volatility") == 0)
> >   {
> >   if (is_procedure)
> > - goto procedure_error;
> > + is_procedure_error =  true;
> >   if (*volatility_item)
> > - goto duplicate_error;
> > + is_duplicate_error = true;
> >
> > Another side effect I see is, in the above check earlier if
> > is_procedure was true it was directly goto procedure_error, but now it
> > will also check the if (*volatility_item) and it may set
> > is_duplicate_error also true, which seems wrong to me.  I think you
> > can change it to
> >
> > if (is_procedure)
> >is_procedure_error =  true;
> > else if (*volatility_item)
> >   is_duplicate_error = true;
>
> Thanks. Done that way, see the attached v3. Let's see what others has to say.
>

Hmmm - I am not so sure about those goto replacements. I think the
poor goto has a bad reputation, but not all gotos are bad. I've met
some very nice gotos.

Each goto here was doing exactly what it looked like it was doing,
whereas all these boolean replacements have now introduced subtle
differences. e.g. now the *volatility_item = defel; assignment (and
all similar assignments) will happen which previously did not happen
at all. It leaves the reader wondering if assigning to those
references could have any side-effects at the caller. Probably there
are no problems at allbut can you be sure?  Meanwhile, those
"inelegant" gotos did not give any cause for such doubts.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Replication slot used in logical decoding of documental database give error: got sequence entry 258 for toast chunk 538757697 instead of seq 0

2021-04-30 Thread silvio brandani
Hi,


our setup:
  Postgres server is running on CentOS release 6.10 (Final)  instance.
  Server Version is PostgreSQL 9.5.9 on x86_64-pc-linux-gnu, compiled by
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-18), 64-bit

With the following parameters set:

wal_level = 'logical' # replica < logical
max_replication_slots = 10
max_wal_senders = 10
track_commit_timestamp = on



We are using decoder_raw  module for retrieving/read the WAL data through
the logical decoding mechanism.


Application setup:

The actual TCapture engine is a Java application which runs as a separate
program outside Postgres, and which must be started explicitly.
When TCapture is running, it will scan the transaction log  (with TCapt
module) of all primary databases and pick up transactions which must be
replicated.
Transactions which have been picked up are stored in the “Replication
Database”, a PG user database exclusively used by TCapture.
In the Replication Database, transaction is ‘copied’ to all replicate
databases which have a subscription for this transaction.
 Transaction is then applied to the replicate tables by inserting it into
by the dedicated Java application module


 We runs TCapt module in the loop for reading a primary database which is a
documental database (with binary columns) .


Behavior reported (Bug)
  We have TCapture Replication Server  running for successfully for weeks
but recently we encountered following error:

cat log/TCapture_enodp_2021-04-12-11\:30\:16_err.log
org.postgresql.util.PSQLException: ERROR: got sequence entry 258 for
toast chunk 538757697 instead of seq 0
at
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2553)
at
org.postgresql.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:1212)
at
org.postgresql.core.v3.QueryExecutorImpl.readFromCopy(QueryExecutorImpl.java:1112)
at
org.postgresql.core.v3.CopyDualImpl.readFromCopy(CopyDualImpl.java:44)
at
org.postgresql.core.v3.replication.V3PGReplicationStream.receiveNextData(V3PGReplicationStream.java:160)
at
org.postgresql.core.v3.replication.V3PGReplicationStream.readInternal(V3PGReplicationStream.java:125)
at
org.postgresql.core.v3.replication.V3PGReplicationStream.readPending(V3PGReplicationStream.java:82)
at
com.edslab.TCapt.receiveChangesOccursBeforTCapt(TCapt.java:421)
at com.edslab.TCapt.run(TCapt.java:182)
at java.lang.Thread.run(Thread.java:745)


 After restarting our TCapt module (see https://www.tcapture.net/ for
better understand the project TCapture), the error went away. But this
causes the producer module (Tapt) to shut down.

Please note that we run TCapture with other Postgres versions (9.6, 10,
11,ec..) without problems !!

Is there any  resolution for this issue or is it resolved in the higher
version of postgres?


Regards,
Silvio


Re: Replication slot stats misgivings

2021-04-30 Thread Amit Kapila
On Fri, Apr 30, 2021 at 5:55 AM Masahiko Sawada  wrote:
>
> After more thought, it seems to me that we should use txn->size here
> regardless of the top transaction or subtransaction since we're
> iterating changes associated with a transaction that is either the top
> transaction or a subtransaction. Otherwise, I think if some
> subtransactions are not serialized, we will end up adding bytes
> including those subtransactions during iterating other serialized
> subtransactions. Whereas in ReorderBufferProcessTXN() we should use
> txn->total_txn since txn is always the top transaction. I've attached
> another patch to do this.
>

LGTM. I have slightly edited the comments in the attached. I'll push
this early next week unless there are more comments.

-- 
With Regards,
Amit Kapila.


use_total_size_v6.patch
Description: Binary data


MaxOffsetNumber for Table AMs

2021-04-30 Thread Jeff Davis


The notion of TID is based on pages and line pointers, which makes
sense for heapam, but that's not likely to make sense for a custom
table AM.

The obvious answer is to make a simple mapping between a TID and
whatever makes sense to the AM (for the sake of discussion, let's say a
plain row number).

The most natural thing would be to say that we have 48 bits, so it can
just be a 48-bit number. Of course, there are some restrictions on
valid values that complicate this:

  * InvalidBlockNumber of 0x. Not a problem.
  * InvalidOffsetNumber of 0. Not a problem.
  * MaxOffsetNumber of 2048. Does this limit really apply to table AMs?
It just seems like it's used when scanning heap or index pages for
stack-allocated arrays. For a table AM it appears to waste 5 bits.
  * ginpostinglist.c:itemptr_to_uint64() seems to think 2047 is the max
offset number. Is this a bug?

As a table AM author, I'd like to know what the real limits are so that
I can use whatever bits are available to map from TID to row number and
back, without worrying that something will break in the future. A few
possibilities:

  1. Keep MaxOffsetNumber as 2048 and fix itemptr_to_uint64().
  2. Change MaxOffsetNumber to 2047. This seems likely to break
extensions that rely on it.
  3. Define MaxOffsetNumber as 65536 and introduce a new
MaxItemsPerPage as 2048 for the stack-allocated arrays. We'd still need
to fix itemptr_to_uint64().

Thoughts?

Regards,
Jeff Davis






Re: Enhanced error message to include hint messages for redundant options error

2021-04-30 Thread Bharath Rupireddy
On Fri, Apr 30, 2021 at 11:23 AM Dilip Kumar  wrote:
>
> On Fri, Apr 30, 2021 at 11:09 AM Bharath Rupireddy
>  wrote:
> >
> > On Fri, Apr 30, 2021 at 10:51 AM Dilip Kumar  wrote:
> > >
> > > On Fri, Apr 30, 2021 at 10:43 AM Bharath Rupireddy
> > >  wrote:
> > > >
> > > > On Fri, Apr 30, 2021 at 10:17 AM Dilip Kumar  
> > > > wrote:
> > > > > In this function, we already have the "defel" variable then I do not
> > > > > understand why you are using one extra variable and assigning defel to
> > > > > that?
> > > > > If the goal is to just improve the error message then you can simply
> > > > > use defel->defname?
> > > >
> > > > Yeah. I can do that. Thanks for the comment.
> > > >
> > > > While on this, I also removed  the duplicate_error and procedure_error
> > > > goto statements, because IMHO, using goto statements is not an elegant
> > > > way. I used boolean flags to do the job instead. See the attached and
> > > > let me know what you think.
> > >
> > > Okay, but I see one side effect of this, basically earlier on
> > > procedure_error and duplicate_error we were not assigning anything to
> > > output parameters, e.g. volatility_item,  but now those values will be
> > > assigned with defel even if there is an error.
> >
> > Yes, but on ereport(ERROR, we don't come back right? The txn gets
> > aborted and the control is not returned to the caller instead it will
> > go to sigjmp_buf of the backend.
> >
> > > So I think we should
> > > better avoid such change.  But even if you want to do then better
> > > check for any impacts on the caller.
> >
> > AFAICS, there will not be any impact on the caller, as the control
> > doesn't return to the caller on error.
>
> I see.
>
> other comments
>
>  if (strcmp(defel->defname, "volatility") == 0)
>   {
>   if (is_procedure)
> - goto procedure_error;
> + is_procedure_error =  true;
>   if (*volatility_item)
> - goto duplicate_error;
> + is_duplicate_error = true;
>
> Another side effect I see is, in the above check earlier if
> is_procedure was true it was directly goto procedure_error, but now it
> will also check the if (*volatility_item) and it may set
> is_duplicate_error also true, which seems wrong to me.  I think you
> can change it to
>
> if (is_procedure)
>is_procedure_error =  true;
> else if (*volatility_item)
>   is_duplicate_error = true;

Thanks. Done that way, see the attached v3. Let's see what others has to say.

Also attaching Vignesh's v3 patch as-is, just for completion.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From 6ed153cdb45a0d41d3889dc7d80b3a743eb66725 Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Mon, 26 Apr 2021 18:40:36 +0530
Subject: [PATCH v3] Enhance error message.

Enhanced error message, so that the user can easily identify the error.
---
 contrib/file_fdw/file_fdw.c |  6 +--
 src/backend/catalog/aclchk.c|  4 +-
 src/backend/commands/copy.c | 23 +-
 src/backend/commands/dbcommands.c   | 28 ++--
 src/backend/commands/extension.c|  8 ++--
 src/backend/commands/foreigncmds.c  |  4 +-
 src/backend/commands/functioncmds.c | 12 +++---
 src/backend/commands/publicationcmds.c  |  4 +-
 src/backend/commands/sequence.c | 18 
 src/backend/commands/subscriptioncmds.c | 18 
 src/backend/commands/tablecmds.c|  2 +-
 src/backend/commands/typecmds.c | 14 +++---
 src/backend/commands/user.c | 48 ++---
 src/backend/parser/parse_utilcmd.c  |  2 +-
 src/backend/replication/pgoutput/pgoutput.c | 10 ++---
 src/backend/replication/walsender.c |  6 +--
 src/test/regress/expected/copy2.out | 24 ++-
 src/test/regress/expected/foreign_data.out  |  4 +-
 src/test/regress/expected/publication.out   |  2 +-
 19 files changed, 119 insertions(+), 118 deletions(-)

diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 2c2f149fb0..10aa2fca28 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -292,8 +292,7 @@ file_fdw_validator(PG_FUNCTION_ARGS)
 			if (force_not_null)
 ereport(ERROR,
 		(errcode(ERRCODE_SYNTAX_ERROR),
-		 errmsg("conflicting or redundant options"),
-		 errhint("Option \"force_not_null\" supplied more than once for a column.")));
+		 errmsg("option \"%s\" specified more than once", def->defname)));
 			force_not_null = def;
 			/* Don't care what the value is, as long as it's a legal boolean */
 			(void) defGetBoolean(def);
@@ -304,8 +303,7 @@ file_fdw_validator(PG_FUNCTION_ARGS)
 			if (force_null)
 ereport(ERROR,
 		(errcode(ERRCODE_SYNTAX_ERROR),
-		 errmsg("conflicting or redundant options"),
-		 errhint("Option \"force_null\" supplied more than once for a column.")));
+		 errmsg("option \"%s\" specified more than once", def->defname)));
 			force_null = def;
 			(void) defGetBoolean(def);
 		}

Re: Included xid in restoring reorder buffer changes from disk log message

2021-04-30 Thread Dilip Kumar
On Thu, Apr 29, 2021 at 9:45 PM vignesh C  wrote:
>
> Hi,
>
> While debugging one of the logical decoding issues, I found that xid was not 
> included in restoring reorder buffer changes from disk log messages.  
> Attached a patch for it. I felt including the XID will be helpful in 
> debugging. Thoughts?
>

It makes sense to include xid in the debug message, earlier I thought
that will it be a good idea to also print the toplevel_xid.  But I
think it is for debug purposes and only we have the xid we can fetch
the other parameters so maybe adding xid is good enough.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Replication slot stats misgivings

2021-04-30 Thread vignesh C
On Fri, Apr 30, 2021 at 5:55 AM Masahiko Sawada  wrote:
>
> On Thu, Apr 29, 2021 at 9:44 PM vignesh C  wrote:
> >
> >
> >
> > On Thu, Apr 29, 2021 at 3:06 PM Amit Kapila  wrote:
> > >
> > > On Wed, Apr 28, 2021 at 5:01 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Wed, Apr 28, 2021 at 4:51 PM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Wed, Apr 28, 2021 at 6:39 PM Amit Kapila  
> > > > > wrote:
> > > >
> > > > @@ -1369,7 +1369,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb,
> > > > ReorderBufferIterTXNState *state)
> > > >   * Update the total bytes processed before releasing the current set
> > > >   * of changes and restoring the new set of changes.
> > > >   */
> > > > - rb->totalBytes += rb->size;
> > > > + rb->totalBytes += entry->txn->total_size;
> > > >   if (ReorderBufferRestoreChanges(rb, entry->txn, >file,
> > > >   >entries[off].segno))
> > > >
> > > > I have not tested this but won't in the above change you need to check
> > > > txn->toptxn for subtxns?
> > > >
> > >
> > > Now, I am able to reproduce this issue:
> > > Create table t1(c1 int);
> > > select pg_create_logical_replication_slot('s', 'test_decoding');
> > > Begin;
> > > insert into t1 values(1);
> > > savepoint s1;
> > > insert into t1 select generate_series(1, 10);
> > > commit;
> > >
> > > postgres=# select count(*) from pg_logical_slot_peek_changes('s1', NULL, 
> > > NULL);
> > >  count
> > > 
> > >  15
> > > (1 row)
> > >
> > > postgres=# select * from pg_stat_replication_slots;
> > >  slot_name | spill_txns | spill_count | spill_bytes | stream_txns |
> > > stream_count | stream_bytes | total_txns | total_bytes |
> > > stats_reset
> > > ---++-+-+-+--+--++-+--
> > >  s1|  0 |   0 |   0 |   0 |
> > > 0 |0 |  2 |13200672 | 2021-04-29
> > > 14:33:55.156566+05:30
> > > (1 row)
> > >
> > > select * from pg_stat_reset_replication_slot('s1');
> > >
> > > Now reduce the logical decoding work mem to allow spilling.
> > > postgres=# set logical_decoding_work_mem='64kB';
> > > SET
> > > postgres=# select count(*) from pg_logical_slot_peek_changes('s1', NULL, 
> > > NULL);
> > >  count
> > > 
> > >  15
> > > (1 row)
> > >
> > > postgres=# select * from pg_stat_replication_slots;
> > >  slot_name | spill_txns | spill_count | spill_bytes | stream_txns |
> > > stream_count | stream_bytes | total_txns | total_bytes |
> > > stats_reset
> > > ---++-+-+-+--+--++-+--
> > >  s1|  1 | 202 |1320 |   0 |
> > > 0 |0 |  2 | 672 | 2021-04-29
> > > 14:35:21.836613+05:30
> > > (1 row)
> > >
> > > You can notice that after we have allowed spilling the 'total_bytes'
> > > stats is showing a different value. The attached patch fixes the issue
> > > for me. Let me know what do you think about this?
> >
> > I found one issue with the following scenario when testing with 
> > logical_decoding_work_mem as 64kB:
> >
> > BEGIN;
> > INSERT INTO t1 values(generate_series(1,1));
> > SAVEPOINT s1;
> > INSERT INTO t1 values(generate_series(1,1));
> > COMMIT;
> > SELECT count(*) FROM pg_logical_slot_get_changes('regression_slot1', NULL,
> > NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
> > select * from pg_stat_replication_slots;
> > slot_name | spill_txns | spill_count | spill_bytes | stream_txns | 
> > stream_count | stream_bytes | total_txns | total_bytes |   
> > stats_reset
> > --++-+-+-+--+--++-+--
> >  regression_slot1 |  6 | 154 | 9130176 |   0 |  
> >   0 |0 |  1 | 4262016 | 2021-04-29 
> > 17:50:00.080663+05:30
> > (1 row)
> >
> > Same thing works fine with logical_decoding_work_mem as 64MB:
> > select * from pg_stat_replication_slots;
> >slot_name | spill_txns | spill_count | spill_bytes | stream_txns | 
> > stream_count | stream_bytes | total_txns | total_bytes |   
> > stats_reset
> > --++-+-+-+--+--++-+--
> >  regression_slot1 |  6 | 154 | 9130176 |   0 |  
> >   0 |0 |  1 | 264 | 2021-04-29 
> > 17:50:00.080663+05:30
> > (1 row)
> >
> > The patch required one change:
> > - rb->totalBytes += rb->size;
> > + if (entry->txn->toptxn)
> > + rb->totalBytes += entry->txn->toptxn->total_size;
> > + else
> > + rb->totalBytes += entry->txn->total_size;
>