Logging the feature of SQL-level read/write commits

2019-05-03 Thread Ronny Ko



Hi,I am trying to efficiently rollback a manually selectedd subset of committed SQL transactions by scanning an SQL transaction log. This feature is useful when a database administrator wants to rollback not the entire database system, but only particular SQL statements that affect a certain set of SQL tables. Unfortunately, this is impossible in the current PostgreSQL setup, because PostgreSQL's WAL(Write-Ahead Log) file doesn't provide any SQL statement-level redo records, but only physical block-level redo records.To this end, my goal is to improve PostgreSQL to produce augmented transaction logs. In particular, the augmented transaction log's every committed transaction ID will contain an additional section called "rollback SQL statements", which is a minimal series of DELETE & INSERT SQL statements that effectively rolls back one transaction to its immediately previous transaction. For example, suppose that we have the following SQL table:

=
Table1
column1 | column2 
1             | 20
2           | 30
3 | 40
3 | 40
4   | 50
=

And suppose that the following 100th transaction was committed:

UPDATE Table1
 SET column1 = 10, column2 = 20
 WHERE colum2 > 20   ;



Then, the augmented transaction log file will generate the following log entry for the above committed transaction: 


Committed Transaction ID = 100
Rollback SQL Statements  = 
-   DELETE FROM Table1 WHERE column1 = 2 AND column2 = 30
-   INSERT INTO TABLE Table1 VALUES(column1, column2) (10, 20)
-     DELETE FROM Table1 WHERE column1 = 3 AND column2 = 40
-     INSERT INTO TABLE Table1 VALUES(column1, column2) (10, 20)
-   DELETE FROM Table1 WHERE column1 = 4 AND column2 = 50
-   INSERT INTO TABLE Table1 VALUES(column1, column2) (10, 20)

Note that the above Rollback SQL statements are in the simplest forms without involving any complex SQL operations such as JOIN or sub-queries. Also note that we cannot create the above Rollback SQL statements purely based on original consecutive  SQL transactions, because we don't know which rows of Table1 will need to be DELETED without actually scanning the entire Table1 and evaluating Transction #100's WHERE clause (i.e., colum2 > 20) on every single row  of Table1. Therefore, to generate a list of simple Rollback SQL statements like the above, we have no choice but to embed this logging feature in the PostgreSQL's source code where the WAL(Write-Ahead Log) file is being updated. 

Since the current PostgreSQL doesn't support this feature, I plan to implement the above feature in the source code. But I have never worked on PostgreSQL source code in my  life, and I wonder if anybody could give me a hint on which source code files (and functions) are about recording redo records in the WAL file. In particular, when the SQL server records the information of updated block location & values into the WAL file for each SQL statement that modifies any relations, we can additionally make the SQL server also write the list of the simplest INSERT & DELETE SQL statements that effectively enforces such SQL table write operations. If such an SQL-level inforcement information is available in the WAL file, one can easily conjecture what will be the corresponding Rollback (i.e., inverse) SQL statements from there.
Thanks for anybody's comments. Ronny






Re: First-draft release notes for back branches are up

2019-05-03 Thread Thomas Munro
On Sat, May 4, 2019 at 1:29 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > On Sat, May 4, 2019 at 10:29 AM Tom Lane  wrote:
> > +  Tolerate EINVAL and ENOSYS
> > +  error results, where appropriate, for fsync calls (Thomas Munro,
> > +  James Sewell)
>
> > Nit-picking: ENOSYS is for sync_file_range, EINVAL is for fsync.
>
> Yeah, I didn't really think it was worth distinguishing.  If there
> is some more general term that covers both calls, maybe we should
> use that?

I would just do s/fsync/fsync and sync_file_range/.  And I guess also
wrap them in ?

-- 
Thomas Munro
https://enterprisedb.com




Re: improving wraparound behavior

2019-05-03 Thread Euler Taveira
Em sex, 3 de mai de 2019 às 17:27, Robert Haas  escreveu:
>
> HINT:  Commit or roll back old prepared transactions, drop stale
> replication slots, or kill long-running sessions.
> Ensure that autovacuum is progressing, or run a manual database-wide VACUUM.
>
First of all, +1 for this patch. But after reading this email, I
couldn't resist to expose an idea about stale XID horizon. [A cup of
tea...] I realized that there is no automatic way to recover from old
prepared transactions, stale replication slots or even long-running
sessions when we have a wraparound situation. Isn't it the case to add
a parameter that recover from stale XID horizon? I mean if we reach a
critical situation (xidStopLimit), free resource that is preventing
the XID advance (and hope autovacuum have some time to prevent a
stop-assigning-xids situation). The situation is analogous to OOM
killer that it kills some processes that are starving resources.


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento




Re: improving wraparound behavior

2019-05-03 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2019-05-03 22:41:11 -0400, Stephen Frost wrote:
> > I suppose it is a pretty big change in the base autovacuum launcher to
> > be something that's run per database instead and then deal with the
> > coordination between the two...  but I can't help but feel like it
> > wouldn't be that much *work*.  I'm not against doing something smaller
> > but was something smaller actually proposed for this specific issue..?
> 
> I think it'd be fairly significant. And that we should redo it from
> scratch if we go there - because what we have isn't worth using as a
> basis.

Alright, what I'm hearing here is that we should probably have a
dedicated thread for this discussion, if someone has the cycles to spend
on it.  I'm not against that.

> > > I'm thinking that we'd do something roughly like (in actual code) for
> > > GetNewTransactionId():
> > > 
> > > TransactionId dat_limit = ShmemVariableCache->oldestXid;
> > > TransactionId slot_limit = Min(replication_slot_xmin, 
> > > replication_slot_catalog_xmin);
> > > Transactionid walsender_limit;
> > > Transactionid prepared_xact_limit;
> > > Transactionid backend_limit;
> > > 
> > > ComputeOldestXminFromProcarray(_limit, 
> > > _xact_limit, _limit);
> > > 
> > > if (IsOldest(dat_limit))
> > >ereport(elevel,
> > >errmsg("close to xid wraparound, held back by database 
> > > %s"),
> > >errdetail("current xid %u, horizon for database %u, 
> > > shutting down at %u"),
> > >errhint("..."));
> > > else if (IsOldest(slot_limit))
> > >   ereport(elevel, errmsg("close to xid wraparound, held back by 
> > > replication slot %s"),
> > >   ...);
> > > 
> > > where IsOldest wouldn't actually compare plainly numerically, but would
> > > actually prefer showing the slot, backend, walsender, prepared_xact, as
> > > long as they are pretty close to the dat_limit - as in those cases
> > > vacuuming wouldn't actually solve the issue, unless the other problems
> > > are addressed first (as autovacuum won't compute a cutoff horizon that's
> > > newer than any of those).
> > 
> > Where the errhint() above includes a recommendation to run the SRF
> > described below, I take it?
> 
> Not necessarily. I feel conciseness is important too, and this would be
> the most imporant thing to tackle.

I'm imagining a relatively rare scenario, just to be clear, where
"pretty close to the dat_limit" would apply to more than just one thing.

> > Also, should this really be an 'else if', or should it be just a set of
> > 'if()'s, thereby giving users more info right up-front?
> 
> Possibly? But it'd also make it even harder to read the log / the system
> to keep up with logging, because we already log *so* much when close to
> wraparound.

Yes, we definitely log a *lot*, and probably too much since other
critical messages might get lost in the noise.

> If we didn't order it, it'd be hard for users to figure out which to
> address first. If we ordered it, people have to further up in the log to
> figure out which is the most urgent one (unless we reverse the order,
> which is odd too).

This makes me think we should both order it and combine it into one
message... but that'd then be pretty difficult to deal with,
potentially, from a translation standpoint and just from a "wow, that's
a huge log message", which is kind of the idea behind the SRF- to give
you all that info in a more easily digestible manner.

Not sure I've got any great ideas on how to improve on this.  I do think
that if we know that there's multiple different things that are within a
small number of xids of the oldest xmin then we should notify the user
about all of them, either directly in the error messages or by referring
them to the SRF, so they have the opportunity to address them all, or
at least know about them all.  As mentioned though, it's likely to be a
quite rare thing to run into, so you'd have to be extra unlucky to even
hit this case and perhaps the extra code complication just isn't worth
it.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: improving wraparound behavior

2019-05-03 Thread Andres Freund
Hi,

On 2019-05-03 22:41:11 -0400, Stephen Frost wrote:
> I suppose it is a pretty big change in the base autovacuum launcher to
> be something that's run per database instead and then deal with the
> coordination between the two...  but I can't help but feel like it
> wouldn't be that much *work*.  I'm not against doing something smaller
> but was something smaller actually proposed for this specific issue..?

I think it'd be fairly significant. And that we should redo it from
scratch if we go there - because what we have isn't worth using as a
basis.


> > I'm thinking that we'd do something roughly like (in actual code) for
> > GetNewTransactionId():
> > 
> > TransactionId dat_limit = ShmemVariableCache->oldestXid;
> > TransactionId slot_limit = Min(replication_slot_xmin, 
> > replication_slot_catalog_xmin);
> > Transactionid walsender_limit;
> > Transactionid prepared_xact_limit;
> > Transactionid backend_limit;
> > 
> > ComputeOldestXminFromProcarray(_limit, _xact_limit, 
> > _limit);
> > 
> > if (IsOldest(dat_limit))
> >ereport(elevel,
> >errmsg("close to xid wraparound, held back by database %s"),
> >errdetail("current xid %u, horizon for database %u, shutting 
> > down at %u"),
> >errhint("..."));
> > else if (IsOldest(slot_limit))
> >   ereport(elevel, errmsg("close to xid wraparound, held back by 
> > replication slot %s"),
> >   ...);
> > 
> > where IsOldest wouldn't actually compare plainly numerically, but would
> > actually prefer showing the slot, backend, walsender, prepared_xact, as
> > long as they are pretty close to the dat_limit - as in those cases
> > vacuuming wouldn't actually solve the issue, unless the other problems
> > are addressed first (as autovacuum won't compute a cutoff horizon that's
> > newer than any of those).
> 
> Where the errhint() above includes a recommendation to run the SRF
> described below, I take it?

Not necessarily. I feel conciseness is important too, and this would be
the most imporant thing to tackle.


> Also, should this really be an 'else if', or should it be just a set of
> 'if()'s, thereby giving users more info right up-front?

Possibly? But it'd also make it even harder to read the log / the system
to keep up with logging, because we already log *so* much when close to
wraparound.

If we didn't order it, it'd be hard for users to figure out which to
address first. If we ordered it, people have to further up in the log to
figure out which is the most urgent one (unless we reverse the order,
which is odd too).


Greetings,

Andres Freund




Re: improving wraparound behavior

2019-05-03 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2019-05-03 22:03:18 -0400, Stephen Frost wrote:
> > * Robert Haas (robertmh...@gmail.com) wrote:
> > > I am not sure exactly how to fix this,
> > > because the calculation we use to determine the XID that can be used
> > > to vacuum a specific table is pretty complex; how can the postmaster
> > > know whether it's going to be able to make any progress in *any* table
> > > in some database to which it's not even connected?  But it's surely
> > > crazy to just keep doing something over and over that can't possibly
> > > work.
> >
> > I definitely agree that it's foolish to keep doing something that isn't
> > going to work, and it seems like a pretty large part of the issue here
> > is that we don't have enough information to be more intelligent because
> > we aren't connected to the database that needs the work to be done.
> >
> > Now, presuming we're talking about 'new feature work' here to try and
> > address this, and not something that we think we can back-patch, I had
> > another thought.
> >
> > I certainly get that having lots of extra processes around can be a
> > waste of resources... but I don't recall a lot of people complaining
> > about the autovacuum launcher process using up lots of resources
> > unnecessairly.
> >
> > Perhaps we should consider having the 'global' autovacuum launcher, when
> > it's decided that a database needs work to be done, launch a 'per-DB'
> > launcher which manages launching autovacuum processes for that database?
> > If the launcher is still running then it's because there's still work to
> > be done on that database and the 'global' autovacuum launcher can skip
> > it.  If the 'per-DB' launcher runs out of things to do, and the database
> > it's working on is no longer in a danger zone, then it exits.
> >
> > There are certainly some other variations on this idea and I don't know
> > that it's really better than keeping more information in shared
> > memory or something else, but it seems like part of the issue is that
> > the thing firing off the processes hasn't got enough info to do so
> > intelligently and maybe we could fix that by having per-DB launchers
> > that are actually connected to a DB.
> 
> This sounds a lot more like a wholesale redesign than some small
> incremental work.  Which I think we should do, but we probably ought to
> do something more minimal before the resources for something like this
> are there. Perfect being the enemy of the good, and all that.

I suppose it is a pretty big change in the base autovacuum launcher to
be something that's run per database instead and then deal with the
coordination between the two...  but I can't help but feel like it
wouldn't be that much *work*.  I'm not against doing something smaller
but was something smaller actually proposed for this specific issue..?

> > I might have misunderstood it, but it sounded like Andres was
> > proposing a new function which would basically tell you what's holding
> > back the xid horizon and that sounds fantastic and would be great to
> > include in this message, if possible.
> >
> > As in:
> >
> > HINT:  Run the function pg_what_is_holding_xmin_back() to identify what
> > is preventing autovacuum from progressing and address it.
> 
> I was basically thinking of doing *both*, amending the message, *and*
> having a new UDF.
> 
> Basically, instead of the current:
> 
> char   *oldest_datname = get_database_name(oldest_datoid);
> 
> /* complain even if that DB has disappeared */
> if (oldest_datname)
> ereport(WARNING,
> (errmsg("database \"%s\" must be vacuumed within %u 
> transactions",
> oldest_datname,
> xidWrapLimit - xid),
>  errhint("To avoid a database shutdown, execute a 
> database-wide VACUUM in that database.\n"
>  "You might also need to commit or roll back 
> old prepared transactions, or drop stale replication slots.")));
> else
> ereport(WARNING,
> (errmsg("database with OID %u must be vacuumed within 
> %u transactions",
> oldest_datoid,
> xidWrapLimit - xid),
>  errhint("To avoid a database shutdown, execute a 
> database-wide VACUUM in that database.\n"
>  "You might also need to commit or roll back 
> old prepared transactions, or drop stale replication slots.")));
> }
> 
> which is dramatically unhelpful, because often vacuuming won't do squat
> (because of old [prepared] transaction, replication connection, or slot).
> 
> I'm thinking that we'd do something roughly like (in actual code) for
> GetNewTransactionId():
> 
> TransactionId dat_limit = ShmemVariableCache->oldestXid;
> TransactionId slot_limit = 

Re: improving wraparound behavior

2019-05-03 Thread Andres Freund
Hi,

On 2019-05-03 22:03:18 -0400, Stephen Frost wrote:
> * Robert Haas (robertmh...@gmail.com) wrote:
> > I am not sure exactly how to fix this,
> > because the calculation we use to determine the XID that can be used
> > to vacuum a specific table is pretty complex; how can the postmaster
> > know whether it's going to be able to make any progress in *any* table
> > in some database to which it's not even connected?  But it's surely
> > crazy to just keep doing something over and over that can't possibly
> > work.
>
> I definitely agree that it's foolish to keep doing something that isn't
> going to work, and it seems like a pretty large part of the issue here
> is that we don't have enough information to be more intelligent because
> we aren't connected to the database that needs the work to be done.
>
> Now, presuming we're talking about 'new feature work' here to try and
> address this, and not something that we think we can back-patch, I had
> another thought.
>
> I certainly get that having lots of extra processes around can be a
> waste of resources... but I don't recall a lot of people complaining
> about the autovacuum launcher process using up lots of resources
> unnecessairly.
>
> Perhaps we should consider having the 'global' autovacuum launcher, when
> it's decided that a database needs work to be done, launch a 'per-DB'
> launcher which manages launching autovacuum processes for that database?
> If the launcher is still running then it's because there's still work to
> be done on that database and the 'global' autovacuum launcher can skip
> it.  If the 'per-DB' launcher runs out of things to do, and the database
> it's working on is no longer in a danger zone, then it exits.
>
> There are certainly some other variations on this idea and I don't know
> that it's really better than keeping more information in shared
> memory or something else, but it seems like part of the issue is that
> the thing firing off the processes hasn't got enough info to do so
> intelligently and maybe we could fix that by having per-DB launchers
> that are actually connected to a DB.

This sounds a lot more like a wholesale redesign than some small
incremental work.  Which I think we should do, but we probably ought to
do something more minimal before the resources for something like this
are there. Perfect being the enemy of the good, and all that.


> I might have misunderstood it, but it sounded like Andres was
> proposing a new function which would basically tell you what's holding
> back the xid horizon and that sounds fantastic and would be great to
> include in this message, if possible.
>
> As in:
>
> HINT:  Run the function pg_what_is_holding_xmin_back() to identify what
> is preventing autovacuum from progressing and address it.

I was basically thinking of doing *both*, amending the message, *and*
having a new UDF.

Basically, instead of the current:

char   *oldest_datname = get_database_name(oldest_datoid);

/* complain even if that DB has disappeared */
if (oldest_datname)
ereport(WARNING,
(errmsg("database \"%s\" must be vacuumed within %u 
transactions",
oldest_datname,
xidWrapLimit - xid),
 errhint("To avoid a database shutdown, execute a 
database-wide VACUUM in that database.\n"
 "You might also need to commit or roll back 
old prepared transactions, or drop stale replication slots.")));
else
ereport(WARNING,
(errmsg("database with OID %u must be vacuumed within 
%u transactions",
oldest_datoid,
xidWrapLimit - xid),
 errhint("To avoid a database shutdown, execute a 
database-wide VACUUM in that database.\n"
 "You might also need to commit or roll back 
old prepared transactions, or drop stale replication slots.")));
}

which is dramatically unhelpful, because often vacuuming won't do squat
(because of old [prepared] transaction, replication connection, or slot).

I'm thinking that we'd do something roughly like (in actual code) for
GetNewTransactionId():

TransactionId dat_limit = ShmemVariableCache->oldestXid;
TransactionId slot_limit = Min(replication_slot_xmin, 
replication_slot_catalog_xmin);
Transactionid walsender_limit;
Transactionid prepared_xact_limit;
Transactionid backend_limit;

ComputeOldestXminFromProcarray(_limit, _xact_limit, 
_limit);

if (IsOldest(dat_limit))
   ereport(elevel,
   errmsg("close to xid wraparound, held back by database %s"),
   errdetail("current xid %u, horizon for database %u, shutting 
down at %u"),
   errhint("..."));
else if (IsOldest(slot_limit))
  ereport(elevel, errmsg("close to xid wraparound, held back by 

Re: improving wraparound behavior

2019-05-03 Thread Andres Freund
Hi,

On 2019-05-03 22:11:08 -0400, Stephen Frost wrote:
> * Andres Freund (and...@anarazel.de) wrote:
> > I don't think we necessarily need a new WAL record for what I'm
> > describing above (as XLOG_SMGR_TRUNCATE already carries information
> > about which forks are truncated, we could just have it acquire the
> > exclusive lock), and I don't think we'd need a ton of code for eliding
> > the WAL logged lock either.  Think the issue with backpatching would be
> > that we can't remove the logged lock, without creating hazards for
> > standbys running older versions of postgres.
> 
> While it's pretty rare, I don't believe this would be the only case of
> "you need to upgrade your replicas before your primary" due to changes
> in WAL.  Of course, we need to make sure that we actually figure out
> that the WAL being sent is something that the replica doesn't know how
> to properly handle because it's from a newer primary; we can't simply do
> the wrong thing in that case.

Don't think this is severe enough to warrant a step like this.  I think
for the back-patch case we could live with
a) move truncation to after the rest of vacuum
b) don't truncate if it'd error out anyway, but log an error

Greetings,

Andres Freund




Re: improving wraparound behavior

2019-05-03 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> I don't think we necessarily need a new WAL record for what I'm
> describing above (as XLOG_SMGR_TRUNCATE already carries information
> about which forks are truncated, we could just have it acquire the
> exclusive lock), and I don't think we'd need a ton of code for eliding
> the WAL logged lock either.  Think the issue with backpatching would be
> that we can't remove the logged lock, without creating hazards for
> standbys running older versions of postgres.

While it's pretty rare, I don't believe this would be the only case of
"you need to upgrade your replicas before your primary" due to changes
in WAL.  Of course, we need to make sure that we actually figure out
that the WAL being sent is something that the replica doesn't know how
to properly handle because it's from a newer primary; we can't simply do
the wrong thing in that case.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: improving wraparound behavior

2019-05-03 Thread Andres Freund
Hi,

On 2019-05-03 21:36:24 -0400, Robert Haas wrote:
> On Fri, May 3, 2019 at 8:45 PM Andres Freund  wrote:
> > Part of my opposition to just disabling it when close to a wraparound,
> > is that it still allows to get close to wraparound because of truncation
> > issues.
> 
> Sure ... it would definitely be better if vacuum didn't consume XIDs
> when it truncates.  On the other hand, only a minority of VACUUM
> operations will truncate, so I don't think there's really a big
> problem in practice here.

I've seen a number of out-of-xid shutdowns precisely because of
truncations. Initially because autovacuum commits suicide because
somebody else wants a conflicting lock, later because there's so much
dead space that people kill (auto)vacuum to get rid of the exclusive
locks.


> > IMO preventing getting closer to wraparound is more important
> > than making it more "comfortable" to be in a wraparound situation.
> 
> I think that's a false dichotomy.  It's impossible to create a
> situation where no user ever gets into a wraparound situation, unless
> we're prepared to do things like automatically drop replication slots
> and automatically roll back (or commit?) prepared transactions.  So,
> while it is good to prevent a user from getting into a wraparound
> situation where we can, it is ALSO good to make it easy to recover
> from those situations as painlessly as possible when they do happen.

Sure, but I've seen a number of real-world cases of xid wraparound
shutdowns related to truncations, and no real world problem due to
truncations assigning an xid.


> > The second problem I see is that even somebody close to a wraparound
> > might have an urgent need to free up some space. So I'm a bit wary of
> > just disabling it.
> 
> I would find that ... really surprising.  If you have < 11 million
> XIDs left before your data gets eaten by a grue, and you file a bug
> report complaining that vacuum won't truncate your tables until you
> catch up on vacuuming a bit, I am prepared to offer you no sympathy at
> all.

I've seen wraparound issues triggered by auto-vacuum generating so much
WAL that the system ran out of space, crash-restart, repeat. And being
unable to reclaim space could make that even harder to tackle.


> > Wonder if there's a reasonable way that'd allow to do the WAL logging
> > for the truncation without using an xid. One way would be to just get
> > rid of the lock on the primary as previously discussed. But we could
> > also drive the locking through the WAL records that do the actual
> > truncation - then there'd not be a need for an xid.  It's probably not a
> > entirely trivial change, but I don't think it'd be too bad?
> 
> Beats me.  For me, this is just a bug, not an excuse to redesign
> vacuum truncation.  Before Hot Standby, when you got into severe
> wraparound trouble, you could vacuum all your tables without consuming
> any XIDs.  Now you can't.  That's bad, and I think we should come up
> with some kind of back-patchable solution to that problem.

I agree we need to do at least a minimal version that can be
backpatched.

I don't think we necessarily need a new WAL record for what I'm
describing above (as XLOG_SMGR_TRUNCATE already carries information
about which forks are truncated, we could just have it acquire the
exclusive lock), and I don't think we'd need a ton of code for eliding
the WAL logged lock either.  Think the issue with backpatching would be
that we can't remove the logged lock, without creating hazards for
standbys running older versions of postgres.

Greetings,

Andres Freund




Re: improving wraparound behavior

2019-05-03 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> I am not sure exactly how to fix this,
> because the calculation we use to determine the XID that can be used
> to vacuum a specific table is pretty complex; how can the postmaster
> know whether it's going to be able to make any progress in *any* table
> in some database to which it's not even connected?  But it's surely
> crazy to just keep doing something over and over that can't possibly
> work.

I definitely agree that it's foolish to keep doing something that isn't
going to work, and it seems like a pretty large part of the issue here
is that we don't have enough information to be more intelligent because
we aren't connected to the database that needs the work to be done.

Now, presuming we're talking about 'new feature work' here to try and
address this, and not something that we think we can back-patch, I had
another thought.

I certainly get that having lots of extra processes around can be a
waste of resources... but I don't recall a lot of people complaining
about the autovacuum launcher process using up lots of resources
unnecessairly.

Perhaps we should consider having the 'global' autovacuum launcher, when
it's decided that a database needs work to be done, launch a 'per-DB'
launcher which manages launching autovacuum processes for that database?
If the launcher is still running then it's because there's still work to
be done on that database and the 'global' autovacuum launcher can skip
it.  If the 'per-DB' launcher runs out of things to do, and the database
it's working on is no longer in a danger zone, then it exits.

There are certainly some other variations on this idea and I don't know
that it's really better than keeping more information in shared
memory or something else, but it seems like part of the issue is that
the thing firing off the processes hasn't got enough info to do so
intelligently and maybe we could fix that by having per-DB launchers
that are actually connected to a DB.

> 2. Once you get to the point where you start to emit errors when
> attempting to assign an XID, you can still run plain old VACUUM
> because it doesn't consume an XID ... except that if it tries to
> truncate the relation, then it will take AccessExclusiveLock, which
> has to be logged, which forces an XID assignment, which makes VACUUM
> fail.  So if you burn through XIDs until the system gets to this
> point, and then you roll back the prepared transaction that caused the
> problem in the first place, autovacuum sits there trying to vacuum
> tables in a tight loop and fails over and over again as soon as hits a
> table that it thinks needs to be truncated.  This seems really lame,
> and also easily fixed.
> 
> Attached is a patch that disables vacuum truncation if xidWarnLimit
> has been reached.  With this patch, in my testing, autovacuum is able
> to recover the system once the prepared transaction has been rolled
> back.  Without this patch, not only does that not happen, but if you
> had a database with enough relations that need truncation, you could
> conceivably cause XID wraparound just from running a database-wide
> VACUUM, the one tool you have available to avoid XID wraparound.  I
> think that this amounts to a back-patchable bug fix.

Sounds reasonable to me but I've not looked at the patch at all.

> 3. The message you get when you hit xidStopLimit seems like bad advice to me:
> 
> ERROR:  database is not accepting commands to avoid wraparound data
> loss in database "%s"
> HINT:  Stop the postmaster and vacuum that database in single-user mode.
> You might also need to commit or roll back old prepared transactions,
> or drop stale replication slots.
> 
> Why do we want people to stop the postmaster and vacuum that database
> in single user mode?  Why not just run VACUUM in multi-user mode, or
> let autovacuum take care of the problem?  Granted, if VACUUM is going
> to fail in multi-user mode, and if switching to single-user mode is
> going to make it succeed, then it's a good suggestion.  But it seems
> that it doesn't fail in multi-user mode, unless it tries to truncate
> something, which is a bug we should fix.  Telling people to go to
> single-user mode where they can continue to assign XIDs even though
> they have almost no XIDs left seems extremely dangerous, actually.
> 
> Also, I think that old prepared transactions and stale replication
> slots should be emphasized more prominently.  Maybe something like:
> 
> HINT:  Commit or roll back old prepared transactions, drop stale
> replication slots, or kill long-running sessions.
> Ensure that autovacuum is progressing, or run a manual database-wide VACUUM.

I agree that a better message would definitely be good and that
recommending single-user isn't a terribly useful thing to do.  I might
have misunderstood it, but it sounded like Andres was proposing a new
function which would basically tell you what's holding back the xid
horizon and that sounds fantastic and would be 

Re: range_agg

2019-05-03 Thread David Fetter
On Fri, May 03, 2019 at 03:56:41PM -0700, Paul Jungwirth wrote:
> Hello,
> 
> I wrote an extension to add a range_agg function with similar behavior to
> existing *_agg functions, and I'm wondering if folks would like to have it
> in core? Here is the repo: https://github.com/pjungwir/range_agg
> 
> I'm also working on a patch for temporal foreign keys, and having range_agg
> would make the FK check easier and faster, which is why I'd like to get it
> added. But also it just seems useful, like array_agg, json_agg, etc.
> 
> One question is how to aggregate ranges that would leave gaps and/or
> overlaps.

This suggests two different ways to extend ranges over aggregation:
one which is a union of (in general) disjoint intervals, two others
are a union of intervals, each of which has a weight.  Please pardon
the ASCII art.

The aggregation of:

[1, 4)
  [2, 5)
 [8, 10)

could turn into:

{[1,5), [8, 10)} (union without weight)

{{[1,2),1}, {[2,4),2}, {[4,5),1}, {[8,10),1}} (strictly positive weights which 
don't (in general) cover the space)

{{[1,2),1}, {[2,4),2}, {[4,5),1}, {[5,8),0}, {[8,10),1}} (non-negative weights 
which guarantee the space is covered)

There is no principled reason to choose one over the others.

> What do people think? I plan to work on a patch regardless, so that I can
> use it for temporal FKs, but I'd appreciate some feedback on the "user
> interface".

I think the cases above, or at least the first two of them, should be
available. They could be called range_agg, weighted_range_agg, and
covering_range_agg.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: improving wraparound behavior

2019-05-03 Thread Robert Haas
On Fri, May 3, 2019 at 8:45 PM Andres Freund  wrote:
> Part of my opposition to just disabling it when close to a wraparound,
> is that it still allows to get close to wraparound because of truncation
> issues.

Sure ... it would definitely be better if vacuum didn't consume XIDs
when it truncates.  On the other hand, only a minority of VACUUM
operations will truncate, so I don't think there's really a big
problem in practice here.

> IMO preventing getting closer to wraparound is more important
> than making it more "comfortable" to be in a wraparound situation.

I think that's a false dichotomy.  It's impossible to create a
situation where no user ever gets into a wraparound situation, unless
we're prepared to do things like automatically drop replication slots
and automatically roll back (or commit?) prepared transactions.  So,
while it is good to prevent a user from getting into a wraparound
situation where we can, it is ALSO good to make it easy to recover
from those situations as painlessly as possible when they do happen.

> The second problem I see is that even somebody close to a wraparound
> might have an urgent need to free up some space. So I'm a bit wary of
> just disabling it.

I would find that ... really surprising.  If you have < 11 million
XIDs left before your data gets eaten by a grue, and you file a bug
report complaining that vacuum won't truncate your tables until you
catch up on vacuuming a bit, I am prepared to offer you no sympathy at
all.  I mean, I'm not going to say that we couldn't invent more
complicated behavior, at least on master, like making the new VACUUM
(TRUNCATE) object ternary-valued: default is on when you have more
than 11 million XIDs remaining and off otherwise, but you can override
either value by saying VACUUM (TRUNCATE { ON | OFF }).  But is that
really a thing?  People have less than 11 million XIDs left and
they're like "forget anti-wraparound, I want to truncate away some
empty pages"?

> Wonder if there's a reasonable way that'd allow to do the WAL logging
> for the truncation without using an xid. One way would be to just get
> rid of the lock on the primary as previously discussed. But we could
> also drive the locking through the WAL records that do the actual
> truncation - then there'd not be a need for an xid.  It's probably not a
> entirely trivial change, but I don't think it'd be too bad?

Beats me.  For me, this is just a bug, not an excuse to redesign
vacuum truncation.  Before Hot Standby, when you got into severe
wraparound trouble, you could vacuum all your tables without consuming
any XIDs.  Now you can't.  That's bad, and I think we should come up
with some kind of back-patchable solution to that problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: First-draft release notes for back branches are up

2019-05-03 Thread Tom Lane
Thomas Munro  writes:
> On Sat, May 4, 2019 at 10:29 AM Tom Lane  wrote:
> +  Tolerate EINVAL and ENOSYS
> +  error results, where appropriate, for fsync calls (Thomas Munro,
> +  James Sewell)

> Nit-picking: ENOSYS is for sync_file_range, EINVAL is for fsync.

Yeah, I didn't really think it was worth distinguishing.  If there
is some more general term that covers both calls, maybe we should
use that?

regards, tom lane




Re: First-draft release notes for back branches are up

2019-05-03 Thread Thomas Munro
On Sat, May 4, 2019 at 10:29 AM Tom Lane  wrote:
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8b3bce2017b15e05f000c3c5947653a3e2c5a29f
>
> Please send any corrections by Sunday.

+  Tolerate EINVAL and ENOSYS
+  error results, where appropriate, for fsync calls (Thomas Munro,
+  James Sewell)

Nit-picking: ENOSYS is for sync_file_range, EINVAL is for fsync.

-- 
Thomas Munro
https://enterprisedb.com




Re: improving wraparound behavior

2019-05-03 Thread Andres Freund
Hi,

On 2019-05-03 18:42:35 -0400, Robert Haas wrote:
> On Fri, May 3, 2019 at 4:47 PM Andres Freund  wrote:
> > I'd actually say the proper fix would be to instead move the truncation
> > to *after* finishing updating relfrozenxid etc.  If we truncate, the
> > additional cost of another in-place pg_class update, to update relpages,
> > is basically insignificant.  And the risk of errors, or being cancelled,
> > during truncation is much higher than before (due to the AEL).
> 
> That would prevent the ERROR from impeding relfrozenxid advancement,
> but it does not prevent the error itself, nor the XID consumption.  If
> autovacuum is hitting that ERROR, it will spew errors in the log but
> succeed in advancing relfrozenxid anyway.  I don't think that's as
> nice as the behavior I proposed, but it's certainly better than the
> status quo.  If you are hitting that error due to a manual VACUUM,
> under your proposal, you'll stop the manual VACUUM as soon as you hit
> the first table where this happens, which is not what you want.
> You'll also keep consuming XIDs, which is not what you want either,
> especially if you are in single-user mode because the number of
> remaining XIDs is less that a million.

Part of my opposition to just disabling it when close to a wraparound,
is that it still allows to get close to wraparound because of truncation
issues. IMO preventing getting closer to wraparound is more important
than making it more "comfortable" to be in a wraparound situation.

The second problem I see is that even somebody close to a wraparound
might have an urgent need to free up some space. So I'm a bit wary of
just disabling it.

Wonder if there's a reasonable way that'd allow to do the WAL logging
for the truncation without using an xid. One way would be to just get
rid of the lock on the primary as previously discussed. But we could
also drive the locking through the WAL records that do the actual
truncation - then there'd not be a need for an xid.  It's probably not a
entirely trivial change, but I don't think it'd be too bad?


> > > Also, I think that old prepared transactions and stale replication
> > > slots should be emphasized more prominently.  Maybe something like:
> > >
> > > HINT:  Commit or roll back old prepared transactions, drop stale
> > > replication slots, or kill long-running sessions.
> > > Ensure that autovacuum is progressing, or run a manual database-wide 
> > > VACUUM.
> >
> > I think it'd be good to instead compute what the actual problem is. It'd
> > not be particularly hard to show some these in the errdetail:
> >
> > 1) the xid horizon (xid + age) of the problematic database; potentially,
> >if connected to that database, additionally compute what the oldest
> >xid is (although that's computationally potentially too expensive)

s/oldest xid/oldest relfrozenxid/


> > 2) the xid horizon (xid + age) due to prepared transactions, and the
> >oldest transaction's name
> > 3) the xid horizon (xid + age) due to replication slot, and the "oldest"
> >slot's name
> > 4) the xid horizon (xid + age) and pid for the connection with the
> >oldest snapshot.
> >
> > I think that'd allow users much much easier to pinpoint what's going on.
> >
> > In fact, I think we probably should additionally add a function that can
> > display the above. That'd make it much easier to write monitoring
> > queries.
> 
> I think that the error and hint that you get from
> GetNewTransactionId() has to be something that we can generate very
> quickly, without doing anything that might hang on cluster with lots
> of databases or lots of relations; but if there's useful detail we can
> display there, that's good.  With a view, it's more OK if it takes a
> long time on a big cluster.

Yea, I agree it has to be reasonably fast. But all of the above, with
the exception of the optional "oldest table", should be cheap enough to
compute. Sure, a scan through PGXACT isn't cheap, but in comparison to
an ereport() and an impending shutdown it's peanuts. In contrast to
scanning a pg_class that could be many many gigabytes.

Greetings,

Andres Freund




accounting for memory used for BufFile during hash joins

2019-05-03 Thread Tomas Vondra

Hi,

I'm starting this thread mostly to keep track of patches developed in
response to issue [1] reported on pgsql-performance. The symptoms are
very simple - query performing a hash join ends up using much more
memory than expected (pretty much ignoring work_mem), and possibly
ending up with OOM.

The root cause is that hash join treats batches as pretty much free, but
that's not really true - we do allocate two BufFile structs per batch,
and each BufFile is ~8kB as it includes PGAlignedBuffer.

This is not ideal even if we happen to estimate everything correctly,
because for example with work_mem=4MB and nbatch=1024, it means we'll
use about 16MB (2*8kB*1024) for the BufFile structures alone, plus the
work_mem for hash table itself.

But it can easily explode when we under-estimate the hash side. In the
pgsql-performance message, the hash side (with the patches applied,
allowing the query to complete) it looks like this:

 Hash (cost=2823846.37..2823846.37 rows=34619 width=930)
  (actual time=252946.367..252946.367 rows=113478127 loops=1)

So it's 3277x under-estimated. It starts with 16 batches, and ends up
adding more and more batches until it fails with 524288 of them (it gets
to that many batches because some of the values are very common and we
don't disable the growth earlier).

The OOM is not very surprising, because with 524288 batches it'd need
about 8GB of memory, and the system only has 8GB RAM installed.

The two attached patches both account for the BufFile memory, but then
use very different strategies when the work_mem limit is reached.

The first patch realizes it's impossible to keep adding batches without
breaking the work_mem limit, because at some point the BufFile will need
more memory than that. But it does not make sense to stop adding batches
entirely, because then the hash table could grow indefinitely.

So the patch abandons the idea of enforcing work_mem in this situation,
and instead attempts to minimize memory usage over time - it increases
the spaceAllowed in a way that ensures doubling the number of batches
actually reduces memory usage in the long run.

The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).


Neither of those patches tweaks ExecChooseHashTableSize() to consider
memory needed for BufFiles while deciding how many batches will be
needed. That's something that probably needs to happen, but it would not
help with the underestimate issue.

I'm not entirely sure which of those approaches is the right one. The
first one is clearly just a "damage control" for cases where the hash
side turned out to be much larger than we expected. With good estimates
we probably would not have picked a hash join for those (that is, we
should have realized we can't keep work_mem and prohibit hash join).

The second patch however makes hash join viable for some of those cases,
and it seems to work pretty well (there are some numbers in the message
posted to pgsql-performance thread). So I kinda like this second one.

It's all just PoC quality, at this point, far from committable state.


[1] 
https://www.postgresql.org/message-id/flat/bc138e9f-c89e-9147-5395-61d51a757b3b%40gusw.net


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6ffaa751f2..4d5a6872cc 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -80,6 +80,7 @@ static bool ExecParallelHashTuplePrealloc(HashJoinTable 
hashtable,
 static void ExecParallelHashMergeCounters(HashJoinTable hashtable);
 static void ExecParallelHashCloseBatchAccessors(HashJoinTable hashtable);
 
+static void ExecHashUpdateSpacePeak(HashJoinTable hashtable);
 
 /* 
  * ExecHash
@@ -193,10 +194,8 @@ MultiExecPrivateHash(HashState *node)
if (hashtable->nbuckets != hashtable->nbuckets_optimal)
ExecHashIncreaseNumBuckets(hashtable);
 
-   /* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
-   hashtable->spaceUsed += hashtable->nbuckets * sizeof(HashJoinTuple);
-   if (hashtable->spaceUsed > hashtable->spacePeak)
-   hashtable->spacePeak = hashtable->spaceUsed;
+   /* refresh info about peak used memory */
+   ExecHashUpdateSpacePeak(hashtable);
 
hashtable->partialTuples = hashtable->totalTuples;
 }
@@ -1647,12 +1646,56 @@ ExecHashTableInsert(HashJoinTable 

Re: using index or check in ALTER TABLE SET NOT NULL

2019-05-03 Thread David Rowley
On Thu, 2 May 2019 at 14:22, David Rowley  wrote:
>
> On Thu, 2 May 2019 at 13:08, Tom Lane  wrote:
> >
> > Not a blocker perhaps, but it's better if we can get new behavior to
> > be more or less right the first time.
>
> It's not really new behaviour though. The code in question is for
> ATTACH PARTITION and was added in c31e9d4bafd (back in 2017).
> bbb96c3704c is the commit for using constraints to speed up SET NOT
> NULL. The mention of using the constraint for proof was made DEBUG1 in
> the initial commit.  What we need to decide is if we want to make
> ATTACH PARTITION's INFO message a DEBUG1 in pg12 or not.

FWIW, I'm in favour of changing ATTACH PARTITION's INFO messages to
DEBUG1 for PG12 to align it to what bbb96c3704c did.

Anyone else?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




range_agg

2019-05-03 Thread Paul Jungwirth

Hello,

I wrote an extension to add a range_agg function with similar behavior 
to existing *_agg functions, and I'm wondering if folks would like to 
have it in core? Here is the repo: https://github.com/pjungwir/range_agg


I'm also working on a patch for temporal foreign keys, and having 
range_agg would make the FK check easier and faster, which is why I'd 
like to get it added. But also it just seems useful, like array_agg, 
json_agg, etc.


One question is how to aggregate ranges that would leave gaps and/or 
overlaps. So in my extension there is a one-param version that forbids 
gaps & overlaps, but I let you permit them by passing extra parameters, 
so the signature is:


range_agg(r anyrange, permit_gaps boolean, permit_overlaps boolean)

Perhaps another useful choice would be to return NULL if a gap/overlap 
is found, so that each param would have three choices instead of just 
two: accept the inputs, raise an error, return a NULL.


What do people think? I plan to work on a patch regardless, so that I 
can use it for temporal FKs, but I'd appreciate some feedback on the 
"user interface".


Thanks,

--
Paul  ~{:-)
p...@illuminatedcomputing.com




Re: improving wraparound behavior

2019-05-03 Thread Tom Lane
Robert Haas  writes:
> I spent a significant chunk of today burning through roughly 2^31 XIDs
> just to see what would happen. ...

> 2. Once you get to the point where you start to emit errors when
> attempting to assign an XID, you can still run plain old VACUUM
> because it doesn't consume an XID ... except that if it tries to
> truncate the relation, then it will take AccessExclusiveLock, which
> has to be logged, which forces an XID assignment, which makes VACUUM
> fail.

Yeah.  I tripped over that earlier this week in connection with the
REINDEX business: taking an AEL only forces XID assignment when
wal_level is above "minimal", so it's easy to come to the wrong
conclusions depending on your test environment.  I suspect that
previous testing of wraparound behavior (yes there has been some)
didn't see this effect because the default wal_level didn't use to
cause it to happen.  But anyway, it's there now and I agree we'd
better do something about it.

My brain is too fried from release-note-writing to have any trustworthy
opinion right now about whether your patch is the best way.

regards, tom lane




Re: improving wraparound behavior

2019-05-03 Thread Robert Haas
On Fri, May 3, 2019 at 4:47 PM Andres Freund  wrote:
> I'd actually say the proper fix would be to instead move the truncation
> to *after* finishing updating relfrozenxid etc.  If we truncate, the
> additional cost of another in-place pg_class update, to update relpages,
> is basically insignificant.  And the risk of errors, or being cancelled,
> during truncation is much higher than before (due to the AEL).

That would prevent the ERROR from impeding relfrozenxid advancement,
but it does not prevent the error itself, nor the XID consumption.  If
autovacuum is hitting that ERROR, it will spew errors in the log but
succeed in advancing relfrozenxid anyway.  I don't think that's as
nice as the behavior I proposed, but it's certainly better than the
status quo.  If you are hitting that error due to a manual VACUUM,
under your proposal, you'll stop the manual VACUUM as soon as you hit
the first table where this happens, which is not what you want.
You'll also keep consuming XIDs, which is not what you want either,
especially if you are in single-user mode because the number of
remaining XIDs is less that a million.

> > Also, I think that old prepared transactions and stale replication
> > slots should be emphasized more prominently.  Maybe something like:
> >
> > HINT:  Commit or roll back old prepared transactions, drop stale
> > replication slots, or kill long-running sessions.
> > Ensure that autovacuum is progressing, or run a manual database-wide VACUUM.
>
> I think it'd be good to instead compute what the actual problem is. It'd
> not be particularly hard to show some these in the errdetail:
>
> 1) the xid horizon (xid + age) of the problematic database; potentially,
>if connected to that database, additionally compute what the oldest
>xid is (although that's computationally potentially too expensive)
> 2) the xid horizon (xid + age) due to prepared transactions, and the
>oldest transaction's name
> 3) the xid horizon (xid + age) due to replication slot, and the "oldest"
>slot's name
> 4) the xid horizon (xid + age) and pid for the connection with the
>oldest snapshot.
>
> I think that'd allow users much much easier to pinpoint what's going on.
>
> In fact, I think we probably should additionally add a function that can
> display the above. That'd make it much easier to write monitoring
> queries.

I think that the error and hint that you get from
GetNewTransactionId() has to be something that we can generate very
quickly, without doing anything that might hang on cluster with lots
of databases or lots of relations; but if there's useful detail we can
display there, that's good.  With a view, it's more OK if it takes a
long time on a big cluster.

> IMO we also ought to compute the *actual* relfrozenxid/relminmxid for a
> table. I.e. the oldest xid actually present. It's pretty common for most
> tables to have effective horizons that are much newer than what
> GetOldestXmin()/vacuum_set_xid_limits() can return. Obviously we can do
> so only when scanning all non-frozen pages. But being able to record
> "more aggressive" horizons would often save unnecessary work.  And it
> ought to not be hard.  I think especially for regular non-freeze,
> non-wraparound vacuums that'll often result in a much newer relfrozenxid
> (as we'll otherwise just have GetOldestXmin() - vacuum_freeze_min_age).

Sure, that would make sense.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC: GROUP BY optimization

2019-05-03 Thread Tomas Vondra

On Fri, May 03, 2019 at 10:28:21PM +0200, Dmitry Dolgov wrote:

On Tue, Apr 9, 2019 at 5:21 PM Tomas Vondra  
wrote:

So I personally would suggest to treat those patches as independent until
the very last moment, develop the costing improvements needed by each
of them, and then decide which of them are committable / in what order.


I had the same idea, but judging from the questions, raised in this thread,
it's quite hard to go with reordering based only on frequency of values. I
hoped that the cost_sort improvement patch would be simple enough to
incorporate it here, but of course it wasn't. Having an assumption, that the
amount of work, required for performing sorting, depends only on the number of
distinct groups and how costly it is to compare a values of this data type,
I've ended up extracting get_width_multiplier and get_func_cost parts from
cost_sort patch and including them into 0003-Reorder-by-values-distribution.
This allows to take into account situations when we compare e.g. long strings
or a custom data type with high procost for comparison (but I've used this
values directly without any adjusting coefficients yet).


On Wed, Jun 13, 2018 at 6:41 PM Teodor Sigaev  wrote:

> So that's a nice improvement, although I think we should also consider
> non-uniform distributions (using the per-column MCV lists).

Could you clarify how to do that?


Since I'm not familiar with this topic, I would like to ask the same question,
how to do that and what are the advantages?



I don't recall the exact details of the discussion, since most of it
happened almost a year ago, but the main concern about the original
costing approach is that it very much assumes uniform distribution.

For example if you have N tuples with M groups (which are not computed
using estimate_num_groups IIRC, and we can hardly do much better), then
the patch assumed each groups is ~N/M rows and used that for computing
the number of comparisons for a particular sequence of ORDER BY clauses.

But that may result in pretty significant errors in causes with a couple
of very large groups. But hey - we can get some of that information from
MCV lists, so maybe we can compute a safer less-optimistic estimate.

I mean, we can't compute "perfect" estimate of how many comparisons
we'll have to do, because we only have per-column MCV lits and maybe
some multi-column MCV lists (but definitely not on all combinations of
columns in the ORDER BY clause).

But what I think we could do is using largest possible group instead of
the average one. So for example when estimating number of comparisons
for columns (c1,..,cN), you could look at MCV lists for these columns
and compute

   m(i) = Max(largest group in MCV list for i-th column)

and then use Min(m(1), ..., m(k)) when estimating the number of
comparisons.

Of course, this is likely to be a worst-case estimate, and it's not
quite obvious that comparing worst-case estimates is necessarily safer
than comparing the regular ones. So I'm not sure about it.

It might be better to just compute the 'average group' in a different
way, not by arithmetic mean, but say as geometric mean from each MCV
list. Not sure.

I guess this needs some research and experimentation - constructing
cases that are likely to cause problems (non-uniform distributions,
columns with a small number of exceptionally large groups, ...) and then
showing how the costing deals with those.

FWIW I've mentioned MCV lists in the incrementalsort thread too, in
which case it was much clearer how to use them to improve the startup
cost estimate - it's enough to look at the first group in per-column MCV
lists (first in the ORDER BY sense, not by frequency).


On Sat, Jun 16, 2018 at 5:59 PM Tomas Vondra  
wrote:

I still think we need to be careful when introducing new optimizations
in this area - reordering the grouping keys by ndistinct, ORDER BY or
whatever. In particular I don't think we should commit these patches
that may quite easily cause regressions, and then hope some hypothetical
future patch fixes the costing.


I'm a bit concerned about this part of the discussion. There is an idea through
the whole thread about avoiding the situation, when a user knows which order is
better and we generate different one by mistake. From what I see right now even
if all the objections would be addressed, there is a chance that some
coefficients will be not good enough (e.g. width multiplier is based on an
average width, or it can suddenly happen that all the compared string have some
difference at the very beginning) and the chosen order will be not optimal.
Does it mean that in any case the implementation of such optimization should
provide a way to override it?


I don't think we have to provide an override no matter what. Otherwise
we'd have to have an override for everything, because no optimizer
decision is perfect - it's all built on estimates, after all.

But I assume we agree that optimizations based on estimates thare are
known to be 

Re: improving wraparound behavior

2019-05-03 Thread Andres Freund
Hi,

On 2019-05-03 16:26:46 -0400, Robert Haas wrote:
> 2. Once you get to the point where you start to emit errors when
> attempting to assign an XID, you can still run plain old VACUUM
> because it doesn't consume an XID ... except that if it tries to
> truncate the relation, then it will take AccessExclusiveLock, which
> has to be logged, which forces an XID assignment, which makes VACUUM
> fail.  So if you burn through XIDs until the system gets to this
> point, and then you roll back the prepared transaction that caused the
> problem in the first place, autovacuum sits there trying to vacuum
> tables in a tight loop and fails over and over again as soon as hits a
> table that it thinks needs to be truncated.  This seems really lame,
> and also easily fixed.
> 
> Attached is a patch that disables vacuum truncation if xidWarnLimit
> has been reached.  With this patch, in my testing, autovacuum is able
> to recover the system once the prepared transaction has been rolled
> back.  Without this patch, not only does that not happen, but if you
> had a database with enough relations that need truncation, you could
> conceivably cause XID wraparound just from running a database-wide
> VACUUM, the one tool you have available to avoid XID wraparound.  I
> think that this amounts to a back-patchable bug fix.
> 
> (One could argue that truncation should be disabled sooner than this,
> like when we've exceed autovacuum_freeze_max_age, or later than this,
> like when we hit xidStopLimit, but I think xidWarnLimit is probably
> the best compromise.)

I'd actually say the proper fix would be to instead move the truncation
to *after* finishing updating relfrozenxid etc.  If we truncate, the
additional cost of another in-place pg_class update, to update relpages,
is basically insignificant.  And the risk of errors, or being cancelled,
during truncation is much higher than before (due to the AEL).



> Also, I think that old prepared transactions and stale replication
> slots should be emphasized more prominently.  Maybe something like:
> 
> HINT:  Commit or roll back old prepared transactions, drop stale
> replication slots, or kill long-running sessions.
> Ensure that autovacuum is progressing, or run a manual database-wide VACUUM.

I think it'd be good to instead compute what the actual problem is. It'd
not be particularly hard to show some these in the errdetail:

1) the xid horizon (xid + age) of the problematic database; potentially,
   if connected to that database, additionally compute what the oldest
   xid is (although that's computationally potentially too expensive)
2) the xid horizon (xid + age) due to prepared transactions, and the
   oldest transaction's name
3) the xid horizon (xid + age) due to replication slot, and the "oldest"
   slot's name
4) the xid horizon (xid + age) and pid for the connection with the
   oldest snapshot.

I think that'd allow users much much easier to pinpoint what's going on.

In fact, I think we probably should additionally add a function that can
display the above. That'd make it much easier to write monitoring
queries.


IMO we also ought to compute the *actual* relfrozenxid/relminmxid for a
table. I.e. the oldest xid actually present. It's pretty common for most
tables to have effective horizons that are much newer than what
GetOldestXmin()/vacuum_set_xid_limits() can return. Obviously we can do
so only when scanning all non-frozen pages. But being able to record
"more aggressive" horizons would often save unnecessary work.  And it
ought to not be hard.  I think especially for regular non-freeze,
non-wraparound vacuums that'll often result in a much newer relfrozenxid
(as we'll otherwise just have GetOldestXmin() - vacuum_freeze_min_age).


Greetings,

Andres Freund




Re: POC: GROUP BY optimization

2019-05-03 Thread Dmitry Dolgov
> On Tue, Apr 9, 2019 at 5:21 PM Tomas Vondra  
> wrote:
>
> So I personally would suggest to treat those patches as independent until
> the very last moment, develop the costing improvements needed by each
> of them, and then decide which of them are committable / in what order.

I had the same idea, but judging from the questions, raised in this thread,
it's quite hard to go with reordering based only on frequency of values. I
hoped that the cost_sort improvement patch would be simple enough to
incorporate it here, but of course it wasn't. Having an assumption, that the
amount of work, required for performing sorting, depends only on the number of
distinct groups and how costly it is to compare a values of this data type,
I've ended up extracting get_width_multiplier and get_func_cost parts from
cost_sort patch and including them into 0003-Reorder-by-values-distribution.
This allows to take into account situations when we compare e.g. long strings
or a custom data type with high procost for comparison (but I've used this
values directly without any adjusting coefficients yet).

> On Wed, Jun 13, 2018 at 6:41 PM Teodor Sigaev  wrote:
>
> > So that's a nice improvement, although I think we should also consider
> > non-uniform distributions (using the per-column MCV lists).
>
> Could you clarify how to do that?

Since I'm not familiar with this topic, I would like to ask the same question,
how to do that and what are the advantages?

> On Sat, Jun 16, 2018 at 5:59 PM Tomas Vondra  
> wrote:
>
> I still think we need to be careful when introducing new optimizations
> in this area - reordering the grouping keys by ndistinct, ORDER BY or
> whatever. In particular I don't think we should commit these patches
> that may quite easily cause regressions, and then hope some hypothetical
> future patch fixes the costing.

I'm a bit concerned about this part of the discussion. There is an idea through
the whole thread about avoiding the situation, when a user knows which order is
better and we generate different one by mistake. From what I see right now even
if all the objections would be addressed, there is a chance that some
coefficients will be not good enough (e.g. width multiplier is based on an
average width, or it can suddenly happen that all the compared string have some
difference at the very beginning) and the chosen order will be not optimal.
Does it mean that in any case the implementation of such optimization should
provide a way to override it?


v12-0001-Add-tests-for-group-by-optimization.patch
Description: Binary data


v12-0002-Reorder-to-match-ORDER-BY-or-index.patch
Description: Binary data


v12-0003-Reorder-by-values-distribution.patch
Description: Binary data


Re: error messages in extended statistics

2019-05-03 Thread Tomas Vondra

On Fri, May 03, 2019 at 12:21:36PM -0400, Tom Lane wrote:

Alvaro Herrera  writes:

Error reporting in extended statistics is inconsistent -- many messages
that are ereport() in mvdistinct.c are elog() in the other modules.
...
I think this should be cleaned up, while at the same time not giving too
much hassle for translators; for example, this message
dependencies.c: elog(ERROR, "invalid MVDependencies size %zd (expected at least 
%zd)",
should not only be turned into an ereport(), but also the MVDependencies
part turned into a %s.  (Alternatively, we could decide I was wrong and
turn them all back into elogs, but I obviously vote against that.)


FWIW, I'd vote the other way: that seems like a clear "internal error",
so making translators deal with it is just make-work.  It should be an
elog.  If there's a reasonably plausible way for a user to trigger an
error condition, then yes ereport, but if we're reporting situations
that couldn't happen without a server bug then elog seems fine.



Yeah, I agree. Most of (peshaps all) those errors are internal errors,
and thus should be elog. I'll take care of clening this up a bit.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Why is infinite_recurse test suddenly failing?

2019-05-03 Thread Tom Lane
Mark Wong  writes:
> On Thu, May 02, 2019 at 11:45:34AM -0400, Tom Lane wrote:
>> While we're looking at this --- Mark, if you could install gdb
>> on your buildfarm hosts, that would be really handy.  I think that's
>> the only extra thing the buildfarm script needs to extract stack
>> traces from core dumps.  We'd likely already know where the problem
>> is if we had a stack trace ...

> Ok, I think I have gdb installed now...

Thanks!

regards, tom lane




Re: Why is infinite_recurse test suddenly failing?

2019-05-03 Thread Mark Wong
On Thu, May 02, 2019 at 11:45:34AM -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Hm, I just noticed:
> >'HEAD' => [
> >'force_parallel_mode = 
> > regress'
> >  ]
> 
> Oooh, I didn't see that.
> 
> > on all those animals. So it's not necessarily the case that HEAD and
> > backbranch runs are behaving all that identical.  Note that isn't a
> > recent config change, so it's not an explanation as to why they started
> > to fail only recently.
> 
> No, but it does point at another area of the code in which a relevant
> change could've occurred.
> 
> While we're looking at this --- Mark, if you could install gdb
> on your buildfarm hosts, that would be really handy.  I think that's
> the only extra thing the buildfarm script needs to extract stack
> traces from core dumps.  We'd likely already know where the problem
> is if we had a stack trace ...

Ok, I think I have gdb installed now...

Regards,
Mark

-- 
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/




Re: Per-tablespace autovacuum settings

2019-05-03 Thread Robert Haas
On Thu, Apr 25, 2019 at 12:36 PM Oleksii Kliukin  wrote:
> - Fallbacks to autovacuum parameters in another scope. Right now in the
> absence of the per-table and per-tablespace autovacuum parameters the code
> uses the ones from the global scope. However, if only some of the reloptions
> are set on a per-table level (i.e. none of the autovacuum related ones), we
> assume defaults for the rest of reloptions without consulting the lower
> level (i.e .per-tablespace options). This is so because we don’t have the
> mechanism to tell whether the option is set to its default value (some of
> them use -1 to request the fallback to the outer level, but for some it’s
> not possible, i.e. autovacuum_enabled is just a boolean value).

That sounds like it's probably not acceptable?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: error messages in extended statistics

2019-05-03 Thread Tom Lane
Alvaro Herrera  writes:
> Error reporting in extended statistics is inconsistent -- many messages
> that are ereport() in mvdistinct.c are elog() in the other modules.
> ...
> I think this should be cleaned up, while at the same time not giving too
> much hassle for translators; for example, this message
> dependencies.c: elog(ERROR, "invalid MVDependencies size %zd (expected at 
> least %zd)",
> should not only be turned into an ereport(), but also the MVDependencies
> part turned into a %s.  (Alternatively, we could decide I was wrong and
> turn them all back into elogs, but I obviously vote against that.)

FWIW, I'd vote the other way: that seems like a clear "internal error",
so making translators deal with it is just make-work.  It should be an
elog.  If there's a reasonably plausible way for a user to trigger an
error condition, then yes ereport, but if we're reporting situations
that couldn't happen without a server bug then elog seems fine.

regards, tom lane




error messages in extended statistics

2019-05-03 Thread Alvaro Herrera
Hello

Error reporting in extended statistics is inconsistent -- many messages
that are ereport() in mvdistinct.c are elog() in the other modules.  I
think what happened is that I changed them from elog to ereport when
committing mvdistinct, but Tomas and Simon didn't follow suit when
committing the other two modules.  As a result, some messages that
should be essentially duplicates only show up once, because the elog()
ones are not marked translatable.

I think this should be cleaned up, while at the same time not giving too
much hassle for translators; for example, this message

dependencies.c: elog(ERROR, "invalid MVDependencies size %zd (expected at 
least %zd)",

should not only be turned into an ereport(), but also the MVDependencies
part turned into a %s.  (Alternatively, we could decide I was wrong and
turn them all back into elogs, but I obviously vote against that.)

$ git grep 'elog\|errmsg' src/backend/statistics

dependencies.c: elog(ERROR, "cache lookup failed for ordering operator 
for type %u",
dependencies.c: elog(ERROR, "invalid MVDependencies size %zd (expected at 
least %zd)",
dependencies.c: elog(ERROR, "invalid dependency magic %d (expected %d)",
dependencies.c: elog(ERROR, "invalid dependency type %d (expected %d)",
dependencies.c:  errmsg("invalid zero-length item array in 
MVDependencies")));
dependencies.c: elog(ERROR, "invalid dependencies size %zd (expected at 
least %zd)",
dependencies.c: elog(ERROR, "cache lookup failed for statistics object %u", 
mvoid);
dependencies.c: elog(ERROR,
dependencies.c:  errmsg("cannot accept a value of type %s", 
"pg_dependencies")));
dependencies.c:  errmsg("cannot accept a value of type %s", 
"pg_dependencies")));
extended_stats.c:errmsg("statistics object \"%s.%s\" 
could not be computed for relation \"%s.%s\"",
extended_stats.c:   elog(ERROR, "unexpected statistics type requested: 
%d", type);
extended_stats.c:   elog(ERROR, "stxkind is not a 1-D char array");
extended_stats.c:   elog(ERROR, "cache lookup failed for statistics object 
%u", statOid);
mcv.c:  elog(ERROR, "cache lookup failed for ordering operator for type 
%u",
mcv.c:  elog(ERROR, "cache lookup failed for statistics object %u", mvoid);
mcv.c:  elog(ERROR,
mcv.c:  elog(ERROR, "invalid MCV size %zd (expected at least %zu)",
mcv.c:  elog(ERROR, "invalid MCV magic %u (expected %u)",
mcv.c:  elog(ERROR, "invalid MCV type %u (expected %u)",
mcv.c:  elog(ERROR, "invalid zero-length dimension array in MCVList");
mcv.c:  elog(ERROR, "invalid length (%d) dimension array in MCVList",
mcv.c:  elog(ERROR, "invalid zero-length item array in MCVList");
mcv.c:  elog(ERROR, "invalid length (%u) item array in MCVList",
mcv.c:  elog(ERROR, "invalid MCV size %zd (expected %zu)",
mcv.c:  elog(ERROR, "invalid MCV size %zd (expected %zu)",
mcv.c:   errmsg("function returning record called in context "
mcv.c:   errmsg("cannot accept a value of type %s", "pg_mcv_list")));
mcv.c:   errmsg("cannot accept a value of type %s", "pg_mcv_list")));
mcv.c:  elog(ERROR, "unknown clause type: %d", clause->type);
mvdistinct.c:   elog(ERROR, "cache lookup failed for statistics object %u", 
mvoid);
mvdistinct.c:   elog(ERROR,
mvdistinct.c:   elog(ERROR, "invalid MVNDistinct size %zd (expected at 
least %zd)",
mvdistinct.c:errmsg("invalid ndistinct magic %08x (expected 
%08x)",
mvdistinct.c:errmsg("invalid ndistinct type %d (expected %d)",
mvdistinct.c:errmsg("invalid zero-length item array in 
MVNDistinct")));
mvdistinct.c:errmsg("invalid MVNDistinct size %zd (expected at 
least %zd)",
mvdistinct.c:errmsg("cannot accept a value of type %s", 
"pg_ndistinct")));
mvdistinct.c:errmsg("cannot accept a value of type %s", 
"pg_ndistinct")));
mvdistinct.c:   elog(ERROR, "cache lookup failed for ordering operator 
for type %u",

-- 
Álvaro Herrera Developer, https://www.PostgreSQL.org/




Re: Heap lock levels for REINDEX INDEX CONCURRENTLY not quite right?

2019-05-03 Thread Andres Freund
Hi,

On 2019-05-03 09:37:07 +0200, Peter Eisentraut wrote:
> REINDEX CONCURRENTLY is still deadlock prone because of
> WaitForOlderSnapshots(), so this doesn't actually fix your test case,
> but that seems unrelated to this particular issue.

Right.

I've not tested the change, but it looks reasonable to me. The change
of moving the logic the reset of *heapOid to the unlock perhaps is
debatable, but I think it's OK.

Greetings,

Andres Freund




Re: POC: Cleaning up orphaned files using undo logs

2019-05-03 Thread Robert Haas
On Fri, May 3, 2019 at 12:46 AM Dilip Kumar  wrote:
> I might be completely missing but (ptr - array_base) is only valid
> when first time you get the array, but qsort will swap the element
> around and after that you will never be able to make out which element
> was at lower index and which one was at higher index.   Basically, our
> goal is to preserve the order of the undo record for the same block
> but their order might get changed due to swap when they are getting
> compared with the undo record pointer of the another block and once
> the order is swap we will never know what was their initial positions?

*facepalm*

Yeah, you're right.

Still, I think we should see if there's some way of getting rid of
that structure, or at least making it an internal detail that is used
by the code that's doing the sorting rather than something that is
exposed as an external interface.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: make \d pg_toast.foo show its indices

2019-05-03 Thread Justin Pryzby
On Fri, May 03, 2019 at 02:55:47PM +0200, Rafia Sabih wrote:
> On Mon, 22 Apr 2019 at 17:49, Justin Pryzby  wrote:
> >
> > It's deliberate that \dt doesn't show toast tables.
> > \d shows them, but doesn't show their indices.
> >
> > It seems to me that their indices should be shown, without having to think 
> > and
> > know to query pg_index.
> >
> > postgres=# \d pg_toast.pg_toast_2600
> > TOAST table "pg_toast.pg_toast_2600"
> >Column   |  Type
> > +-
> >  chunk_id   | oid
> >  chunk_seq  | integer
> >  chunk_data | bytea
> > Indexes:
> > "pg_toast_2600_index" PRIMARY KEY, btree (chunk_id, chunk_seq)
>
> +1.

Thanks - what about also showing the associated non-toast table ?

postgres=# \d pg_toast.pg_toast_2620
TOAST table "pg_toast.pg_toast_2620"
   Column   |  Type
+-
 chunk_id   | oid
 chunk_seq  | integer
 chunk_data | bytea
FOR TABLE: "pg_catalog.pg_trigger"
Indexes:
"pg_toast_2620_index" PRIMARY KEY, btree (chunk_id, chunk_seq)

That could be displayed differently, perhaps in the header, but I think this is
more consistent with other display.

Justin
>From 7c15ebe408cc5f2af51120ea152e7997ee768f81 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 3 May 2019 09:24:51 -0500
Subject: [PATCH v2 1/2] make \d pg_toast.foo show its indices

---
 src/bin/psql/describe.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index d7390d5..d26d986 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -2274,6 +2274,7 @@ describeOneTableDetails(const char *schemaname,
 	else if (tableinfo.relkind == RELKIND_RELATION ||
 			 tableinfo.relkind == RELKIND_MATVIEW ||
 			 tableinfo.relkind == RELKIND_FOREIGN_TABLE ||
+			 tableinfo.relkind == RELKIND_TOASTVALUE ||
 			 tableinfo.relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		/* Footer information about a table */
-- 
2.7.4

>From 38f50cdb727c67ae7aece8e85caf2960e824cb65 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 30 Apr 2019 19:05:53 -0500
Subject: [PATCH v2 2/2] print table associated with given TOAST table

---
 src/bin/psql/describe.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index d26d986..ebdf18a 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -2152,6 +2152,28 @@ describeOneTableDetails(const char *schemaname,
 		}
 	}
 
+	/* print table associated with given TOAST table */
+	if (tableinfo.relkind == RELKIND_TOASTVALUE)
+	{
+		PGresult   *result = NULL;
+		printfPQExpBuffer(,
+		  "SELECT relnamespace::pg_catalog.regnamespace, relname FROM pg_class WHERE reltoastrelid = '%s'",
+		  oid);
+		result = PSQLexec(buf.data);
+		if (!result) {
+			goto error_return;
+		} else if (1 != PQntuples(result)) {
+			PQclear(result);
+			goto error_return;
+		} else {
+			char	   *schemaname = PQgetvalue(result, 0, 0);
+			char	   *relname = PQgetvalue(result, 0, 1);
+			appendPQExpBuffer(, _("For table: \"%s.%s\""),
+		  schemaname, relname);
+			printTableAddFooter(, tmpbuf.data);
+		}
+	}
+
 	if (tableinfo.relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		/* Get the partition key information  */
-- 
2.7.4



Re: Google Season of Docs 2019 - Starter

2019-05-03 Thread Stephen Frost
Greetings,

* Evangelos Karatarakis (baggeliskap...@gmail.com) wrote:
> I am interested in participating in GSoD 2019 and more specifically I am
> interested in working on the project of impooving the Introductory Tutorial
> for beginners PostgreSQL and in databases. I have a practical and
> theoretical backround in PostgreSQL due to previous work. Thus I am not in
> lack of prior knowledge which is needed. Besides, I am about to get my
> degree as an Electrical and Computer Engineer in a while. On the other hand
> I am a starter in the field of Technical Writing but I am willing to engage
> with this profession. For this reason I consider GSoD and this project idea
> a great opportunity for me. Do you think is a good idea for me to work on a
> proposal since I am a starter Technical Writer? I am looking forward for
> your opinion.

This is covered here, I believe:

https://developers.google.com/season-of-docs/docs/

Specifically:

"Technical writers are technical writers worldwide who’re accepted to
take part in this year’s Season of Docs. Applicants should be able to
demonstrate prior technical writing experience by submitting role
descriptions and work samples. See the technical writer guide and
responsibilities."

This program isn't intended, as I understand the above to say, as a
starter for technical writers.  I recommend you ask Google directly
regarding your interest in the program.

Thanks,

Stephen


signature.asc
Description: PGP signature


Google Season of Docs 2019 - Starter

2019-05-03 Thread Evangelos Karatarakis
Greetings

I am interested in participating in GSoD 2019 and more specifically I am
interested in working on the project of impooving the Introductory Tutorial
for beginners PostgreSQL and in databases. I have a practical and
theoretical backround in PostgreSQL due to previous work. Thus I am not in
lack of prior knowledge which is needed. Besides, I am about to get my
degree as an Electrical and Computer Engineer in a while. On the other hand
I am a starter in the field of Technical Writing but I am willing to engage
with this profession. For this reason I consider GSoD and this project idea
a great opportunity for me. Do you think is a good idea for me to work on a
proposal since I am a starter Technical Writer? I am looking forward for
your opinion.

Thank you in advance. Regards.
Evangelos Karatarakis
Electrical and Computer Engineering Student
Technical University of Crete
Greece


Re: pg_upgrade --clone error checking

2019-05-03 Thread Jeff Janes
On Fri, May 3, 2019 at 3:53 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2019-05-02 20:03, Jeff Janes wrote:
> > It looks like it was designed for early checking, it just wasn't placed
> > early enough.  So changing it is pretty easy, as check_file_clone does
> > not need to be invented, and there is no additional code duplication
> > over what was already there.
> >
> > This patch moves the checking to near the beginning.
>
> I think the reason it was ordered that way is that it wants to do all
> the checks of the old cluster before doing any checks touching the new
> cluster.
>

But is there a reason to want to do that?  I understand we don't want to
keep starting and stopping the clusters needlessly, so we should do
everything we can in one before moving to the other.  But for checks that
don't need a running cluster, why would it matter?  The existence and
contents of PG_VERSION of the new cluster directory is already checked at
the very beginning (and even tries to start it up and shuts it down again
if a pid file also exists), so there is precedence for touching the new
cluster directory at the filesystem level early (albeit in a readonly
manner) and if a pid file exists then doing even more than that.  I didn't
move check_file_clone to before the liveness check is done, out of a
abundance of caution.  But creating a transient file with a name of no
significance ("PG_VERSION.clonetest") in a cluster that is not even running
seems like a very low risk thing to do.   The pay off is that we get an
inevitable error message much sooner.

Cheers,

Jeff


Re: make \d pg_toast.foo show its indices

2019-05-03 Thread Rafia Sabih
On Mon, 22 Apr 2019 at 17:49, Justin Pryzby  wrote:
>
> It's deliberate that \dt doesn't show toast tables.
> \d shows them, but doesn't show their indices.
>
> It seems to me that their indices should be shown, without having to think and
> know to query pg_index.
>
> postgres=# \d pg_toast.pg_toast_2600
> TOAST table "pg_toast.pg_toast_2600"
>Column   |  Type
> +-
>  chunk_id   | oid
>  chunk_seq  | integer
>  chunk_data | bytea
> Indexes:
> "pg_toast_2600_index" PRIMARY KEY, btree (chunk_id, chunk_seq)
>
+1.


-- 
Regards,
Rafia Sabih




Re: Statistical aggregate functions are not working with partitionwise aggregate

2019-05-03 Thread Jeevan Chalke
On Fri, May 3, 2019 at 2:56 PM Rajkumar Raghuwanshi <
rajkumar.raghuwan...@enterprisedb.com> wrote:

> Hi,
>
> On PG-head, Some of statistical aggregate function are not giving correct
> output when enable partitionwise aggregate while same is working on v11.
>

I had a quick look over this and observed that something broken with the
PARTIAL aggregation.

I can reproduce same issue with the larger dataset which results into
parallel scan.

CREATE TABLE tbl1(a int2,b float4) partition by range(a);
create table tbl1_p1 partition of tbl1 for values from (minvalue) to (0);
create table tbl1_p2 partition of tbl1 for values from (0) to (maxvalue);
insert into tbl1 select i%2, i from generate_series(1, 100) i;

# SELECT regr_count(b, a) FROM tbl1;
 regr_count

  0
(1 row)

postgres:5432 [120536]=# explain SELECT regr_count(b, a) FROM tbl1;
   QUERY
PLAN

 Finalize Aggregate  (cost=15418.08..15418.09 rows=1 width=8)
   ->  Gather  (cost=15417.87..15418.08 rows=2 width=8)
 Workers Planned: 2
 ->  Partial Aggregate  (cost=14417.87..14417.88 rows=1 width=8)
   ->  Parallel Append  (cost=0.00..11091.62 rows=443500
width=6)
 ->  Parallel Seq Scan on tbl1_p2  (cost=0.00..8850.00
rows=442500 width=6)
 ->  Parallel Seq Scan on tbl1_p1  (cost=0.00..24.12
rows=1412 width=6)
(7 rows)

postgres:5432 [120536]=# set max_parallel_workers_per_gather to 0;
SET
postgres:5432 [120536]=# SELECT regr_count(b, a) FROM tbl1;
 regr_count

100
(1 row)

After looking further, it seems that it got broken by following commit:

commit a9c35cf85ca1ff72f16f0f10d7ddee6e582b62b8
Author: Andres Freund 
Date:   Sat Jan 26 14:17:52 2019 -0800

Change function call information to be variable length.


This commit is too big to understand and thus could not get into the excact
cause.

Thanks


> below are some of examples.
>
> CREATE TABLE tbl(a int2,b float4) partition by range(a);
> create table tbl_p1 partition of tbl for values from (minvalue) to (0);
> create table tbl_p2 partition of tbl for values from (0) to (maxvalue);
> insert into tbl values (-1,-1),(0,0),(1,1),(2,2);
>
> --when partitionwise aggregate is off
> postgres=# SELECT regr_count(b, a) FROM tbl;
>  regr_count
> 
>   4
> (1 row)
> postgres=# SELECT regr_avgx(b, a), regr_avgy(b, a) FROM tbl;
>  regr_avgx | regr_avgy
> ---+---
>0.5 |   0.5
> (1 row)
> postgres=# SELECT corr(b, a) FROM tbl;
>  corr
> --
> 1
> (1 row)
>
> --when partitionwise aggregate is on
> postgres=# SET enable_partitionwise_aggregate = true;
> SET
> postgres=# SELECT regr_count(b, a) FROM tbl;
>  regr_count
> 
>   0
> (1 row)
> postgres=# SELECT regr_avgx(b, a), regr_avgy(b, a) FROM tbl;
>  regr_avgx | regr_avgy
> ---+---
>|
> (1 row)
> postgres=# SELECT corr(b, a) FROM tbl;
>  corr
> --
>
> (1 row)
>
> Thanks & Regards,
> Rajkumar Raghuwanshi
> QMG, EnterpriseDB Corporation
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Statistical aggregate functions are not working with partitionwise aggregate

2019-05-03 Thread Rajkumar Raghuwanshi
Hi,

On PG-head, Some of statistical aggregate function are not giving correct
output when enable partitionwise aggregate while same is working on v11.

below are some of examples.

CREATE TABLE tbl(a int2,b float4) partition by range(a);
create table tbl_p1 partition of tbl for values from (minvalue) to (0);
create table tbl_p2 partition of tbl for values from (0) to (maxvalue);
insert into tbl values (-1,-1),(0,0),(1,1),(2,2);

--when partitionwise aggregate is off
postgres=# SELECT regr_count(b, a) FROM tbl;
 regr_count

  4
(1 row)
postgres=# SELECT regr_avgx(b, a), regr_avgy(b, a) FROM tbl;
 regr_avgx | regr_avgy
---+---
   0.5 |   0.5
(1 row)
postgres=# SELECT corr(b, a) FROM tbl;
 corr
--
1
(1 row)

--when partitionwise aggregate is on
postgres=# SET enable_partitionwise_aggregate = true;
SET
postgres=# SELECT regr_count(b, a) FROM tbl;
 regr_count

  0
(1 row)
postgres=# SELECT regr_avgx(b, a), regr_avgy(b, a) FROM tbl;
 regr_avgx | regr_avgy
---+---
   |
(1 row)
postgres=# SELECT corr(b, a) FROM tbl;
 corr
--

(1 row)

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-05-03 Thread Amit Kapila
On Fri, May 3, 2019 at 11:43 AM John Naylor  wrote:
> On Thu, May 2, 2019 at 4:57 PM Amit Kapila  wrote:
> > On Thu, May 2, 2019 at 12:39 PM John Naylor  
> > wrote:
> > >
> > Can you please test/review?
>
> There isn't enough time. But since I already wrote some debugging
> calls earlier (attached), I gave it a brief spin, I found this patch
> isn't as careful as HEAD making sure we don't try the same block twice
> in a row. If you insert enough tuples into an empty table such that we
> need to extend, you get something like this:
>
> DEBUG:  Not enough space on block 0
> DEBUG:  Now trying block 0
> DEBUG:  Not enough space on block 0
> DEBUG:  Updating local map for block 0
>
> At this point, I'm sorry to say, but I'm in favor of reverting.
>

Fair enough.  I think we have tried to come up with a patch for an
alternative approach, but it needs time.  I will revert this tomorrow.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: pg_upgrade --clone error checking

2019-05-03 Thread Peter Eisentraut
On 2019-05-02 20:03, Jeff Janes wrote:
> It looks like it was designed for early checking, it just wasn't placed
> early enough.  So changing it is pretty easy, as check_file_clone does
> not need to be invented, and there is no additional code duplication
> over what was already there.
> 
> This patch moves the checking to near the beginning.

I think the reason it was ordered that way is that it wants to do all
the checks of the old cluster before doing any checks touching the new
cluster.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH v1] Show whether tables are logged in \dt+

2019-05-03 Thread Rafia Sabih
On Sat, 27 Apr 2019 at 06:18, David Fetter  wrote:
>
> On Fri, Apr 26, 2019 at 04:22:18PM +0200, Rafia Sabih wrote:
> > On Fri, 26 Apr 2019 at 14:49, Rafia Sabih  wrote:
> > >
> > > On Wed, 24 Apr 2019 at 10:30, Fabien COELHO  wrote:
> > > >
> > > >
> > > > Hello David,
> > > >
> > > > >>> I noticed that there wasn't a bulk way to see table logged-ness in 
> > > > >>> psql,
> > > > >>> so I made it part of \dt+.
> > > > >>
> > > > >> Applies, compiles, works for me.
> > > > >>
> > > > >> ISTM That temporary-ness is not shown either. Maybe the persistence 
> > > > >> column
> > > > >> should be shown as is?
> > > > >
> > > > > Temporariness added, but not raw.
> > > >
> > > > Ok, it is better like this way.
> > > >
> > > > >> Tests?
> > > > >
> > > > > Included, but they're not stable for temp tables. I'm a little stumped
> > > > > as to how to either stabilize them or test some other way.
> > > >
> > > > Hmmm. First there is the username which appears, so there should be a
> > > > dedicated user for the test.
> > > >
> > > > I'm unsure how to work around the temporary schema number, which is
> > > > undeterministic with parallel execution it. I'm afraid the only viable
> > > > approach is not to show temporary tables, too bad:-(
> > > >
> > > > >> Doc?
> > > > >
> > > > > What further documentation does it need?
> > > >
> > > > Indeed, there is no precise doc, so nothing to update :-)/:-(
> > > >
> > > >
> > > > Maybe you could consider adding a case for prior 9.1 version, something
> > > > like:
> > > >... case c.relistemp then 'temporary' else 'permanent' end as ...
> > > >
> > > >
> > > I was reviewing this patch and found a bug,
> > >
> > > create table t (i int);
> > > create index idx on t(i);
> > > \di+
> > > psql: print.c:3452: printQuery: Assertion `opt->translate_columns ==
> > > ((void *)0) || opt->n_translate_columns >= cont.ncolumns' failed.
> >
> > Looking into this further, apparently the position of
> >
> >   if (verbose)
> >   {
> > + /*
> > + * Show whether the table is permanent, temporary, or unlogged.
> > + */
> > + if (pset.sversion >= 91000)
> > + appendPQExpBuffer(,
> > +   ",\n  case c.relpersistence when 'p' then 'permanent' when 't'
> > then 'temporary' when 'u' then 'unlogged' else 'unknown' end as
> > \"%s\"",
> > +   gettext_noop("Persistence"));
> >
> > is not right, it is being called for indexes with verbose option also.
> > There should be an extra check for it being not called for index case.
> > Something like,
> > if (verbose)
> > {
> > /*
> > * Show whether the table is permanent, temporary, or unlogged.
> > */
> > if (!showIndexes)
> > if (pset.sversion >= 91000)
> > appendPQExpBuffer(,
> >   ",\n  case c.relpersistence when 'p' then 'permanent' when 't' then
> > 'temporary' when 'u' then 'unlogged' else 'unknown' end as \"%s\"",
> >   gettext_noop("Persistence"));
> >
> > Not sure, how do modify it in a more neat way.
>
> I suspect that as this may get a little messier, but I've made it
> fairly neat short of a major refactor.
>
I found the following warning on the compilation,
describe.c: In function ‘listTables’:
describe.c:3705:7: warning: this ‘if’ clause does not guard...
[-Wmisleading-indentation]
  else if (pset.sversion >= 80100)
   ^~
describe.c:3710:3: note: ...this statement, but the latter is
misleadingly indented as if it were guarded by the ‘if’
   appendPQExpBuffer(,

Talking of indentation, you might want to run pgindent once. Other
than that the patch looks good to me.
-- 
Regards,
Rafia Sabih




Re: Heap lock levels for REINDEX INDEX CONCURRENTLY not quite right?

2019-05-03 Thread Peter Eisentraut
On 2019-05-02 10:44, Peter Eisentraut wrote:
>> so there's a lock upgrade hazard.
> Confirmed.

Here is a patch along the lines of your sketch.  I cleaned up the
variable naming a bit too.

REINDEX CONCURRENTLY is still deadlock prone because of
WaitForOlderSnapshots(), so this doesn't actually fix your test case,
but that seems unrelated to this particular issue.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 8a4dcc3e6b849fd72bed29c9308c9a897eec0cc5 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 3 May 2019 09:15:29 +0200
Subject: [PATCH] Fix table lock levels for REINDEX INDEX CONCURRENTLY

REINDEX CONCURRENTLY locks tables with ShareUpdateExclusiveLock rather
than the ShareLock used by a plain REINDEX.  However,
RangeVarCallbackForReindexIndex() was not updated for that and still
used the ShareLock only.  This would lead to lock upgrades later,
leading to possible deadlocks.

Reported-by: Andres Freund 
Discussion: 
https://www.postgresql.org/message-id/flat/20190430151735.wi52sxjvxsjvaxxt%40alap3.anarazel.de
---
 src/backend/commands/indexcmds.c | 45 +++-
 1 file changed, 33 insertions(+), 12 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index fabba3bacd..acfb2d24c9 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -91,6 +91,15 @@ static bool ReindexRelationConcurrently(Oid relationOid, int 
options);
 static void ReindexPartitionedIndex(Relation parentIdx);
 static void update_relispartition(Oid relationId, bool newval);
 
+/*
+ * callback argument type for RangeVarCallbackForReindexIndex()
+ */
+struct ReindexIndexCallbackState
+{
+   boolconcurrent; /* flag from statement */
+   Oid locked_table_oid;   /* tracks previously locked table */
+};
+
 /*
  * CheckIndexCompatible
  * Determine whether an existing index definition is compatible 
with a
@@ -2303,8 +2312,8 @@ ChooseIndexColumnNames(List *indexElems)
 void
 ReindexIndex(RangeVar *indexRelation, int options, bool concurrent)
 {
+   struct ReindexIndexCallbackState state;
Oid indOid;
-   Oid heapOid = InvalidOid;
Relationirel;
charpersistence;
 
@@ -2313,11 +2322,13 @@ ReindexIndex(RangeVar *indexRelation, int options, bool 
concurrent)
 * obtain lock on table first, to avoid deadlock hazard.  The lock level
 * used here must match the index lock obtained in reindex_index().
 */
+   state.concurrent = concurrent;
+   state.locked_table_oid = InvalidOid;
indOid = RangeVarGetRelidExtended(indexRelation,
  
concurrent ? ShareUpdateExclusiveLock : AccessExclusiveLock,
  0,
  
RangeVarCallbackForReindexIndex,
- (void 
*) );
+ 
);
 
/*
 * Obtain the current persistence of the existing index.  We already 
hold
@@ -2350,7 +2361,15 @@ RangeVarCallbackForReindexIndex(const RangeVar *relation,
Oid relId, Oid 
oldRelId, void *arg)
 {
charrelkind;
-   Oid*heapOid = (Oid *) arg;
+   struct ReindexIndexCallbackState *state = arg;
+   LOCKMODEtable_lockmode;
+
+   /*
+* Lock level here should match table lock in reindex_index() for
+* non-concurrent case and table locks used by index_concurrently_*() 
for
+* concurrent case.
+*/
+   table_lockmode = state->concurrent ? ShareUpdateExclusiveLock : 
ShareLock;
 
/*
 * If we previously locked some other index's heap, and the name we're
@@ -2359,9 +2378,8 @@ RangeVarCallbackForReindexIndex(const RangeVar *relation,
 */
if (relId != oldRelId && OidIsValid(oldRelId))
{
-   /* lock level here should match reindex_index() heap lock */
-   UnlockRelationOid(*heapOid, ShareLock);
-   *heapOid = InvalidOid;
+   UnlockRelationOid(state->locked_table_oid, table_lockmode);
+   state->locked_table_oid = InvalidOid;
}
 
/* If the relation does not exist, there's nothing more to do. */
@@ -2389,14 +2407,17 @@ RangeVarCallbackForReindexIndex(const RangeVar 
*relation,
/* Lock heap before index to avoid deadlock. */
if (relId != oldRelId)
{
+   Oid table_oid = IndexGetRelation(relId, 
true);
+
/*
-* Lock level 

Re: Heap lock levels for REINDEX INDEX CONCURRENTLY not quite right?

2019-05-03 Thread Peter Eisentraut
On 2019-05-02 22:42, Andres Freund wrote:
> RangeVarGetRelidExtended() can call the callback multiple times, if
> there are any concurrent schema changes. That's why it's unlocking the
> previously locked heap oid.

Ah that explains it then.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Identity columns should own only one sequence

2019-05-03 Thread Laurenz Albe
On Thu, 2019-05-02 at 22:43 +0200, Peter Eisentraut wrote:
> On 2019-04-29 18:28, Laurenz Albe wrote:
> > I still think thatthat there is merit to Michael's idea of removing
> > sequence "ownership" (which is just a dependency) when the DEFAULT
> > on the column is dropped, but this approach is possibly cleaner.
> 
> I think the proper way to address this would be to create some kind of
> dependency between the sequence and the default.

That is certainly true.  But that's hard to retrofit into existing databases,
so it would probably be a modification that is not backpatchable.

Yours,
Laurenz Albe





Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-05-03 Thread John Naylor
On Thu, May 2, 2019 at 4:57 PM Amit Kapila  wrote:
>
> On Thu, May 2, 2019 at 12:39 PM John Naylor  
> wrote:
> >
> Okay,  I have updated the patch to incorporate your changes and call
> relcache invalidation at required places. I have updated comments at a
> few places as well.  The summarization of this patch is that (a) it
> moves the local map to relation cache (b) performs the cache
> invalidation whenever we create fsm (either via backend or vacuum),
> when some space in a page is freed by vacuum (provided fsm doesn't
> exist) or whenever the local map is cleared (c) additionally, we clear
> the local map when we found a block from FSM, when we have already
> tried all the blocks present in cache or when we are allowed to create
> FSM.
>
> If we agree on this, then we can update the README accordingly.
>
> Can you please test/review?

There isn't enough time. But since I already wrote some debugging
calls earlier (attached), I gave it a brief spin, I found this patch
isn't as careful as HEAD making sure we don't try the same block twice
in a row. If you insert enough tuples into an empty table such that we
need to extend, you get something like this:

DEBUG:  Not enough space on block 0
DEBUG:  Now trying block 0
DEBUG:  Not enough space on block 0
DEBUG:  Updating local map for block 0

At this point, I'm sorry to say, but I'm in favor of reverting. There
just wasn't enough time to redesign and debug a feature like this
during feature freeze.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


debug-free-space.patch
Description: Binary data