Re: [HACKERS] Block level parallel vacuum

2019-12-30 Thread Amit Kapila
On Mon, Dec 30, 2019 at 6:37 PM Tomas Vondra
 wrote:
>
> On Mon, Dec 30, 2019 at 10:40:39AM +0530, Amit Kapila wrote:
> >On Mon, Dec 30, 2019 at 2:53 AM Tomas Vondra
> > wrote:
> >>
> >
> >+1.  It is already a separate patch and I think we can even discuss
> >more on it in a new thread once the main patch is committed or do you
> >think we should have a conclusion about it now itself?  To me, this
> >option appears to be an extension to the main feature which can be
> >useful for some users and people might like to have a separate option,
> >so we can discuss it and get broader feedback after the main patch is
> >committed.
> >
>
> I don't think it's an extension of the main feature - it does not depend
> on it, it could be committed before or after the parallel vacuum (with
> some conflicts, but the feature itself is not affected).
>
> My point was that by moving it into a separate thread we're more likely
> to get feedback on it, e.g. from people who don't feel like reviewing
> the parallel vacuum feature and/or feel intimidated by t100+ messages in
> this thread.
>

I agree with this point.

> >> >>
> >> >> The same thing applies to the PARALLEL flag, added in 0002, BTW. Why do
> >> >> we need a separate VACUUM option, instead of just using the existing
> >> >> max_parallel_maintenance_workers GUC?
> >> >>
> >
> >How will user specify parallel degree?  The parallel degree is helpful
> >because in some cases users can decide how many workers should be
> >launched based on size and type of indexes.
> >
>
> By setting max_maintenance_parallel_workers.
>
> >> >> It's good enough for CREATE INDEX
> >> >> so why not here?
> >> >
> >
> >That is a different feature and I think here users can make a better
> >judgment based on the size of indexes.  Moreover, users have an option
> >to control a parallel degree for 'Create Index' via Alter Table
> > Set (parallel_workers = ) which I am not sure is a good
> >idea for parallel vacuum as the parallelism is more derived from size
> >and type of indexes.  Now, we can think of a similar parameter at the
> >table/index level for parallel vacuum, but I don't see it equally
> >useful in this case.
> >
>
> I'm a bit skeptical about users being able to pick good parallel degree.
> If we (i.e. experienced developers/hackers with quite a bit of
> knowledge) can't come up with a reasonable heuristics, how likely is it
> that a regular user will come up with something better?
>

In this case, it is highly dependent on the number of indexes (as for
each index, we can spawn one worker).   So, it is a bit easier for the
users to specify it.  Now, we can internally also identify the same
and we do that in case the user doesn't specify it, however, that can
easily lead to more resource (CPU, I/O) usage than the user would like
to do for a particular vacuum.  So, giving an option to the user
sounds quite reasonable to me.  Anyway, in case user doesn't specify
the parallel_degree, we are going to select one internally.

> Not sure I understand why "parallel_workers" would not be suitable for
> parallel vacuum? I mean, even for CREATE INDEX it certainly matters the
> size/type of indexes, no?
>

The difference here is that in parallel vacuum each worker can scan a
separate index whereas parallel_workers is more of an option for
scanning heap in parallel.  So, if the size of the heap is bigger,
then increasing that value helps whereas here if there are more number
of indexes on the table, increasing corresponding value for parallel
vacuum can help.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Reorderbuffer crash during recovery

2019-12-30 Thread vignesh C
On Mon, Dec 30, 2019 at 11:17 AM Amit Kapila  wrote:
>
> On Fri, Dec 27, 2019 at 8:37 PM Alvaro Herrera  
> wrote:
> >
> > On 2019-Dec-27, vignesh C wrote:
> >
> > > I felt amit solution also solves the problem. Attached patch has the
> > > fix based on the solution proposed.
> > > Thoughts?
> >
> > This seems a sensible fix to me, though I didn't try to reproduce the
> > failure.
> >
> > > @@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, 
> > > ReorderBufferTXN *txn)
> > >   }
> > >
> > >   ReorderBufferSerializeChange(rb, txn, fd, change);
> > > + txn->final_lsn = change->lsn;
> > >   dlist_delete(&change->node);
> > >   ReorderBufferReturnChange(rb, change);
> >
> > Should this be done insider ReorderBufferSerializeChange itself, instead
> > of in its caller?
> >
>
> makes sense.  But, I think we should add a comment specifying the
> reason why it is important to set final_lsn while serializing the
> change.

Fixed

> >  Also, would it be sane to verify that the TXN
> > doesn't already have a newer final_lsn?  Maybe as an Assert.
> >
>
> I don't think this is a good idea because we update the final_lsn with
> commit_lsn in ReorderBufferCommit after which we can try to serialize
> the remaining changes.  Instead, we should update it only if the
> change_lsn value is greater than final_lsn.
>

Fixed.
Thanks Alvaro & Amit for your suggestions. I have made the changes
based on your suggestions. Please find the updated patch for the same.
I have also verified the patch in back branches. Separate patch was
required for Release-10 branch, patch for the same is attached as
0001-Reorder-buffer-crash-while-aborting-old-transactions-REL_10.patch.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From dcd6c495c32d1dfd8ca3e723a0d45584ab5d10e7 Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Mon, 30 Dec 2019 22:59:05 +0530
Subject: [PATCH] Reorder buffer crash while aborting old transactions.

While aborting aborted transactions it crashed as there are no reorderbuffer
changes present in the list because all the reorderbuffer changes were
serialized. Fixing it by storing the last change's lsn while serializing the
reorderbuffer changes. This lsn will be used as final_lsn which will help in
cleaning of spilled files in pg_replslot.
---
 src/backend/replication/logical/reorderbuffer.c | 26 +++--
 src/include/replication/reorderbuffer.h |  4 ++--
 2 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 53affeb..a807a75 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1956,21 +1956,6 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
 
 		if (TransactionIdPrecedes(txn->xid, oldestRunningXid))
 		{
-			/*
-			 * We set final_lsn on a transaction when we decode its commit or
-			 * abort record, but we never see those records for crashed
-			 * transactions.  To ensure cleanup of these transactions, set
-			 * final_lsn to that of their last change; this causes
-			 * ReorderBufferRestoreCleanup to do the right thing.
-			 */
-			if (txn->serialized && txn->final_lsn == 0)
-			{
-ReorderBufferChange *last =
-dlist_tail_element(ReorderBufferChange, node, &txn->changes);
-
-txn->final_lsn = last->lsn;
-			}
-
 			elog(DEBUG2, "aborting old transaction %u", txn->xid);
 
 			/* remove potential on-disk data, and deallocate this tx */
@@ -2679,6 +2664,17 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	}
 	pgstat_report_wait_end();
 
+	/*
+	 * We set final_lsn on a transaction when we decode its commit or abort
+	 * record, but we never see those records for crashed transactions.  To
+	 * ensure cleanup of these transactions, set final_lsn to that of their
+	 * last change; this causes ReorderBufferRestoreCleanup to do the right
+	 * thing. Final_lsn would have been set with commit_lsn earlier when we
+	 * decode it commit, no need to update in that case
+	 */
+	if (txn->final_lsn < change->lsn)
+		txn->final_lsn = change->lsn;
+
 	Assert(ondisk->change.action == change->action);
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0867ee9..b58c19a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -188,8 +188,8 @@ typedef struct ReorderBufferTXN
 	 * * plain abort record
 	 * * prepared transaction abort
 	 * * error during decoding
-	 * * for a crashed transaction, the LSN of the last change, regardless of
-	 *   what it was.
+	 * * for a transaction with serialized changes, the LSN of the latest
+	 *   serialized one, unless one of the above cases.
 	 * 
 	 */
 	XLogRecPtr	final_lsn;
-- 
1.8.3.1

From f38ee9810f21397faa1e0fbb699360ee129119df Mon Sep 17 00:

Re: Cache relation sizes?

2019-12-30 Thread Thomas Munro
On Tue, Dec 31, 2019 at 4:43 PM Kyotaro HORIGUCHI
 wrote:
> I still believe that one shared memory element for every
> non-mapped relation is not only too-complex but also too-much, as
> Andres (and implicitly I) wrote. I feel that just one flag for
> all works fine but partitioned flags (that is, relations or files
> corresponds to the same hash value share one flag) can reduce the
> shared memory elements to a fixed small number.

There is one potentially interesting case that doesn't require any
kind of shared cache invalidation AFAICS.  XLogReadBufferExtended()
calls smgrnblocks() for every buffer access, even if the buffer is
already in our buffer pool.  I tried to make yet another quick
experiment-grade patch to cache the size[1], this time for use in
recovery only.

initdb -D pgdata
postgres -D pgdata -c checkpoint_timeout=60min

In another shell:
pgbench -i -s100 postgres
pgbench -M prepared -T60 postgres
killall -9 postgres
mv pgdata pgdata-save

Master branch:

cp -r pgdata-save pgdata
strace -c -f postgres -D pgdata
[... wait for "redo done", then hit ^C ...]
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
...
 18.61   22.492286  26849396   lseek
  6.958.404369  30277134   pwrite64
  6.638.009679  28277892   pread64
  0.500.604037  39 15169   sync_file_range
...

Patched:

rm -fr pgdata
cp -r pgdata-save pgdata
strace -c -f ~/install/bin/postgres -D pgdata
[... wait for "redo done", then hit ^C ...]
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
...
 16.338.097631  29277134   pwrite64
 15.567.715052  27277892   pread64
  1.130.559648  39 14137   sync_file_range
...
  0.000.001505  2559   lseek

> Note: I'm still not sure how much lseek impacts performance.

It doesn't seem great that we are effectively making system calls for
most WAL records we replay, but, sadly, in this case the patch didn't
really make any measurable difference when run without strace on this
Linux VM.  I suspect there is some workload and stack where it would
make a difference (CF the read(postmaster pipe) call for every WAL
record that was removed), but this is just something I noticed in
passing while working on something else, so I haven't investigated
much.

[1] https://github.com/postgres/postgres/compare/master...macdice:cache-nblocks
(just a test, unfinished, probably has bugs)




Re: Clarifying/rationalizing Vars' varno/varattno/varnoold/varoattno

2019-12-30 Thread Tom Lane
I wrote:
> Anyway, I had started to work on getting parse analysis to label
> outer-join-nullable Vars properly, and soon decided that no matter how
> we do it, there's not enough information available at the point where
> parse analysis makes a Var.  The referenced RTE is not, in itself,
> enough info, and I don't think we want to decorate RTEs with more info
> that's only needed during parse analysis.  What would be saner is to add
> any extra info to the ParseNamespaceItem structs.

Here is a further step on this journey.  It's still just parser
refactoring, and doesn't (AFAICT) result in any change in generated
parse trees, but it seems worth posting and committing separately.

The two key ideas here are:

1. Integrate ParseNamespaceItems a bit further into the parser's
relevant APIs.  In particular, the addRangeTableEntryXXX functions
no longer return just a bare RTE, but a ParseNamespaceItem wrapper
for it.  This gets rid of a number of kluges we had for finding out
the RT index of the new RTE, since that's now carried along in the
nsitem --- we no longer need fragile assumptions about how the new
RTE is still the last one in the rangetable, at some point rather
distant from where it was actually appended to the list.

Most of the callers of addRangeTableEntryXXX functions just turn
around and pass the result to addRTEtoQuery, which I've renamed
to addNSItemtoQuery; it doesn't gin up a new nsitem anymore but
just installs the one it's passed.  It's perhaps a bit inconsistent
that I renamed that function but not addRangeTableEntryXXX.
I considered making those addNamespaceItemXXX, but desisted on the
perhaps-thin grounds that they don't link the new nsitem into the
parse state, only the RTE.  This could be argued of course.

2. Add per-column information to the ParseNamespaceItems.  As of
this patch, the useful part of that is column type/typmod/collation
info which can be used to generate Vars referencing this RTE.
I envision that the next step will involve generating the Vars'
identity (varno/varattno columns) from that as well, and this
patch includes logic to set up some associated per-column fields.
But those are not actually getting fed into the Vars quite yet.
(The step after that will be to add outer-join-nullability info.)

But independently of those future improvements, this patch is
a win because it allows carrying forward column-type info that's
known at the time we do addRangeTableEntryXXX, and using that
when we make a Var, instead of having to do the rather expensive
computations involved in expandRTE() or get_rte_attribute_type().
get_rte_attribute_type() is indeed gone altogether, and while
expandRTE() is still needed, it's not used in any performance-critical
parse analysis code paths.

On a complex-query test case that I've used before [1], microbenchmarking
just raw parsing plus parse analysis shows a full 20% speedup over HEAD,
which I think can mostly be attributed to getting rid of the syscache
lookups that get_rte_attribute_type() did for Vars referencing base
relations.  The total impact over a complete query execution cycle
is a lot less of course.  Still, it's pretty clearly a performance win,
and to my mind the code is also cleaner --- this is paying down some
technical debt from when we bolted JOIN syntax onto pre-existing
parsing code.

Barring objections, I plan to commit this fairly soon and get onto the
next step, which will start to have ramifications outside the parser.

regards, tom lane

[1] https://www.postgresql.org/message-id/6970.1545327857%40sss.pgh.pa.us

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 8b68fb7..07533f6 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -2531,7 +2531,7 @@ AddRelationNewConstraints(Relation rel,
 	TupleConstr *oldconstr;
 	int			numoldchecks;
 	ParseState *pstate;
-	RangeTblEntry *rte;
+	ParseNamespaceItem *nsitem;
 	int			numchecks;
 	List	   *checknames;
 	ListCell   *cell;
@@ -2554,13 +2554,13 @@ AddRelationNewConstraints(Relation rel,
 	 */
 	pstate = make_parsestate(NULL);
 	pstate->p_sourcetext = queryString;
-	rte = addRangeTableEntryForRelation(pstate,
-		rel,
-		AccessShareLock,
-		NULL,
-		false,
-		true);
-	addRTEtoQuery(pstate, rte, true, true, true);
+	nsitem = addRangeTableEntryForRelation(pstate,
+		   rel,
+		   AccessShareLock,
+		   NULL,
+		   false,
+		   true);
+	addNSItemtoQuery(pstate, nsitem, true, true, true);
 
 	/*
 	 * Process column default expressions.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 42a147b..04d5bde 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -882,6 +882,7 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt,
 	if (stmt->relation)
 	{
 		LOCKMODE	lockmode = is_from ? RowExclusiveLock : AccessShareLock;
+		ParseNamespaceItem *nsitem;
 		RangeTblEntry *rte;
 		TupleDesc	

Re: [HACKERS] Block level parallel vacuum

2019-12-30 Thread Amit Kapila
On Mon, Dec 30, 2019 at 6:46 PM Tomas Vondra
 wrote:
>
> On Mon, Dec 30, 2019 at 08:25:28AM +0530, Amit Kapila wrote:
> >On Mon, Dec 30, 2019 at 2:53 AM Tomas Vondra
> > wrote:
> >> I think there's another question we need to ask - why to we introduce a
> >> bitmask, instead of using regular boolean struct members? Until now, the
> >> IndexAmRoutine struct had simple boolean members describing capabilities
> >> of the AM implementation. Why shouldn't this patch do the same thing,
> >> i.e. add one boolean flag for each AM feature?
> >>
> >
> >This structure member describes mostly one property of index which is
> >about a parallel vacuum which I am not sure is true for other members.
> >Now, we can use separate bool variables for it which we were initially
> >using in the patch but that seems to be taking more space in a
> >structure without any advantage.  Also, using one variable makes a
> >code bit better because otherwise, in many places we need to check and
> >set four variables instead of one.  This is also the reason we used
> >parallel in its name (we also use *parallel* for parallel index scan
> >related things).  Having said that, we can remove parallel from its
> >name if we want to extend/use it for something other than a parallel
> >vacuum.  I think we might need to add a flag or two for parallelizing
> >heap scan of vacuum when we enhance this feature, so keeping it for
> >just a parallel vacuum is not completely insane.
> >
> >I think keeping amusemaintenanceworkmem separate from this variable
> >seems to me like a better idea as it doesn't describe whether IndexAM
> >can participate in a parallel vacuum or not.  You can see more
> >discussion about that variable in the thread [1].
> >
>
> I don't know, but IMHO it's somewhat easier to work with separate flags.
> Bitmasks make sense when space usage matters a lot, e.g. for on-disk
> representation, but that doesn't seem to be the case here I think (if it
> was, we'd probably use bitmasks already).
>
> It seems like we're mixing two ways to design the struct unnecessarily,
> but I'm not going to nag about this any further.
>

Fair enough.  I see your point and as mentioned earlier that we
started with the approach of separate booleans, but later found that
this is a better way as it was easier to set and check the different
parallel options for a parallel vacuum.   I think we can go back to
the individual booleans if we want but I am not sure if that is a
better approach for this usage.  Sawada-San, others, do you have any
opinion here?

> >> >>
> >> >>
> >> >> v40-0004-Add-ability-to-disable-leader-participation-in-p.patch
> >> >> ---
> >> >>
> >> >> IMHO this should be simply merged into 0002.
> >> >
> >> >We discussed it's still unclear whether we really want to commit this
> >> >code and therefore it's separated from the main part. Please see more
> >> >details here[2].
> >> >
> >>
> >> IMO there's not much reason for the leader not to participate.
> >>
> >
> >The only reason for this is just a debugging/testing aid because
> >during the development of other parallel features we required such a
> >knob.  The other way could be to have something similar to
> >force_parallel_mode and there is some discussion about that as well on
> >this thread but we haven't concluded which is better.  So, we decided
> >to keep it as a separate patch which we can use to test this feature
> >during development and decide later whether we really need to commit
> >it.  BTW, we have found few bugs by using this knob in the patch.
> >
>
> OK, understood. Then why not just use force_parallel_mode?
>

Because we are not sure what should be its behavior under different
modes especially what should we do when user set force_parallel_mode =
on.  We can even consider introducing new guc specific for this, but
as of now, I am not convinced that is required.  See some more
discussion around this parameter in emails [1][2].  I think we can
decide on this later (probably once the main patch is committed) as we
already have one way to test the patch.

[1] - 
https://www.postgresql.org/message-id/CAFiTN-sUuLASVXm2qOjufVH3tBZHPLdujMJ0RHr47Tnctjk9YA%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CA%2Bfd4k6VgA_DG%3D8%3Dui7UvHhqx9VbQ-%2B72X%3D_GdTzh%3DJ_xN%2BVEg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Increase the maximum value track_activity_query_size

2019-12-30 Thread Alvaro Herrera
On 2019-Dec-31, Andrew Dunstan wrote:

> Turns out my arithmetic was a bit off. Animal Farm is 90 kb, Madame
> Bovary 678 Kb.

Ten animal farms should be enough for everybody.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: archive status ".ready" files may be created too early

2019-12-30 Thread Bossart, Nathan
On 12/23/19, 6:09 PM, "Kyotaro Horiguchi"  wrote:
> You're right. That doesn't seem to work. Another thing I had in my
> mind was that XLogWrite had an additional flag so that
> AdvanceXLInsertBuffer can tell not to mark .ready. The function is
> called while it *is* writing a complete record. So even if
> AdvanceXLInsertBuffer inhibit marking .ready the succeeding bytes
> comes soon and we can mark the old segment as .ready at the time.
>
> ..
> + * If record_write == false, we don't mark the last segment as .ready
> + * if the caller requested to write up to segment boundary.
> ..
>   static void
> - XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
> + XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool record_write)
>
> When XLogWrite is called with record_write = false, we don't mark
> .ready and don't advance lastSegSwitchTime/LSN. At the next time
> XLogWrite is called with record_write=true, if lastSegSwitchLSN is
> behind the latest segment boundary before or equal to
> LogwrtResult.Write, mark the skipped segments as .ready and update
> lastSegSwitchTime/LSN.

Thanks for the suggestion.  I explored this proposal a bit today.
It looks like there are three places where XLogWrite() is called:
AdvanceXLInsertBuffer(), XLogFlush(), and XLogBackgroundFlush().  IIUC
while XLogFlush() generally seems to be used to write complete records
to disk, this might not be true for XLogBackgroundFlush(), and we're
reasonably sure we cannot make such an assumption for
AdvanceXLInsertBuffer().  Therefore, we would likely only set
record_write to true for XLogFlush() and for certain calls to
XLogBackgroundFlush (e.g. flushing asynchronous commits).

I'm worried that this approach could be fragile and that we could end
up waiting an arbitrarily long time before marking segments as ready
for archival.  Even if we pay very close attention to the latest
flushed LSN, it seems possible that a non-record_write call to
XLogWrite() advances things such that we avoid ever calling it with
record_write = true.  For example, XLogBackgroundFlush() may have
flushed the completed blocks, which we cannot assume are complete
records.  Then, XLogFlush() would skip calling XLogWrite() if
LogwrtResult.Flush is sufficiently far ahead.  In this scenario, I
don't think we would mark any eligible segments as ".ready" until the
next call to XLogWrite() with record_write = true, which may never
happen.

The next approach I'm going to try is having the callers of
XLogWrite() manage marking segments ready.  That might make it easier
to mitigate some of my concerns above, but I'm not tremendously
optimistic that this approach will fare any better.

Nathan



Re: [PATCH] Increase the maximum value track_activity_query_size

2019-12-30 Thread Andrew Dunstan
On Tue, Dec 31, 2019 at 11:16 AM Tomas Vondra
 wrote:
>
> On Tue, Dec 31, 2019 at 10:25:26AM +1030, Andrew Dunstan wrote:
> >On Tue, Dec 31, 2019 at 10:16 AM Andrew Dunstan
> > wrote:
> >>
> >> On Tue, Dec 31, 2019 at 9:38 AM Tom Lane  wrote:
> >>
> >> >
> >> > This doesn't seem like a reason not to allow a higher limit, like a
> >> > megabyte or so, but I'm not sure that pushing it to the moon would be
> >> > wise.
> >> >
> >>
> >>
> >> Just to get a mental handle on the size of queries we might be
> >> allowing before truncation, I did some very rough arithmetic on what
> >> well known texts might fit in a megabyte. By my calculations you could
> >> fit about four Animal Farms or one Madame Bovary in about a megabyte.
> >> So I think that seems like more than enough :-). (My mind kinda
> >> explores at the thought of debugging a query as long as Animal Farm.)
> >>
> >
> >
> >Turns out my arithmetic was a bit off. Animal Farm is 90 kb, Madame
> >Bovary 678 Kb.
> >
>
> Not sure, but the Animal Farm text I found is about ~450kB (~120 pages,
> with ~3kB per page) ...


My browser has led me astray.
http://gutenberg.net.au/ebooks01/0100011.txt is in fact 172618 bytes.

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




[PATCH] Fix parallel query doc typos

2019-12-30 Thread Jon Jensen

Hi, hackers.

Attached are 2 tiny doc typo fixes.

Jon
From ed4e539d5eade0a3e20d42e35239141f14a3b24b Mon Sep 17 00:00:00 2001
From: Jon Jensen 
Date: Mon, 30 Dec 2019 17:45:30 -0700
Subject: [PATCH] Fix doc typos

---
 doc/src/sgml/parallel.sgml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/parallel.sgml b/doc/src/sgml/parallel.sgml
index af5d48a5c7..95306287e2 100644
--- a/doc/src/sgml/parallel.sgml
+++ b/doc/src/sgml/parallel.sgml
@@ -277,7 +277,7 @@ EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE 
'%x%';
 scan, the cooperating processes take turns reading data 
from the
 index.  Currently, parallel index scans are supported only for
 btree indexes.  Each process will claim a single index block and will
-scan and return all tuples referenced by that block; other process can
+scan and return all tuples referenced by that block; other processes 
can
 at the same time be returning tuples from a different index block.
 The results of a parallel btree scan are returned in sorted order
 within each worker process.
@@ -410,7 +410,7 @@ EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE 
'%x%';
 involve appending multiple results sets can therefore achieve
 coarse-grained parallelism even when efficient partial plans are not
 available.  For example, consider a query against a partitioned table
-which can be only be implemented efficiently by using an index that does
+which can only be implemented efficiently by using an index that does
 not support parallel scans.  The planner might choose a Parallel
 Append of regular Index Scan plans; each
 individual index scan would have to be executed to completion by a single
-- 
2.24.1



Re: [PATCH] Increase the maximum value track_activity_query_size

2019-12-30 Thread Tomas Vondra

On Tue, Dec 31, 2019 at 10:25:26AM +1030, Andrew Dunstan wrote:

On Tue, Dec 31, 2019 at 10:16 AM Andrew Dunstan
 wrote:


On Tue, Dec 31, 2019 at 9:38 AM Tom Lane  wrote:

>
> This doesn't seem like a reason not to allow a higher limit, like a
> megabyte or so, but I'm not sure that pushing it to the moon would be
> wise.
>


Just to get a mental handle on the size of queries we might be
allowing before truncation, I did some very rough arithmetic on what
well known texts might fit in a megabyte. By my calculations you could
fit about four Animal Farms or one Madame Bovary in about a megabyte.
So I think that seems like more than enough :-). (My mind kinda
explores at the thought of debugging a query as long as Animal Farm.)




Turns out my arithmetic was a bit off. Animal Farm is 90 kb, Madame
Bovary 678 Kb.



Not sure, but the Animal Farm text I found is about ~450kB (~120 pages,
with ~3kB per page) ...

Anyway, the longest queries I personally saw in production were a couple
of kB long (~32kB IIRC, it's been a couple years ago). The queries were
generated by the application (not quite a traditional ORM, but something
like it), with long identifiers (e.g. table names) pretty long due to
including a hash (so being 63 characters most of the time). Plus the
columns were always fully qualified, with multiple joins etc.

Not sure what a good limit would be. Obviously, if we pick value X, the
next day someone will come asking for X+1 ... ;-)


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




remove separate postgres.(sh)description files

2019-12-30 Thread John Naylor
I'm guessing the initial data for pg_(sh)description is output into
separate files because it was too difficult for the traditional shell
script to maintain enough state to do otherwise. With Perl, it's just
as easy to assemble the data into the same format as the rest of the
catalogs and then let the generic code path output it into
postgres.bki. The attached patch does that and simplifies the catalog
makefile and initdb.c. I'll add a commitfest entry for this.
--
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


v1-0001-Remove-separate-files-for-the-initial-contents-of.patch
Description: Binary data


Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence

2019-12-30 Thread Peter Geoghegan
On Tue, Dec 24, 2019 at 4:29 AM Anastasia Lubennikova
 wrote:
> I updated the patchset.
> The first patch now contains only infrastructure changes
> and the second one sets opcisbitwise for btree opclasses in pg_opclass.dat.

We should try to formally define what we're trying to represent about
B-Tree opclasses here -- the definition of
"opcisbitwise"/preciseness/whatever should be tightened up. In
particular, it should be clear how the "image" binary row comparators
[1] (i.e. "operator *= equality" stuff) fit in. This new concept
should be defined in terms of that existing concept -- we're talking
about exactly the same variety of "internal binary equality" here, I
think.

I propose that we adopt the following definition: For an operator
class to be safe, its equality operator has to always agree with
datum_image_eq() (i.e. two datums must be bitwise equal after
detoasting).

(Maybe we should say something about "operator *= equality" as well
(or instead), since that is already documented in [1].)

We may also want to say something about foreign keys in this formal
definition of "opcisbitwise"/preciseness. Discussion around the bug
fixed by commit 1ffa59a85cb [1] showed that there was plenty of
confusion in this area. Commit 1ffa59a85cb simply solved the problem
that existed with foreign keys, without "joining the dots". It reused
the rowtypes.c "operator *= equality" stuff to fix the problem, but
only in an ad-hoc and undoumented way. Let's not do that again now.

Note: In theory this definition is stricter than truly necessary to
make deduplication safe, because we can imagine a contrived case in
which an operator class exists where datum_image_eq() does not always
agree with the equality operator, even though the equality operator
will reliably consider two datums to be equal only when they have
identical outputs from the underlying type's output function. This
could happen when an operator class author wasn't very careful about
zeroing padding -- this may not have mattered to the opclass author
because nobody relied on that padding anyway. I think that stuff like
this is not worth worrying about -- it can only happen because the
datatype/operator class author was very sloppy.

Note also: We seem to make this assumption already. Maybe this
uninitialized bytes side issue doesn't even need to be pointed out or
discussed. The comment just above VALGRIND_CHECK_MEM_IS_DEFINED()
within PageAddItemExtended() seems to suggest this. The comment
specifically mentions datumIsEqual() (not datum_image_eq()), but it's
exactly the same issue.

[1] 
https://www.postgresql.org/docs/devel/functions-comparisons.html#COMPOSITE-TYPE-COMPARISON
[2] 
https://www.postgresql.org/message-id/flat/3326fc2e-bc02-d4c5-e3e5-e54da466e89a%402ndquadrant.com

-- 
Peter Geoghegan




Re: [PATCH] Increase the maximum value track_activity_query_size

2019-12-30 Thread Andrew Dunstan
On Tue, Dec 31, 2019 at 10:16 AM Andrew Dunstan
 wrote:
>
> On Tue, Dec 31, 2019 at 9:38 AM Tom Lane  wrote:
>
> >
> > This doesn't seem like a reason not to allow a higher limit, like a
> > megabyte or so, but I'm not sure that pushing it to the moon would be
> > wise.
> >
>
>
> Just to get a mental handle on the size of queries we might be
> allowing before truncation, I did some very rough arithmetic on what
> well known texts might fit in a megabyte. By my calculations you could
> fit about four Animal Farms or one Madame Bovary in about a megabyte.
> So I think that seems like more than enough :-). (My mind kinda
> explores at the thought of debugging a query as long as Animal Farm.)
>


Turns out my arithmetic was a bit off. Animal Farm is 90 kb, Madame
Bovary 678 Kb.

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Increase the maximum value track_activity_query_size

2019-12-30 Thread Andrew Dunstan
On Tue, Dec 31, 2019 at 9:38 AM Tom Lane  wrote:

>
> This doesn't seem like a reason not to allow a higher limit, like a
> megabyte or so, but I'm not sure that pushing it to the moon would be
> wise.
>


Just to get a mental handle on the size of queries we might be
allowing before truncation, I did some very rough arithmetic on what
well known texts might fit in a megabyte. By my calculations you could
fit about four Animal Farms or one Madame Bovary in about a megabyte.
So I think that seems like more than enough :-). (My mind kinda
explores at the thought of debugging a query as long as Animal Farm.)

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Increase the maximum value track_activity_query_size

2019-12-30 Thread Tom Lane
Tomas Vondra  writes:
> 2) What's the overhead for increasing the value for short/long queries?

> My assumption is that for short queries, it's going to be negligible.
> For longer queries it may be measurable, but I'd expect longer queries
> to be more expensive in general, so maybe it's still negligible.

The thing that has been bothering me is the idea that backends reading
st_activity_raw might palloc the max possible length and/or memcpy the
whole buffer rather than just the valid part.  Having now rooted through
pgstat.c, that appears to be half true: the local allocation made by
pgstat_read_current_status() will be just as large as the shared-memory
arena, but we use strcpy() or equivalent so that each query copy should
stop upon hitting a '\0'.  So the run-time cost should be negligible, but
you might be eating a lot of memory if multiple sessions are inspecting
pg_stat_activity and you cranked the setting up imprudently high.

This doesn't seem like a reason not to allow a higher limit, like a
megabyte or so, but I'm not sure that pushing it to the moon would be
wise.

Meanwhile, I noted what seems like a pretty obvious bug in
pg_stat_get_backend_activity():

clipped_activity = pgstat_clip_activity(activity);
ret = cstring_to_text(activity);
pfree(clipped_activity);

We're not actually applying the intended clip to the returned
value, so that an invalidly-encoded result is possible.

(Of course, since we also don't seem to be making any attempt
to translate from the source backend's encoding to our own,
there's more problems here than just that.)

regards, tom lane




Re: Allow an alias to be attached directly to a JOIN ... USING

2019-12-30 Thread Vik Fearing
On 30/12/2019 22:25, Peter Eisentraut wrote:
> On 2019-12-24 19:13, Fabien COELHO wrote:
 Indeed, that seems like a problem, and it's a good question.  You can
 see this on unpatched master with SELECT x.filler FROM
 (pgbench_tellers AS t JOIN b USING (bid)) AS x.
>>>
>>> I'm not sure I understand why that problem is a blocker for this patch.
>>
>> As discussed on another thread,
>>
>>  
>> https://www.postgresql.org/message-id/flat/2aa57950-b1d7-e9b6-0770-fa592d565...@2ndquadrant.com
>>
>> the patch does not conform to spec
>>
>>     SQL:2016 Part 2 Foundation Section 7.10 
>>
>> Basically "x" is expected to include *ONLY* joined attributes with
>> USING,
>> i.e. above only x.bid should exists, and per-table aliases are
>> expected to
>> still work for other attributes.
>
> I took another crack at this.  Attached is a new patch that addresses
> the semantic comments from this and the other thread.  It's all a bit
> tricky, comments welcome.


Excellent!  Thank you for working on this, Peter.


One thing I notice is that the joined columns are still accessible from
their respective table names when they should not be per spec.  That
might be one of those "silly restrictions" that we choose to ignore, but
it should probably be noted somewhere, at the very least in a code
comment if not in user documentation. (This is my reading of SQL:2016 SR
11.a.i)

-- 

Vik Fearing





Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence

2019-12-30 Thread Peter Geoghegan
On Mon, Dec 30, 2019 at 9:45 AM Robert Haas  wrote:
> > For example, float and numeric types are "never bitwise equal", while array,
> > text, and other container types are "maybe bitwise equal". An array of
> > integers
> > or text with C collation can be treated as bitwise equal attributes, and it
> > would be too harsh to restrict them from deduplication.

We might as well support container types (like array) in the first
Postgres version that has nbtree deduplication, I suppose. Even still,
I don't think that it actually matters much to users. B-Tree indexes
on arrays are probably very rare. Note that I don't consider text to
be a container type here -- obviously btree/text_ops is a very
important opclass for the deduplication feature. It may be the most
important opclass overall.

Recursively invoking a support function for the "contained" data type
in the btree/array_ops support function seems like it might be messy.
Not sure about that, though.

> > What bothers me is that this option will unlikely be helpful on its own
> > and we
> > should also provide some kind of recheck function along with opclass, which
> > complicates this idea even further and doesn't seem very clear.
>
> It seems like the simplest thing might be to forget about the 'char'
> column and just have a support function which can be used to assess
> whether a given opclass's notion of equality is bitwise.

I like the idea of relying only on a support function.

This approach makes collations a problem that the opclass author has
to deal with directly, as is the case within a SortSupport support
function. Also seems like it would make life easier for third party
data types that want to make use of these optimizations (if in fact
there are any).

I also see little downside to this approach. The extra cycles
shouldn't be noticeable. As far as the B-Tree deduplication logic is
concerned, the final boolean value (is deduplication safe?) comes from
the index metapage -- we pass that down through an insertion scankey.
We only need to determine whether or not the optimization is safe at
CREATE INDEX time. (Actually, I don't want to commit to the idea that
nbtree should only call this support function at CREATE INDEX time
right now. I'm sure that it will hardly ever need to be called,
though.)

-- 
Peter Geoghegan




Re: [PATCH] Increase the maximum value track_activity_query_size

2019-12-30 Thread Tomas Vondra

On Mon, Dec 30, 2019 at 12:46:40PM -0800, Robert Treat wrote:

On Mon, Dec 23, 2019 at 9:11 PM Robert Haas  wrote:


On Sat, Dec 21, 2019 at 1:25 PM Tom Lane  wrote:
> > What is the overhead here except the memory consumption?
>
> The time to copy those strings out of shared storage, any time
> you query pg_stat_activity.

It seems like you're masterminding this, and I don't know why. It
seems unlikely that anyone will raise the value unless they have very
long queries, and if those people would rather pay the overhead of
copying more data than have their queries truncated, who are we to
argue?

If increasing the maximum imposed some noticeable cost on
installations the kept the default setting, that might well be a good
argument for not raising the maximum. But I don't think that's the
case. I also suspect that the overhead would be pretty darn small even
for people who *do* raise the default setting. It looks to me like
both reading and write operations on st_activity_raw stop when they
hit a NUL byte, so any performance costs on short queries must come
from second-order effects (e.g. the main shared memory segment is
bigger, so the OS cache is smaller) which are likely irrelevant in
practice.



I'm generally in favor of the idea of allowing people to make
trade-offs that best work for them, but Tom's concern does give me
pause, because it isn't clear to me how people will measure the
overhead of upping this setting. If given the option people will
almost certainly start raising this limit because the benefits are
obvious ("I can see all my query now!") but so far the explanation of
the downsides have been either hand-wavy or, in the case of your
second paragraph, an argument they are non-existent, which doesn't
seem right either; so how do we explain to people how to measure the
overhead for them?



I think there are two questions that we need to answer:

1) Does allowing higher values for the GUC mean overhead for people who
don't actually increase it?

I don't think so.

2) What's the overhead for increasing the value for short/long queries?

My assumption is that for short queries, it's going to be negligible.
For longer queries it may be measurable, but I'd expect longer queries
to be more expensive in general, so maybe it's still negligible.

Of course, the easiest thing we can do is actually measuring this.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Allow an alias to be attached directly to a JOIN ... USING

2019-12-30 Thread Peter Eisentraut

On 2019-12-24 19:13, Fabien COELHO wrote:

Indeed, that seems like a problem, and it's a good question.  You can
see this on unpatched master with SELECT x.filler FROM
(pgbench_tellers AS t JOIN b USING (bid)) AS x.


I'm not sure I understand why that problem is a blocker for this patch.


As discussed on another thread,

  
https://www.postgresql.org/message-id/flat/2aa57950-b1d7-e9b6-0770-fa592d565...@2ndquadrant.com

the patch does not conform to spec

SQL:2016 Part 2 Foundation Section 7.10 

Basically "x" is expected to include *ONLY* joined attributes with USING,
i.e. above only x.bid should exists, and per-table aliases are expected to
still work for other attributes.


I took another crack at this.  Attached is a new patch that addresses 
the semantic comments from this and the other thread.  It's all a bit 
tricky, comments welcome.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From e3118a93b6df9ee8144a9c0e8454860f66b8e3f5 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 30 Dec 2019 22:15:46 +0100
Subject: [PATCH v2] Allow an alias to be attached to a JOIN ... USING

This allows something like

SELECT ... FROM t1 JOIN t2 USING (a, b, c) AS x

where x has the columns a, b, c and unlike a regular alias it does not
hide the range variables of the tables being joined t1 and t2.

Per SQL:2016 feature F404 "Range variable for common column names".
---
 doc/src/sgml/ref/select.sgml  | 11 +++-
 src/backend/catalog/sql_features.txt  |  2 +-
 src/backend/nodes/copyfuncs.c |  2 +
 src/backend/nodes/equalfuncs.c|  2 +
 src/backend/nodes/outfuncs.c  |  2 +
 src/backend/nodes/readfuncs.c |  2 +
 src/backend/parser/analyze.c  |  1 +
 src/backend/parser/gram.y | 69 +++
 src/backend/parser/parse_clause.c | 44 +++
 src/backend/parser/parse_relation.c   | 42 +-
 src/backend/utils/adt/ruleutils.c |  4 ++
 src/include/nodes/parsenodes.h| 13 -
 src/include/nodes/primnodes.h |  1 +
 src/include/parser/parse_node.h   |  1 +
 src/include/parser/parse_relation.h   |  1 +
 src/test/regress/expected/create_view.out | 52 -
 src/test/regress/expected/join.out| 31 ++
 src/test/regress/sql/create_view.sql  | 11 
 src/test/regress/sql/join.sql |  8 +++
 19 files changed, 258 insertions(+), 41 deletions(-)

diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 691e402803..36416085c1 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -59,7 +59,7 @@
 [ LATERAL ] function_name ( [ 
argument [, ...] ] ) AS ( 
column_definition [, ...] )
 [ LATERAL ] ROWS FROM( function_name ( [ argument [, ...] ] ) [ AS ( column_definition [, ...] ) ] [, ...] )
 [ WITH ORDINALITY ] [ [ AS ] alias [ ( column_alias [, ...] ) ] ]
-from_item [ NATURAL ] 
join_type from_item [ ON join_condition | USING ( join_column [, ...] ) ]
+from_item [ NATURAL ] 
join_type from_item [ ON join_condition | USING ( join_column [, ...] ) [ AS join_using_alias ] ]
 
 and grouping_element can 
be one of:
 
@@ -638,6 +638,15 @@ FROM 
Clause
 equivalent columns will be included in the join output, not
 both.

+
+   
+If a join_using_alias is
+specified, it gives a correlation name to the join columns.  Only the
+join columns in the USING clause are addressable by
+this name.  Unlike an alias, this does not hide the names of
+the joined tables from the rest of the query.
+   
   
  
 
diff --git a/src/backend/catalog/sql_features.txt 
b/src/backend/catalog/sql_features.txt
index ab3e381cff..cb8fd9ddb9 100644
--- a/src/backend/catalog/sql_features.txt
+++ b/src/backend/catalog/sql_features.txt
@@ -264,7 +264,7 @@ F401Extended joined table   02  FULL OUTER JOIN 
YES
 F401   Extended joined table   04  CROSS JOIN  YES 
 F402   Named column joins for LOBs, arrays, and multisets  
YES 
 F403   Partitioned joined tables   NO  
-F404   Range variable for common column names  NO  
+F404   Range variable for common column names  YES 
 F411   Time zone specification YES differences regarding 
literal interpretation
 F421   National character  YES 
 F431   Read-only scrollable cursorsYES 
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index a9b8b84b8f..117079409b 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2168,6 +2168,7 @@ _copyJoinExpr(const JoinExpr *from)
COPY_NODE_FIELD(rarg);
COPY_NODE_FIELD(usingClause);
  

Re: [PATCH] Increase the maximum value track_activity_query_size

2019-12-30 Thread Robert Treat
On Mon, Dec 23, 2019 at 9:11 PM Robert Haas  wrote:
>
> On Sat, Dec 21, 2019 at 1:25 PM Tom Lane  wrote:
> > > What is the overhead here except the memory consumption?
> >
> > The time to copy those strings out of shared storage, any time
> > you query pg_stat_activity.
>
> It seems like you're masterminding this, and I don't know why. It
> seems unlikely that anyone will raise the value unless they have very
> long queries, and if those people would rather pay the overhead of
> copying more data than have their queries truncated, who are we to
> argue?
>
> If increasing the maximum imposed some noticeable cost on
> installations the kept the default setting, that might well be a good
> argument for not raising the maximum. But I don't think that's the
> case. I also suspect that the overhead would be pretty darn small even
> for people who *do* raise the default setting. It looks to me like
> both reading and write operations on st_activity_raw stop when they
> hit a NUL byte, so any performance costs on short queries must come
> from second-order effects (e.g. the main shared memory segment is
> bigger, so the OS cache is smaller) which are likely irrelevant in
> practice.
>

I'm generally in favor of the idea of allowing people to make
trade-offs that best work for them, but Tom's concern does give me
pause, because it isn't clear to me how people will measure the
overhead of upping this setting. If given the option people will
almost certainly start raising this limit because the benefits are
obvious ("I can see all my query now!") but so far the explanation of
the downsides have been either hand-wavy or, in the case of your
second paragraph, an argument they are non-existent, which doesn't
seem right either; so how do we explain to people how to measure the
overhead for them?

Robert Treat
https://xzilla.net




Re: proposal: schema variables

2019-12-30 Thread Pavel Stehule
Hi

po 30. 12. 2019 v 17:27 odesílatel Philippe BEAUDOIN 
napsal:

> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   not tested
> Documentation:tested, failed
>
> Hi Pavel,
>
> I have tested the latest version of your patch.
> Both issues I reported are now fixed. And you largely applied my
> proposals. That's great !
>
> I have also spent some time to review more closely the documentation. I
> will send you a direct mail with an attached file for some minor comments
> on this topic.
>
> Except these documentation remarks to come, I haven't any other issue or
> suggestion to report.
> Note that I have not closely looked at the C code itself. But may be some
> other reviewers have already done that job.
> If yes, my feeling is that the patch could soon be set as "Ready for
> commiter".
>
> Best regards. Philippe.
>
> The new status of this patch is: Waiting on Author
>

Thank you very much for your comments, and notes. Updated patch attached.

Regards

Pavel


schema-variables-20191230.patch.gz
Description: application/gzip


Re: Recognizing superuser in pg_hba.conf

2019-12-30 Thread David Fetter
On Mon, Dec 30, 2019 at 11:56:17AM +0100, Vik Fearing wrote:
> On 29/12/2019 23:10, Vik Fearing wrote:
> > On 29/12/2019 17:31, Tom Lane wrote:
> >> Robert Haas  writes:
> >>> On Sat, Dec 28, 2019 at 2:02 PM Vik Fearing  
> >>> wrote:
>  I'm all for this (and even suggested it during the IRC conversation that
>  prompted this patch). It's rife with bikeshedding, though.  My original
>  proposal was to use '&' and Andrew Gierth would have used ':'.
> >>> I think this is a good proposal regardless of which character we
> >>> decide to use. My order of preference from highest-to-lowest would
> >>> probably be :*&, but maybe that's just because I'm reading this on
> >>> Sunday rather than on Tuesday.
> >> I don't have any particular objection to '&' if people prefer that.
> >
> > I wrote the patch so I got to decide. :-)  I will also volunteer to do
> > the grunt work of changing the symbol if consensus wants that, though.
> >
> >
> > It turns out that my original patch didn't really change, all the meat
> > is in the keywords patch.  The superuser patch is to be applied on top
> > of the keywords patch.
> >
> 
> I missed a few places in the tap tests.  New keywords patch attached,
> superuser patch unchanged.
> 
> -- 
> 
> Vik Fearing
> 


Patches apply cleanly to 0ce38730ac72029f3f2c95ae80b44f5b9060cbcc, and
include documentation. They could use an example of the new
capability, possibly included in the sample pg_hba.conf, e.g. 

host&all&superuser  0.0.0.0/0   reject

or similar.

The feature works as described, and is useful. I have thus far been
unable to make it crash.

I haven't used intentionally hostile strings to test it, as I didn't
see those as an important attack surface. This is because by the time
someone hostile can write to pg_hba.conf, they've got all the control
they need to manipulate the entire node, including root exploits.

I've marked this as Ready for Committer.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: comment regarding double timestamps; and, infinite timestamps and NaN

2019-12-30 Thread Tom Lane
Justin Pryzby  writes:
> On Mon, Dec 30, 2019 at 09:05:24AM -0500, Tom Lane wrote:
>> Uh, what?  This seems completely wrong to me.  We could possibly
>> promote DT_NOBEGIN and DT_NOEND to +/- infinity (not NaN), but
>> I don't really see the point.  They'll compare to other timestamp
>> values correctly without that, cf timestamp_cmp_internal().
>> The example you give seems to me to be working sanely, or at least
>> as sanely as it can given the number of histogram points available,
>> with the existing code.  In any case, shoving NaNs into the
>> computation is not going to make anything better.

> As I see it, the problem is that the existing code tests for isnan(), but
> infinite timestamps are PG_INT64_MIN/MAX (here, stored in a double), so 
> there's
> absurdly large values being used as if they were isnormal().

I still say that (1) you're confusing NaN with Infinity, and (2)
you haven't actually shown that there's a problem to fix.
These endpoint values are *not* NaNs.

> On v12, my test gives:
> |DROP TABLE t; CREATE TABLE t(t) AS SELECT generate_series(now(), now()+'1 
> day', '5 minutes');
> |INSERT INTO t VALUES('-infinity');
> |ALTER TABLE t ALTER t SET STATISTICS 1; ANALYZE t;
> |explain analyze SELECT * FROM t WHERE t>='2010-12-29';
> | Seq Scan on t  (cost=0.00..5.62 rows=3 width=8) (actual time=0.012..0.042 
> rows=289 loops=1)

This is what it should do.  There's only one histogram bucket, and
it extends down to -infinity, so the conclusion is going to be that
the WHERE clause excludes all but a small part of the bucket.  This
is the correct answer based on the available stats; the problem is
not with the calculation, but with the miserable granularity of the
available stats.

> vs patched master:
> |DROP TABLE t; CREATE TABLE t(t) AS SELECT generate_series(now(), now()+'1 
> day', '5 minutes');
> |INSERT INTO t VALUES('-infinity');
> |ALTER TABLE t ALTER t SET STATISTICS 1; ANALYZE t;
> |explain analyze SELECT * FROM t WHERE t>='2010-12-29';
> | Seq Scan on t  (cost=0.00..5.62 rows=146 width=8) (actual time=0.048..0.444 
> rows=289 loops=1)

This answer is simply broken.  You've caused it to estimate half
of the bucket, which is an insane estimate for the given bucket
boundaries and WHERE constraint.

> IMO 146 rows is a reasonable estimate given a single histogram bucket of
> infinite width,

No, it isn't.

regards, tom lane




Re: [HACKERS] pg_shmem_allocations view

2019-12-30 Thread Robert Haas
On Wed, Dec 18, 2019 at 9:53 PM Kyotaro Horiguchi
 wrote:
> The doc is saying that "size" is "Size of the allocation" and
> "allocated_size" is "size including padding". It seems somewhat
> confusing to me. I'm not sure what wording is best but I think people
> use net/gross wordings to describe like that.

I don't think I'd find that particularly clear. It seems to me that if
the second size includes padding, then the first one differs in not
including padding, so I'm not really sure where the confusion is. I
thought about writing, for the first one, that is the requested size
of the allocation, but that seemed like it might confuse someone by
suggesting that the actual size of the allocation might be less than
what was requested. I also thought about describing the first one as
the size excluding padding, but that seems redundant. Maybe it would
be good to change the second one to say "Size of the allocation
including padding added by the allocator itself."

> > All of this makes me think that we might want to do some followup to
> > (1) convert anonymous allocations to non-anonymous allocations, for
> > inspectability, and (2) do some renaming to get better stylistic
> > agreement between the names of various shared memory chunks. But I
> > think that work is properly done after this patch is committed, not
> > before.
>
> I agree to (2), but regarding (1), most or perhaps all of the
> anonymous allocations seems belonging to one of the shared hashes but
> are recognized as holes shown by the above statement. So the current
> output of the view is wrong in that sense. I think (1) should be
> resolved before adding the view.

I guess I don't understand how this makes the output wrong. Either the
allocations have a name, or they are anonymous. This dumps everything
that has a name. What would you suggest? It seems to me that it's more
appropriate for this patch to just tell us about what's in shared
memory, not change what's in shared memory. If we want to do the
latter, that's a job for another patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: backup manifests

2019-12-30 Thread Robert Haas
On Tue, Dec 24, 2019 at 5:42 AM Suraj Kharage
 wrote:
> Made the change in backup manifest as well in backup validatort patch. Thanks 
> to Rushabh Lathia for the offline discussion and help.
>
> To examine the first word of each line, I am using below check:
> if (strncmp(line, "File", 4) == 0)
> {
> ..
> }
> else if (strncmp(line, "Manifest-Checksum", 17) == 0)
> {
> ..
> }
> else
> error
>
> strncmp might be not right here, but we can not put '\0' in between the line 
> (to find out first word)
> before we recognize the line type.
> All the lines expect line last one (where we have manifest checksum) are feed 
> to the checksum machinary to calculate manifest checksum.
> so update_checksum() should be called after recognizing the type, i.e: if it 
> is a File type record. Do you see any issues with this?

I see the problem, but I don't think your solution is right, because
the first test would pass if the line said FiletMignon rather than
just File, which we certainly don't want. You've got to write the test
so that you're checking against the whole first word, not just some
prefix of it. There are several possible ways to accomplish that, but
this isn't one of them.

>> + pg_log_error("invalid record found in \"%s\"", manifest_path);
>>
>> Error message needs work.

Looks better now, but you have a messages that say "invalid checksums
type \"%s\" found in \"%s\"". This is wrong because checksums would
need to be singular in this context (checksum). Also, I think it could
be better phrased as "manifest file \"%s\" specifies unknown checksum
algorithm \"%s\" at line %d".

>> Your function names should be consistent with the surrounding style,
>> and with each other, as far as possible. Three different conventions
>> within the same patch and source file seems over the top.

This appears to be fixed.

>> Also keep in mind that you're not writing code in a vacuum. There's a
>> whole file of code here, and around that, a whole project.
>> scan_data_directory() is a good example of a function whose name is
>> clearly too generic. It's not a general-purpose function for scanning
>> the data directory; it's specifically a support function for verifying
>> a backup. Yet, the name gives no hint of this.

But this appears not to be fixed.

>> if (strcmp(newpathsuffix, "/pg_wal") == 0 || strcmp(newpathsuffix,
>> "/backup_manifest") == 0)
>> continue;
>
> Thanks for the suggestion. Corrected as per the above inputs.

You need a comment here, like "Ignore the possible presence of a
backup_manifest file and/or a pg_wal directory in the backup being
verified." and then maybe another sentence explaining why that's the
right thing to do.

+ * The forth parameter to VerifyFile() will pass the relative path
+ * of file to match exactly with the filename present in manifest.

I don't know what this comment is trying to tell me, which might be
something you want to try to fix. However, I'm pretty sure it's
supposed to say "fourth" not "forth".

>> and the result would be that everything inside that long if-block is
>> now at the top level of the function and indented one level less. And
>> I think if you look at this function you'll see a way that you can
>> save a *second* level of indentation for much of that code. Please
>> check the rest of the patch for similar cases, too.
>
> Make sense. corrected.

I don't agree. A large chunk of VerifyFile() is still subject to a
quite unnecessary level of indentation.

> I have added a check for EOF, but not sure whether that woule be right here.
> Do we need to check the length of buffer as well?

That's really, really not right. EOF is not a character that can
appear in the buffer. It's chosen on purpose to be a value that never
matches any actual character when both the character and the EOF value
are regarded as values of type 'int'. That guarantee doesn't apply
here though because you're dealing with values of type 'char'. So what
this code is doing is searching for an impossible value using
incorrect logic, which has very little to do with the actual need
here, which is to avoid running off the end of the buffer. To see what
the problem is, try creating a file with no terminating newline, like
this:

echo -n this file has no terminating newline >> some-file

I doubt it will be very hard to make this patch crash horribly. Even
if you can't, it seems pretty clear that the logic isn't right.

I don't really know what the \0 tests in NextLine() and NextWord()
think they're doing either. If there's a \0 in the buffer before you
add one, it was in the original input data, and pretending like that
marks a word or line boundary seems like a fairly arbitrary choice.

What I suggest is:

(1) Allocate one byte more than the file size for the buffer that's
going to hold the file, so that if you write a \0 just after the last
byte of the file, you don't overrun the allocated buffer.

(2) Compute char *endptr = buf + len.

(3) Pass endptr to NextLine and Nex

Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence

2019-12-30 Thread Robert Haas
On Mon, Dec 30, 2019 at 10:57 AM Anastasia Lubennikova
 wrote:
> In this design "maybe" category reflects the need for an extra recheck.
>
> For example, float and numeric types are "never bitwise equal", while array,
> text, and other container types are "maybe bitwise equal". An array of
> integers
> or text with C collation can be treated as bitwise equal attributes, and it
> would be too harsh to restrict them from deduplication.
>
> What bothers me is that this option will unlikely be helpful on its own
> and we
> should also provide some kind of recheck function along with opclass, which
> complicates this idea even further and doesn't seem very clear.

It seems like the simplest thing might be to forget about the 'char'
column and just have a support function which can be used to assess
whether a given opclass's notion of equality is bitwise. If equality
is always bitwise, the function can always return true. If it's
sometimes bitwise, it can return true or false as appropriate. If it's
never bitwise, then it can either always return false or the support
function can be omitted altogether (so that the safe value is the
default).

I don't think you're going to save very much by avoiding an indirect
function call in the "always" case. It doesn't really seem worth the
complexity of making that a special case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: 12's AND CHAIN doesn't chain when transaction raised an error

2019-12-30 Thread Alvaro Herrera
On 2019-Aug-13, Philip Dubé wrote:

> The easiest way to see this is to BEGIN READ ONLY & then attempt an
> insert. Execute either of COMMIT AND CHAIN or ROLLBACK AND CHAIN &
> attempt the insert a second time
> 
> This seems incorrect. The documentation should at least point out this
> behavior if it's intended

What do you mean with "doesn't chain"?

A simple experiment shows that "ROLLBACK AND CHAIN" in an aborted
transaction does indeed start a new transaction; so the "chain" part is
working to some extent.  It is also true that if the original
transaction was READ ONLY, then the followup transaction after an error
is not READ ONLY; but if the first transaction is successful and you do
COMMIT AND CHAIN, then the second transaction *is* READ ONLY.
So there is some discrepancy here.

 (17.7 in SQL:2016) General Rule 10) a) says
  If  contains AND CHAIN, then an SQL-transaction is
  initiated. Any branch transactions of the SQL-transaction are
  initiated with the same transaction access mode, transaction isolation
  level, and condition area limit as the corresponding branch of the
  SQL-transaction just terminated.

... which is exactly the same wording used in 17.8 
General Rule 2) h) i).

(4.41.3 defines "An SQL-transaction has a transaction access mode that
is either read-only or read-write.")

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: proposal: schema variables

2019-12-30 Thread Philippe BEAUDOIN
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:tested, failed

Hi Pavel,

I have tested the latest version of your patch.
Both issues I reported are now fixed. And you largely applied my proposals. 
That's great !

I have also spent some time to review more closely the documentation. I will 
send you a direct mail with an attached file for some minor comments on this 
topic.

Except these documentation remarks to come, I haven't any other issue or 
suggestion to report.
Note that I have not closely looked at the C code itself. But may be some other 
reviewers have already done that job.
If yes, my feeling is that the patch could soon be set as "Ready for commiter".

Best regards. Philippe.

The new status of this patch is: Waiting on Author


Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence

2019-12-30 Thread Anastasia Lubennikova

29.12.2019 2:56, Robert Haas wrote:

On Tue, Dec 24, 2019 at 7:29 AM Anastasia Lubennikova
 wrote:

We can make this 'opcisbitwise' parameter enum (or char) instead of
boolean to mark
"always bitwise", "never bitwise" and "maybe bitwise". Though, I doubt
if it will be helpful in any real use case.

What would be the difference between "never bitwise" and "maybe
bitwise" in that scheme?


In this design "maybe" category reflects the need for an extra recheck.

For example, float and numeric types are "never bitwise equal", while array,
text, and other container types are "maybe bitwise equal". An array of 
integers

or text with C collation can be treated as bitwise equal attributes, and it
would be too harsh to restrict them from deduplication.

What bothers me is that this option will unlikely be helpful on its own 
and we

should also provide some kind of recheck function along with opclass, which
complicates this idea even further and doesn't seem very clear.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: comment regarding double timestamps; and, infinite timestamps and NaN

2019-12-30 Thread Justin Pryzby
On Mon, Dec 30, 2019 at 09:05:24AM -0500, Tom Lane wrote:
> Justin Pryzby  writes:
> > That seems to be only used for ineq_histogram_selectivity() interpolation of
> > histogram bins.  It looks to me that at least isn't working for "special
> > values", and needs to use something other than isnan().
> 
> Uh, what?  This seems completely wrong to me.  We could possibly
> promote DT_NOBEGIN and DT_NOEND to +/- infinity (not NaN), but
> I don't really see the point.  They'll compare to other timestamp
> values correctly without that, cf timestamp_cmp_internal().
> The example you give seems to me to be working sanely, or at least
> as sanely as it can given the number of histogram points available,
> with the existing code.  In any case, shoving NaNs into the
> computation is not going to make anything better.

As I see it, the problem is that the existing code tests for isnan(), but
infinite timestamps are PG_INT64_MIN/MAX (here, stored in a double), so there's
absurdly large values being used as if they were isnormal().

src/include/datatype/timestamp.h:#define DT_NOBEGIN PG_INT64_MIN
src/include/datatype/timestamp.h-#define DT_NOEND   PG_INT64_MAX

On v12, my test gives:
|DROP TABLE t; CREATE TABLE t(t) AS SELECT generate_series(now(), now()+'1 
day', '5 minutes');
|INSERT INTO t VALUES('-infinity');
|ALTER TABLE t ALTER t SET STATISTICS 1; ANALYZE t;
|explain analyze SELECT * FROM t WHERE t>='2010-12-29';
| Seq Scan on t  (cost=0.00..5.62 rows=3 width=8) (actual time=0.012..0.042 
rows=289 loops=1)

vs patched master:
|DROP TABLE t; CREATE TABLE t(t) AS SELECT generate_series(now(), now()+'1 
day', '5 minutes');
|INSERT INTO t VALUES('-infinity');
|ALTER TABLE t ALTER t SET STATISTICS 1; ANALYZE t;
|explain analyze SELECT * FROM t WHERE t>='2010-12-29';
| Seq Scan on t  (cost=0.00..5.62 rows=146 width=8) (actual time=0.048..0.444 
rows=289 loops=1)

IMO 146 rows is a reasonable estimate given a single histogram bucket of
infinite width, and 3 rows is a less reasonable result of returning INT64_MAX
in one place and then handling it as a normal value.  The comments in
ineq_histogram seem to indicate that this case is intended to get binfrac=0.5:

|  Watch out for the possibility that we got a NaN or Infinity from the
|  division.  This can happen despite the previous checks, if for example "low" 
is
|  -Infinity.

I changed to use INFINITY, -INFINITY and !isnormal() rather than nan() and
isnan() (although binfrac is actually NAN at that point so the existing test is
ok).

Justin




Re: TAP testing for psql's tab completion code

2019-12-30 Thread Fabien COELHO



Hello Tom,


I do not think it is a good idea, because help output is quite large,
there are many of them, and we should certainly not want it stored
repeatedly in output files for diffs.


Hm, I don't follow --- we are most certainly not going to exercise
\help for every possible SQL keyword, that'd just be silly.


I am silly.

Price is pretty low, it helps with coverage in "sql_help.c, it checks that 
the help files returns adequate results so that adding new help contents 
does not hide existing stuff. I do not see why we should not do it, in TAP 
tests.


The alternative is that the project tolerates substandard test coverage. 
The "psql" command is currently around 40-44%.



Having said that, the fact that \help now includes a version-dependent
URL in its output is probably enough to break the idea of testing it
with a conventional expected-output test, so maybe TAP is the only
way for that.


The URL is a good thing, though.

--
Fabien.




Re: Disallow cancellation of waiting for synchronous replication

2019-12-30 Thread Bruce Momjian
On Sat, Dec 28, 2019 at 04:55:55PM -0500, Robert Haas wrote:
> On Fri, Dec 20, 2019 at 12:04 AM Andrey Borodin  wrote:
> > Currently, we can have split brain with the combination of following steps:
> > 0. Setup cluster with synchronous replication. Isolate primary from 
> > standbys.
> > 1. Issue upsert query INSERT .. ON CONFLICT DO NOTHING
> > 2. CANCEL 1 during wait for synchronous replication
> > 3. Retry 1. Idempotent query will succeed and client have confirmation of 
> > written data, while it is not present on any standby.
> 
> It seems to me that in order for synchronous replication to work
> reliably, you've got to be very careful about any situation where a
> commit might or might not have completed, and this is one such
> situation. When the client sends the query cancel, it does not and
> cannot know whether the INSERT statement has not yet completed, has
> already completed but not yet replicated, or has completed and
> replicated but not yet sent back a response. However, the server's
> response will be different in each of those cases, because in the
> second case, there will be a WARNING about synchronous replication
> having been interrupted. If the client ignores that WARNING, there are
> going to be problems.

This gets to the heart of something I was hoping to discuss.  When is
something committed?  You would think it is when the client receives the
commit message, but Postgres can commit something, and try to inform the
client but fail to inform, perhaps due to network problems.  In Robert's
case above, we send a "success", but it is only a success on the primary
and not on the synchronous standby.

In the first case I mentioned, we commit without guaranteeing the client
knows, but in the second case, we tell the client success with a warning
that the synchronous standby didn't get the commit.  Are clients even
checking warning messages?  You see it in psql, but what about
applications that use Postgres.  Do they even check for warnings?
Should administrators be informed via email or some command when this
happens?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: TAP testing for psql's tab completion code

2019-12-30 Thread Tom Lane
Fabien COELHO  writes:
>> Well, I think that where possible we ought to test using the existing 
>> test infrastructure -- help, for example, seems like it could perfectly 
>> well be tested in src/test/regress/sql/psql.sql, or we could move stuff 
>> out to a new set of SQL test scripts under src/bin/psql/sql/,

> I do not think it is a good idea, because help output is quite large, 
> there are many of them, and we should certainly not want it stored 
> repeatedly in output files for diffs.

Hm, I don't follow --- we are most certainly not going to exercise
\help for every possible SQL keyword, that'd just be silly.

Having said that, the fact that \help now includes a version-dependent
URL in its output is probably enough to break the idea of testing it
with a conventional expected-output test, so maybe TAP is the only
way for that.

regards, tom lane




Re: comment regarding double timestamps; and, infinite timestamps and NaN

2019-12-30 Thread Tom Lane
Justin Pryzby  writes:
> selfuncs.c convert_to_scalar() says:
> |* The several datatypes representing absolute times are all converted
> |* to Timestamp, which is actually a double, and then we just use that
> |* double value.

> So I propose it should say something like:

> |* The several datatypes representing absolute times are all converted
> |* to Timestamp, which is actually an int64, and then we just promote that
> |* to double.

Check, obviously this comment never got updated.

> That seems to be only used for ineq_histogram_selectivity() interpolation of
> histogram bins.  It looks to me that at least isn't working for "special
> values", and needs to use something other than isnan().

Uh, what?  This seems completely wrong to me.  We could possibly
promote DT_NOBEGIN and DT_NOEND to +/- infinity (not NaN), but
I don't really see the point.  They'll compare to other timestamp
values correctly without that, cf timestamp_cmp_internal().
The example you give seems to me to be working sanely, or at least
as sanely as it can given the number of histogram points available,
with the existing code.  In any case, shoving NaNs into the
computation is not going to make anything better.

regards, tom lane




Re: Windows v readline

2019-12-30 Thread Tom Lane
Peter Eisentraut  writes:
> On 2019-09-29 22:55, Andrew Dunstan wrote:
>> It would certainly be nice to have readline-enabled psql on Windows if
>> possible.

> I tried this out.  First, it doesn't build, because readline doesn't do 
> the dllimport/dllexport dance on global variables, so all references to 
> rl_* global variables in tab-complete.c fail (similar to [0]).  After 
> patching those out, it builds, but it doesn't work.

What do you mean by "patching those out" --- you removed all of
tab-complete's external variable assignments?  That would certainly
disable tab completion, at a minimum.

> It doesn't print a 
> prompt, keys don't do anything sensible.  I can enter SQL commands and 
> get results back, but the readline part doesn't do anything sensible AFAICT.

I wonder if readline was confused about the terminal type, or if it
decided that the input file was not-a-tty for some reason.

regards, tom lane




Re: [HACKERS] Block level parallel vacuum

2019-12-30 Thread Tomas Vondra

On Mon, Dec 30, 2019 at 08:25:28AM +0530, Amit Kapila wrote:

On Mon, Dec 30, 2019 at 2:53 AM Tomas Vondra
 wrote:


On Sun, Dec 29, 2019 at 10:06:23PM +0900, Masahiko Sawada wrote:
>> v40-0001-Introduce-IndexAM-fields-for-parallel-vacuum.patch
>> ---
>>
>> I wonder if 'amparallelvacuumoptions' is unnecessarily specific. Maybe
>> it should be called just 'amvacuumoptions' or something like that? The
>> 'parallel' part is actually encoded in names of the options.
>>
>
>amvacuumoptions seems good to me.
>
>> Also, why do we need a separate amusemaintenanceworkmem option? Why
>> don't we simply track it using a separate flag in 'amvacuumoptions'
>> (or whatever we end up calling it)?
>>
>
>It also seems like a good idea.
>

I think there's another question we need to ask - why to we introduce a
bitmask, instead of using regular boolean struct members? Until now, the
IndexAmRoutine struct had simple boolean members describing capabilities
of the AM implementation. Why shouldn't this patch do the same thing,
i.e. add one boolean flag for each AM feature?



This structure member describes mostly one property of index which is
about a parallel vacuum which I am not sure is true for other members.
Now, we can use separate bool variables for it which we were initially
using in the patch but that seems to be taking more space in a
structure without any advantage.  Also, using one variable makes a
code bit better because otherwise, in many places we need to check and
set four variables instead of one.  This is also the reason we used
parallel in its name (we also use *parallel* for parallel index scan
related things).  Having said that, we can remove parallel from its
name if we want to extend/use it for something other than a parallel
vacuum.  I think we might need to add a flag or two for parallelizing
heap scan of vacuum when we enhance this feature, so keeping it for
just a parallel vacuum is not completely insane.

I think keeping amusemaintenanceworkmem separate from this variable
seems to me like a better idea as it doesn't describe whether IndexAM
can participate in a parallel vacuum or not.  You can see more
discussion about that variable in the thread [1].



I don't know, but IMHO it's somewhat easier to work with separate flags.
Bitmasks make sense when space usage matters a lot, e.g. for on-disk
representation, but that doesn't seem to be the case here I think (if it
was, we'd probably use bitmasks already).

It seems like we're mixing two ways to design the struct unnecessarily,
but I'm not going to nag about this any further.


>>
>>
>> v40-0004-Add-ability-to-disable-leader-participation-in-p.patch
>> ---
>>
>> IMHO this should be simply merged into 0002.
>
>We discussed it's still unclear whether we really want to commit this
>code and therefore it's separated from the main part. Please see more
>details here[2].
>

IMO there's not much reason for the leader not to participate.



The only reason for this is just a debugging/testing aid because
during the development of other parallel features we required such a
knob.  The other way could be to have something similar to
force_parallel_mode and there is some discussion about that as well on
this thread but we haven't concluded which is better.  So, we decided
to keep it as a separate patch which we can use to test this feature
during development and decide later whether we really need to commit
it.  BTW, we have found few bugs by using this knob in the patch.



OK, understood. Then why not just use force_parallel_mode?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] Block level parallel vacuum

2019-12-30 Thread Tomas Vondra

On Mon, Dec 30, 2019 at 10:40:39AM +0530, Amit Kapila wrote:

On Mon, Dec 30, 2019 at 2:53 AM Tomas Vondra
 wrote:


On Sun, Dec 29, 2019 at 10:06:23PM +0900, Masahiko Sawada wrote:
>>
>> v40-0003-Add-FAST-option-to-vacuum-command.patch
>> 
>>
>> I do have a bit of an issue with this part - I'm not quite convinved we
>> actually need a FAST option, and I actually suspect we'll come to regret
>> it sooner than later. AFAIK it pretty much does exactly the same thing
>> as setting vacuum_cost_delay to 0, and IMO it's confusing to provide
>> multiple ways to do the same thing - I do expect reports from confused
>> users on pgsql-bugs etc. Why is setting vacuum_cost_delay directly not a
>> sufficient solution?
>
>I think the motivation of this option is similar to FREEZE. I think
>it's sometimes a good idea to have a shortcut of popular usage and
>make it have an name corresponding to its job. From that perspective I
>think having FAST option would make sense but maybe we need more
>discussion the combination parallel vacuum and vacuum delay.
>

OK. I think it's mostly independent piece, so maybe we should move it to
a separate patch. It's more likely to get attention/feedback when not
buried in this thread.



+1.  It is already a separate patch and I think we can even discuss
more on it in a new thread once the main patch is committed or do you
think we should have a conclusion about it now itself?  To me, this
option appears to be an extension to the main feature which can be
useful for some users and people might like to have a separate option,
so we can discuss it and get broader feedback after the main patch is
committed.



I don't think it's an extension of the main feature - it does not depend
on it, it could be committed before or after the parallel vacuum (with
some conflicts, but the feature itself is not affected).

My point was that by moving it into a separate thread we're more likely
to get feedback on it, e.g. from people who don't feel like reviewing
the parallel vacuum feature and/or feel intimidated by t100+ messages in
this thread.


>>
>> The same thing applies to the PARALLEL flag, added in 0002, BTW. Why do
>> we need a separate VACUUM option, instead of just using the existing
>> max_parallel_maintenance_workers GUC?
>>


How will user specify parallel degree?  The parallel degree is helpful
because in some cases users can decide how many workers should be
launched based on size and type of indexes.



By setting max_maintenance_parallel_workers.


>> It's good enough for CREATE INDEX
>> so why not here?
>


That is a different feature and I think here users can make a better
judgment based on the size of indexes.  Moreover, users have an option
to control a parallel degree for 'Create Index' via Alter Table
 Set (parallel_workers = ) which I am not sure is a good
idea for parallel vacuum as the parallelism is more derived from size
and type of indexes.  Now, we can think of a similar parameter at the
table/index level for parallel vacuum, but I don't see it equally
useful in this case.



I'm a bit skeptical about users being able to pick good parallel degree.
If we (i.e. experienced developers/hackers with quite a bit of
knowledge) can't come up with a reasonable heuristics, how likely is it
that a regular user will come up with something better?

Not sure I understand why "parallel_workers" would not be suitable for
parallel vacuum? I mean, even for CREATE INDEX it certainly matters the
size/type of indexes, no?

I may be wrong in both cases, of course.


>AFAIR There was no such discussion so far but I think one reason could
>be that parallel vacuum should be disabled by default. If the parallel
>vacuum uses max_parallel_maintenance_workers (2 by default) rather
>than having the option the parallel vacuum would work with default
>setting but I think that it would become a big impact for user because
>the disk access could become random reads and writes when some indexes
>are on the same tablespace.
>

I'm not quite convinced VACUUM should have parallelism disabled by
default. I know some people argued we should do that because making
vacuum faster may put pressure on other parts of the system. Which is
true, but I don't think the solution is to make vacuum slower by
default. IMHO we should do the opposite - have it parallel by default
(as driven by max_parallel_maintenance_workers), and have an option
to disable parallelism.



I think driving parallelism for vacuum by
max_parallel_maintenance_workers might not be sufficient.  We need to
give finer control as it depends a lot on the size of indexes. Also,
unlike Create Index, Vacuum can be performed on an entire database and
it is quite possible that some tables/indexes are relatively smaller
and forcing parallelism on them by default might slow down the
operation.



Why wouldn't it be sufficient? Why couldn't this use similar logic to
what we have in plan_create_index_workers for CR

Re: [PATCH] fix a performance issue with multiple logical-decoding walsenders

2019-12-30 Thread Pierre Ducroquet
On Sunday, December 29, 2019 1:32:31 PM CET Julien Rouhaud wrote:
> On Sat, Dec 28, 2019 at 1:56 PM Pierre Ducroquet  
wrote:
> > Thank you for your comments.
> > Attached to this email is a patch with better comments regarding the
> > XLogSendLogical change.
> 
> Arguably the first test to compare to InvalidXLogRecPtr is unneeded,
> as any value of EndRecPtr is greater or equal than that value.  It
> will only save at best 1 GetFlushRecPtr() per walsender process
> lifetime, so I'm not sure it's worth arguing about it.  Other than
> that I still think that it's a straightforward optimization that
> brings nice speedup, and I don't see any problem with this patch.  I
> think that given the time of the year you should create a commitfest
> entry for this patch to make sure it won't be forgotten (and obviously
> I'll mark it as RFC, unless someone objects by then).
> 
> > We've spent quite some time yesterday benching it again, this time with
> > changes that must be fully processed by the decoder. The speed-up is
> > obviously much smaller, we are only ~5% faster than without the patch.
> 
> I'm assuming that it's benchmarking done with multiple logical slots?
> Anyway, a 5% speedup in the case that this patch is not aimed to
> optimize is quite nice!


I've created a commitfest entry for this patch.
https://commitfest.postgresql.org/26/2403/
I would like to know if it would be acceptable to backport this to PostgreSQL 
12. I have to write a clean benchmark for that (our previous benchs are either 
PG10 or PG12 specific), but the change from Thomas Munro that removed the 
calls to PostmasterIsAlive is very likely to have the same side-effect we 
observed in PG10 when patching IsAlive, aka. moving the pressure from the pipe 
reads to the PostgreSQL locks between processes, and this made the whole 
process slower: the benchmark showed a serious regression, going from 3m45s to 
5m15s to decode the test transactions.

Regards

 Pierre






Re: Recognizing superuser in pg_hba.conf

2019-12-30 Thread Vik Fearing
On 29/12/2019 23:10, Vik Fearing wrote:
> On 29/12/2019 17:31, Tom Lane wrote:
>> Robert Haas  writes:
>>> On Sat, Dec 28, 2019 at 2:02 PM Vik Fearing  
>>> wrote:
 I'm all for this (and even suggested it during the IRC conversation that
 prompted this patch). It's rife with bikeshedding, though.  My original
 proposal was to use '&' and Andrew Gierth would have used ':'.
>>> I think this is a good proposal regardless of which character we
>>> decide to use. My order of preference from highest-to-lowest would
>>> probably be :*&, but maybe that's just because I'm reading this on
>>> Sunday rather than on Tuesday.
>> I don't have any particular objection to '&' if people prefer that.
>
> I wrote the patch so I got to decide. :-)  I will also volunteer to do
> the grunt work of changing the symbol if consensus wants that, though.
>
>
> It turns out that my original patch didn't really change, all the meat
> is in the keywords patch.  The superuser patch is to be applied on top
> of the keywords patch.
>

I missed a few places in the tap tests.  New keywords patch attached,
superuser patch unchanged.

-- 

Vik Fearing

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 55694c4368..add25b2b43 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -8953,8 +8953,8 @@ SCRAM-SHA-256$:&l
  text
  
   Host name or IP address, or one
-  of all, samehost,
-  or samenet, or null for local connections
+  of &all, &samehost,
+  or &samenet, or null for local connections
  
 
 
diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
index 5f1eec78fb..96fc4b01e9 100644
--- a/doc/src/sgml/client-auth.sgml
+++ b/doc/src/sgml/client-auth.sgml
@@ -81,11 +81,26 @@
A record is made
up of a number of fields which are separated by spaces and/or tabs.
Fields can contain white space if the field value is double-quoted.
-   Quoting one of the keywords in a database, user, or address field (e.g.,
-   all or replication) makes the word lose its special
-   meaning, and just match a database, user, or host with that name.
   
 
+  
+   
+As of version 13, keywords are preceded by the character "&".
+For compatibility, the following legacy keywords are still accepted
+on their own:
+all, replication,
+samegroup, samehost,
+samenet, samerole,
+sameuser.
+   
+
+   
+Quoting one of these legacy keywords in a database, user, or address
+field makes the word lose its special meaning, and just match a
+database, user, or host with that name.
+   
+  
+
   
Each record specifies a connection type, a client IP address range
(if relevant for the connection type), a database name, a user name,
@@ -221,18 +236,18 @@ hostnogssenc database  user
   
Specifies which database name(s) this record matches.  The value
-   all specifies that it matches all databases.
-   The value sameuser specifies that the record
+   &all specifies that it matches all databases.
+   The value &sameuser specifies that the record
matches if the requested database has the same name as the
-   requested user.  The value samerole specifies that
+   requested user.  The value &samerole specifies that
the requested user must be a member of the role with the same
-   name as the requested database.  (samegroup is an
-   obsolete but still accepted spelling of samerole.)
+   name as the requested database.  (&samegroup is an
+   obsolete but still accepted spelling of &samerole.)
Superusers are not considered to be members of a role for the
-   purposes of samerole unless they are explicitly
+   purposes of &samerole unless they are explicitly
members of the role, directly or indirectly, and not just by
virtue of being a superuser.
-   The value replication specifies that the record
+   The value &replication specifies that the record
matches if a physical replication connection is requested (note that
replication connections do not specify any particular database).
Otherwise, this is the name of
@@ -249,7 +264,7 @@ hostnogssenc database  user
   
Specifies which database user name(s) this record
-   matches. The value all specifies that it
+   matches. The value &all specifies that it
matches all users.  Otherwise, this is either the name of a specific
database user, or a group name preceded by +.
(Recall that there is no real distinction between users and groups
@@ -312,9 +327,9 @@ hostnogssenc database  user
 
   
-   You can also write all to match any IP address,
-   samehost to match any of the server's own IP
-   addresses, or samenet to match any address in any
+   You can also write &all to match any IP address,
+   &samehost to match any of the server's own IP
+   addresses, or &

Re: TAP testing for psql's tab completion code

2019-12-30 Thread Fabien COELHO



That is what my patch does: it tests prompts, tab completion, help,
command options… and I added tests till I covered most psql source.


Well, I think that where possible we ought to test using the existing 
test infrastructure -- help, for example, seems like it could perfectly 
well be tested in src/test/regress/sql/psql.sql, or we could move stuff 
out to a new set of SQL test scripts under src/bin/psql/sql/,


I do not think it is a good idea, because help output is quite large, 
there are many of them, and we should certainly not want it stored 
repeatedly in output files for diffs. I rather trigger the output and only 
check for some related keywords, so that it fits TAP tests reasonably 
well.


--
Fabien.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-12-30 Thread Amit Kapila
On Thu, Dec 26, 2019 at 12:36 PM Masahiko Sawada
 wrote:
>
> On Tue, 24 Dec 2019 at 17:21, Amit Kapila  wrote:
> >
> > >
> > > Thank you for explanation. The plan makes sense. But I think in the
> > > current design it's a problem that logical replication worker doesn't
> > > receive changes (and doesn't check interrupts) during applying
> > > committed changes even if we don't have a worker dedicated for
> > > applying. I think the worker should continue to receive changes and
> > > save them to temporary files even during applying changes.
> > >
> >
> > Won't it beat the purpose of this feature which is to reduce the apply
> > lag?  Basically, it can so happen that while applying commit, it
> > constantly gets changes of other transactions which will delay the
> > apply of the current transaction.
>
> You're right. But it seems to me that it optimizes the apply lags of
> only a transaction that made many changes. On the other hand if a
> transaction made many changes applying of subsequent changes are
> delayed.
>

Hmm, how would it be worse than the current situation where once
commit is encountered on the publisher, we won't start with other
transactions until the replay of the same is finished on subscriber?

>
> >   I think the best way as
> > discussed is to launch new workers for streamed transactions, but we
> > can do that as an additional feature. Anyway, as proposed, users can
> > choose the streaming mode for subscriptions, so there is an option to
> > turn this selectively.
>
> Yes. But user who wants to use this feature would want to replicate
> many changes but I guess the side effect is quite big. I think that at
> least we need to make the logical replication tolerate such situation.
>

What exactly you mean by "at least we need to make the logical
replication tolerate such situation."?  Do you have something specific
in mind?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-12-30 Thread Amit Kapila
On Sun, Dec 29, 2019 at 1:34 PM Dilip Kumar  wrote:
>
> I have observed some more issues
>
> 1. Currently, In ReorderBufferCommit, it is always expected that
> whenever we get REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we must
> have already got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT and in
> SPEC_CONFIRM we send the tuple we got in SPECT_INSERT.  But, now those
> two messages can be in different streams.  So we need to find a way to
> handle this.  Maybe once we get SPEC_INSERT then we can remember the
> tuple and then if we get the SPECT_CONFIRM in the next stream we can
> send that tuple?
>

Your suggestion makes sense to me.  So, we can try it.

> 2. During commit time in DecodeCommit we check whether we need to skip
> the changes of the transaction or not by calling
> SnapBuildXactNeedsSkip but since now we support streaming so it's
> possible that before we decode the commit WAL, we might have already
> sent the changes to the output plugin even though we could have
> skipped those changes.  So my question is instead of checking at the
> commit time can't we check before adding to ReorderBuffer itself
>

I think if we can do that then the same will be true for current code
irrespective of this patch.  I think it is possible that we can't take
that decision while decoding because we haven't assembled a consistent
snapshot yet.  I think we might be able to do that while we try to
stream the changes.  I think we need to take care of all the
conditions during streaming (when the logical_decoding_workmem limit
is reached) as we do in DecodeCommit.  This needs a bit more study.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Windows v readline

2019-12-30 Thread Peter Eisentraut

On 2019-09-29 22:55, Andrew Dunstan wrote:


The configure code currently has this:


# readline on MinGW has problems with backslashes in psql and other bugs.
# This is particularly a problem with non-US code pages.
# Therefore disable its use until we understand the cause. 2004-07-20
if test "$PORTNAME" = "win32"; then
   if test "$with_readline" = yes; then
     AC_MSG_WARN([*** Readline does not work on MinGW --- disabling])
     with_readline=no
   fi
fi


2004 is a very long time ago. Has anyone looked at this more recently?
It would certainly be nice to have readline-enabled psql on Windows if
possible.


I tried this out.  First, it doesn't build, because readline doesn't do 
the dllimport/dllexport dance on global variables, so all references to 
rl_* global variables in tab-complete.c fail (similar to [0]).  After 
patching those out, it builds, but it doesn't work.  It doesn't print a 
prompt, keys don't do anything sensible.  I can enter SQL commands and 
get results back, but the readline part doesn't do anything sensible AFAICT.


Perhaps I did something wrong.  You can still use readline without 
global variables, but it seems like a serious restriction and makes me 
wonder whether this has actually ever been used before.  It's curious 
that MSYS2 ships a readline build for mingw.  Is there other software 
that uses it on Windows?



[0]: 
https://www.postgresql.org/message-id/001101c3eb6a$3b275500$f800a...@kuczek.pl


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services