date:20200116

Re: Minimal logical decoding on standbys

2020-01-16 Thread Andreas Joseph Krogh

På torsdag 16. januar 2020 kl. 05:42:24, skrev Amit Khandekar <
amitdkhan...@gmail.com >: 
On Fri, 10 Jan 2020 at 17:50, Rahila Syed  wrote:
 >
 > Hi Amit,
 >
 > Can you please rebase the patches as they don't apply on latest master?

 Thanks for notifying. Attached is the rebased version. 

Will this patch enable logical replication from a standby-server? 

--
 Andreas Joseph Krogh

Re: Making psql error out on output failures

2020-01-16 Thread David Zhang


On 2020-01-16 5:20 a.m., Daniel Verite wrote:

David Zhang wrote:


If I change the error log message like below, where "%m" is used to pass the
value of strerror(errno), "could not write to output file:" is copied from
function "WRITE_ERROR_EXIT".
-   pg_log_error("Error printing tuples");
+   pg_log_error("could not write to output file: %m");
then the output message is something like below, which, I believe, is more
consistent with pg_dump.

The problem is that errno may not be reliable to tell us what was
the problem that leads to ferror(fout) being nonzero, since it isn't
saved at the point of the error and the output code may have called
many libc functions between the first occurrence of the output error
and when pg_log_error() is called.

The linux manpage on errno warns specifically about this:

NOTES
A common mistake is to do

   if (somecall() == -1) {
   printf("somecall() failed\n");
   if (errno == ...) { ... }
   }

where errno no longer needs to have the value  it  had  upon  return
from  somecall()
(i.e.,  it  may have been changed by the printf(3)).  If the value of
errno should be
preserved across a library call, it must be saved:


This other bit from the POSIX spec [1] is relevant:

   "The value of errno shall be defined only after a call to a function
   for which it is explicitly stated to be set and until it is changed
   by the next function call or if the application assigns it a value."

To use errno in a way that complies with the above, the psql code
should be refactored. I don't know if having a more precise error
message justifies that refactoring. I've elaborated a bit about that
upthread with the initial submission.


Yes, I agree with you. For case 2 "select repeat('111', 100) \g 
/mnt/ramdisk/file", it can be easily fixed with more accurate error 
message similar to pg_dump, one example could be something like below. 
But for case 1 "psql -d postgres  -At -c "select repeat('111', 100)" 
> /mnt/ramdisk/file" , it may require a lot of refactoring work.


diff --git a/src/port/snprintf.c b/src/port/snprintf.c
index 8fd997553e..e6c239fd9f 100644
--- a/src/port/snprintf.c
+++ b/src/port/snprintf.c
@@ -309,8 +309,10 @@ flushbuffer(PrintfTarget *target)

    written = fwrite(target->bufstart, 1, nc, target->stream);
    target->nchars += written;
-   if (written != nc)
+   if (written != nc) {
    target->failed = true;
+   fprintf(stderr, "could not write to output file: 
%s\n", strerror(errno));

+   }


Besides, I'm not even
sure that errno is necessarily set on non-POSIX platforms when fputc
or fputs fails.

Verified, fputs does set the errno at least in Ubuntu Linux.

That's why this patch uses the safer approach to emit a generic
error message.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/errno.html


Best regards,

--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca

Re: [HACKERS] Block level parallel vacuum

2020-01-16 Thread Dilip Kumar

On Fri, Jan 17, 2020 at 11:39 AM Dilip Kumar  wrote:
I have performed cost delay testing on the latest test(I have used
same script as attahced in [1] and [2].
vacuum_cost_delay = 10
vacuum_cost_limit = 2000

Observation: As we have concluded earlier, the delay time is in sync
with the I/O performed by the worker
and the total delay (heap + index) is almost the same as the
non-parallel operation.

test1:[1]

Vacuum non-parallel

WARNING:  VacuumCostTotalDelay=11332.32

Vacuum 2 workers
WARNING:  worker 0 delay=171.085000 total io=34288 hit=22208 miss=0 dirty=604
WARNING:  worker 1 delay=87.79 total io=17910 hit=17890 miss=0 dirty=1
WARNING:  worker 2 delay=88.62 total io=17910 hit=17890 miss=0 dirty=1

WARNING:  VacuumCostTotalDelay=11505.65

Vacuum 4 workers
WARNING:  worker 0 delay=87.75 total io=17910 hit=17890 miss=0 dirty=1
WARNING:  worker 1 delay=89.155000 total io=17910 hit=17890 miss=0 dirty=1
WARNING:  worker 2 delay=87.08 total io=17910 hit=17890 miss=0 dirty=1
WARNING:  worker 3 delay=78.745000 total io=16378 hit=4318 miss=0 dirty=603

WARNING:  VacuumCostTotalDelay=11590.68


test2:[2]

Vacuum non-parallel
WARNING:  VacuumCostTotalDelay=22835.97

Vacuum 2 workers
WARNING:  worker 0 delay=345.55 total io=69338 hit=45338 miss=0 dirty=1200
WARNING:  worker 1 delay=177.15 total io=35807 hit=35787 miss=0 dirty=1
WARNING:  worker 2 delay=178.105000 total io=35807 hit=35787 miss=0 dirty=1
WARNING:  VacuumCostTotalDelay=23191.405000


Vacuum 4 workers
WARNING:  worker 0 delay=177.265000 total io=35807 hit=35787 miss=0 dirty=1
WARNING:  worker 1 delay=177.175000 total io=35807 hit=35787 miss=0 dirty=1
WARNING:  worker 2 delay=177.385000 total io=35807 hit=35787 miss=0 dirty=1
WARNING:  worker 3 delay=166.515000 total io=33531 hit=9551 miss=0 dirty=1199
WARNING:  VacuumCostTotalDelay=23357.115000



[1] 
https://www.postgresql.org/message-id/CAFiTN-tFLN%3Dvdu5Ra-23E9_7Z1JXkk5MkRY3Bkj2zAoWK7fULA%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAFiTN-tC%3DNcvcEd%2B5J62fR8-D8x7EHuVi2xhS-0DMf1bnJs4hw%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Add pg_file_sync() to adminpack

2020-01-16 Thread Fujii Masao





On 2020/01/13 22:46, Michael Paquier wrote:

On Sat, Jan 11, 2020 at 02:12:15AM +0900, Fujii Masao wrote:

I'm not sure if returning false with WARNING only in some error cases
is really good idea or not. At least for me, it's more intuitive to
return true on success and emit an ERROR otherwise. I'd like to hear
more opinions about this. Also if returning true on success is rather
confusing, we can change its return type to void.


An advantage of not issuing an ERROR if that when working on a list of
files (for example a WITH RECURSIVE on the whole data directory?), you
can then know which files could not be synced instead of seeing one
ERROR about one file, while being unsure about the state of the
others.


Could you elaborate why? But if it's not good to sync the existing directory
in the regression test, we may need to give up testing the sync of directory.
Another idea is to add another function like pg_mkdir() into adminpack
and use the directory that we newly created by using that function,
for the test. Or better idea?


We should avoid potentially costly tests in any regression scenario if
we have a way to do so.  I like your idea of having a pg_mkdir(), that
feels more natural to have as there is already pg_file_write().


BTW, in the latest patch that I posted upthread, I changed
the directory to sync for the test from "global" to "pg_stat"
because pg_stat is empty while the server is running,
and syncing it would not be so costly.
Introduing pg_mkdir() (maybe pg_rmdir() would be also necessary)
is an idea, but it's better to do that as a separate patch
if it's really necessary.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Re: Add pg_file_sync() to adminpack

2020-01-16 Thread Fujii Masao




On 2020/01/17 13:36, Michael Paquier wrote:

On Thu, Jan 16, 2020 at 09:51:24AM -0500, Robert Haas wrote:

On Tue, Jan 14, 2020 at 10:08 AM Atsushi Torikoshi  wrote:

fails we can get error messages. So it seems straightforward for me to
  'return true on success and emit an ERROR otherwise'.


I agree with reporting errors using ERROR, but it seems to me that we
ought to then make the function return 'void' rather than 'bool'.


Yeah, that should be either ERROR and return a void result, or issue a
WARNING/ERROR (depending on the code path, maybe PANIC?) with a
boolean status returned if there is a WARNING.  Mixing both concepts
with an ERROR all the time and always a true status is just weird,
because you know that if no errors are raised then the status will be
always true.  So there is no point to have a boolean status to begin
with.


OK, so our consensus is to return void on success and throw an error
otherwise. Attached is the updated version of the patch.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters
diff --git a/contrib/adminpack/Makefile b/contrib/adminpack/Makefile
index 9a464b11bc..630fea7726 100644
--- a/contrib/adminpack/Makefile
+++ b/contrib/adminpack/Makefile
@@ -7,7 +7,8 @@ OBJS = \
 PG_CPPFLAGS = -I$(libpq_srcdir)
 
 EXTENSION = adminpack
-DATA = adminpack--1.0.sql adminpack--1.0--1.1.sql adminpack--1.1--2.0.sql
+DATA = adminpack--1.0.sql adminpack--1.0--1.1.sql adminpack--1.1--2.0.sql\
+   adminpack--2.0--2.1.sql
 PGFILEDESC = "adminpack - support functions for pgAdmin"
 
 REGRESS = adminpack
diff --git a/contrib/adminpack/adminpack--2.0--2.1.sql 
b/contrib/adminpack/adminpack--2.0--2.1.sql
new file mode 100644
index 00..1c6712e816
--- /dev/null
+++ b/contrib/adminpack/adminpack--2.0--2.1.sql
@@ -0,0 +1,17 @@
+/* contrib/adminpack/adminpack--2.0--2.1.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION adminpack UPDATE TO '2.1'" to load this file. \quit
+
+/* ***
+ * Administrative functions for PostgreSQL
+ * *** */
+
+/* generic file access functions */
+
+CREATE OR REPLACE FUNCTION pg_catalog.pg_file_sync(text)
+RETURNS void
+AS 'MODULE_PATHNAME', 'pg_file_sync'
+LANGUAGE C VOLATILE STRICT;
+
+REVOKE EXECUTE ON FUNCTION pg_catalog.pg_file_sync(text) FROM PUBLIC;
diff --git a/contrib/adminpack/adminpack.c b/contrib/adminpack/adminpack.c
index 710f4ea32d..7b5a531e08 100644
--- a/contrib/adminpack/adminpack.c
+++ b/contrib/adminpack/adminpack.c
@@ -43,6 +43,7 @@ PG_MODULE_MAGIC;
 
 PG_FUNCTION_INFO_V1(pg_file_write);
 PG_FUNCTION_INFO_V1(pg_file_write_v1_1);
+PG_FUNCTION_INFO_V1(pg_file_sync);
 PG_FUNCTION_INFO_V1(pg_file_rename);
 PG_FUNCTION_INFO_V1(pg_file_rename_v1_1);
 PG_FUNCTION_INFO_V1(pg_file_unlink);
@@ -215,6 +216,30 @@ pg_file_write_internal(text *file, text *data, bool 
replace)
return (count);
 }
 
+/* 
+ * pg_file_sync
+ *
+ * We REVOKE EXECUTE on the function from PUBLIC.
+ * Users can then grant access to it based on their policies.
+ */
+Datum
+pg_file_sync(PG_FUNCTION_ARGS)
+{
+   char   *filename;
+   struct stat fst;
+
+   filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0), false);
+
+   if (stat(filename, ) < 0)
+   ereport(ERROR,
+   (errcode_for_file_access(),
+errmsg("could not stat file \"%s\": %m", 
filename)));
+
+   fsync_fname_ext(filename, S_ISDIR(fst.st_mode), false, ERROR);
+
+   PG_RETURN_VOID();
+}
+
 /* 
  * pg_file_rename - old version
  *
diff --git a/contrib/adminpack/adminpack.control 
b/contrib/adminpack/adminpack.control
index 12569dcdd7..ae35d22156 100644
--- a/contrib/adminpack/adminpack.control
+++ b/contrib/adminpack/adminpack.control
@@ -1,6 +1,6 @@
 # adminpack extension
 comment = 'administrative functions for PostgreSQL'
-default_version = '2.0'
+default_version = '2.1'
 module_pathname = '$libdir/adminpack'
 relocatable = false
 schema = pg_catalog
diff --git a/contrib/adminpack/expected/adminpack.out 
b/contrib/adminpack/expected/adminpack.out
index 8747ac69a2..5738b0f6c4 100644
--- a/contrib/adminpack/expected/adminpack.out
+++ b/contrib/adminpack/expected/adminpack.out
@@ -56,6 +56,21 @@ RESET ROLE;
 REVOKE EXECUTE ON FUNCTION pg_file_write(text,text,bool) FROM regress_user1;
 REVOKE pg_read_all_settings FROM regress_user1;
 DROP ROLE regress_user1;
+-- sync
+SELECT pg_file_sync('test_file1'); -- sync file
+ pg_file_sync 
+--
+ 
+(1 row)
+
+SELECT pg_file_sync('pg_stat'); -- sync directory
+ pg_file_sync 
+--
+ 
+(1 row)
+
+SELECT pg_file_sync('test_file2'); -- not there
+ERROR:  could not stat file "test_file2": No such file or directory
 -- rename file
 SELECT

Re: Batch insert in CTAS/MatView code

2020-01-16 Thread Paul Guo

I took some time on digging the issue yesterday so the main concern of the
patch is to get the tuple length from ExecFetchSlotHeapTuple().

+   ExecCopySlot(batchslot, slot);
+   tuple = ExecFetchSlotHeapTuple(batchslot, true, NULL);
+
+   myState->mi_slots_num++;
+   myState->mi_slots_size += tuple->t_len;

We definitely should remove ExecFetchSlotHeapTuple() here but we need to
know the tuple length (at least rough length). One solution might be using
memory context stat info as mentioned, the code looks like this.

tup_len = MemoryContextUsedSize(batchslot->tts_mcxt);
ExecCopySlot(batchslot, slot);
ExecMaterializeSlot(batchslot);
tup_len = MemoryContextUsedSize(batchslot->tts_mcxt) - tup_len;

MemoryContextUsedSize() is added to calculate the total used size (simply
hack to use the stats interface).

+int
+MemoryContextUsedSize(MemoryContext context)
+{
+   MemoryContextCounters total;
+
+   memset(, 0, sizeof(total));
+   context->methods->stats(context, NULL, NULL, );
+
+   return total.totalspace - total.freespace;
+}

This basically works but there are concerns:

   The length is not accurate (though we do not need to be that accurate)
since there are
   some additional memory allocations, but we are not sure if the size is
not much
   larger than the real length for some slot types in the future and I'm
not sure whether we
   definitely allocate at least the tuple length in the memory context
after materialization
   for all slot types in the future. Last is that the code seems to be a
bit ugly also.

   As a reference, For "create table t1 as select * from t2", the above
code returns
   "tuple length" is 88 (real tuple length is 4).

Another solution is that maybe return the real size
in ExecMaterializeSlot()? e.g.
ExecMaterializeSlot(slot, _len); For this we probably want to store the
length
in the slot struct for performance.

For the COPY case the tuple length is known in advance but I can image more
cases
which do not know the size but need that for the multi_insert interface, at
least I'm
wondering if we should use that in 'refresh matview' and fast path for
Insert node
(I heard some complaints about the performance of "insert into tbl from
select..."
from some of our users)? So the concern is not just for the case in this
patch.

Besides, My colleagues Ashwin Agrawal and Adam Lee found maybe could
try raw_heap_insert() similar code for ctas and compare since it is do
insert in
a new created table. That would involve more discussions, much more code
change and need to test more (stability and performance). So multi insert
seems
to be a stable solution in a short time given that has been used in COPY
for a
long time?

Whatever the solution for CTAS we need to address the concern of tuple size
for multi insert cases.

Re: SQL/JSON: functions

2020-01-16 Thread Pavel Stehule

Hi

čt 14. 11. 2019 v 17:46 odesílatel Nikita Glukhov 
napsal:

> Attached 40th version of the patches.
>
>
> I have added some documentation which has been simply copied from [1].
>
> Also, I have split patches more correctly, and extracted patch #2 with common
> SQL/JSON clauses, that are used by both constructors and query functions.
> Patches #3 and #4 are necessary only for constructors (patch #5).
> Patches #5 and #6 are almost independent on patch #1, which is needed only for
> query functions.
>
> So, patches #5, #6, #7 are independent now, but the code changes in them are
> still conflicting.  I can rebase #6 and #7 on top of #2, if it is necessary
> for separate review/commit.
>
>
> [1] 
> https://www.postgresql.org/message-id/732208d3-56c3-25a4-8f08-3be1d54ad51b%40postgrespro.ru
>
> I tested cumulative patch - sent in json_table patch.

I almost satisfied by quality of this patch. There is very good conformance
with standard and with Oracle. Unfortunately MySQL in this part of JSON
support is not compatible.

I found one issue, when I tested some examples from Oracle.

SELECT JSON_VALUE('{a:100}', '$.a' RETURNING int) AS value;

then the result was null.

But it is wrong, because it should to raise a exception, because this json
is broken on Postgres (Postgres requires quoted attribute names)

json_query has same problem

postgres=# SELECT JSON_QUERY('{a:100, b:200, c:300}', '$') AS value;
┌───┐
│ value │
╞═══╡
│ ∅ │
└───┘
(1 row)

It should to check if input is correct json

All others was good.

I'll mark this patch as waiting for author

Regards

Pavel



> --
> Nikita Glukhov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>

Re: FETCH FIRST clause PERCENT option

2020-01-16 Thread Kyotaro Horiguchi

Thank you for the notification, Tomas.

At Thu, 16 Jan 2020 22:33:58 +0100, Tomas Vondra  
wrote in 
> This patch is marked as RFC since September. Since then there was no
> discussion on this thread, but Andrew proposed an alternative approach
> based on window functions in a separate thread [1] (both for this and
> for the WITH TIES case).
> 
> I'll set this patch back to "needs review" - at this point we need to
> decide which of the approaches is the right one.

Even consdiering that we have only two usage , WITH TIES and PERCENT,
of the mechanism for now, I like [1] for the generic nature,
simpleness and smaller invasiveness to executor. The fact that cursor
needs materialize to allow backward scan would not be a big problem.
It seems that PERCENT will be easily implemented on that.

However, isn't there a possibility that we allow FETCH FIRST x PERCENT
WITH TIES, though I'm not sure what SQL spec tells about that?  I
don't come up right now how to do that using window functions..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Errors when update a view with conditional-INSTEAD rules

2020-01-16 Thread Pengzhou Tang

Thanks a lot, Dean, to look into this and also sorry for the late reply.

On Sun, Jan 5, 2020 at 12:08 AM Dean Rasheed 
wrote:

> Tracing it through, this seems to be a result of
> cab5dc5daf2f6f5da0ce79deb399633b4bb443b5 which added support for
> updatable views with a mix of updatable and non-updatable columns.
> That included a change to rewriteTargetListIU() to prevent it from
> adding dummy targetlist entries for unassigned-to attributes for
> auto-updatable views, in case they are no longer simple references to
> the underlying relation. Instead, that is left to expand_targetlist(),
> as for a normal table. However, in this case (an UPDATE on a view with
> a conditional rule), the target relation of the original query isn't
> rewritten (we leave it to the executor to report the error), and so
> expand_targetlist() ends up adding a new targetlist entry that
> references the target relation, which is still the original view. But
> when the planner bulds the simple_rel_array, it only adds entries for
> relations referenced in the query's jointree, which only includes the
> base table by this point, not the view. Thus the new targetlist entry
> added by expand_targetlist() refers to a NULL slot in the
> simple_rel_array, and it blows up.
>
> That's a great analysis of this issue.


> Given that this is a query that's going to fail anyway, I'm inclined
> to think that the right thing to do is to throw the error sooner, in
> rewriteQuery(), rather than attempting to plan a query that cannot be
> executed.
>

I am wondering whether a simple auto-updatable view can have a conditional
update instead rule.
For the test case I added, does bellow plan looks reasonable?
gpadmin=# explain UPDATE v1 SET b = 2 WHERE a = 1;
QUERY PLAN
---
 Insert on t2  (cost=0.00..49.55 rows=1 width=8)
   ->  Seq Scan on t1  (cost=0.00..49.55 rows=1 width=8)
 Filter: ((b > 100) AND (a > 2) AND (a = 1))

 Update on t1  (cost=0.00..49.55 rows=3 width=14)
   ->  Seq Scan on t1  (cost=0.00..49.55 rows=3 width=14)
 Filter: (((a > 2) IS NOT TRUE) AND (b > 100) AND (a = 1))
(7 rows)

gpadmin=# UPDATE v1 SET b = 2 WHERE a = 1;
UPDATE 1

The document also says that:
"There is a catch if you try to use conditional rules for *complex view*
updates: there must be an unconditional
INSTEAD rule for each action you wish to allow on the view" which makes me
think a simple view can have a
conditional INSTEAD rule. And the document is explicitly changed in commit
a99c42f291421572aef2:
-   There is a catch if you try to use conditional rules for view
+  There is a catch if you try to use conditional rules for complex view

Does that mean we should support conditional rules for a simple view?

Regards,
Pengzhou Tang

Re: [HACKERS] Block level parallel vacuum

2020-01-16 Thread Dilip Kumar

On Fri, Jan 17, 2020 at 11:34 AM Amit Kapila  wrote:
>
> On Fri, Jan 17, 2020 at 11:00 AM Dilip Kumar  wrote:
> >
> > On Fri, Jan 17, 2020 at 10:44 AM Amit Kapila  
> > wrote:
> > >
> > > On Fri, Jan 17, 2020 at 9:36 AM Dilip Kumar  wrote:
> > > >
> > > > I have few small comments.
> > > >
> > > > 1.
> > > > logical streaming for large in-progress transactions+
> > > > + /* Can't perform vacuum in parallel */
> > > > + if (parallel_workers <= 0)
> > > > + {
> > > > + pfree(can_parallel_vacuum);
> > > > + return lps;
> > > > + }
> > > >
> > > > why are we checking parallel_workers <= 0, Function
> > > > compute_parallel_vacuum_workers only returns 0 or greater than 0
> > > > so isn't it better to just check if (parallel_workers == 0) ?
> > > >
> > >
> > > Why to have such an assumption about
> > > compute_parallel_vacuum_workers()?  The function
> > > compute_parallel_vacuum_workers() returns int, so such a check
> > > (<= 0) seems reasonable to me.
> >
> > Okay so I should probably change my statement to why
> > compute_parallel_vacuum_workers is returning "int" instead of uint?
> >
>
> Hmm, I think the number of workers at most places is int, so it is
> better to return int here which will keep it consistent with how we do
> at other places.  See, the similar usage in compute_parallel_worker.

Okay, I see.

>
>   I
> > mean when this function is designed to return 0 or more worker why to
> > make it return int and then handle extra values on caller.  Am I
> > missing something, can it really return negative in some cases?
> >
> > I find the below code in "compute_parallel_vacuum_workers" a bit confusing
> >
> > +static int
> > +compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int 
> > nrequested,
> > + bool *can_parallel_vacuum)
> > +{
> > ..
> > + /* The leader process takes one index */
> > + nindexes_parallel--;--> nindexes_parallel can become -1
> > +
> > + /* No index supports parallel vacuum */
> > + if (nindexes_parallel == 0) .  -> Now if it is 0 then return 0 but
> > if its -1 then continue. seems strange no?  I think here itself we can
> > handle if (nindexes_parallel <= 0), that will make code cleaner.
> > + return 0;
> > +
>
> I think this got recently introduce by one of my changes based on the
> comment by Mahendra, we can adjust this check.

Ok
>
> > > >
> > > > I don't like the idea of first initializing the
> > > > VacuumSharedCostBalance with lps->lvshared->cost_balance and then
> > > > uninitialize if nworkers_launched is 0.
> > > > I am not sure why do we need to initialize VacuumSharedCostBalance
> > > > here? just to perform pg_atomic_write_u32(VacuumSharedCostBalance,
> > > > VacuumCostBalance);?
> > > > I think we can initialize it only if nworkers_launched > 0 then we can
> > > > get rid of the else branch completely.
> > > >
> > >
> > > No, we can't initialize after nworkers_launched > 0 because by that
> > > time some workers would have already tried to access the shared cost
> > > balance.  So, it needs to be done before launching the workers as is
> > > done in code.  We can probably add a comment.
> > I don't think so, VacuumSharedCostBalance is a process local which is
> > just pointing to the shared memory variable right?
> >
> > and each process has to point it to the shared memory and that we are
> > already doing in parallel_vacuum_main.  So we can initialize it after
> > worker is launched.
> > Basically code will look like below
> >
> > pg_atomic_write_u32(&(lps->lvshared->cost_balance), VacuumCostBalance);
> > pg_atomic_write_u32(&(lps->lvshared->active_nworkers), 0);
> >
>
> oh, I thought you were telling to initialize the shared memory itself
> after launching the workers.  However, you are asking to change the
> usage of the local variable, I think we can do that.

Okay.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Block level parallel vacuum

2020-01-16 Thread Amit Kapila

On Fri, Jan 17, 2020 at 11:00 AM Dilip Kumar  wrote:
>
> On Fri, Jan 17, 2020 at 10:44 AM Amit Kapila  wrote:
> >
> > On Fri, Jan 17, 2020 at 9:36 AM Dilip Kumar  wrote:
> > >
> > > I have few small comments.
> > >
> > > 1.
> > > logical streaming for large in-progress transactions+
> > > + /* Can't perform vacuum in parallel */
> > > + if (parallel_workers <= 0)
> > > + {
> > > + pfree(can_parallel_vacuum);
> > > + return lps;
> > > + }
> > >
> > > why are we checking parallel_workers <= 0, Function
> > > compute_parallel_vacuum_workers only returns 0 or greater than 0
> > > so isn't it better to just check if (parallel_workers == 0) ?
> > >
> >
> > Why to have such an assumption about
> > compute_parallel_vacuum_workers()?  The function
> > compute_parallel_vacuum_workers() returns int, so such a check
> > (<= 0) seems reasonable to me.
>
> Okay so I should probably change my statement to why
> compute_parallel_vacuum_workers is returning "int" instead of uint?
>

Hmm, I think the number of workers at most places is int, so it is
better to return int here which will keep it consistent with how we do
at other places.  See, the similar usage in compute_parallel_worker.

  I
> mean when this function is designed to return 0 or more worker why to
> make it return int and then handle extra values on caller.  Am I
> missing something, can it really return negative in some cases?
>
> I find the below code in "compute_parallel_vacuum_workers" a bit confusing
>
> +static int
> +compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int nrequested,
> + bool *can_parallel_vacuum)
> +{
> ..
> + /* The leader process takes one index */
> + nindexes_parallel--;--> nindexes_parallel can become -1
> +
> + /* No index supports parallel vacuum */
> + if (nindexes_parallel == 0) .  -> Now if it is 0 then return 0 but
> if its -1 then continue. seems strange no?  I think here itself we can
> handle if (nindexes_parallel <= 0), that will make code cleaner.
> + return 0;
> +

I think this got recently introduce by one of my changes based on the
comment by Mahendra, we can adjust this check.

> > >
> > > I don't like the idea of first initializing the
> > > VacuumSharedCostBalance with lps->lvshared->cost_balance and then
> > > uninitialize if nworkers_launched is 0.
> > > I am not sure why do we need to initialize VacuumSharedCostBalance
> > > here? just to perform pg_atomic_write_u32(VacuumSharedCostBalance,
> > > VacuumCostBalance);?
> > > I think we can initialize it only if nworkers_launched > 0 then we can
> > > get rid of the else branch completely.
> > >
> >
> > No, we can't initialize after nworkers_launched > 0 because by that
> > time some workers would have already tried to access the shared cost
> > balance.  So, it needs to be done before launching the workers as is
> > done in code.  We can probably add a comment.
> I don't think so, VacuumSharedCostBalance is a process local which is
> just pointing to the shared memory variable right?
>
> and each process has to point it to the shared memory and that we are
> already doing in parallel_vacuum_main.  So we can initialize it after
> worker is launched.
> Basically code will look like below
>
> pg_atomic_write_u32(&(lps->lvshared->cost_balance), VacuumCostBalance);
> pg_atomic_write_u32(&(lps->lvshared->active_nworkers), 0);
>

oh, I thought you were telling to initialize the shared memory itself
after launching the workers.  However, you are asking to change the
usage of the local variable, I think we can do that.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Unnecessary delay in streaming replication due to replay lag

2020-01-16 Thread Michael Paquier

On Fri, Jan 17, 2020 at 09:34:05AM +0530, Asim R P wrote:
> Standby does not start walreceiver process until startup process
> finishes WAL replay.  The more WAL there is to replay, longer is the
> delay in starting streaming replication.  If replication connection is
> temporarily disconnected, this delay becomes a major problem and we
> are proposing a solution to avoid the delay.

Yeah, that's documented:
https://www.postgresql.org/message-id/20190910062325.gd11...@paquier.xyz

> We propose to address this by starting walreceiver without waiting for
> startup process to finish replay of WAL.  Please see attached
> patchset.  It can be summarized as follows:
> 
> 0001 - TAP test to demonstrate the problem.

There is no real need for debug_replay_delay because we have already
recovery_min_apply_delay, no?  That would count only after consistency
has been reached, and only for COMMIT records, but your test would be
enough with that.

> 0002 - The standby startup sequence is changed such that
>walreceiver is started by startup process before it begins
>to replay WAL.

See below.

> 0003 - Postmaster starts walreceiver if it finds that a
>walreceiver process is no longer running and the state
>indicates that it is operating as a standby.

I have not checked in details, but I smell some race conditions
between the postmaster and the startup process here.

> This is a POC, we are looking for early feedback on whether the
> problem is worth solving and if it makes sense to solve if along this
> route.

You are not the first person interested in this problem, we have a
patch registered in this CF to control the timing when a WAL receiver
is started at recovery:
https://commitfest.postgresql.org/26/1995/
https://www.postgresql.org/message-id/b271715f-f945-35b0-d1f5-c9de3e56f...@postgrespro.ru

I am pretty sure that we should not change the default behavior to
start the WAL receiver after replaying everything from the archives to
avoid copying some WAL segments for nothing, so being able to use a
GUC switch should be the way to go, and Konstantin's latest patch was
using this approach.  Your patch 0002 adds visibly a third mode: start
immediately on top of the two ones already proposed:
- Start after replaying all WAL available locally and in the
archives.
- Start after reaching a consistent point.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] Block level parallel vacuum

2020-01-16 Thread Dilip Kumar

On Fri, Jan 17, 2020 at 10:44 AM Amit Kapila  wrote:
>
> On Fri, Jan 17, 2020 at 9:36 AM Dilip Kumar  wrote:
> >
> > I have few small comments.
> >
> > 1.
> > logical streaming for large in-progress transactions+
> > + /* Can't perform vacuum in parallel */
> > + if (parallel_workers <= 0)
> > + {
> > + pfree(can_parallel_vacuum);
> > + return lps;
> > + }
> >
> > why are we checking parallel_workers <= 0, Function
> > compute_parallel_vacuum_workers only returns 0 or greater than 0
> > so isn't it better to just check if (parallel_workers == 0) ?
> >
>
> Why to have such an assumption about
> compute_parallel_vacuum_workers()?  The function
> compute_parallel_vacuum_workers() returns int, so such a check
> (<= 0) seems reasonable to me.

Okay so I should probably change my statement to why
compute_parallel_vacuum_workers is returning "int" instead of uint?  I
mean when this function is designed to return 0 or more worker why to
make it return int and then handle extra values on caller.  Am I
missing something, can it really return negative in some cases?

I find the below code in "compute_parallel_vacuum_workers" a bit confusing

+static int
+compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int nrequested,
+ bool *can_parallel_vacuum)
+{
..
+ /* The leader process takes one index */
+ nindexes_parallel--;--> nindexes_parallel can become -1
+
+ /* No index supports parallel vacuum */
+ if (nindexes_parallel == 0) .  -> Now if it is 0 then return 0 but
if its -1 then continue. seems strange no?  I think here itself we can
handle if (nindexes_parallel <= 0), that will make code cleaner.
+ return 0;
+
+ /* Compute the parallel degree */
+ parallel_workers = (nrequested > 0) ?
+ Min(nrequested, nindexes_parallel) : nindexes_parallel;



>
> > 2.
> > +/*
> > + * Macro to check if we are in a parallel vacuum.  If true, we are in the
> > + * parallel mode and the DSM segment is initialized.
> > + */
> > +#define ParallelVacuumIsActive(lps) (((LVParallelState *) (lps)) != NULL)
> >
> > (LVParallelState *) (lps) -> this typecast is not required, just (lps)
> > != NULL should be enough.
> >
>
> I think the better idea would be to just replace it PointerIsValid
> like below. I see similar usage in other places.
> #define ParallelVacuumIsActive(lps)  PointerIsValid(lps)
Make sense to me.
>
> > 3.
> >
> > + shared->offset = MAXALIGN(add_size(SizeOfLVShared, BITMAPLEN(nindexes)));
> > + prepare_index_statistics(shared, can_parallel_vacuum, nindexes);
> > + pg_atomic_init_u32(&(shared->idx), 0);
> > + pg_atomic_init_u32(&(shared->cost_balance), 0);
> > + pg_atomic_init_u32(&(shared->active_nworkers), 0);
> >
> > I think it will look cleaner if we can initialize in the order they
> > are declared in structure.
> >
>
> Okay.
>
> > 4.
> > + VacuumSharedCostBalance = &(lps->lvshared->cost_balance);
> > + VacuumActiveNWorkers = &(lps->lvshared->active_nworkers);
> > +
> > + /*
> > + * Set up shared cost balance and the number of active workers for
> > + * vacuum delay.
> > + */
> > + pg_atomic_write_u32(VacuumSharedCostBalance, VacuumCostBalance);
> > + pg_atomic_write_u32(VacuumActiveNWorkers, 0);
> > +
> > + /*
> > + * The number of workers can vary between bulkdelete and cleanup
> > + * phase.
> > + */
> > + ReinitializeParallelWorkers(lps->pcxt, nworkers);
> > +
> > + LaunchParallelWorkers(lps->pcxt);
> > +
> > + if (lps->pcxt->nworkers_launched > 0)
> > + {
> > + /*
> > + * Reset the local cost values for leader backend as we have
> > + * already accumulated the remaining balance of heap.
> > + */
> > + VacuumCostBalance = 0;
> > + VacuumCostBalanceLocal = 0;
> > + }
> > + else
> > + {
> > + /*
> > + * Disable shared cost balance if we are not able to launch
> > + * workers.
> > + */
> > + VacuumSharedCostBalance = NULL;
> > + VacuumActiveNWorkers = NULL;
> > + }
> > +
> >
> > I don't like the idea of first initializing the
> > VacuumSharedCostBalance with lps->lvshared->cost_balance and then
> > uninitialize if nworkers_launched is 0.
> > I am not sure why do we need to initialize VacuumSharedCostBalance
> > here? just to perform pg_atomic_write_u32(VacuumSharedCostBalance,
> > VacuumCostBalance);?
> > I think we can initialize it only if nworkers_launched > 0 then we can
> > get rid of the else branch completely.
> >
>
> No, we can't initialize after nworkers_launched > 0 because by that
> time some workers would have already tried to access the shared cost
> balance.  So, it needs to be done before launching the workers as is
> done in code.  We can probably add a comment.
I don't think so, VacuumSharedCostBalance is a process local which is
just pointing to the shared memory variable right?

and each process has to point it to the shared memory and that we are
already doing in parallel_vacuum_main.  So we can initialize it after
worker is launched.
Basically code will look like below

pg_atomic_write_u32(&(lps->lvshared->cost_balance), VacuumCostBalance);

Re: [HACKERS] Block level parallel vacuum

2020-01-16 Thread Amit Kapila

On Fri, Jan 17, 2020 at 9:36 AM Dilip Kumar  wrote:
>
> I have few small comments.
>
> 1.
> logical streaming for large in-progress transactions+
> + /* Can't perform vacuum in parallel */
> + if (parallel_workers <= 0)
> + {
> + pfree(can_parallel_vacuum);
> + return lps;
> + }
>
> why are we checking parallel_workers <= 0, Function
> compute_parallel_vacuum_workers only returns 0 or greater than 0
> so isn't it better to just check if (parallel_workers == 0) ?
>

Why to have such an assumption about
compute_parallel_vacuum_workers()?  The function
compute_parallel_vacuum_workers() returns int, so such a check
(<= 0) seems reasonable to me.

> 2.
> +/*
> + * Macro to check if we are in a parallel vacuum.  If true, we are in the
> + * parallel mode and the DSM segment is initialized.
> + */
> +#define ParallelVacuumIsActive(lps) (((LVParallelState *) (lps)) != NULL)
>
> (LVParallelState *) (lps) -> this typecast is not required, just (lps)
> != NULL should be enough.
>

I think the better idea would be to just replace it PointerIsValid
like below. I see similar usage in other places.
#define ParallelVacuumIsActive(lps)  PointerIsValid(lps)

> 3.
>
> + shared->offset = MAXALIGN(add_size(SizeOfLVShared, BITMAPLEN(nindexes)));
> + prepare_index_statistics(shared, can_parallel_vacuum, nindexes);
> + pg_atomic_init_u32(&(shared->idx), 0);
> + pg_atomic_init_u32(&(shared->cost_balance), 0);
> + pg_atomic_init_u32(&(shared->active_nworkers), 0);
>
> I think it will look cleaner if we can initialize in the order they
> are declared in structure.
>

Okay.

> 4.
> + VacuumSharedCostBalance = &(lps->lvshared->cost_balance);
> + VacuumActiveNWorkers = &(lps->lvshared->active_nworkers);
> +
> + /*
> + * Set up shared cost balance and the number of active workers for
> + * vacuum delay.
> + */
> + pg_atomic_write_u32(VacuumSharedCostBalance, VacuumCostBalance);
> + pg_atomic_write_u32(VacuumActiveNWorkers, 0);
> +
> + /*
> + * The number of workers can vary between bulkdelete and cleanup
> + * phase.
> + */
> + ReinitializeParallelWorkers(lps->pcxt, nworkers);
> +
> + LaunchParallelWorkers(lps->pcxt);
> +
> + if (lps->pcxt->nworkers_launched > 0)
> + {
> + /*
> + * Reset the local cost values for leader backend as we have
> + * already accumulated the remaining balance of heap.
> + */
> + VacuumCostBalance = 0;
> + VacuumCostBalanceLocal = 0;
> + }
> + else
> + {
> + /*
> + * Disable shared cost balance if we are not able to launch
> + * workers.
> + */
> + VacuumSharedCostBalance = NULL;
> + VacuumActiveNWorkers = NULL;
> + }
> +
>
> I don't like the idea of first initializing the
> VacuumSharedCostBalance with lps->lvshared->cost_balance and then
> uninitialize if nworkers_launched is 0.
> I am not sure why do we need to initialize VacuumSharedCostBalance
> here? just to perform pg_atomic_write_u32(VacuumSharedCostBalance,
> VacuumCostBalance);?
> I think we can initialize it only if nworkers_launched > 0 then we can
> get rid of the else branch completely.
>

No, we can't initialize after nworkers_launched > 0 because by that
time some workers would have already tried to access the shared cost
balance.  So, it needs to be done before launching the workers as is
done in code.  We can probably add a comment.

>
> + /* Carry the shared balance value to heap scan */
> + if (VacuumSharedCostBalance)
> + VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
> +
> + if (nworkers > 0)
> + {
> + /* Disable shared cost balance */
> + VacuumSharedCostBalance = NULL;
> + VacuumActiveNWorkers = NULL;
> + }
>
> Doesn't make sense to keep them as two conditions, we can combine them as 
> below
>
> /* If shared costing is enable, carry the shared balance value to heap
> scan and disable the shared costing */
>  if (VacuumSharedCostBalance)
> {
>  VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
>  VacuumSharedCostBalance = NULL;
>  VacuumActiveNWorkers = NULL;
> }
>

makes sense to me, will change.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

2020-01-16 Thread Michael Paquier

On Thu, Jan 16, 2020 at 11:17:36PM +0900, Fujii Masao wrote:
> OK, I updated the patch that way.
> Attached is the updated version of the patch.

Thanks.  I have few tweaks to propose to the docs.

+raise a PANIC-level error, aborting the recovery. Setting
Instead of "PANIC-level error", I would just use "PANIC error", and
instead of "aborting the recovery" just "crashing the server".

+causes the system to ignore those WAL records
WAL records are not ignored, but errors caused by incorrect page
references in those WAL records are.  The current phrasing sounds like
the WAL records are not applied.

Another thing that I just recalled.  Do you think that it would be
better to mention that invalid page references can only be seen after
reaching the consistent point during recovery?  The information given
looks enough, but I was just wondering if that's worth documenting or
not.
--
Michael


signature.asc
Description: PGP signature

Re: Add pg_file_sync() to adminpack

2020-01-16 Thread Michael Paquier

On Thu, Jan 16, 2020 at 09:51:24AM -0500, Robert Haas wrote:
> On Tue, Jan 14, 2020 at 10:08 AM Atsushi Torikoshi  wrote:
>> fails we can get error messages. So it seems straightforward for me to
>>  'return true on success and emit an ERROR otherwise'.
> 
> I agree with reporting errors using ERROR, but it seems to me that we
> ought to then make the function return 'void' rather than 'bool'.

Yeah, that should be either ERROR and return a void result, or issue a
WARNING/ERROR (depending on the code path, maybe PANIC?) with a
boolean status returned if there is a WARNING.  Mixing both concepts
with an ERROR all the time and always a true status is just weird,
because you know that if no errors are raised then the status will be
always true.  So there is no point to have a boolean status to begin
with.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] Block level parallel vacuum

2020-01-16 Thread Dilip Kumar

On Fri, Jan 17, 2020 at 9:36 AM Dilip Kumar  wrote:
>
> On Thu, Jan 16, 2020 at 5:34 PM Amit Kapila  wrote:
> >
> > On Thu, Jan 16, 2020 at 4:46 PM Masahiko Sawada
> >  wrote:
> > >
> > > Right. Most indexes (all?) of tables that are used in the regression
> > > tests are smaller than min_parallel_index_scan_size. And we set
> > > min_parallel_index_scan_size to 0 in vacuum.sql but VACUUM would not
> > > be speeded-up much because of the relation size. Since we instead
> > > populate new table for parallel vacuum testing the regression test for
> > > vacuum would take a longer time.
> > >
> >
> > Fair enough and I think it is good in a way that it won't change the
> > coverage of existing vacuum code.  I have fixed all the issues
> > reported by Mahendra and have fixed a few other cosmetic things in the
> > attached patch.
> >
> I have few small comments.
>
> 1.
> logical streaming for large in-progress transactions+
> + /* Can't perform vacuum in parallel */
> + if (parallel_workers <= 0)
> + {
> + pfree(can_parallel_vacuum);
> + return lps;
> + }
>
> why are we checking parallel_workers <= 0, Function
> compute_parallel_vacuum_workers only returns 0 or greater than 0
> so isn't it better to just check if (parallel_workers == 0) ?
>
> 2.
> +/*
> + * Macro to check if we are in a parallel vacuum.  If true, we are in the
> + * parallel mode and the DSM segment is initialized.
> + */
> +#define ParallelVacuumIsActive(lps) (((LVParallelState *) (lps)) != NULL)
>
> (LVParallelState *) (lps) -> this typecast is not required, just (lps)
> != NULL should be enough.
>
> 3.
>
> + shared->offset = MAXALIGN(add_size(SizeOfLVShared, BITMAPLEN(nindexes)));
> + prepare_index_statistics(shared, can_parallel_vacuum, nindexes);
> + pg_atomic_init_u32(&(shared->idx), 0);
> + pg_atomic_init_u32(&(shared->cost_balance), 0);
> + pg_atomic_init_u32(&(shared->active_nworkers), 0);
>
> I think it will look cleaner if we can initialize in the order they
> are declared in structure.
>
> 4.
> + VacuumSharedCostBalance = &(lps->lvshared->cost_balance);
> + VacuumActiveNWorkers = &(lps->lvshared->active_nworkers);
> +
> + /*
> + * Set up shared cost balance and the number of active workers for
> + * vacuum delay.
> + */
> + pg_atomic_write_u32(VacuumSharedCostBalance, VacuumCostBalance);
> + pg_atomic_write_u32(VacuumActiveNWorkers, 0);
> +
> + /*
> + * The number of workers can vary between bulkdelete and cleanup
> + * phase.
> + */
> + ReinitializeParallelWorkers(lps->pcxt, nworkers);
> +
> + LaunchParallelWorkers(lps->pcxt);
> +
> + if (lps->pcxt->nworkers_launched > 0)
> + {
> + /*
> + * Reset the local cost values for leader backend as we have
> + * already accumulated the remaining balance of heap.
> + */
> + VacuumCostBalance = 0;
> + VacuumCostBalanceLocal = 0;
> + }
> + else
> + {
> + /*
> + * Disable shared cost balance if we are not able to launch
> + * workers.
> + */
> + VacuumSharedCostBalance = NULL;
> + VacuumActiveNWorkers = NULL;
> + }
> +
>
> I don't like the idea of first initializing the
> VacuumSharedCostBalance with lps->lvshared->cost_balance and then
> uninitialize if nworkers_launched is 0.
> I am not sure why do we need to initialize VacuumSharedCostBalance
> here? just to perform pg_atomic_write_u32(VacuumSharedCostBalance,
> VacuumCostBalance);?
> I think we can initialize it only if nworkers_launched > 0 then we can
> get rid of the else branch completely.

I missed one of my comment

+ /* Carry the shared balance value to heap scan */
+ if (VacuumSharedCostBalance)
+ VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+
+ if (nworkers > 0)
+ {
+ /* Disable shared cost balance */
+ VacuumSharedCostBalance = NULL;
+ VacuumActiveNWorkers = NULL;
+ }

Doesn't make sense to keep them as two conditions, we can combine them as below

/* If shared costing is enable, carry the shared balance value to heap
scan and disable the shared costing */
 if (VacuumSharedCostBalance)
{
 VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
 VacuumSharedCostBalance = NULL;
 VacuumActiveNWorkers = NULL;
}

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Block level parallel vacuum

2020-01-16 Thread Dilip Kumar

On Thu, Jan 16, 2020 at 5:34 PM Amit Kapila  wrote:
>
> On Thu, Jan 16, 2020 at 4:46 PM Masahiko Sawada
>  wrote:
> >
> > Right. Most indexes (all?) of tables that are used in the regression
> > tests are smaller than min_parallel_index_scan_size. And we set
> > min_parallel_index_scan_size to 0 in vacuum.sql but VACUUM would not
> > be speeded-up much because of the relation size. Since we instead
> > populate new table for parallel vacuum testing the regression test for
> > vacuum would take a longer time.
> >
>
> Fair enough and I think it is good in a way that it won't change the
> coverage of existing vacuum code.  I have fixed all the issues
> reported by Mahendra and have fixed a few other cosmetic things in the
> attached patch.
>
I have few small comments.

1.
logical streaming for large in-progress transactions+
+ /* Can't perform vacuum in parallel */
+ if (parallel_workers <= 0)
+ {
+ pfree(can_parallel_vacuum);
+ return lps;
+ }

why are we checking parallel_workers <= 0, Function
compute_parallel_vacuum_workers only returns 0 or greater than 0
so isn't it better to just check if (parallel_workers == 0) ?

2.
+/*
+ * Macro to check if we are in a parallel vacuum.  If true, we are in the
+ * parallel mode and the DSM segment is initialized.
+ */
+#define ParallelVacuumIsActive(lps) (((LVParallelState *) (lps)) != NULL)

(LVParallelState *) (lps) -> this typecast is not required, just (lps)
!= NULL should be enough.

3.

+ shared->offset = MAXALIGN(add_size(SizeOfLVShared, BITMAPLEN(nindexes)));
+ prepare_index_statistics(shared, can_parallel_vacuum, nindexes);
+ pg_atomic_init_u32(&(shared->idx), 0);
+ pg_atomic_init_u32(&(shared->cost_balance), 0);
+ pg_atomic_init_u32(&(shared->active_nworkers), 0);

I think it will look cleaner if we can initialize in the order they
are declared in structure.

4.
+ VacuumSharedCostBalance = &(lps->lvshared->cost_balance);
+ VacuumActiveNWorkers = &(lps->lvshared->active_nworkers);
+
+ /*
+ * Set up shared cost balance and the number of active workers for
+ * vacuum delay.
+ */
+ pg_atomic_write_u32(VacuumSharedCostBalance, VacuumCostBalance);
+ pg_atomic_write_u32(VacuumActiveNWorkers, 0);
+
+ /*
+ * The number of workers can vary between bulkdelete and cleanup
+ * phase.
+ */
+ ReinitializeParallelWorkers(lps->pcxt, nworkers);
+
+ LaunchParallelWorkers(lps->pcxt);
+
+ if (lps->pcxt->nworkers_launched > 0)
+ {
+ /*
+ * Reset the local cost values for leader backend as we have
+ * already accumulated the remaining balance of heap.
+ */
+ VacuumCostBalance = 0;
+ VacuumCostBalanceLocal = 0;
+ }
+ else
+ {
+ /*
+ * Disable shared cost balance if we are not able to launch
+ * workers.
+ */
+ VacuumSharedCostBalance = NULL;
+ VacuumActiveNWorkers = NULL;
+ }
+

I don't like the idea of first initializing the
VacuumSharedCostBalance with lps->lvshared->cost_balance and then
uninitialize if nworkers_launched is 0.
I am not sure why do we need to initialize VacuumSharedCostBalance
here? just to perform pg_atomic_write_u32(VacuumSharedCostBalance,
VacuumCostBalance);?
I think we can initialize it only if nworkers_launched > 0 then we can
get rid of the else branch completely.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Unnecessary delay in streaming replication due to replay lag

2020-01-16 Thread Asim R P

Hi

Standby does not start walreceiver process until startup process
finishes WAL replay.  The more WAL there is to replay, longer is the
delay in starting streaming replication.  If replication connection is
temporarily disconnected, this delay becomes a major problem and we
are proposing a solution to avoid the delay.

WAL replay is likely to fall behind when master is processing
write-heavy workload, because WAL is generated by concurrently running
backends on master while only one startup process on standby replays WAL
records in sequence as new WAL is received from master.

Replication connection between walsender and walreceiver may break due
to reasons such as transient network issue, standby going through
restart, etc.  The delay in resuming replication connection leads to
lack of high availability - only one copy of WAL is available during
this period.

The problem worsens when the replication is configured to be
synchronous.  Commits on master must wait until the WAL replay is
finished on standby, walreceiver is then started and it confirms flush
of WAL upto the commit LSN.  If synchronous_commit GUC is set to
remote_write, this behavior is equivalent to tacitly changing it to
remote_apply until the replication connection is re-established!

Has anyone encountered such a problem with streaming replication?

We propose to address this by starting walreceiver without waiting for
startup process to finish replay of WAL.  Please see attached
patchset.  It can be summarized as follows:

0001 - TAP test to demonstrate the problem.

0002 - The standby startup sequence is changed such that
   walreceiver is started by startup process before it begins
   to replay WAL.

0003 - Postmaster starts walreceiver if it finds that a
   walreceiver process is no longer running and the state
   indicates that it is operating as a standby.

This is a POC, we are looking for early feedback on whether the
problem is worth solving and if it makes sense to solve if along this
route.

Hao and Asim


0001-Test-that-replay-of-WAL-logs-on-standby-does-not-aff.patch
Description: Binary data


0003-Start-WAL-receiver-when-it-is-found-not-running.patch
Description: Binary data


0002-Start-WAL-receiver-before-startup-process-replays-ex.patch
Description: Binary data

Re: Reorderbuffer crash during recovery

2020-01-16 Thread Dilip Kumar

On Fri, Jan 17, 2020 at 7:42 AM vignesh C  wrote:
>
> On Thu, Jan 16, 2020 at 9:17 AM Dilip Kumar  wrote:
> >
> > One minor comment.  Otherwise, the patch looks fine to me.
> > + /*
> > + * We set final_lsn on a transaction when we decode its commit or abort
> > + * record, but we never see those records for crashed transactions.  To
> > + * ensure cleanup of these transactions, set final_lsn to that of their
> > + * last change; this causes ReorderBufferRestoreCleanup to do the right
> > + * thing. Final_lsn would have been set with commit_lsn earlier when we
> > + * decode it commit, no need to update in that case
> > + */
> > + if (txn->final_lsn < change->lsn)
> > + txn->final_lsn = change->lsn;
> >
> > /decode it commit,/decode its commit,
> >
>
> Thanks Dilip for reviewing.
> I have fixed the comments you have suggested.
>
Thanks for the updated patch.  It looks fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: pgindent && weirdness

2020-01-16 Thread Tom Lane

Alvaro Herrera  writes:
> On 2020-Jan-16, Tom Lane wrote:
>> ... But I was hoping to
>> hear Piotr's opinion before moving forward.

> FWIW I think this code predates Piotr's involvement, I think; at least,
> it was already there in the FreeBSD code he imported:
> https://github.com/pstef/freebsd_indent/commit/55c29a8774923f2d40fef7919b9490f61e57e7bb#diff-85c94ae15198235e2363f96216b9a1b2R565

The roots of that code are even older than Postgres, I believe,
and there may not be anybody left who understands it completely.
But Piotr has certainly spent more time looking at it than I have,
so I'd still like to hear what he thinks.

regards, tom lane

Re: pgindent && weirdness

2020-01-16 Thread Alvaro Herrera

On 2020-Jan-16, Tom Lane wrote:

> Alvaro Herrera  writes:
> > On 2020-Jan-17, Michael Paquier wrote:
> >> Nice find!  Could you commit that?  I can see many places improved as
> >> well, among explain.c, tablecmds.c, typecmds.c, and much more. 
> 
> > I think Tom is the only one who can commit that,
> > https://git.postgresql.org/gitweb/?p=pg_bsd_indent.git
> 
> I don't *think* that repo is locked down that hard --- IIRC,
> PG committers should have access to it.  But I was hoping to
> hear Piotr's opinion before moving forward.

FWIW I think this code predates Piotr's involvement, I think; at least,
it was already there in the FreeBSD code he imported:
https://github.com/pstef/freebsd_indent/commit/55c29a8774923f2d40fef7919b9490f61e57e7bb#diff-85c94ae15198235e2363f96216b9a1b2R565

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pgindent && weirdness

2020-01-16 Thread Tom Lane

Alvaro Herrera  writes:
> On 2020-Jan-17, Michael Paquier wrote:
>> Nice find!  Could you commit that?  I can see many places improved as
>> well, among explain.c, tablecmds.c, typecmds.c, and much more. 

> I think Tom is the only one who can commit that,
> https://git.postgresql.org/gitweb/?p=pg_bsd_indent.git

I don't *think* that repo is locked down that hard --- IIRC,
PG committers should have access to it.  But I was hoping to
hear Piotr's opinion before moving forward.

regards, tom lane

Re: Run-time pruning for ModifyTable

2020-01-16 Thread Michael Paquier

On Thu, Jan 16, 2020 at 10:45:25PM +0100, Tomas Vondra wrote:
> David, this patch is marked as "waiting on author" since 11/27, and
> there have been no updates or responses since then. Do you plan to
> submit a new patch version in this CF? We're already half-way through,
> so there's not much time ...

The reason why I moved it to 2020-01 is that there was not enough time
for David to reply back.  At this stage, it seems more appropriate to
me to mark it as returned with feedback and move on.
--
Michael


signature.asc
Description: PGP signature

Re: Setting min/max TLS protocol in clientside libpq

2020-01-16 Thread Michael Paquier

On Fri, Jan 17, 2020 at 10:09:54AM +0900, Michael Paquier wrote:
> Could you please rebase and fix the remaining pieces of the patch?

And while I remember, you may want to add checks for incorrect bounds
when validating the values in fe-connect.c...  The same arguments as
for the backend part apply because we'd want to make the
implementation a maximum pluggable with all SSL libraries.
--
Michael


signature.asc
Description: PGP signature

Re: pgindent && weirdness

2020-01-16 Thread Alvaro Herrera

On 2020-Jan-17, Michael Paquier wrote:

> On Thu, Jan 16, 2020 at 06:13:36PM -0300, Alvaro Herrera wrote:
> > This is indeed a very good fix!  Several badly formatted sites in our
> > code are improved with this change.
> 
> Nice find!  Could you commit that?  I can see many places improved as
> well, among explain.c, tablecmds.c, typecmds.c, and much more. 

I think Tom is the only one who can commit that,
https://git.postgresql.org/gitweb/?p=pg_bsd_indent.git

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reorderbuffer crash during recovery

2020-01-16 Thread vignesh C

On Thu, Jan 16, 2020 at 9:17 AM Dilip Kumar  wrote:
>
> One minor comment.  Otherwise, the patch looks fine to me.
> + /*
> + * We set final_lsn on a transaction when we decode its commit or abort
> + * record, but we never see those records for crashed transactions.  To
> + * ensure cleanup of these transactions, set final_lsn to that of their
> + * last change; this causes ReorderBufferRestoreCleanup to do the right
> + * thing. Final_lsn would have been set with commit_lsn earlier when we
> + * decode it commit, no need to update in that case
> + */
> + if (txn->final_lsn < change->lsn)
> + txn->final_lsn = change->lsn;
>
> /decode it commit,/decode its commit,
>

Thanks Dilip for reviewing.
I have fixed the comments you have suggested.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From 6c5de8db4b6ac4da9d4ada0e9fa56133af0ed008 Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Fri, 17 Jan 2020 07:34:31 +0530
Subject: [PATCH] Reorder buffer crash while aborting old transactions.

While aborting aborted transactions it crashed as there are no reorderbuffer
changes present in the list because all the reorderbuffer changes were
serialized. Fixing it by storing the last change's lsn while serializing the
reorderbuffer changes. This lsn will be used as final_lsn which will help in
cleaning of spilled files in pg_replslot.
---
 src/backend/replication/logical/reorderbuffer.c | 26 +++--
 src/include/replication/reorderbuffer.h |  4 ++--
 2 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bbd908a..fa8a6b0 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1974,21 +1974,6 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
 
 		if (TransactionIdPrecedes(txn->xid, oldestRunningXid))
 		{
-			/*
-			 * We set final_lsn on a transaction when we decode its commit or
-			 * abort record, but we never see those records for crashed
-			 * transactions.  To ensure cleanup of these transactions, set
-			 * final_lsn to that of their last change; this causes
-			 * ReorderBufferRestoreCleanup to do the right thing.
-			 */
-			if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
-			{
-ReorderBufferChange *last =
-dlist_tail_element(ReorderBufferChange, node, >changes);
-
-txn->final_lsn = last->lsn;
-			}
-
 			elog(DEBUG2, "aborting old transaction %u", txn->xid);
 
 			/* remove potential on-disk data, and deallocate this tx */
@@ -2697,6 +2682,17 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	}
 	pgstat_report_wait_end();
 
+	/*
+	 * We set final_lsn on a transaction when we decode its commit or abort
+	 * record, but we never see those records for crashed transactions.  To
+	 * ensure cleanup of these transactions, set final_lsn to that of their
+	 * last change; this causes ReorderBufferRestoreCleanup to do the right
+	 * thing. Final_lsn would have been set with commit_lsn earlier when we
+	 * decode its commit, no need to update in that case
+	 */
+	if (txn->final_lsn < change->lsn)
+		txn->final_lsn = change->lsn;
+
 	Assert(ondisk->change.action == change->action);
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 4f6c65d..5b67457 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -208,8 +208,8 @@ typedef struct ReorderBufferTXN
 	 * * plain abort record
 	 * * prepared transaction abort
 	 * * error during decoding
-	 * * for a crashed transaction, the LSN of the last change, regardless of
-	 *   what it was.
+	 * * for a transaction with serialized changes, the LSN of the latest
+	 *   serialized one, unless one of the above cases.
 	 * 
 	 */
 	XLogRecPtr	final_lsn;
-- 
1.8.3.1

From e9a9983ac250f2e795bfdfe78cf50e9c6c7dc5a0 Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Fri, 17 Jan 2020 07:00:14 +0530
Subject: [PATCH] Reorder buffer crash while aborting old transactions.

While aborting aborted transactions it crashed as there are no reorderbuffer
changes present in the list because all the reorderbuffer changes were
serialized. Fixing it by storing the last change's lsn while serializing the
reorderbuffer changes. This lsn will be used as final_lsn which will help in
cleaning of spilled files in pg_replslot.
---
 src/backend/replication/logical/reorderbuffer.c | 26 +++--
 src/include/replication/reorderbuffer.h |  4 ++--
 2 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 6dcd973..fb2bf70 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1877,21 +1877,6 @@ ReorderBufferAbortOld(ReorderBuffer *rb,

Re: making the backend's json parser work in frontend code

2020-01-16 Thread Andrew Dunstan

On Thu, Jan 16, 2020 at 7:33 AM Robert Haas  wrote:
>

>
> 0002 does some basic header cleanup to make it possible to include the
> existing header file jsonapi.h in frontend code. The state of the JSON
> headers today looks generally poor. There seems not to have been much
> attempt to get the prototypes for a given source file, say foo.c, into
> a header file with the same name, say foo.h. Also, dependencies
> between various header files seem to be have added somewhat freely.
> This patch does not come close to fixing all that, but I consider it a
> modest down payment on a cleanup that probably ought to be taken
> further.
>
> 0003 splits json.c into two files, json.c and jsonapi.c. All the
> lexing and parsing stuff (whose prototypes are in jsonapi.h) goes into
> jsonapi.c, while the stuff that pertains to the 'json' data type
> remains in json.c. This also seems like a good cleanup, because to me,
> at least, it's not a great idea to mix together code that is used by
> both the json and jsonb data types as well as other things in the
> system that want to generate or parse json together with things that
> are specific to the 'json' data type.
>

I'm probably responsible for a good deal of the mess, so let me say Thankyou.

I'll have a good look at these.

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pgindent && weirdness

2020-01-16 Thread Michael Paquier

On Thu, Jan 16, 2020 at 06:13:36PM -0300, Alvaro Herrera wrote:
> This is indeed a very good fix!  Several badly formatted sites in our
> code are improved with this change.

Nice find!  Could you commit that?  I can see many places improved as
well, among explain.c, tablecmds.c, typecmds.c, and much more. 
--
Michael


signature.asc
Description: PGP signature

Re: Amcheck: do rightlink verification with lock coupling

2020-01-16 Thread Peter Geoghegan

On Mon, Jan 13, 2020 at 8:47 PM Andrey Borodin  wrote:
> > 11 янв. 2020 г., в 7:49, Peter Geoghegan  написал(а):
> > I'm curious why Andrey's corruption problems were not detected by the
> > cross-page amcheck test, though. We compare the first non-pivot tuple
> > on the right sibling leaf page with the last one on the target page,
> > towards the end of bt_target_page_check() -- isn't that almost as good
> > as what you have here in practice? I probably would have added
> > something like this myself earlier, if I had reason to think that
> > verification would be a lot more effective that way.
>
> We were dealing with corruption caused by lost page update. Consider two 
> pages:
> A->B
> If A is split into A` and C we get:
> A`->C->B
> But if update of A is lost we still have
> A->B, but B backward pointers points to C.
> B's smallest key is bigger than hikey of A`, but this do not violate
> cross-pages invariant.
>
> Page updates may be lost due to bug in backup software with incremental
> backups, bug in storage layer of Aurora-style system, bug in page cache, 
> incorrect
> fsync error handling, bug in ssd firmware etc. And our data checksums do not
> detect this kind of corruption. BTW I think that it would be better if our
> checksums were not stored on a page itseft, they could detect this kind of 
> faults.

I find this argument convincing. I'll try to get this committed soon.

While you could have used bt_index_parent_check() or heapallindexed to
detect the issue, those two options are a lot more expensive (plus the
former option won't work on a standby). Relaxing the principle that
says that we shouldn't hold multiple buffer locks concurrently doesn't
seem like that much to ask for to get such a clear benefit.

I think that this is safe, but page deletion/half-dead pages need more
thought. In general, the target page could have become "ignorable"
when rechecked.

> We were able to timely detect all those problems on primaries in our testing
> environment. But much later found out that some standbys were corrupted,
> the problem appeared only when they were promoted.
> Also, in nearby thread Grygory Rylov (g0djan) is trying to enable one more
> invariant in standby checks.

I looked at that thread just now, but Grygory didn't say why this true
root check was particularly important, so I can't see much upside.
Plus that seems riskier than what you have in mind here.

Does it have something to do with the true root looking like a deleted
page? The details matter.

-- 
Peter Geoghegan

Re: Setting min/max TLS protocol in clientside libpq

2020-01-16 Thread Michael Paquier

On Thu, Jan 16, 2020 at 09:56:01AM +0100, Daniel Gustafsson wrote:
> The patch looks fine to me, I don't an issue with splitting it into a
> refactoring patch and a TLS min/max version patch.

Thanks, committed the refactoring part then.  If the buildfarm breaks
for a reason or another, the window to look at is narrower than if we
had the full set of changes, and the git history is cleaner.  I
noticed as well a compilation warning when compiling with OpenSSL
1.0.2 from protocol_openssl.c because of missing declarations of the
two routines because the header declaration was incorrect.

Could you please rebase and fix the remaining pieces of the patch?
--
Michael

signature.asc
Description: PGP signature

Re: Verify true root on replicas with amcheck

2020-01-16 Thread Peter Geoghegan

On Thu, Jan 9, 2020 at 12:55 AM godjan •  wrote:
> Hi, we have trouble to detect true root corruptions on replicas. I made a 
> patch for resolving it with the locking meta page and potential root page.

What do you mean by true root corruption? What is the cause of the
problem? What symptom does it have in your application?

While I was the one that wrote the existing !readonly/parent check for
the true root (a check which your patch makes work with the regular
bt_check_index() function), I wasn't thinking of any particular
corruption scenario at the time. I wrote the check simply because it
was easy to do so (with a heavyweight ShareLock on the index).

> I heard that amcheck has an invariant about locking no more than 1 page at a 
> moment for avoiding deadlocks. Is there possible a deadlock situation?

This is a conservative principle that I came up with when I wrote the
original version of amcheck. It's not strictly necessary, but it
seemed like a good idea. It should be safe to "couple" buffer locks in
a way that matches the B-Tree code -- as long as it is thought through
very carefully. I am probably going to relax the rule for one specific
case soon -- see:

https://postgr.es/m/f7527087-6e95-4077-b964-d2cafef62...@yandex-team.ru

Your patch looks like it gets it right (it won't deadlock with other
sessions that access the metapage), but I hesitate to commit it
without a strong justification. Acquiring multiple buffer locks
concurrently is worth avoiding wherever possible.

--
Peter Geoghegan

Re: row filtering for logical replication

2020-01-16 Thread Euler Taveira

Em qui., 16 de jan. de 2020 às 18:57, Tomas Vondra
 escreveu:
>
> Euler, this patch is still in "waiting on author" since 11/25. Do you
> plan to review changes made by Amit in the patches he submitted, or what
> are your plans with this patch?
>
Yes, I'm working on Amit suggestions. I'll post a new patch as soon as possible.


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: making the backend's json parser work in frontend code

2020-01-16 Thread Tom Lane

Robert Haas  writes:
> It's possible that we're chasing a real problem here, and if there's
> something we can agree on and get done I'd rather do that than argue,
> but I am still quite suspicious that there's no actually serious
> technical problem here.

It's entirely possible that you're right.  But if this is a file format
that is meant to be exposed to user tools, we need to take a very long
view of the requirements for it.  Five or ten years down the road, we
might be darn glad we spent extra time now.

regards, tom lane

Re: Improve search for missing parent downlinks in amcheck

2020-01-16 Thread Tomas Vondra


On Fri, Nov 29, 2019 at 03:03:01PM +0900, Michael Paquier wrote:

On Mon, Aug 19, 2019 at 01:15:19AM +0300, Alexander Korotkov wrote:

The revised patch seems to fix all of above.


The latest patch is failing to apply.  Please provide a rebase.


This still does not apply (per cputube). Can you provide a fixed patch?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] ltree, lquery, and ltxtquery binary protocol support

2020-01-16 Thread Tomas Vondra

On Fri, Nov 29, 2019 at 11:29:03AM +0900, Michael Paquier wrote:

On Mon, Nov 11, 2019 at 03:44:54PM +0100, Nino Floris wrote:

Alright, as usual life got in the way. I've attached a new version of
the patch with pgindent changes.

> What do you think about writing patch for ALTER TYPE?
I'd rather not :$

> 1) Write migration script, which directly updates pg_type.
This sounds like the best option right now, if we don't want people to
do manual migrations to ltree 1.2.
How would I best go at this?

> Travis has a small complaint:
Should be fixed!

The latest patch provided fails to apply for me on HEAD.  Please
provide a rebase.  For now I am bumping this patch to next CF with
"waiting on author" as status.

Nino, any plans to submit a rebased/fixed patch, so that people can
review it? Not sure if this needs a simple rebase or something more
complex, all I know is cputube can't apply it.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: row filtering for logical replication

2020-01-16 Thread Tomas Vondra


On Thu, Nov 28, 2019 at 11:32:01AM +0900, Michael Paquier wrote:

On Mon, Nov 25, 2019 at 11:48:29AM +0900, Amit Langote wrote:

On Mon, Nov 25, 2019 at 11:38 AM Amit Langote  wrote:

Needed to be rebased, which I did, to be able to test them; patches attached.


Oops, really attached this time.


Euler, this thread is waiting for input from you regarding the latest
comments from Amit.


Euler, this patch is still in "waiting on author" since 11/25. Do you
plan to review changes made by Amit in the patches he submitted, or what
are your plans with this patch?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Run-time pruning for ModifyTable

2020-01-16 Thread Tomas Vondra


On Wed, Nov 27, 2019 at 05:17:06PM +0900, Michael Paquier wrote:

On Tue, Nov 05, 2019 at 04:04:25PM +1300, Thomas Munro wrote:

On Thu, Sep 12, 2019 at 10:10 AM Alvaro Herrera
 wrote:
> Here's a rebased version of this patch (it had a trivial conflict).

Hi, FYI partition_prune.sql currently fails (maybe something to do
with commit d52eaa09?):


David, perhaps you did not notice that?  For now I have moved this
patch to next CF waiting on author to look after the failure.

Amit, Kato-san, both of you are marked as reviewers of this patch.
Are you planning to look at it?


David, this patch is marked as "waiting on author" since 11/27, and
there have been no updates or responses since then. Do you plan to
submit a new patch version in this CF? We're already half-way through,
so there's not much time ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: remove separate postgres.(sh)description files

2020-01-16 Thread Tomas Vondra


On Wed, Jan 08, 2020 at 02:33:23PM +0200, Heikki Linnakangas wrote:

On 31/12/2019 02:08, John Naylor wrote:

I'm guessing the initial data for pg_(sh)description is output into
separate files because it was too difficult for the traditional shell
script to maintain enough state to do otherwise.


Yeah, I guess so. The roots of postgres.description goes all the way 
back to 1997, when not only genbki was a shell script, but also 
initdb.



With Perl, it's just as easy to assemble the data into the same
format as the rest of the catalogs and then let the generic code path
output it into postgres.bki. The attached patch does that and
simplifies the catalog makefile and initdb.c.
Nice cleanup! Looks like we didn't have any mention of the 
postgres.(sh)decription files in the docs, so no doc updates needed. 
Grepping around, there are a few stray references to 
postgres.description still:


$ git grep -r -I postgres.shdescript .
src/backend/catalog/.gitignore:/postgres.shdescription
src/backend/catalog/Makefile:# postgres.bki, postgres.description, 
postgres.shdescription,
src/tools/msvc/clean.bat:if %DIST%==1 if exist 
src\backend\catalog\postgres.shdescription del /q 
src\backend\catalog\postgres.shdescription


Barring objections, I'll remove those too, and commit this.



+1 from me. Let's remove these small RFC patches out of the way.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Recognizing superuser in pg_hba.conf

2020-01-16 Thread Tomas Vondra


Hi,

I see this patch is marked as RFC since 12/30, but there seems to be
quite a lot of discussion about the syntax, keywords and how exactly to
identify the superuser. So I'll switch it back to needs review, which I
think is a better representation of the current state.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: psql - add SHOW_ALL_RESULTS option

2020-01-16 Thread Fabien COELHO





My own vote would be to use the initial patch (after applying any
unrelated changes per later review), ie. add the feature with its
disable button.


I can do that, but not if there is a veto from Tom on the feature.

I wish definite negative opinions by senior committers would be expressed 
earlier, so that people do not spend time rewiewing dead code and 
developing even deader code.


--
Fabien.

Re: FETCH FIRST clause PERCENT option

2020-01-16 Thread Tomas Vondra


Hi,

This patch is marked as RFC since September. Since then there was no
discussion on this thread, but Andrew proposed an alternative approach
based on window functions in a separate thread [1] (both for this and
for the WITH TIES case).

I'll set this patch back to "needs review" - at this point we need to
decide which of the approaches is the right one.


regards

[1] 
https://www.postgresql.org/message-id/flat/87o8wvz253@news-spur.riddles.org.uk

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: psql - add SHOW_ALL_RESULTS option

2020-01-16 Thread Fabien COELHO




Hello Tom,


This is one of the patches already marked as RFC (since September by
Alvaro). Anyone interested in actually pushing it, so that it does not
fall through to yet another commitfest?


TBH, I think we'd be better off to reject it.  This makes a nontrivial
change in a very long-standing psql behavior, with AFAICS no way to
get back the old semantics.  (The thread title is completely misleading
about that; there's no "option" in the patch as it stands.)


The thread title was not misleading, the initial version of the patch did 
offer an option. Then I was said "the current behavior is stupid (which I 
agree), let us change it to the sane behavior without option", then I'm 
told the contrary. Sigh.


I still have the patch with the option, though.

Sure, in a green field this behavior would likely be more sensible ... 
but that has to be weighed against the fact that it's behaved the way it 
does for a long time, and any existing scripts that are affected by that 
behavior have presumably deliberately chosen to use it.


I cannot imagine many people actually relying on the current insane 
behavior.



I can't imagine that changing this will make very many people happier.
It seems much more likely that people who are affected will be unhappy.

The compatibility issue could be resolved by putting in the option
that I suppose was there at the beginning.


Indeed.

But then we'd have to have a debate about which behavior would be 
default,


The patch was keeping current behavior as the default because people do 
not like a change whatever.


and there would still be the question of who would find this to 
be an improvement. If you're chaining together commands with \; then 
it's likely that you are happy with the way it behaves today. 
Certainly there's been no drumbeat of bug reports about it.


Why would there be bug report if this is a feature? :-)

The behavior has been irritating me for a long time. It is plain stupid to 
be able to send queries but not see their results.


--
Fabien.

Re: making the backend's json parser work in frontend code

2020-01-16 Thread Tom Lane

Robert Haas  writes:
> On Thu, Jan 16, 2020 at 3:11 PM Tom Lane  wrote:
>> Here's a reviewed version of 0001.  You missed fixing the MSVC build,
>> and there were assorted comments and other things referencing wchar.c
>> that needed to be cleaned up.

> Wow, thanks.

Pushed that.

>> Another thing I'm wondering about is if any of the #ifndef FRONTEND
>> code should get moved *back* to src/backend/utils/mb.  But that
>> could be a separate commit, too.

> +1 for moving that stuff to a separate backend-only file.

After a brief look, I propose the following:

* I think we should just shove the "#ifndef FRONTEND" stuff in
wchar.c into mbutils.c.  It doesn't seem worth inventing a whole
new file for that code, especially when it's arguably within the
remit of mbutils.c anyway.

* Let's remove the "#ifndef FRONTEND" restriction on the ICU-related
stuff in encnames.c.  Even if we don't need that stuff in frontend
today, it's hardly unlikely that we will need it tomorrow.  And there's
not that much bulk there anyway.

* The one positive reason for that restriction is the ereport() in
get_encoding_name_for_icu.  We could change that to be the usual
#ifdef-ereport-or-printf dance, but I think there's a better way: put
the ereport at the caller, by redefining that function to return NULL
for an unsupported encoding.  There's only one caller today anyhow.

* PG_char_to_encoding() and PG_encoding_to_char() can be moved to
mbutils.c; they'd fit reasonably well beside getdatabaseencoding and
pg_client_encoding.  (I also thought about utils/adt/misc.c, but
that's not obviously better.)

Barring objections I'll go make this happen shortly.

>> Lastly, it strikes me that maybe pg_wchar.h, or parts of it, should
>> migrate over to src/include/common.  But that'd be far more invasive
>> to other source files, so I've not touched the issue here.

> I don't have a view on this.

If anyone is hot to do this part, please have at it.  I'm not.

regards, tom lane

Re: pgindent && weirdness

2020-01-16 Thread Alvaro Herrera

On 2020-Jan-17, Thomas Munro wrote:

> On Thu, Jan 16, 2020 at 3:59 PM Thomas Munro  wrote:
> > On Wed, Jan 15, 2020 at 11:30 AM Tom Lane  wrote:
> > > Yeah, it's been doing that for decades.  I think the triggering
> > > factor is the typedef name (Var, here) preceding the &&.
> 
> Here's a better fix:

This is indeed a very good fix!  Several badly formatted sites in our
code are improved with this change.

Thanks,

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: psql - add SHOW_ALL_RESULTS option

2020-01-16 Thread Alvaro Herrera

On 2020-Jan-16, Tom Lane wrote:

> The compatibility issue could be resolved by putting in the option
> that I suppose was there at the beginning.  But then we'd have to
> have a debate about which behavior would be default, and there would
> still be the question of who would find this to be an improvement.
> If you're chaining together commands with \; then it's likely that
> you are happy with the way it behaves today.  Certainly there's been
> no drumbeat of bug reports about it.

The patch originally submitted did indeed have the option (defaulting to
"off", that is, the original behavior), and it was removed at request of
reviewers Daniel Vérité, Peter Eisentraut and Kyotaro Horiguchi.

My own opinion is that any scripts that rely heavily on the current
behavior are stepping on shaky ground anyway.  I'm not saying we should
break them on every chance we get -- just that keeping them unharmed is
not necessarily a priority, and that if this patch enables other psql
features, it might be a good step forward.

My own vote would be to use the initial patch (after applying any
unrelated changes per later review), ie. add the feature with its
disable button.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pgindent && weirdness

2020-01-16 Thread Thomas Munro

On Thu, Jan 16, 2020 at 3:59 PM Thomas Munro  wrote:
> On Wed, Jan 15, 2020 at 11:30 AM Tom Lane  wrote:
> > Yeah, it's been doing that for decades.  I think the triggering
> > factor is the typedef name (Var, here) preceding the &&.

Here's a better fix:

diff --git a/indent.c b/indent.c
index 9faf57a..51a60a6 100644
--- a/indent.c
+++ b/indent.c
@@ -570,8 +570,11 @@ check_type:
ps.in_or_st = false;/* turn off flag for structure decl or
 * initialization */
}
-   /* parenthesized type following sizeof or offsetof is not a cast */
-   if (ps.keyword == 1 || ps.keyword == 2)
+   /*
+* parenthesized type following sizeof or offsetof is
not a cast;
+* likewise for function-like macros that take a type
+*/
+   if (ps.keyword == 1 || ps.keyword == 2 || ps.last_token == ident)
ps.not_cast_mask |= 1 << ps.p_l_follow;
break;

Re: psql - add SHOW_ALL_RESULTS option

2020-01-16 Thread Tomas Vondra


On Thu, Jan 16, 2020 at 01:08:16PM -0500, Tom Lane wrote:

Tomas Vondra  writes:

This is one of the patches already marked as RFC (since September by
Alvaro). Anyone interested in actually pushing it, so that it does not
fall through to yet another commitfest?


TBH, I think we'd be better off to reject it.  This makes a nontrivial
change in a very long-standing psql behavior, with AFAICS no way to
get back the old semantics.  (The thread title is completely misleading
about that; there's no "option" in the patch as it stands.)  Sure,
in a green field this behavior would likely be more sensible ... but
that has to be weighed against the fact that it's behaved the way it
does for a long time, and any existing scripts that are affected by
that behavior have presumably deliberately chosen to use it.

I can't imagine that changing this will make very many people happier.
It seems much more likely that people who are affected will be unhappy.

The compatibility issue could be resolved by putting in the option
that I suppose was there at the beginning.  But then we'd have to
have a debate about which behavior would be default, and there would
still be the question of who would find this to be an improvement.
If you're chaining together commands with \; then it's likely that
you are happy with the way it behaves today.  Certainly there's been
no drumbeat of bug reports about it.



I don't know, really, I only pinged this as a CFM who sees a patch
marked as RFC for months ...

The current behavior certainly seems strange/wrong to me - if I send
multiple queries to psql, I'd certainly expect results for all of them,
not just the last one. So the current behavior seems pretty surprising.

I'm unable to make any judgments about risks/benefits of this change. I
can't imagine anyone intentionally relying on the current behavior, so
I'd say the patch is unlikely to break anything (which is not already
broken). But I don't have any data to support this ...

Essentially, I'm just advocating to make a decision - we should either
commit or reject the patch, not just move it to the next commitfest over
and over.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: making the backend's json parser work in frontend code

2020-01-16 Thread Robert Haas

On Thu, Jan 16, 2020 at 1:58 PM David Steele  wrote:
> To do page-level incrementals (which this feature is intended to enable)
> the user will need to be able to associate full and incremental backups
> and the only way I see to do that (currently) is to read the manifests,
> since the prior backup should be stored there.  I think this means that
> parsing the manifest is not really optional -- it will be required to do
> any kind of automation with incrementals.

My current belief is that enabling incremental backup will require
extending the manifest format either not at all or by adding one
additional line with some LSN info.

If we could foresee a need to store a bunch of additional *per-file*
details, I'd be a lot more receptive to the argument that we ought to
be using a more structured format like JSON. And it doesn't seem
impossible that such a thing could happen, but I don't think it's at
all clear that it actually will happen, or that it will happen soon
enough that we ought to be worrying about it now.

It's possible that we're chasing a real problem here, and if there's
something we can agree on and get done I'd rather do that than argue,
but I am still quite suspicious that there's no actually serious
technical problem here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: our checks for read-only queries are not great

2020-01-16 Thread Robert Haas

On Thu, Jan 16, 2020 at 12:22 PM Stephen Frost  wrote:
> I think I agree with you regarding the original intent, though even
> there, as discussed elsewhere, it seems like there's perhaps either a
> bug or a disagreement about the specifics of what that means when it
> relates to committing a 2-phase transaction.  Still, setting that aside
> for the moment, do we feel like this is enough to be able to update our
> documentation with?

I think that would be possible. What did you have in mind?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: making the backend's json parser work in frontend code

2020-01-16 Thread Robert Haas

On Thu, Jan 16, 2020 at 3:11 PM Tom Lane  wrote:
> Robert Haas  writes:
> > 0001 moves wchar.c from src/backend/utils/mb to src/common. Unless I'm
> > missing something, this seems like an overdue cleanup.
>
> Here's a reviewed version of 0001.  You missed fixing the MSVC build,
> and there were assorted comments and other things referencing wchar.c
> that needed to be cleaned up.

Wow, thanks.

> Also, it seemed to me that if we are going to move wchar.c, we should
> also move encnames.c, so that libpq can get fully out of the
> symlinking-source-files business.  It makes initdb less weird too.

OK.

> I took the liberty of sticking proper copyright headers onto these
> two files, too.  (This makes the diff a lot more bulky :-(.  Would
> it help to add the headers in a separate commit?)

I wouldn't bother making it a separate commit, but please do whatever you like.

> Another thing I'm wondering about is if any of the #ifndef FRONTEND
> code should get moved *back* to src/backend/utils/mb.  But that
> could be a separate commit, too.

+1 for moving that stuff to a separate backend-only file.

> Lastly, it strikes me that maybe pg_wchar.h, or parts of it, should
> migrate over to src/include/common.  But that'd be far more invasive
> to other source files, so I've not touched the issue here.

I don't have a view on this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: SlabCheck leaks memory into TopMemoryContext

2020-01-16 Thread Tomas Vondra


On Thu, Jan 16, 2020 at 03:15:41PM -0500, Tom Lane wrote:

Tomas Vondra  writes:

I think the one possible argument against this approach might be that it
adds a field to the struct, so if you have an extension using a Slab
context, it'll break if you don't rebuild it. But that only matters if
we want to backpatch it (which I think is not the plan) and with memory
context checking enabled (which does not apply to regular packages).


Huh?  That struct is private in slab.c, no?  Any outside code relying
on its contents deserves to break.



Ah, right. Silly me.


I do think we ought to back-patch this, given the horrible results
Andres showed.



OK.

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: SlabCheck leaks memory into TopMemoryContext

2020-01-16 Thread Tom Lane

Tomas Vondra  writes:
> I think the one possible argument against this approach might be that it
> adds a field to the struct, so if you have an extension using a Slab
> context, it'll break if you don't rebuild it. But that only matters if
> we want to backpatch it (which I think is not the plan) and with memory
> context checking enabled (which does not apply to regular packages).

Huh?  That struct is private in slab.c, no?  Any outside code relying
on its contents deserves to break.

I do think we ought to back-patch this, given the horrible results
Andres showed.

regards, tom lane

Re: making the backend's json parser work in frontend code

2020-01-16 Thread Tom Lane

Robert Haas  writes:
> 0001 moves wchar.c from src/backend/utils/mb to src/common. Unless I'm
> missing something, this seems like an overdue cleanup.

Here's a reviewed version of 0001.  You missed fixing the MSVC build,
and there were assorted comments and other things referencing wchar.c
that needed to be cleaned up.

Also, it seemed to me that if we are going to move wchar.c, we should
also move encnames.c, so that libpq can get fully out of the
symlinking-source-files business.  It makes initdb less weird too.

I took the liberty of sticking proper copyright headers onto these
two files, too.  (This makes the diff a lot more bulky :-(.  Would
it help to add the headers in a separate commit?)

Another thing I'm wondering about is if any of the #ifndef FRONTEND
code should get moved *back* to src/backend/utils/mb.  But that
could be a separate commit, too.

Lastly, it strikes me that maybe pg_wchar.h, or parts of it, should
migrate over to src/include/common.  But that'd be far more invasive
to other source files, so I've not touched the issue here.

regards, tom lane

diff --git a/src/backend/utils/mb/Makefile b/src/backend/utils/mb/Makefile
index cd4a016..b19a125 100644
--- a/src/backend/utils/mb/Makefile
+++ b/src/backend/utils/mb/Makefile
@@ -14,10 +14,8 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	conv.o \
-	encnames.o \
 	mbutils.o \
 	stringinfo_mb.o \
-	wchar.o \
 	wstrcmp.o \
 	wstrncmp.o
 
diff --git a/src/backend/utils/mb/README b/src/backend/utils/mb/README
index 7495ca5..ef36626 100644
--- a/src/backend/utils/mb/README
+++ b/src/backend/utils/mb/README
@@ -3,12 +3,8 @@ src/backend/utils/mb/README
 Encodings
 =
 
-encnames.c:	public functions for both the backend and the frontend.
 conv.c:		static functions and a public table for code conversion
-wchar.c:	mostly static functions and a public table for mb string and
-		multibyte conversion
 mbutils.c:	public functions for the backend only.
-		requires conv.c and wchar.c
 stringinfo_mb.c: public backend-only multibyte-aware stringinfo functions
 wstrcmp.c:	strcmp for mb
 wstrncmp.c:	strncmp for mb
@@ -16,6 +12,12 @@ win866.c:	a tool to generate KOI8 <--> CP866 conversion table
 iso.c:		a tool to generate KOI8 <--> ISO8859-5 conversion table
 win1251.c:	a tool to generate KOI8 <--> CP1251 conversion table
 
+See also in src/common/:
+
+encnames.c:	public functions for encoding names
+wchar.c:	mostly static functions and a public table for mb string and
+		multibyte conversion
+
 Introduction
 
 	http://www.cprogramming.com/tutorial/unicode.html
diff --git a/src/backend/utils/mb/encnames.c b/src/backend/utils/mb/encnames.c
deleted file mode 100644
index 12b61cd..000
--- a/src/backend/utils/mb/encnames.c
+++ /dev/null
@@ -1,629 +0,0 @@
-/*
- * Encoding names and routines for work with it. All
- * in this file is shared between FE and BE.
- *
- * src/backend/utils/mb/encnames.c
- */
-#ifdef FRONTEND
-#include "postgres_fe.h"
-#else
-#include "postgres.h"
-#include "utils/builtins.h"
-#endif
-
-#include 
-#include 
-
-#include "mb/pg_wchar.h"
-
-
-/* --
- * All encoding names, sorted:		 *** A L P H A B E T I C ***
- *
- * All names must be without irrelevant chars, search routines use
- * isalnum() chars only. It means ISO-8859-1, iso_8859-1 and Iso8859_1
- * are always converted to 'iso88591'. All must be lower case.
- *
- * The table doesn't contain 'cs' aliases (like csISOLatin1). It's needed?
- *
- * Karel Zak, Aug 2001
- * --
- */
-typedef struct pg_encname
-{
-	const char *name;
-	pg_enc		encoding;
-} pg_encname;
-
-static const pg_encname pg_encname_tbl[] =
-{
-	{
-		"abc", PG_WIN1258
-	},			/* alias for WIN1258 */
-	{
-		"alt", PG_WIN866
-	},			/* IBM866 */
-	{
-		"big5", PG_BIG5
-	},			/* Big5; Chinese for Taiwan multibyte set */
-	{
-		"euccn", PG_EUC_CN
-	},			/* EUC-CN; Extended Unix Code for simplified
- * Chinese */
-	{
-		"eucjis2004", PG_EUC_JIS_2004
-	},			/* EUC-JIS-2004; Extended UNIX Code fixed
- * Width for Japanese, standard JIS X 0213 */
-	{
-		"eucjp", PG_EUC_JP
-	},			/* EUC-JP; Extended UNIX Code fixed Width for
- * Japanese, standard OSF */
-	{
-		"euckr", PG_EUC_KR
-	},			/* EUC-KR; Extended Unix Code for Korean , KS
- * X 1001 standard */
-	{
-		"euctw", PG_EUC_TW
-	},			/* EUC-TW; Extended Unix Code for
- *
- * traditional Chinese */
-	{
-		"gb18030", PG_GB18030
-	},			/* GB18030;GB18030 */
-	{
-		"gbk", PG_GBK
-	},			/* GBK; Chinese Windows CodePage 936
- * simplified Chinese */
-	{
-		"iso88591", PG_LATIN1
-	},			/* ISO-8859-1; RFC1345,KXS2 */
-	{
-		"iso885910", PG_LATIN6
-	},			/* ISO-8859-10; RFC1345,KXS2 */
-	{
-		"iso885913", PG_LATIN7
-	},			/* ISO-8859-13; RFC1345,KXS2 */
-	{
-		"iso885914", PG_LATIN8
-	},			/* ISO-8859-14; RFC1345,KXS2 */
-	{
-		"iso885915", PG_LATIN9
-	},			/* ISO-8859-15; RFC1345,KXS2 */
-	{
-

Re: making the backend's json parser work in frontend code

2020-01-16 Thread David Steele


On 1/16/20 12:26 PM, Andres Freund wrote:

Hi,

On 2020-01-16 14:20:28 -0500, Tom Lane wrote:

David Steele  writes:

The way we handle this in pgBackRest is to put a TRY ... CATCH block in
main() to log and exit on any uncaught THROW.  That seems like a
reasonable way to start here.  Without memory contexts that almost
certainly will mean memory leaks but I'm not sure how much that matters
if the action is to exit immediately.


If that's the expectation, we might as well replace backend ereport(ERROR)
with something that just prints a message and does exit(1).


Well, the process might still want to do some cleanup of half-finished
work. You'd not need to be resistant against memory leaks to do so, if
followed by an exit. Obviously you can also do all the necessarily
cleanup from within the ereport(ERROR) itself, but that doesn't seem
appealing to me (not composable, harder to reuse for other programs,
etc).


In pgBackRest we have a default handler that just logs the message to 
stderr and exits (though we consider it a coding error if it gets 
called).  Seems like we could do the same here.  Default message and 
exit if no handler, but optionally allow a handler (which could RETHROW 
to get to the default handler afterwards).


It seems like we've been wanting a front end version of ereport() for a 
while so I'll take a look at that and see what it involves.


Regards,
--
-David
da...@pgmasters.net

Re: Enabling B-Tree deduplication by default

2020-01-16 Thread Peter Geoghegan

On Thu, Jan 16, 2020 at 10:55 AM Robert Haas  wrote:
> On Wed, Jan 15, 2020 at 6:38 PM Peter Geoghegan  wrote:
> > There are some outstanding questions about how B-Tree deduplication
> > [1] should be configured, and whether or not it should be enabled by
> > default. I'm starting this new thread in the hopes of generating
> > discussion on these high level questions.
>
> It seems like the issue here is that you're pretty confident that
> deduplication will be a win for unique indexes, but not so confident
> that this will be true for non-unique indexes.

Right.

> I don't know that I understand why.

The main reason that I am confident about unique indexes is that we
only do a deduplication pass in a unique index when we observe that
the incoming tuple (the one that might end up splitting the page) is a
duplicate of some existing tuple. Checking that much is virtually
free, since we already have the information close at hand today (we
cache the _bt_check_unique() binary search bounds for reuse within
_bt_findinsertloc() today). This seems to be an excellent heuristic,
since we really only want to target unique index leaf pages where all
or almost all insertions must be duplicates caused by non-HOT updates
-- this category includes all the pgbench indexes, and includes all of
the unique indexes in TPC-C. Whereas with non-unique indexes, we
aren't specifically targeting version churn (though it will help with
that too).

Granted, the fact that the incoming tuple happens to be a duplicate is
not a sure sign that the index is in this informal "suitable for
deduplication" category of mine. The incoming duplicate could just be
a once off. Even still, it's extremely unlikely to matter -- a failed
deduplication pass really isn't that expensive anyway, since it takes
place just before we split the page (we'll need the page in L1 cache
anyway). If we consistently attempt deduplication in a unique index,
then we're virtually guaranteed to consistently benefit from it.

In general, the way that deduplication is only considered at the point
where we'd otherwise have to split the page buys *a lot*. The idea of
delaying page splits by doing something like load balancing or
compression in a lazy fashion has a long history -- it was not my
idea. I'm not talking about the LP_DEAD bit set deletion stuff here --
this goes back to the 1970s.

> It does seem odd to me to treat them differently, but it's possible
> that this is a reflection of my own lack of understanding. What do
> other database systems do?

Other database systems treat unique indexes very differently, albeit
in a way that we're not really in a position to take too much away
from -- other than the general fact that unique indexes can be thought
of as very different things.

In general, the unique indexes in other systems are expected to be
unique in every sense, even during an "UPDATE foo SET unique_key =
unique_key + 1" style query. Index tuples are slightly smaller in a
unique index compared to an equivalent non-unique index in the case of
one such system. Also, that same system has something called a "unique
index scan" that can only be used with a unique index (and only when
all columns appear in the query qual).

> I wonder whether we could avoid the downside of having only one
> LP_DEAD bit per line pointer by having a bit per TID within the
> compressed tuples themselves. I assume you already thought about that,
> though.

So far, this lack of LP_DEAD bit granularity issue is only a
theoretical problem. I haven't been able to demonstrate in any
meaningful way. Setting LP_DEAD bits is bound to be correlated, and we
only deduplicate to avoid a page split.

Just last night I tried a variant pgbench workload with a tiny
accounts table, an extremely skewed Zipf distribution, and lots of
clients relative to the size of the machine. I used a non-unique index
instead of a unique index, since that is likely to be where the patch
was weakest (no _bt_check_unique() LP_DEAD bit setting that way). The
patch still came out ahead of the master branch by about 3%. It's very
hard to prove that there is no real downside to having only one
LP_DEAD bit per posting list tuple, since absence of evidence isn't
evidence of absence. I believe that it's much easier to make the
argument that it's okay to one have one LP_DEAD bit per posting list
within unique indexes specifically, though (because we understand that
there can be no duplicates in the long run there).

Throughout this work, and the v12 B-Tree work, I consistently made
conservative decisions about space accounting in code like
nbtsplitloc.c (the new nbtdedup.c code has to think about space in
about the same way). My intuition is that space accounting is one area
where we really ought to be conservative, since it's so hard to test.
That's the main reason why I find the idea of having LP_DEAD bits
within posting list tuples unappealing, whatever the benefits may be
-- it adds complexity in the one area that I really

Re: SlabCheck leaks memory into TopMemoryContext

2020-01-16 Thread Tomas Vondra


On Thu, Jan 16, 2020 at 12:33:03PM -0500, Tom Lane wrote:

Tomas Vondra  writes:

On Thu, Jan 16, 2020 at 08:48:49AM -0800, Andres Freund wrote:

I don't get why it's advantageous to allocate this once for each slab,
rather than having it as a global once for all slabs. But anyway, still
clearly better than the current situation.



It's largely a matter of personal preference - I agree there are cases
when global variables are the best solution, but I kinda dislike them.
It seems cleaner to just allocate it as part of the slab, not having to
deal with different number of chunks per block between slabs.
Plus we don't have all that many slabs (like 2), and it's only really
used in debug builds anyway. So I'm not all that woried about this
wasting a couple extra kB of memory.


A positive argument for doing it like this is that the overhead goes away
when the SlabContexts are all deallocated, while a global variable would
presumably stick around indefinitely.  But I concur that in current usage,
there's hardly any point in worrying about the relative benefits.  We
should just keep it simple, and this seems marginally simpler than the
other way.



I think the one possible argument against this approach might be that it
adds a field to the struct, so if you have an extension using a Slab
context, it'll break if you don't rebuild it. But that only matters if
we want to backpatch it (which I think is not the plan) and with memory
context checking enabled (which does not apply to regular packages).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Patch to document base64 encoding

2020-01-16 Thread Alvaro Herrera

I just wanted to throw this in the archives; this doesn't need to affect
your patch.

Because of how the new tables look in the PDF docs, I thought it might
be a good time to research how to make each function-entry occupy two
rows: one for prototype, return type and description, and the other for
the example and its result.  Below is a first cut of how you'd implement
that idea -- see colspec/spanspec/spanname ... only the output looks
almost as bad (though the benefit is that it doesn't overwrite cell
contents anymore).

I think we have two choices.  One is to figure out how to make this
work (ie. make it pretty; maybe by using alternate cell backgrounds, say
one white and one very light gray; maybe by using thinner/thicker
inter-cell lines); the other is to forget tables altogether and format
the info in some completely different way.

   
Binary/String Conversion Functions








 
  
   Function
   Return Type
   Description
  
  
   Example
   Result
  
 

 
   

 
  convert_from
 
 convert_from(bytes 
bytea,
 src_encoding 
name)

text

 Convert binary string to the database encoding.  The original encoding
 is specified by src_encoding. The
 bytes must be valid in this encoding.  See
  for available conversions.

   
   
convert_from('text_in_utf8', 
'UTF8')
text_in_utf8 represented in 
the current database encoding
   


-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: making the backend's json parser work in frontend code

2020-01-16 Thread Andres Freund

Hi,

On 2020-01-16 14:20:28 -0500, Tom Lane wrote:
> David Steele  writes:
> > The way we handle this in pgBackRest is to put a TRY ... CATCH block in
> > main() to log and exit on any uncaught THROW.  That seems like a
> > reasonable way to start here.  Without memory contexts that almost
> > certainly will mean memory leaks but I'm not sure how much that matters
> > if the action is to exit immediately.
>
> If that's the expectation, we might as well replace backend ereport(ERROR)
> with something that just prints a message and does exit(1).

Well, the process might still want to do some cleanup of half-finished
work. You'd not need to be resistant against memory leaks to do so, if
followed by an exit. Obviously you can also do all the necessarily
cleanup from within the ereport(ERROR) itself, but that doesn't seem
appealing to me (not composable, harder to reuse for other programs,
etc).

Greetings,

Andres Freund

Re: making the backend's json parser work in frontend code

2020-01-16 Thread Tom Lane

David Steele  writes:
> On 1/15/20 7:39 PM, Robert Haas wrote:
>>> I agree that it's far from obvious that the hacks in the patch are
>>> best; to the contrary, they are hacks. That said, I feel that the
>>> semantics of throwing an error are not very well-defined in a
>>> front-end environment. I mean, in a backend context, throwing an error
>>> is going to abort the current transaction, with all that this implies.
>>> If the frontend equivalent is to do nothing and hope for the best, I
>>> doubt it will survive anything more than the simplest use cases. This
>>> is one of the reasons I've been very reluctant to go do down this
>>> whole path in the first place.

> The way we handle this in pgBackRest is to put a TRY ... CATCH block in 
> main() to log and exit on any uncaught THROW.  That seems like a 
> reasonable way to start here.  Without memory contexts that almost 
> certainly will mean memory leaks but I'm not sure how much that matters 
> if the action is to exit immediately.

If that's the expectation, we might as well replace backend ereport(ERROR)
with something that just prints a message and does exit(1).

The question comes down to whether there are use-cases where a frontend
application would really want to recover and continue processing after
a JSON syntax problem.  I'm not seeing that that's a near-term
requirement, so maybe we could leave it for somebody to solve when
and if they want to do it.

regards, tom lane

Re: making the backend's json parser work in frontend code

2020-01-16 Thread David Steele


On 1/15/20 7:39 PM, Robert Haas wrote:
> On Wed, Jan 15, 2020 at 6:40 PM Andres Freund  wrote:
>> It's not obvious why the better approach here wouldn't be to just have a
>> very simple ereport replacement, that needs to be explicitly included
>> from frontend code. It'd not be meaningfully harder, imo, and it'd
>> require fewer adaptions, and it'd look more familiar.
>
> I agree that it's far from obvious that the hacks in the patch are
> best; to the contrary, they are hacks. That said, I feel that the
> semantics of throwing an error are not very well-defined in a
> front-end environment. I mean, in a backend context, throwing an error
> is going to abort the current transaction, with all that this implies.
> If the frontend equivalent is to do nothing and hope for the best, I
> doubt it will survive anything more than the simplest use cases. This
> is one of the reasons I've been very reluctant to go do down this
> whole path in the first place.

The way we handle this in pgBackRest is to put a TRY ... CATCH block in 
main() to log and exit on any uncaught THROW.  That seems like a 
reasonable way to start here.  Without memory contexts that almost 
certainly will mean memory leaks but I'm not sure how much that matters 
if the action is to exit immediately.


Regards,
--
-David
da...@pgmasters.net

Re: making the backend's json parser work in frontend code

2020-01-16 Thread David Steele

On 1/15/20 4:40 PM, Andres Freund wrote:
>
> I'm not sure where I come down between using json and a simple ad-hoc
> format, when the dependency for the former is making the existing json
> parser work in the frontend. But if the alternative is to add a second
> json parser, it very clearly shifts towards using an ad-hoc
> format. Having to maintain a simple ad-hoc parser is a lot less
> technical debt than having a second full blown json parser.

Maybe at first, but it will grow and become more complex as new features 
are added.  This has been our experience with pgBackRest, at least.

> Imo even
> when an external project or three also has to have that simple parser.

I don't agree here.  Especially if we outgrow the format and they need 
two parsers, depending on the version of PostgreSQL.

To do page-level incrementals (which this feature is intended to enable) 
the user will need to be able to associate full and incremental backups 
and the only way I see to do that (currently) is to read the manifests, 
since the prior backup should be stored there.  I think this means that 
parsing the manifest is not really optional -- it will be required to do 
any kind of automation with incrementals.

It's easy enough for a tool like pgBackRest to do something like that, 
much harder for a user hacking together a tool in bash based on 
pg_basebackup.

> If the alternative were to use that newly proposed json parser to
> *replace* the backend one too, the story would again be different.

That was certainly not my intention.

Regards,
--
-David
da...@pgmasters.net

Re: Enabling B-Tree deduplication by default

2020-01-16 Thread Robert Haas

On Wed, Jan 15, 2020 at 6:38 PM Peter Geoghegan  wrote:
> There are some outstanding questions about how B-Tree deduplication
> [1] should be configured, and whether or not it should be enabled by
> default. I'm starting this new thread in the hopes of generating
> discussion on these high level questions.

It seems like the issue here is that you're pretty confident that
deduplication will be a win for unique indexes, but not so confident
that this will be true for non-unique indexes. I don't know that I
understand why.

It does seem odd to me to treat them differently, but it's possible
that this is a reflection of my own lack of understanding. What do
other database systems do?

I wonder whether we could avoid the downside of having only one
LP_DEAD bit per line pointer by having a bit per TID within the
compressed tuples themselves. I assume you already thought about that,
though.

What are the characteristics of this system if you have an index that
is not declared as UNIQUE but actually happens to be UNIQUE?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: making the backend's json parser work in frontend code

2020-01-16 Thread Robert Haas

On Thu, Jan 16, 2020 at 1:37 PM David Steele  wrote:
> I was starting to wonder if it wouldn't be simpler to go back to the
> Postgres JSON parser and see if we can adapt it.  I'm not sure that it
> *is* simpler, but it would almost certainly be more acceptable.

That is my feeling also.

> So the idea here is that json.c will have the JSON SQL functions,
> jsonb.c the JSONB SQL functions, and jsonapi.c the parser, and
> jsonfuncs.c the utility functions?

Uh, I think roughly that, yes. Although I can't claim to fully
understand everything that's here.

> This seems like a good first step.  I wonder if the remainder of the SQL
> json/jsonb functions should be moved to json.c/jsonb.c respectively?
>
> That does represent a lot of code churn though, so perhaps not worth it.

I don't have an opinion on this right now.

> Well, with the caveat that it doesn't work, it's less than I expected.
>
> Obviously ereport() is a pretty big deal and I agree with Michael
> downthread that we should port this to the frontend code.

Another possibly-attractive option would be to defer throwing the
error: i.e. set some flags in the lex or parse state or something, and
then just return. The caller notices the flags and has enough
information to throw an error or whatever it wants to do. The reason I
think this might be attractive is that it dodges the whole question of
what exactly throwing an error is supposed to do in a world without
transactions, memory contexts, resource owners, etc. However, it has
some pitfalls of its own, like maybe being too much code churn or
hurting performance in non-error cases.

> It would also be nice to unify functions like PQmblen() and pg_mblen()
> if possible.

I don't see how to do that at the moment, but I agree that it would be
nice if we can figure it out.

> The next question in my mind is given the caveat that the error handing
> is questionable in the front end, can we at least render/parse valid
> JSON with the code?

That's a real good question. Thanks for offering to test it; I think
that would be very helpful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Thoughts on "killed tuples" index hint bits support on standby

2020-01-16 Thread Peter Geoghegan

On Thu, Jan 16, 2020 at 9:54 AM Andres Freund  wrote:
> I don't think we can rely on hot_standby_feedback at all. We can to
> avoid unnecessary cancellations, etc, and even assume it's setup up
> reasonably for some configurations, but there always needs to be an
> independent correctness backstop.

+1

> I'm less clear on how we can make sure that we can *rely* on LP_DEAD to
> skip over entries during scans, however. The bits set as described above
> would be safe, but we also can see LP_DEAD set by the primary (and even
> upstream cascading standbys at least in case of new base backups taken
> from them), due to them being not being WAL logged. As we don't WAL log,
> there is no conflict associated with the LP_DEADs being set.  My gut
> feeling is that it's going to be very hard to get around this, without
> adding WAL logging for _bt_killitems et al (including an interface for
> kill_prior_tuple to report the used horizon to the index).

I agree.

What about calling _bt_vacuum_one_page() more often than strictly
necessary to avoid a page split on the primary? The B-Tree
deduplication patch sometimes does that, albeit for completely
unrelated reasons. (We don't want to have to unset an LP_DEAD bit in
the case when a new/incoming duplicate tuple has a TID that overlaps
with the posting list range of some existing duplicate posting list
tuple.)

I have no idea how you'd determine that it was time to call
_bt_vacuum_one_page(). Seems worth considering.

> I'm wondering if we could recycle BTPageOpaqueData.xact to store the
> horizon applying to killed tuples on the page. We don't need to store
> the level for leaf pages, because we have BTP_LEAF, so we could make
> space for that (potentially signalled by a new BTP flag).  Obviously we
> have to be careful with storing xids in the index, due to potential
> wraparound danger - but I think such page would have to be vacuumed
> anyway, before a potential wraparound.

You would think that, but unfortunately we don't currently do it that
way. We store XIDs in deleted leaf pages that can sometimes be missed
until the next wraparound.

We need to do something like commit
6655a7299d835dea9e8e0ba69cc5284611b96f29, but for B-Tree. It's
somewhere on my TODO list.

> I think we could safely unset
> the xid during nbtree single page cleanup, and vacuum, by making sure no
> LP_DEAD entries survive, and by including the horizon in the generated
> WAL record.
>
> That however still doesn't really fully allow us to set LP_DEAD on
> standbys, however - but it'd allow us to take the primary's LP_DEADs
> into account on a standby. I think we'd run into torn page issues, if we
> were to do so without WAL logging, because we'd rely on the LP_DEAD bits
> and BTPageOpaqueData.xact to be in sync.  I *think* we might be safe to
> do so *iff* the page's LSN indicates that there has been a WAL record
> covering it since the last redo location.

That sounds like a huge mess.

--
Peter Geoghegan

Re: making the backend's json parser work in frontend code

2020-01-16 Thread David Steele


Hi Robert,

On 1/16/20 11:37 AM, David Steele wrote:


The next question in my mind is given the caveat that the error handing 
is questionable in the front end, can we at least render/parse valid 
JSON with the code?


Hrm, this bit was from an earlier edit.  I meant:

The next question in my mind is what will it take to get this working in 
a limited form so we can at least prototype it with pg_basebackup.  I 
can hack on this with some static strings in front end code tomorrow to 
see what works and what doesn't if that makes sense.


Regards,
--
-David
da...@pgmasters.net

Re: making the backend's json parser work in frontend code

2020-01-16 Thread David Steele


Hi Robert,

On 1/15/20 2:02 PM, Robert Haas wrote:
> The discussion on the backup manifest thread has gotten bogged down on
> the issue of the format that should be used to store the backup
> manifest file. I want something simple and ad-hoc; David Steele and
> Stephen Frost prefer JSON. That is problematic because our JSON parser
> does not work in frontend code, and I want to be able to validate a
> backup against its manifest, which involves being able to parse the
> manifest from frontend code. The latest development over there is that
> David Steele has posted the JSON parser that he wrote for pgbackrest
> with an offer to try to adapt it for use in front-end PostgreSQL code,
> an offer which I genuinely appreciate. I'll write more about that over
> on that thread. However, I decided to spend today doing some further
> investigation of an alternative approach, namely making the backend's
> existing JSON parser work in frontend code as well. I did not solve
> all the problems there, but I did come up with some patches which I
> think would be worth committing on independent grounds, and I think
> the whole series is worth posting. So here goes.

I was starting to wonder if it wouldn't be simpler to go back to the 
Postgres JSON parser and see if we can adapt it.  I'm not sure that it 
*is* simpler, but it would almost certainly be more acceptable.


> 0001 moves wchar.c from src/backend/utils/mb to src/common. Unless I'm
> missing something, this seems like an overdue cleanup. It's long been
> the case that wchar.c is actually compiled and linked into both
> frontend and backend code. Commit
> 60f11b87a2349985230c08616fa8a34ffde934c8 added code into src/common
> that depends on wchar.c being available, but didn't actually make
> wchar.c part of src/common, which seems like an odd decision: the
> functions in the library are dependent on code that is not part of any
> library but whose source files get copied around where needed. Eh?

This looks like an obvious improvement to me.

> 0002 does some basic header cleanup to make it possible to include the
> existing header file jsonapi.h in frontend code. The state of the JSON
> headers today looks generally poor. There seems not to have been much
> attempt to get the prototypes for a given source file, say foo.c, into
> a header file with the same name, say foo.h. Also, dependencies
> between various header files seem to be have added somewhat freely.
> This patch does not come close to fixing all that, but I consider it a
> modest down payment on a cleanup that probably ought to be taken
> further.

Agreed that these header files are fairly disorganized.  In general the 
names json, jsonapi, jsonfuncs don't tell me a whole lot.  I feel like 
I'd want to include json.h to get a json parser but it only contains one 
utility function before these patches.  I can see that json.c primarily 
contains SQL functions so that's why.


So the idea here is that json.c will have the JSON SQL functions, 
jsonb.c the JSONB SQL functions, and jsonapi.c the parser, and 
jsonfuncs.c the utility functions?


> 0003 splits json.c into two files, json.c and jsonapi.c. All the
> lexing and parsing stuff (whose prototypes are in jsonapi.h) goes into
> jsonapi.c, while the stuff that pertains to the 'json' data type
> remains in json.c. This also seems like a good cleanup, because to me,
> at least, it's not a great idea to mix together code that is used by
> both the json and jsonb data types as well as other things in the
> system that want to generate or parse json together with things that
> are specific to the 'json' data type.

This seems like a good first step.  I wonder if the remainder of the SQL 
json/jsonb functions should be moved to json.c/jsonb.c respectively?


That does represent a lot of code churn though, so perhaps not worth it.

> As far as I know all three of the above patches are committable as-is;
> review and contrary opinions welcome.

Agreed, with some questions as above.

> On the other hand, 0004, 0005, and 0006 are charitably described as
> experimental or WIP.  0004 and 0005 hack up jsonapi.c so that it can
> still be compiled even if #include "postgres.h" is changed to #include
> "postgres-fe.h" and 0006 moves it into src/common. Note that I say
> that they make it compile, not work. It's not just untested; it's
> definitely broken. But it gives a feeling for what the remaining
> obstacles to making this code available in a frontend environment are.
> Since I wrote my very first email complaining about the difficulty of
> making the backend's JSON parser work in a frontend environment, one
> obstacle has been knocked down: StringInfo is now available in
> front-end code (commit 26aaf97b683d6258c098859e6b1268e1f5da242f). The
> remaining problems (that I know about) have to do with error reporting
> and multibyte character support; a read of the patches is suggested
> for those wanting further details.

Well, with the caveat that it doesn't work,

Re: psql - add SHOW_ALL_RESULTS option

2020-01-16 Thread Tom Lane

Tomas Vondra  writes:
> This is one of the patches already marked as RFC (since September by
> Alvaro). Anyone interested in actually pushing it, so that it does not
> fall through to yet another commitfest?

TBH, I think we'd be better off to reject it.  This makes a nontrivial
change in a very long-standing psql behavior, with AFAICS no way to
get back the old semantics.  (The thread title is completely misleading
about that; there's no "option" in the patch as it stands.)  Sure,
in a green field this behavior would likely be more sensible ... but
that has to be weighed against the fact that it's behaved the way it
does for a long time, and any existing scripts that are affected by
that behavior have presumably deliberately chosen to use it.

I can't imagine that changing this will make very many people happier.
It seems much more likely that people who are affected will be unhappy.

The compatibility issue could be resolved by putting in the option
that I suppose was there at the beginning.  But then we'd have to
have a debate about which behavior would be default, and there would
still be the question of who would find this to be an improvement.
If you're chaining together commands with \; then it's likely that
you are happy with the way it behaves today.  Certainly there's been
no drumbeat of bug reports about it.

regards, tom lane

Re: SlabCheck leaks memory into TopMemoryContext

2020-01-16 Thread Andres Freund

Hi,

On 2020-01-16 18:01:53 +0100, Tomas Vondra wrote:
> Plus we don't have all that many slabs (like 2)

FWIW, I have a local patch that adds additional ones, for the relcache
and catcache, that's how I noticed the leak. Because a test pgbench
absolutely *tanked* in performance.

Just for giggles. With leak:
pgbench -n -M prepared -P1 -c20 -j20 -T6000 -S
progress: 1.0 s, 81689.4 tps, lat 0.242 ms stddev 0.087
progress: 2.0 s, 51228.5 tps, lat 0.390 ms stddev 0.107
progress: 3.0 s, 42297.4 tps, lat 0.473 ms stddev 0.141
progress: 4.0 s, 34885.9 tps, lat 0.573 ms stddev 0.171
progress: 5.0 s, 31211.2 tps, lat 0.640 ms stddev 0.182
progress: 6.0 s, 27307.9 tps, lat 0.732 ms stddev 0.216
progress: 7.0 s, 25698.9 tps, lat 0.778 ms stddev 0.228

without:
pgbench -n -M prepared -P1 -c20 -j20 -T6000 -S
progress: 1.0 s, 144119.1 tps, lat 0.137 ms stddev 0.047
progress: 2.0 s, 148092.8 tps, lat 0.135 ms stddev 0.039
progress: 3.0 s, 148757.0 tps, lat 0.134 ms stddev 0.032
progress: 4.0 s, 148553.7 tps, lat 0.134 ms stddev 0.038

I do find the size of the impact quite impressive. It's all due to the
TopMemoryContext's AllocSetCheck() taking longer and longer.


> and it's only really used in debug builds anyway. So I'm not all that
> woried about this wasting a couple extra kB of memory.

IDK, making memory usage look different makes optimizing it harder. Not
a hard rule, obviously, but ...

Greetings,

Andres Freund

Re: Thoughts on "killed tuples" index hint bits support on standby

2020-01-16 Thread Andres Freund

Hi,

On 2020-01-16 14:30:12 +0300, Michail Nikolaev wrote:
> First thing we need to consider - checksums and wal_log_hints are
> widely used these days. So, at any moment master could send FPW page
> with new "killed tuples" hints and overwrite hints set by standby.
> Moreover it is not possible to distinguish hints are set by primary or 
> standby.

Note that the FPIs are only going to be sent for the first write to a
page after a checksum. I don't think you're suggesting we rely on them
for correctness (this far into the email at least), but it still seems
worthwhile to point out.

> And there is where hot_standby_feedback comes to play. Master node
> considers xmin of hot_standy_feedback replicas (RecentGlobalXmin)
> while setting "killed tuples" bits.  So, if hot_standby_feedback is
> enabled on standby for a while - it could safely trust hint bits from
> master.

Well, not easily. There's no guarantee that the node it reports
hot_standby_feedback to is actually the primary. It could be an
cascading replica setup, that doesn't report hot_standby_feedback
upstream. Also hot_standby_feedback only takes effect while a streaming
connection is active, if that is even temporarily interrupted, the
primary will loose all knowledge of the standby's horizon - unless
replication slots are in use, that is.

Additionally, we also need to handle the case where the replica
currently has restarted, and is recovering using local WAL, and/or
archive based recovery. In that case the standby could already have sent
a *newer* horizon as feedback upstream, but currently again have an
older view. It is entirely possible that the standby is consistent and
queryable in such a state (if nothing forced disk writes during WAL
replay, minRecoveryLSN will not be set to something too new).

> Also, standby could set own hints using xmin it sends to primary
> during feedback (but without marking page as dirty).

We do something similar for heap hint bits already...

> Of course all is not so easy, there are a few things and corner cases
> to care about
> * Looks like RecentGlobalXmin could be moved backwards in case of new
> replica with lower xmin is connected (or by switching some replica to
> hot_standby_feedback=on). We must ensure RecentGlobalXmin is moved
> strictly forward.

I'm not sure this is a problem. If that happens we cannot rely on the
different xmin horizon anyway, because action may have been taken on the
old RecentGlobalXmin. Thus we need to be safe against that anyway.

> * hot_standby_feedback could be enabled on the fly. In such a case we
> need distinguish transactions which are safe or unsafe to deal with
> hints. Standby could receive fresh RecentGlobalXmin as response to
> feedback message. All standby transactions with xmin >=
> RecentGlobalXmin are safe to use hints.
> * hot_standby_feedback could be disabled on the fly. In such situation
> standby needs to continue to send feedback while canceling all queries
> with ignore_killed_tuples=true. Once all such queries are canceled -
> feedback are no longer needed and should be disabled.

I don't think we can rely on hot_standby_feedback at all. We can to
avoid unnecessary cancellations, etc, and even assume it's setup up
reasonably for some configurations, but there always needs to be an
independent correctness backstop.

I think it might be more promising to improve the the kill logic based
on the WAL logged horizons from the primary. All I think we need to do
is to use a more conservatively computed RecentGlobalXmin when
determining whether tuples are dead, I think. We already regularly log a
xl_running_xacts, adding information about the primaries horizon to
that, and stashing it in shared memory on the standby, shouldn't be too
hard.  Then we can make a correct, albeit likely overly pessimistic,
visibility determinations about tuples, and go on to set LP_DEAD.

There are some complexities around how to avoid unnecessary query
cancellations. We'd not want to trigger recovery conflicts based on the
new xl_running_xacts field, as that'd make the conflict rate go through
the roof - but I think we could safely use the logical minimum of the
local RecentGlobalXmin, and the primaries'.

That should allow us to set additional LP_DEAD safely, I believe. We
could even rely on those LP_DEAD bits. But:

I'm less clear on how we can make sure that we can *rely* on LP_DEAD to
skip over entries during scans, however. The bits set as described above
would be safe, but we also can see LP_DEAD set by the primary (and even
upstream cascading standbys at least in case of new base backups taken
from them), due to them being not being WAL logged. As we don't WAL log,
there is no conflict associated with the LP_DEADs being set.  My gut
feeling is that it's going to be very hard to get around this, without
adding WAL logging for _bt_killitems et al (including an interface for
kill_prior_tuple to report the used horizon to the index).

I'm wondering if we could recycle

Re: psql - add SHOW_ALL_RESULTS option

2020-01-16 Thread Tomas Vondra


Hi,

This is one of the patches already marked as RFC (since September by
Alvaro). Anyone interested in actually pushing it, so that it does not
fall through to yet another commitfest?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: SlabCheck leaks memory into TopMemoryContext

2020-01-16 Thread Tom Lane

Tomas Vondra  writes:
> On Thu, Jan 16, 2020 at 08:48:49AM -0800, Andres Freund wrote:
>> I don't get why it's advantageous to allocate this once for each slab,
>> rather than having it as a global once for all slabs. But anyway, still
>> clearly better than the current situation.

> It's largely a matter of personal preference - I agree there are cases
> when global variables are the best solution, but I kinda dislike them.
> It seems cleaner to just allocate it as part of the slab, not having to
> deal with different number of chunks per block between slabs.
> Plus we don't have all that many slabs (like 2), and it's only really
> used in debug builds anyway. So I'm not all that woried about this
> wasting a couple extra kB of memory.

A positive argument for doing it like this is that the overhead goes away
when the SlabContexts are all deallocated, while a global variable would
presumably stick around indefinitely.  But I concur that in current usage,
there's hardly any point in worrying about the relative benefits.  We
should just keep it simple, and this seems marginally simpler than the
other way.

regards, tom lane

Re: our checks for read-only queries are not great

2020-01-16 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, Jan 14, 2020 at 1:46 PM Stephen Frost  wrote:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > > Robert Haas  writes:
> > > > Speaking of sensible progress, I think we've drifted off on a tangent
> > > > here about ALTER SYSTEM.
> > >
> > > Agreed, that's not terribly relevant for the proposed patch.
> >
> > I agree that the proposed patch seems alright by itself, as the changes
> > it's making to existing behavior seem to all be bug-fixes and pretty
> > clear improvements not really related to 'read-only' transactions.
> 
> There seems to be no disagreement on this point, so I have committed the 
> patch.

Works for me.

> > It's unfortunate that we haven't been able to work through to some kind
> > of agreement around what "SET TRANSACTION READ ONLY" means, so that
> > users of it can know what to expect.
> 
> I at least feel like we have a pretty good handle on what it was
> intended to mean; that is, "doesn't cause semantically significant
> changes to pg_dump output." I do hear some skepticism as to whether
> that's the best definition, but it has pretty good explanatory power
> relative to the current state of the code, which is something.

I think I agree with you regarding the original intent, though even
there, as discussed elsewhere, it seems like there's perhaps either a
bug or a disagreement about the specifics of what that means when it
relates to committing a 2-phase transaction.  Still, setting that aside
for the moment, do we feel like this is enough to be able to update our
documentation with?

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: our checks for read-only queries are not great

2020-01-16 Thread Robert Haas

On Tue, Jan 14, 2020 at 1:46 PM Stephen Frost  wrote:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > Robert Haas  writes:
> > > Speaking of sensible progress, I think we've drifted off on a tangent
> > > here about ALTER SYSTEM.
> >
> > Agreed, that's not terribly relevant for the proposed patch.
>
> I agree that the proposed patch seems alright by itself, as the changes
> it's making to existing behavior seem to all be bug-fixes and pretty
> clear improvements not really related to 'read-only' transactions.

There seems to be no disagreement on this point, so I have committed the patch.

> It's unfortunate that we haven't been able to work through to some kind
> of agreement around what "SET TRANSACTION READ ONLY" means, so that
> users of it can know what to expect.

I at least feel like we have a pretty good handle on what it was
intended to mean; that is, "doesn't cause semantically significant
changes to pg_dump output." I do hear some skepticism as to whether
that's the best definition, but it has pretty good explanatory power
relative to the current state of the code, which is something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Physical replication slot advance is not persistent

2020-01-16 Thread Alexey Kondratov


On 09.01.2020 09:36, Kyotaro Horiguchi wrote:

Hello.

At Sun, 29 Dec 2019 15:12:16 +0300, Alexey Kondratov 
 wrote in

On 2019-12-26 16:35, Alexey Kondratov wrote:

Another concern is that ReplicationSlotIsDirty is added with the only
one user. It also cannot be used by SaveSlotToPath due to the
simultaneous usage of both flags dirty and just_dirtied there.
In that way, I hope that we should call ReplicationSlotSave
unconditionally in the pg_replication_slot_advance, so slot will be
saved or not automatically based on the slot->dirty flag. In the same
time, ReplicationSlotsComputeRequiredXmin and
ReplicationSlotsComputeRequiredLSN should be called by anyone, who
modifies xmin and LSN fields in the slot. Otherwise, currently we are
getting some leaky abstractions.

Sounds reasonable.


Great, so it seems that we have reached some agreement about who should 
mark slot as dirty, at least for now.





If someone will utilise old WAL and after that crash will happen
between steps 2) and 3), then we start with old value of restart_lsn,
but without required WAL. I do not know how to properly reproduce it
without gdb and power off, so the chance is pretty low, but still it
could be a case.

In the first place we advance required LSN for every reply message but
save slot data only at checkpoint on physical repliation.  Such a
strict guarantee seems too much.

...

I think we shouldn't touch the paths used by replication protocol. And
don't we focus on how we make a change of a replication slot from SQL
interface persistent?  It seems to me that generaly we don't need to
save dirty slots other than checkpoint, but the SQL function seems
wanting the change to be saved immediately.

As the result, please find the attached, which is following only the
first paragraph cited above.


OK, I have definitely overthought that, thanks. This looks like a 
minimal subset of changes that actually solves the bug. I would only 
prefer to keep some additional comments (something like the attached), 
otherwise after half a year it will be unclear again, why we save slot 
unconditionally here.



Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index bb69683e2a..084e0c2960 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -370,6 +370,11 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto)
 		MyReplicationSlot->data.restart_lsn = moveto;
 		SpinLockRelease(>mutex);
 		retlsn = moveto;
+
+		ReplicationSlotMarkDirty();
+
+		/* We moved retart_lsn, update the global value. */
+		ReplicationSlotsComputeRequiredLSN();
 	}
 
 	return retlsn;
@@ -564,7 +569,10 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 		(uint32) (moveto >> 32), (uint32) moveto,
 		(uint32) (minlsn >> 32), (uint32) minlsn)));
 
-	/* Do the actual slot update, depending on the slot type */
+	/*
+	 * Do the actual slot update, depending on the slot type.  Slot will be
+	 * marked as dirty by pg_*_replication_slot_advance if changed.
+	 */
 	if (OidIsValid(MyReplicationSlot->data.database))
 		endlsn = pg_logical_replication_slot_advance(moveto);
 	else
@@ -573,14 +581,11 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	values[0] = NameGetDatum(>data.name);
 	nulls[0] = false;
 
-	/* Update the on disk state when lsn was updated. */
-	if (XLogRecPtrIsInvalid(endlsn))
-	{
-		ReplicationSlotMarkDirty();
-		ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
-		ReplicationSlotSave();
-	}
+	/*
+	 * Update the on disk state.  No work here if
+	 * pg_*_replication_slot_advance call was a no-op.
+	 */
+	ReplicationSlotSave();
 
 	ReplicationSlotRelease();

Re: SlabCheck leaks memory into TopMemoryContext

2020-01-16 Thread Tomas Vondra


On Thu, Jan 16, 2020 at 11:43:34AM -0500, Tom Lane wrote:

Tomas Vondra  writes:

The attached fix should do the trick - it pre-allocates the space when
creating the context. There is a bit of complexity because we want to
allocate the space as part of the context header, but nothin too bad. We
might optimize it a bit by using a regular bitmap (instead of just an
array of bools), but I haven't done that.


Hmm ... so if this is an array of bools, why isn't it declared bool*
rather than char* ?  (Pre-existing ugliness, sure, but we might as
well fix it while we're here.  Especially since you used sizeof(bool)
in the space calculation.)



True. Will fix.


I agree that maxaligning the start point of the array is pointless.

I'd write "free chunks in a block" not "free chunks on a block",
the latter seems rather shaky English.  But that's getting picky.

LGTM otherwise.



OK. Barring objections I'll push and backpatch this later today.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Code cleanup for build_regexp_split_result

2020-01-16 Thread Tom Lane

Li Japin  writes:
> I find the build_regexp_split_result() has redundant codes, we can move it to 
> before the condition check, can we?

Hm, yeah, that looks a bit strange.  It was less strange before
c8ea87e4bd950572cba4575e9a62284cebf85ac5, I think.

Pushed with some additional simplification to get rid of the
rather ugly (IMO) PG_USED_FOR_ASSERTS_ONLY variable.

regards, tom lane

Re: SlabCheck leaks memory into TopMemoryContext

2020-01-16 Thread Tomas Vondra


On Thu, Jan 16, 2020 at 10:27:01AM -0500, Tom Lane wrote:

Andres Freund  writes:

... I thought you were asking whether
any additional memory could just be avoided...


Well, I was kind of wondering that, but if it's not practical then
preallocating the space instead will do.



I don't think it's practical to rework the checks in a way that would
not require allocations. Maybe it's possible, but I think it's not worth
the extra complexity.

The attached fix should do the trick - it pre-allocates the space when
creating the context. There is a bit of complexity because we want to
allocate the space as part of the context header, but nothin too bad. We
might optimize it a bit by using a regular bitmap (instead of just an
array of bools), but I haven't done that.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/utils/mmgr/slab.c b/src/backend/utils/mmgr/slab.c
index c5866d9cc3..7f9749a73f 100644
--- a/src/backend/utils/mmgr/slab.c
+++ b/src/backend/utils/mmgr/slab.c
@@ -70,6 +70,9 @@ typedef struct SlabContext
int chunksPerBlock; /* number of chunks per block */
int minFreeChunks;  /* min number of free chunks in 
any block */
int nblocks;/* number of blocks 
allocated */
+#ifdef MEMORY_CONTEXT_CHECKING
+   char   *freechunks; /* bitmap of free chunks on a block */
+#endif
/* blocks with free space, grouped by number of free chunks: */
dlist_head  freelist[FLEXIBLE_ARRAY_MEMBER];
 } SlabContext;
@@ -229,6 +232,15 @@ SlabContextCreate(MemoryContext parent,
/* Size of the memory context header */
headerSize = offsetof(SlabContext, freelist) + freelistSize;
 
+#ifdef MEMORY_CONTEXT_CHECKING
+   /*
+* With memory checking, we need to allocate extra space for the bitmap
+* of free chunks. The space is allocated at the end, and we need proper
+* alignment (it's an array of bools, so maybe MAXALIGN is not needed).
+*/
+   headerSize = MAXALIGN(headerSize) + chunksPerBlock * sizeof(bool);
+#endif
+
slab = (SlabContext *) malloc(headerSize);
if (slab == NULL)
{
@@ -258,6 +270,12 @@ SlabContextCreate(MemoryContext parent,
for (i = 0; i < (slab->chunksPerBlock + 1); i++)
dlist_init(>freelist[i]);
 
+#ifdef MEMORY_CONTEXT_CHECKING
+   /* set the freechunks pointer after the freelists array (aligned) */
+   slab->freechunks = (char *) slab +
+   MAXALIGN(offsetof(SlabContext, freelist) + 
freelistSize);
+#endif
+
/* Finally, do the type-independent part of context creation */
MemoryContextCreate((MemoryContext) slab,
T_SlabContext,
@@ -701,14 +719,10 @@ SlabCheck(MemoryContext context)
int i;
SlabContext *slab = castNode(SlabContext, context);
const char *name = slab->header.name;
-   char   *freechunks;
 
Assert(slab);
Assert(slab->chunksPerBlock > 0);
 
-   /* bitmap of free chunks on a block */
-   freechunks = palloc(slab->chunksPerBlock * sizeof(bool));
-
/* walk all the freelists */
for (i = 0; i <= slab->chunksPerBlock; i++)
{
@@ -731,7 +745,7 @@ SlabCheck(MemoryContext context)
 name, block->nfree, block, i);
 
/* reset the bitmap of free chunks for this block */
-   memset(freechunks, 0, (slab->chunksPerBlock * 
sizeof(bool)));
+   memset(slab->freechunks, 0, (slab->chunksPerBlock * 
sizeof(bool)));
idx = block->firstFreeChunk;
 
/*
@@ -748,7 +762,7 @@ SlabCheck(MemoryContext context)
 
/* count the chunk as free, add it to the 
bitmap */
nfree++;
-   freechunks[idx] = true;
+   slab->freechunks[idx] = true;
 
/* read index of the next free chunk */
chunk = SlabBlockGetChunk(slab, block, idx);
@@ -759,7 +773,7 @@ SlabCheck(MemoryContext context)
for (j = 0; j < slab->chunksPerBlock; j++)
{
/* non-zero bit in the bitmap means chunk the 
chunk is used */
-   if (!freechunks[j])
+   if (!slab->freechunks[j])
{
SlabChunk  *chunk = 
SlabBlockGetChunk(slab, block, j);

Re: SlabCheck leaks memory into TopMemoryContext

2020-01-16 Thread Tom Lane

Andres Freund  writes:
> ... I thought you were asking whether
> any additional memory could just be avoided...

Well, I was kind of wondering that, but if it's not practical then
preallocating the space instead will do.

regards, tom lane

Re: [HACKERS] WIP: Aggregation push-down

2020-01-16 Thread Antonin Houska

Tomas Vondra  wrote:

> Hi,
> 
> I've been looking at the last version (v14) of this patch series,
> submitted way back in July and unfortunately quiet since then. Antonin
> is probably right one of the reasons for the lack of reviews is that it
> requires interest/experience with planner.
> 
> Initially it was also a bit hard to understand what are the benefits
> (and the patch shifted a bit), which is now mostly addressed by the
> README in the last patch. The trouble is that's hidden in the patch and
> so not immediately obvious to people considering reviewing it :-( Tom
> posted a nice summary in November 2018, but it was perhaps focused on
> the internals.
> 
> So my first suggestion it to write a short summary with example how it
> affects practical queries (plan change, speedup) etc.

I think README plus regression test output should be enough for someone who is
about to review patch as complex as this.

> My second suggestion is to have meaningful commit messages for each part
> of the patch series, explaining why we need that particular part. It
> might have been explained somewhere in the thread, but as a reviewer I
> really don't want to reverse-engineer the whole thread.

ok, done.

> Now, regarding individual parts of the patch:
> 
> 
> 1) v14-0001-Introduce-RelInfoList-structure.patch
> -
> 
> - I'm not entirely sure why we need this change. We had the list+hash
>   before, so I assume we do this because we need the output functions?

I believe that this is what Tom proposed in [1], see "Maybe an appropriate
preliminary patch is to refactor the support code ..." near the end of that
message. The point is that now we need two lists: one for "plain" relations
and one for grouped ones.

> - The RelInfoList naming is pretty confusing, because it's actually not
>   a list but a combination of list+hash. And the embedded list is called
>   items, to make it yet a bit more confusing. I suggest we call this
>   just RelInfo or RelInfoLookup or something else not including List.

I think it can be considered a list by the caller of add_join_rel() etc. The
hash table is hidden from user and is empty until the list becomes long
enough. The word "list" in the structure name may just indicate that new
elements can be added to the end, which shouldn't be expected if the structure
was an array.

I searched a bit in tools/pgindent/typedefs.list and found at least a few
structures that also do have "list" in the name but are not lists internally:
CatCList, FuncCandidateList, MCVList.

Actually I'm not opposed to renaming the structure, but don't have better idea
right now. As for *Lookup: following is the only use of such a structure name
in PG code and it's not something to store data in:

typedef enum
{
IDENTIFIER_LOOKUP_NORMAL,   /* normal processing of var names */
IDENTIFIER_LOOKUP_DECLARE,  /* In DECLARE --- don't look up names */
IDENTIFIER_LOOKUP_EXPR  /* In SQL expression --- special case */
} IdentifierLookup;

> - RelInfoEntry seems severely under-documented, particularly the data
>   part (which is void* making it even harder to understand what's it for
>   without having to read the functions). AFAICS it's just simply a link
>   to the embedded list, but maybe I'm wrong.

This is just JoinHashEntry (which was also undocumented) renamed. I've added a
short comment now.

> - I suggest we move find_join_rel/add_rel_info/add_join_rel in relnode.c
>   right before build_join_rel. This makes diff clearer/smaller and
>   visual diffs nicer.

Hm, it might improve readability of the diff, but this API is exactly where I
still need feedback from Tom. I'm not eager to optimize the diff as long as
there's a risk that these functions will be removed or renamed.

> 2) v14-0002-Introduce-make_join_rel_common-function.patch
> -
> 
> I see no point in keeping this change in a separate patch, as it prety
> much just renames make_join_rel to make_join_rel_common and then adds 
> 
>   make_join_rel(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2)
>   {
>   return make_join_rel_common(root, rel1, rel2);
>   }
> 
> which seems rather trivial and useless on it's own. I'd just merge it
> into 0003 where we use the _common function more extensively.

ok. I thought that this improves readability of the diffs, but it doesn't look
that bad if this is included in 0003.

> 
> 3) v14-0003-Aggregate-push-down-basic-functionality.patch
> -
> 
> I haven't done much review on this yet, but I've done some testing and
> benchmarking so let me share at least that. The queries I used were of
> the type I mentioned earlier in this thread - essentially a star schema,
> i.e. fact table referencing dimensions, with aggregation per columns in
> the dimension. So something like
> 
>   SELECT d.c, sum(f) FROM fact JOIN dimension d ON

Re: empty range

2020-01-16 Thread Tom Lane

Emre Hasegeli  writes:
>> It's only suggestion, i don't now if somebody wants store empty range 
>> without bounds.

> I thought about the same while developing the BRIN inclusion operator
> class.  I am not sure how useful empty ranges are in practice, but
> keeping their bound would only bring more flexibility, and eliminate
> special cases on most of the range operators.  For reference, we allow
> empty boxes, and none of the geometric operators has to handle them
> specially.

I think it'd just move the special cases somewhere else.  Consider

regression=# select int4range(4,4) = int4range(5,5);
 ?column? 
--
 t
(1 row)

How do you preserve that behavior ... or if you don't, how much
damage does that do to the semantics of ranges?  Right now there's
a pretty solid set-theoretic basis for understanding what a range is,
ie two ranges are the same if they include the same sets of elements.
It seems like that goes out the window if we don't consider that
all empty ranges are the same.

BTW, I think the main reason for all the bound-normalization pushups
is to try to have a rule that ranges that are set-theoretically equal
will look the same.  That also goes out the window if we make
empty ranges look like this.

regards, tom lane

Re: progress report for ANALYZE

2020-01-16 Thread Justin Pryzby

On Wed, Jan 15, 2020 at 02:11:10PM -0300, Alvaro Herrera wrote:
> I just pushed this after some small extra tweaks.
> 
> Thanks, Yamada-san, for seeing this to completion!

Find attached minor fixes to docs - sorry I didn't look earlier.

Possibly you'd also want to change the other existing instances of "preparing
to begin".
>From de108e69b5d33c881074b0a04697d7061684f823 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 15 Jan 2020 23:10:29 -0600
Subject: [PATCH v1] Doc review for ANALYZE progress (a166d408)

---
 doc/src/sgml/monitoring.sgml | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8b44fb1..10871b7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3525,7 +3525,7 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
   
Whenever ANALYZE is running, the
pg_stat_progress_analyze view will contain a
-   row for each backend that is currently running that command.  The tables
+   row for each backend currently running ANALYZE.  The tables
below describe the information that will be reported and provide
information about how to interpret it.
   
@@ -3635,7 +3635,7 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
 
  initializing
  
-   The command is preparing to begin scanning the heap.  This phase is
+   The command is preparing to scan the heap.  This phase is
expected to be very brief.
  
 
@@ -3643,7 +3643,7 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
  acquiring sample rows
  
The command is currently scanning the table given by
-   current_relid to obtain sample rows.
+   relid to obtain sample rows.
  
 
 
@@ -3659,14 +3659,14 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
 
  computing statistics
  
-   The command is computing statistics from the samples rows obtained during
+   The command is computing statistics from the sample rows obtained during
the table scan.
  
 
 
  computing extended statistics
  
-   The command is computing extended statistics from the samples rows obtained
+   The command is computing extended statistics from the sample rows obtained
durring the table scan.
  
 
-- 
2.7.4

Code cleanup for build_regexp_split_result

2020-01-16 Thread Li Japin

Hi hackers,

I find the build_regexp_split_result() has redundant codes, we can move it to 
before the condition check, can we?

Best regards.

Japin Li






0001-Code-cleanup-for-build_regexp_split_result.patch
Description: 0001-Code-cleanup-for-build_regexp_split_result.patch

Re: Add pg_file_sync() to adminpack

2020-01-16 Thread Robert Haas

On Tue, Jan 14, 2020 at 10:08 AM Atsushi Torikoshi  wrote:
> fails we can get error messages. So it seems straightforward for me to
>  'return true on success and emit an ERROR otherwise'.

I agree with reporting errors using ERROR, but it seems to me that we
ought to then make the function return 'void' rather than 'bool'.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH v1] pg_ls_tmpdir to show directories

2020-01-16 Thread Justin Pryzby

On Thu, Jan 16, 2020 at 09:34:32AM +0100, Fabien COELHO wrote:
> Also, I'm not fully sure why ".*" files should be skipped, maybe it should
> be an option? Or the user can filter it with SQL if it does not want them?

I think if someone wants the full generality, they can do this:

postgres=# SELECT name, s.size, s.modification, s.isdir FROM (SELECT 
'base/pgsql_tmp'p)p, pg_ls_dir(p)name, pg_stat_file(p||'/'||name)s;
 name | size |  modification  | isdir 
--+--++---
 .foo | 4096 | 2020-01-16 08:57:04-05 | t

In my mind, pg_ls_tmpdir() is for showing tmpfiles, not just a shortcut to
SELECT pg_ls_dir((SELECT 'base/pgsql_tmp'p)); -- or, for all tablespaces:
WITH x AS (SELECT format('/PG_%s_%s', 
split_part(current_setting('server_version'), '.', 1), catalog_version_no) 
suffix FROM pg_control_system()), y AS (SELECT a, pg_ls_dir(a) AS d FROM 
(SELECT DISTINCT COALESCE(NULLIF(pg_tablespace_location(oid),'')||suffix, 
'base') a FROM pg_tablespace,x)a) SELECT a, pg_ls_dir(a||'/pgsql_tmp') FROM y 
WHERE d='pgsql_tmp';

I think changing dotfiles is topic for another patch.
That would also affect pg_ls_dir, and everything else that uses the backing
function pg_ls_dir_files_recurse.  I'd have to ask why not also show . and .. ?

(In fact, if I were to change anything, I would propose to limit pg_ls_tmpdir()
to files matching PG_TEMP_FILE_PREFIX).

Justin

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

2020-01-16 Thread Fujii Masao

On Tue, Dec 17, 2019 at 2:19 PM Michael Paquier  wrote:
>
> On Mon, Dec 16, 2019 at 12:22:18PM +0900, Fujii Masao wrote:
> > > +Detection of WAL records having references to invalid pages 
> > > during
> > > +recovery causes PostgreSQL to report
> > > +an error, aborting the recovery. Setting
> > > Well, that's not really an error.  This triggers a PANIC, aka crashes
> > > the server.  And in this case the actual problem is that you may not
> > > be able to move on with recovery when restarting the server again,
> > > except if luck is on your side because you would continuously face
> > > it..
> >
> > So you're thinking that "report an error" should be changed to
> > "trigger a PANIC"? Personally "report an error" sounds ok because
> > PANIC is one of "error", I think. But if that misleads people,
> > I will change the sentence.
>
> In the context of a recovery, an ERROR is promoted to a FATAL, but
> here are talking about something that bypasses the crash of the
> server.  So this could bring confusion.  I think that the
> documentation should be crystal clear about that, with two aspects
> outlined when the parameter is disabled, somewhat like data_sync_retry
> actually:
> - A PANIC-level error is triggered.
> - It crashes the cluster.

OK, I updated the patch that way.
Attached is the updated version of the patch.

Regards,

-- 
Fujii Masao


ignore_invalid_pages_v3.patch
Description: Binary data

Re: Duplicate Workers entries in some EXPLAIN plans

2020-01-16 Thread Georgios Kokolatos

> Sounds good, I'll try that format. Any idea how to test YAML? With the
> JSON format, I was able to rely on Postgres' own JSON-manipulating
> functions to strip or canonicalize fields that can vary across
> executions--I can't really do that with YAML. 

Yes, this approach was clear in the patch and works great with Json. Also
you are correct, this can not be done with YAML. I spend a bit of time to
look around and I could not find any tests really on yaml format.

> Or should I run EXPLAIN
> with COSTS OFF, TIMING OFF, SUMMARY OFF and assume that for simple
> queries the BUFFERS output (and other fields I can't turn off like
> Sort Space Used) *is* going to be stable?

I have to admit with the current diff tool used in pg_regress, this is not 
possible.
I am pretty certain that it *is not* going to be stable. Not for long anyways.
I withdraw my suggestion for YAML and currently awaiting for TEXT format only.

Re: empty range

2020-01-16 Thread Emre Hasegeli

> It's only suggestion, i don't now if somebody wants store empty range without 
> bounds.

I thought about the same while developing the BRIN inclusion operator
class.  I am not sure how useful empty ranges are in practice, but
keeping their bound would only bring more flexibility, and eliminate
special cases on most of the range operators.  For reference, we allow
empty boxes, and none of the geometric operators has to handle them
specially.

Re: SlabCheck leaks memory into TopMemoryContext

2020-01-16 Thread Tomas Vondra

On Wed, Jan 15, 2020 at 10:41:43PM -0800, Andres Freund wrote:

Hi,

On 2020-01-16 01:25:00 -0500, Tom Lane wrote:

Andres Freund  writes:
> On 2020-01-16 00:09:53 -0500, Tom Lane wrote:
>> It's basically assuming that the memory management mechanism is sane,
>> which makes the whole thing fundamentally circular, even if it's
>> relying on some other context to be sane.  Is there a way to do the
>> checking without extra allocations?

> Probably not easily.

In the AllocSet code, we don't hesitate to expend extra space per-chunk
for debug support:

typedef struct AllocChunkData
{
...
#ifdef MEMORY_CONTEXT_CHECKING
Sizerequested_size;
#endif
...

I don't see a pressing reason why SlabContext couldn't do something
similar, either per-chunk or per-context, whatever makes sense.

Well, what I suggested upthread, was to just have two globals like

int slabcheck_freechunks_size;
bool *slabcheck_freechunks;

and realloc that to the larger size in SlabContextCreate() if necessary,
based on the computed chunksPerBlock?  I thought you were asking whether
any additional memory could just be avoided...

I don't see why not to just do what Tom proposed, i.e. allocate this as
part of the memory context in SlabContextCreate(), when memory context
checking is enabled. It seems much more convenient / simpler than the
globals (no repalloc, ...).

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Patch to document base64 encoding

2020-01-16 Thread Fabien COELHO



Hello Karl,


New patch attached: doc_base64_v13.patch

This required surprisingly little re-wording.
Added word "binary" into the descriptions of convert(),
substring(), convert_from(), and convert_to().

I also added data types to the call syntax of set_bit()
and set_byte().

And this patch adds hyperlinks from the get_bit(), get_byte(),
set_bit(), and set_byte() descriptions to the note
that offsets are zero-based.

I also removed the hyperlinked asterisks about the hash
function results and instead hyperlinked the word "hash"
in the descriptions.  (Links to the note about md5()
returning hex text and the others returning bytea and how
to convert between the two.)


Patch applies cleanly and compiles.

My 0.02€: The overall restructuration and cross references is an 
improvement.


Some comments about v13:

The note about get_byte reads:

  get_byte and set_byte number the first byte of a binary string as byte
  0. get_bit and set_bit number bits from the right within each byte; for
  example bit 0 is the least significant bit of the first byte, and bit 15
  is the most significant bit of the second byte.

The two sentences starts with a lower case letter, which looks strange to 
me. I'd suggest to put "Functions" at the beginning of the sentences:


  Functions get_byte and set_byte number the first byte of a binary string
  as byte 0. Functions get_bit and set_bit number bits from the right
  within each byte; for example bit 0 is the least significant bit of the
  first byte, and bit 15 is the most significant bit of the second byte.

The note about hash provides an example for getting the hex representation 
out of sha*. I'd add an exemple to get the bytea representation from md5, 
eg "DECODE(MD5('hello world'), 'hex')"…


Maybe the encode/decode in the note could be linked to the function
description? Well, they are just after, maybe it is not very useful.

The "Binary String Functions and Operators" 9.5 section has only one 
subsection, "9.5.1", which is about at two thirds of the page. This 
structure looks weird. ISTM that a subsection is missing for the beginning 
of the page, or that the subsection should just be dropped, because it is 
somehow redundant with the table title.


The "9.4" section has the same structural weirdness. Either remove the 
subsection, or add some for the other parts?


--
Fabien.

Re: Making psql error out on output failures

2020-01-16 Thread Daniel Verite

David Zhang wrote:

> If I change the error log message like below, where "%m" is used to pass the
> value of strerror(errno), "could not write to output file:" is copied from
> function "WRITE_ERROR_EXIT". 
> -   pg_log_error("Error printing tuples"); 
> +   pg_log_error("could not write to output file: %m"); 
> then the output message is something like below, which, I believe, is more
> consistent with pg_dump. 

The problem is that errno may not be reliable to tell us what was
the problem that leads to ferror(fout) being nonzero, since it isn't
saved at the point of the error and the output code may have called
many libc functions between the first occurrence of the output error
and when pg_log_error() is called.

The linux manpage on errno warns specifically about this:

NOTES
   A common mistake is to do

   if (somecall() == -1) {
   printf("somecall() failed\n");
   if (errno == ...) { ... }
   }

   where errno no longer needs to have the value  it  had  upon  return 
from  somecall()
   (i.e.,  it  may  have been changed by the printf(3)).  If the value of
errno should be
   preserved across a library call, it must be saved:

This other bit from the POSIX spec [1] is relevant:

  "The value of errno shall be defined only after a call to a function
  for which it is explicitly stated to be set and until it is changed
  by the next function call or if the application assigns it a value."

To use errno in a way that complies with the above, the psql code
should be refactored. I don't know if having a more precise error
message justifies that refactoring. I've elaborated a bit about that
upthread with the initial submission. Besides, I'm not even
sure that errno is necessarily set on non-POSIX platforms when fputc
or fputs fails.
That's why this patch uses the safer approach to emit a generic
error message.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/errno.html

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: remove some STATUS_* symbols

2020-01-16 Thread Robert Haas

On Thu, Jan 16, 2020 at 12:50 AM Michael Paquier  wrote:
> Thanks, that looks fine.  I am still not sure whether the second patch
> adding an enum via ProcWaitStatus improves the code readability
> though, so my take would be to discard it for now.  Perhaps others
> think differently, I don't know.

IMHO, custom enums for each particular case would be a big improvement
over supposedly-generic STATUS codes. It makes it clearer which values
are possible in each code path, and it comes out nicer in the
debugger, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Block level parallel vacuum

2020-01-16 Thread Prabhat Sahu

Hi all,

I would like to share my observation on this PG feature "Block-level
parallel vacuum".
I have tested the earlier patch (i.e v48) with below high-level test
scenarios, and those are working as expected.

   - I have played around with these GUC parameters  while testing

max_worker_processes

autovacuum = off


shared_buffers

max_parallel_workers

max_parallel_maintenance_workers

min_parallel_index_scan_size

vacuum_cost_limit

vacuum_cost_delay


   - Tested the parallel vacuum with tables and Partition tables having
   possible datatypes and Columns having various indexes(like btree, gist,
   etc.) on part / full table.
   - Tested the pgbench tables data with multiple indexes created manually
   and ran script(vacuum_test.sql) with DMLs and VACUUM for multiple Clients,
   Jobs, and Time as below.

./pgbench  -c 8 -j 16 -T 900  postgres -f vacuum_test.sql

We observe the usage of parallel workers during VACUUM.


   - Ran few isolation schedule test cases(in regression) with huge data
   and indexes, perform DMLs -> VACUUM
   - Tested with PARTITION TABLEs -> global/local indexes ->  DMLs -> VACUUM
   - Tested with PARTITION TABLE having different TABLESPACE in different
   location -> global/local indexes -> DMLs -> VACUUM
   - With Changing STORAGE options for columns(as PLAIN / EXTERNAL /
   EXTENDED)  -> DMLs -> VACUUM
   - Create index with CONCURRENTLY option / Changing storage_parameter for
   index as below  -> DMLs -> VACUUM

with(buffering=auto) / with(buffering=on) / with(buffering=off) /
with(fillfactor=30);


   - Tested with creating Simple and Partitioned tables ->  DMLs  ->
   pg_dump/pg_restore/pg_upgrade -> VACUUM

Verified the data after restore / upgrade / VACUUM.


   - Indexes on UUID-OSSP data ->  DMLs -> pg_upgrade -> VACUUM
   - Verified with various test scenarios for better performance of
   parallel VACUUM as compared to Non-parallel VACUUM.

 Time taken by VACUUM on PG HEAD+PATCH(with PARALLEL) <  Time taken
by VACUUM on PG HEAD (without PARALLEL)


Machine configuration: (16 VCPUs / RAM: 16GB / Disk size: 640GB)

*PG HEAD:*
VACUUM tab1;

Time: 38915.384 ms (00:38.915)

Time: 48389.006 ms (00:48.389)

Time: 41324.223 ms (00:41.324)

*Time: 37640.874 ms (00:37.641) --median*

Time: 36897.325 ms (00:36.897)

Time: 36351.022 ms (00:36.351)

Time: 36198.890 ms (00:36.199)


*PG HEAD + v48 Patch:*
VACUUM tab1;

Time: 37051.589 ms (00:37.052)

*Time: 33647.459 ms (00:33.647) --median*

Time: 31580.894 ms (00:31.581)

Time: 34442.046 ms (00:34.442)

Time: 31335.960 ms (00:31.336)

Time: 34441.245 ms (00:34.441)

Time: 31159.639 ms (00:31.160)



-- 

With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Block level parallel vacuum

2020-01-16 Thread Amit Kapila

On Thu, Jan 16, 2020 at 4:46 PM Masahiko Sawada
 wrote:
>
> Right. Most indexes (all?) of tables that are used in the regression
> tests are smaller than min_parallel_index_scan_size. And we set
> min_parallel_index_scan_size to 0 in vacuum.sql but VACUUM would not
> be speeded-up much because of the relation size. Since we instead
> populate new table for parallel vacuum testing the regression test for
> vacuum would take a longer time.
>

Fair enough and I think it is good in a way that it won't change the
coverage of existing vacuum code.  I have fixed all the issues
reported by Mahendra and have fixed a few other cosmetic things in the
attached patch.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


v49-0001-Allow-vacuum-command-to-process-indexes-in-parallel.patch
Description: Binary data

Thoughts on "killed tuples" index hint bits support on standby

2020-01-16 Thread Michail Nikolaev

Hello, hackers.

Currently hint bits in the index pages (dead tuples) are set and taken
into account only at primary server. Standby just ignores it. It is
done for reasons, of course (see RelationGetIndexScan and [1]):

* We do this because the xmin on the primary node could easily be
* later than the xmin on the standby node, so that what the primary
* thinks is killed is supposed to be visible on standby. So for correct
* MVCC for queries during recovery we must ignore these hints and check
* all tuples.

Also, according to [2] and cases like [3] it seems to be good idea to
support "ignore_killed_tuples" on standby.

I hope I know the way to support it correctly with reasonable amount of changes.

First thing we need to consider - checksums and wal_log_hints are
widely used these days. So, at any moment master could send FPW page
with new "killed tuples" hints and overwrite hints set by standby.
Moreover it is not possible to distinguish hints are set by primary or standby.

And there is where hot_standby_feedback comes to play. Master node
considers xmin of hot_standy_feedback replicas (RecentGlobalXmin)
while setting "killed tuples" bits. So, if hot_standby_feedback is
enabled on standby for a while - it could safely trust hint bits from
master.
Also, standby could set own hints using xmin it sends to primary
during feedback (but without marking page as dirty).

Of course all is not so easy, there are a few things and corner cases
to care about
* Looks like RecentGlobalXmin could be moved backwards in case of new
replica with lower xmin is connected (or by switching some replica to
hot_standby_feedback=on). We must ensure RecentGlobalXmin is moved
strictly forward.
* hot_standby_feedback could be enabled on the fly. In such a case we
need distinguish transactions which are safe or unsafe to deal with
hints. Standby could receive fresh RecentGlobalXmin as response to
feedback message. All standby transactions with xmin >=
RecentGlobalXmin are safe to use hints.
* hot_standby_feedback could be disabled on the fly. In such situation
standby needs to continue to send feedback while canceling all queries
with ignore_killed_tuples=true. Once all such queries are canceled -
feedback are no longer needed and should be disabled.

Could someone validate my thoughts please? If the idea is mostly
correct - I could try to implement and test it.

[1] -
https://www.postgresql.org/message-id/flat/7067.1529246768%40sss.pgh.pa.us#d9e2e570ba34fc96c4300a362cbe8c38
[2] -
https://www.postgresql.org/message-id/flat/12843.1529331619%40sss.pgh.pa.us#6df9694fdfd5d550fbb38e711d162be8
[3] -
https://www.postgresql.org/message-id/flat/20170428133818.24368.33533%40wrigleys.postgresql.org

Re: [HACKERS] Block level parallel vacuum

2020-01-16 Thread Masahiko Sawada

On Thu, 16 Jan 2020 at 14:11, Amit Kapila  wrote:
>
> On Thu, Jan 16, 2020 at 10:11 AM Mahendra Singh Thalor
>  wrote:
> >
> > On Thu, 16 Jan 2020 at 08:22, Amit Kapila  wrote:
> > >
> > > > 2.
> > > > I checked time taken by vacuum.sql test. Execution time is almost same
> > > > with and without v45 patch.
> > > >
> > > > Without v45 patch:
> > > > Run1) vacuum   ... ok 701 ms
> > > > Run2) vacuum   ... ok 549 ms
> > > > Run3) vacuum   ... ok 559 ms
> > > > Run4) vacuum   ... ok 480 ms
> > > >
> > > > With v45 patch:
> > > > Run1) vacuum   ... ok 842 ms
> > > > Run2) vacuum   ... ok 808 ms
> > > > Run3)  vacuum   ... ok 774 ms
> > > > Run4) vacuum   ... ok 792 ms
> > > >
> > >
> > > I see some variance in results, have you run with autovacuum as off.
> > > I was expecting that this might speed up some cases where parallel
> > > vacuum is used by default.
> >
> > I think, this is expected difference in timing because we are adding
> > some vacuum related test. I am not starting server manually(means I am
> > starting server with only default setting).
> >
>
> Can you once test by setting autovacuum = off?  The autovacuum leads
> to variability in test timing.
>
>

I've also run the regression tests with and without the patch:

* w/o patch and autovacuum = on:  255 ms
* w/o patch and autovacuum = off: 258 ms
* w/ patch and autovacuum = on: 370 ms
* w/ patch and autovacuum = off: 375 ms

> > If we start server with default settings, then we will not hit vacuum
> > related test cases to parallel because size of index relation is very
> > small so we will not do parallel vacuum.

Right. Most indexes (all?) of tables that are used in the regression
tests are smaller than min_parallel_index_scan_size. And we set
min_parallel_index_scan_size to 0 in vacuum.sql but VACUUM would not
be speeded-up much because of the relation size. Since we instead
populate new table for parallel vacuum testing the regression test for
vacuum would take a longer time.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: remove some STATUS_* symbols

2020-01-16 Thread Kyotaro Horiguchi

At Thu, 16 Jan 2020 14:50:01 +0900, Michael Paquier  wrote 
in 
> On Sat, Jan 11, 2020 at 08:14:17AM +0100, Peter Eisentraut wrote:
> > OK, pushed as it was then.
> 
> Thanks, that looks fine.  I am still not sure whether the second patch
> adding an enum via ProcWaitStatus improves the code readability
> though, so my take would be to discard it for now.  Perhaps others
> think differently, I don't know.

I feel the same about the second patch.

Although actually STATUS_WAITING is used only by ProcSleep and related
functions, likewise STATUS_EOF is seen only in auth.c/h. Other files,
pqcomm.c, crypt.c postmaster.c, hba.c, fe-auth.c , fe-connect.c,
fe-gssapi-common.c are using only STATUS_OK and ERROR. I haven't had a
close look but all of the usages would be equivalent to bool.

On the other hand many functions in fe-*.c and pqcomm.c returns
EOF(-1)/0 instead of STATUS_EOF(-2)/STATUS_OK(0).

We could reorganize the values and their usage but it doesn't seem to
be a big win..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

empty range

2020-01-16 Thread david . turon

Hello,

default constructor for ranges use lower bound closed '[' and upper bound 
open ')'. This is correct behavior, but when upper bound is same like 
lower bound then range is empty. Mathematically is correct again - but in 
database is lost information about range bounds (lower/upper is NULL). To 
prevent this sitiuation we must have check if lower and upper argument is 
same and add some 0.1s  to upper range or use another constructor like 
tstzrange(now(),now(),'[]') .

Is there chance to change behavior of storing ranges? Its possible store 
range bounds in internal structure and lower(tstzrange(now(),now())) show 
not NULL value or change default behavior 
tstzrange(timestamptz,timestamptz) - if both args are same, then store as 
'[]', else '[)' and only tstzrange(timestamptz,timestamtz,'[)') and 
tstzrange(timestamptz,timestamtz,'()') store empty range.

It's only suggestion, i don't now if somebody wants store empty range 
without bounds.

We must have some checks to prevent storing empty values on every place 
where can occur this empty range, becouse we don't want lose bound 
information.

Best regards,
-- 
-
Ing. David TUROŇ
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.:+420 591 166 224
fax:+420 596 621 273
mobil:  +420 732 589 152
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-

Re: Implementing Incremental View Maintenance

2020-01-16 Thread nuko yokohama

Error occurs when updating user-defined type columns.

Create an INCREMENTAL MATERIALIZED VIEW by specifying a query that includes
user-defined type columns.
After the view is created, an error occurs when inserting into the view
source table (including the user-defined type column).
```
ERROR:  operator does not exist
```

An execution example is shown below.

```
[ec2-user@ip-10-0-1-10 ivm]$ psql testdb -a -f extension-insert.sql
--
-- pg_fraction: https://github.com/nuko-yokohama/pg_fraction
--
DROP EXTENSION IF EXISTS pg_fraction CASCADE;
psql:extension-insert.sql:4: NOTICE:  drop cascades to column data of table
foo
DROP EXTENSION
DROP TABLE IF EXISTS foo CASCADE;
DROP TABLE
CREATE EXTENSION IF NOT EXISTS pg_fraction;
CREATE EXTENSION
\dx
   List of installed extensions
Name | Version |   Schema   | Description
-+-++--
 pg_fraction | 1.0 | public | fraction data type
 plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
(2 rows)

\dT+ fraction
List of data types
 Schema |   Name   | Internal name | Size | Elements |  Owner   | Access
privileges | Description
+--+---+--+--+--+---+-
 public | fraction | fraction  | 16   |  | postgres |
|
(1 row)

CREATE TABLE foo (id int, data fraction);
CREATE TABLE
INSERT INTO foo (id, data) VALUES (1,'2/3'),(2,'1/3'),(3,'1/2');
INSERT 0 3
SELECT id, data FROM foo WHERE data >= '1/2';
 id | data
+--
  1 | 2/3
  3 | 1/2
(2 rows)

CREATE INCREMENTAL MATERIALIZED VIEW foo_imv AS SELECT id, data FROM foo
WHERE data >= '1/2';
SELECT 2
TABLE foo_imv;
 id | data
+--
  1 | 2/3
  3 | 1/2
(2 rows)

INSERT INTO foo (id, data) VALUES (4,'2/3'),(5,'2/5'),(6,'3/6'); -- error
psql:extension-insert.sql:17: ERROR:  operator does not exist: fraction
pg_catalog.= fraction
LINE 1: ...(mv.id IS NULL AND diff.id IS NULL)) AND (mv.data OPERATOR(p...
 ^
HINT:  No operator matches the given name and argument types. You might
need to add explicit type casts.
QUERY:  WITH updt AS (UPDATE public.foo_imv AS mv SET __ivm_count__ =
mv.__ivm_count__ OPERATOR(pg_catalog.+) diff.__ivm_count__  FROM
pg_temp_3.pg_temp_73900 AS diff WHERE (mv.id OPERATOR(pg_catalog.=) diff.id
OR (mv.id IS NULL AND diff.id IS NULL)) AND (mv.data OPERATOR(pg_catalog.=)
diff.data OR (mv.data IS NULL AND diff.data IS NULL)) RETURNING mv.id,
mv.data) INSERT INTO public.foo_imv SELECT * FROM pg_temp_3.pg_temp_73900
AS diff WHERE NOT EXISTS (SELECT 1 FROM updt AS mv WHERE (mv.id
OPERATOR(pg_catalog.=) diff.id OR (mv.id IS NULL AND diff.id IS NULL)) AND
(mv.data OPERATOR(pg_catalog.=) diff.data OR (mv.data IS NULL AND diff.data
IS NULL)));
TABLE foo;
 id | data
+--
  1 | 2/3
  2 | 1/3
  3 | 1/2
(3 rows)

TABLE foo_imv;
 id | data
+--
  1 | 2/3
  3 | 1/2
(2 rows)

DROP MATERIALIZED VIEW foo_imv;
DROP MATERIALIZED VIEW
INSERT INTO foo (id, data) VALUES (4,'2/3'),(5,'2/5'),(6,'3/6');
INSERT 0 3
TABLE foo;
 id | data
+--
  1 | 2/3
  2 | 1/3
  3 | 1/2
  4 | 2/3
  5 | 2/5
  6 | 1/2
(6 rows)

```

Best regards.

2018年12月27日(木) 21:57 Yugo Nagata :

> Hi,
>
> I would like to implement Incremental View Maintenance (IVM) on
> PostgreSQL.
> IVM is a technique to maintain materialized views which computes and
> applies
> only the incremental changes to the materialized views rather than
> recomputate the contents as the current REFRESH command does.
>
> I had a presentation on our PoC implementation of IVM at PGConf.eu 2018
> [1].
> Our implementation uses row OIDs to compute deltas for materialized
> views.
> The basic idea is that if we have information about which rows in base
> tables
> are contributing to generate a certain row in a matview then we can
> identify
> the affected rows when a base table is updated. This is based on an idea of
> Dr. Masunaga [2] who is a member of our group and inspired from ID-based
> approach[3].
>
> In our implementation, the mapping of the row OIDs of the materialized view
> and the base tables are stored in "OID map". When a base relation is
> modified,
> AFTER trigger is executed and the delta is recorded in delta tables using
> the transition table feature. The accual udpate of the matview is triggerd
> by REFRESH command with INCREMENTALLY option.
>
> However, we realize problems of our implementation. First, WITH OIDS will
> be removed since PG12, so OIDs are no longer available. Besides this, it
> would
> be hard to implement this since it needs many changes of executor nodes to
> collect base tables's OIDs during execuing a query. Also, the cost of
> maintaining
> OID map would be high.
>
> For these reasons, we started to think to implement IVM without relying on
> OIDs
> and made a bit more surveys.
>
> We also looked at Kevin Grittner's discussion

Re: Amcheck: do rightlink verification with lock coupling

2020-01-16 Thread Andrey Borodin




> 14 янв. 2020 г., в 9:47, Andrey Borodin  написал(а):
> 
> Page updates may be lost due to bug in backup software with incremental 
> backups, bug in storage layer of Aurora-style system, bug in page cache, 
> incorrect
> fsync error handling, bug in ssd firmware etc. And our data checksums do not
> detect this kind of corruption. BTW I think that it would be better if our
> checksums were not stored on a page itseft, they could detect this kind of 
> faults.

Observed it just now.
There is one HA cluster where a node was marked dead. This node was 
disconnected from cluster, but due to human error there was postgres running.
Node managed to install block-level incremental backup to the chain. And backup 
software did not detect that backup step was taken from part of timeline that 
was not in actual timeline's history.
Result of restoration is:

man-w%/%db R # select bt_index_check('%.pk_%');
 bt_index_check 

 
(1 row)

Time: 1411.065 ms (00:01.411)
man-w%/%db R # select patched_index_check('%.pk_%');
ERROR:  XX002: left link/right link pair in index "pk_labels" not in agreement
DETAIL:  Block=42705 left block=42707 left link from block=45495.
LOCATION:  bt_recheck_block_rightlink, verify_nbtree.c:621
Time: 671.336 ms

('%' is replacing removed chars)

I understand that this corruption was not introduced by postgres itself, but by 
combination of bug in two 3rd party tools and human error.
But I can imagine similar corruptions with different root causes.

Best regards, Andrey Borodin.

1 2 >

1 - 100 of 103 matches

Mail list logo