date:20201113

Re: Have we tried to treat CTE as SubQuery in planner?

2020-11-13 Thread Andy Fan

On Sat, Nov 14, 2020 at 2:44 PM Jesse Zhang  wrote:

> Hi,
>
> On Fri, Nov 13, 2020 at 10:04 PM Andy Fan wrote:
> >
> > Hi:
> >
> > Take the following example:
> >
> > insert into cte1  select i, i from generate_series(1, 100)i;
> > create index on cte1(a);
> >
> > explain
> > with cte1 as  (select * from cte1)
> > select * from c where a = 1;
> >
>
> ITYM:
>
> EXPLAIN
> WITH c AS (SELECT * FROM cte1)
> SELECT * FROM c WHERE a = 1;
>
> I'm also guessing your table DDL is:
>
> CREATE TABLE cte1 (a int, b int);
>
> > It needs to do seq scan on the above format, however it is pretty
> > quick if we change the query to
> > select * from (select * from cte1) c where a = 1;
>
> Does it? On HEAD, I got the following plan:
>
>
You understand me correctly,  just too busy recently and make
me make mistakes like this. Sorry about that:(




> (without stats):
>  Bitmap Heap Scan on foo
>Recheck Cond: (a = 1)
>->  Bitmap Index Scan on foo_a_idx
>  Index Cond: (a = 1)
>
> (with stats):
>  Index Scan using foo_a_idx on foo
>Index Cond: (a = 1)
>
>
>

> >
> > I know how we treat cte and subqueries differently currently,
> > I just don't know why we can't treat cte as a subquery, so lots of
> > subquery related technology can apply to it.  Do we have any
> > discussion about this?
>
> This was brought up a few times, the most recent one I can recall was a
> little bit over two years ago [1]
>
> [1] https://postgr.es/m/87sh48ffhb@news-spur.riddles.org.uk


And I should have searched "CTE" at least for a while..


-- 
Best Regards
Andy Fan

Re: Have we tried to treat CTE as SubQuery in planner?

2020-11-13 Thread Andy Fan

On Sat, Nov 14, 2020 at 2:14 PM Tom Lane  wrote:

> Andy Fan  writes:
> > Take the following example:
>
> > insert into cte1  select i, i from generate_series(1, 100)i;
> > create index on cte1(a);
>
> > explain
> > with cte1 as  (select * from cte1)
> > select * from c where a = 1;
>
> > It needs to do seq scan on the above format, however it is pretty
> > quick if we change the query to
> > select * from (select * from cte1) c where a = 1;
>
> This example seems both confused and out of date.  Since we changed
> the rules on materializing CTEs (in 608b167f9), I get
>
>
Sorry,  I should have tested it again on the HEAD,  and 608b167f9 is exactly
the thing  I mean.

regression=# create table c as select i as a, i from generate_series(1,
> 100)i;
> SELECT 100
> regression=# create index on c(a);
> CREATE INDEX
> regression=# explain
> regression-# with cte1 as (select * from c)
> regression-# select * from cte1 where a = 1;
> QUERY PLAN
> --
>  Bitmap Heap Scan on c  (cost=95.17..4793.05 rows=5000 width=8)
>Recheck Cond: (a = 1)
>->  Bitmap Index Scan on c_a_idx  (cost=0.00..93.92 rows=5000 width=0)
>  Index Cond: (a = 1)
> (4 rows)
>
> regards, tom lane
>


-- 
Best Regards
Andy Fan

Re: Have we tried to treat CTE as SubQuery in planner?

2020-11-13 Thread Jesse Zhang

Hi,

On Fri, Nov 13, 2020 at 10:04 PM Andy Fan wrote:
>
> Hi:
>
> Take the following example:
>
> insert into cte1  select i, i from generate_series(1, 100)i;
> create index on cte1(a);
>
> explain
> with cte1 as  (select * from cte1)
> select * from c where a = 1;
>

ITYM:

EXPLAIN
WITH c AS (SELECT * FROM cte1)
SELECT * FROM c WHERE a = 1;

I'm also guessing your table DDL is:

CREATE TABLE cte1 (a int, b int);

> It needs to do seq scan on the above format, however it is pretty
> quick if we change the query to
> select * from (select * from cte1) c where a = 1;

Does it? On HEAD, I got the following plan:

(without stats):
 Bitmap Heap Scan on foo
   Recheck Cond: (a = 1)
   ->  Bitmap Index Scan on foo_a_idx
 Index Cond: (a = 1)

(with stats):
 Index Scan using foo_a_idx on foo
   Index Cond: (a = 1)


>
> I know how we treat cte and subqueries differently currently,
> I just don't know why we can't treat cte as a subquery, so lots of
> subquery related technology can apply to it.  Do we have any
> discussion about this?

This was brought up a few times, the most recent one I can recall was a
little bit over two years ago [1]

[1] https://postgr.es/m/87sh48ffhb@news-spur.riddles.org.uk

Re: Have we tried to treat CTE as SubQuery in planner?

2020-11-13 Thread Julien Rouhaud

On Sat, Nov 14, 2020 at 2:04 PM Andy Fan  wrote:
>
> Hi:
>
> Take the following example:
>
> insert into cte1  select i, i from generate_series(1, 100)i;
> create index on cte1(a);
>
> explain
> with cte1 as  (select * from cte1)
> select * from c where a = 1;
>
> It needs to do seq scan on the above format, however it is pretty
> quick if we change the query to
> select * from (select * from cte1) c where a = 1;
>
> I know how we treat cte and subqueries differently currently,
> I just don't know why we can't treat cte as a subquery, so lots of
> subquery related technology can apply to it.  Do we have any
> discussion about this?

See 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=608b167f9f9c4553c35bb1ec0eab9ddae643989b

Re: Have we tried to treat CTE as SubQuery in planner?

2020-11-13 Thread Tom Lane

Andy Fan  writes:
> Take the following example:

> insert into cte1  select i, i from generate_series(1, 100)i;
> create index on cte1(a);

> explain
> with cte1 as  (select * from cte1)
> select * from c where a = 1;

> It needs to do seq scan on the above format, however it is pretty
> quick if we change the query to
> select * from (select * from cte1) c where a = 1;

This example seems both confused and out of date.  Since we changed
the rules on materializing CTEs (in 608b167f9), I get

regression=# create table c as select i as a, i from generate_series(1, 
100)i;
SELECT 100
regression=# create index on c(a);
CREATE INDEX
regression=# explain
regression-# with cte1 as (select * from c)
regression-# select * from cte1 where a = 1;
QUERY PLAN
--
 Bitmap Heap Scan on c  (cost=95.17..4793.05 rows=5000 width=8)
   Recheck Cond: (a = 1)
   ->  Bitmap Index Scan on c_a_idx  (cost=0.00..93.92 rows=5000 width=0)
 Index Cond: (a = 1)
(4 rows)

regards, tom lane

Have we tried to treat CTE as SubQuery in planner?

2020-11-13 Thread Andy Fan

Hi:

Take the following example:

insert into cte1  select i, i from generate_series(1, 100)i;
create index on cte1(a);

explain
with cte1 as  (select * from cte1)
select * from c where a = 1;

It needs to do seq scan on the above format, however it is pretty
quick if we change the query to
select * from (select * from cte1) c where a = 1;

I know how we treat cte and subqueries differently currently,
I just don't know why we can't treat cte as a subquery, so lots of
subquery related technology can apply to it.  Do we have any
discussion about this?

Thanks

-- 
Best Regards
Andy Fan

Re: Supporting = operator in gin/gist_trgm_ops

2020-11-13 Thread Alexander Korotkov

Hi!

On Fri, Nov 13, 2020 at 1:47 PM Georgios  wrote:
> In short, I think v3 of the patch looks good to change to 'RFC' status.
> Given the possible costing concerns, I will refrain from changing the
> status just yet, to give the opportunity to more reviewers to chime in.
> If in the next few days there are no more reviews, I will update the
> status to 'RFC' to move the patch forward.
>
> Thoughts?

I went through and revised this patch.  I made the documentation
statement less categorical.  pg_trgm gist/gin indexes might have lower
performance of equality operator search than B-tree.  So, we can't
claim the B-tree index is always not needed.  Also, simple comparison
operators are <, <=, >, >=, and they are not supported.

I also have checked that btree_gist is preferred over pg_trgm gist
index for equality search.  Despite our gist cost estimate is quite
dumb, it selects btree_gist index due to its lower size.  So, this
part also looks good to me.

I'm going to push this if no objections.

--
Regards,
Alexander Korotkov

v4-0001-Handle-equality-operator-in-contrib-pg_trgm.patch
Description: Binary data

Re: "pg_ctl: the PID file ... is empty" at end of make check

2020-11-13 Thread Tom Lane

Thomas Munro  writes:
> My bug report got an automated-looking message telling me to retest in
> Big Sur beta 6 back in September, and I've just now upgraded an old
> x86 Mac to Big Sur 11.01 and I can no longer reproduce the problem, so
> it looks like it was fixed!  Thanks, Apple.

Hah!  I've occasionally despaired that Apple actually reads anything
I file there, so this is glad tidings indeed.

regards, tom lane

Re: "pg_ctl: the PID file ... is empty" at end of make check

2020-11-13 Thread Thomas Munro

On Fri, Oct 18, 2019 at 2:26 PM Thomas Munro  wrote:
> On Fri, Oct 18, 2019 at 1:21 AM Tom Lane  wrote:
> > Thomas Munro  writes:
> > > On Tue, Oct 15, 2019 at 1:55 PM Tom Lane  wrote:
> > >> and now prairiedog has shown it too:
> > >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-10-14%2021%3A45%3A47
> > >> which is positively fascinating, because prairiedog is running a
> > >> bronze-age version of macOS that surely never heard of APFS.
> > >> So this makes it look like this is a basic macOS bug that's not
> > >> as filesystem-dependent as one might think.
> >
> > > Does https://github.com/macdice/unlinktest show the problem on that 
> > > system?
> >
> > It does, though with a very low frequency:
> >
> > $ ./unlinktest
> > $ ./unlinktest 1
> > read 0 bytes, unexpected
> > $ ./unlinktest 1
> > read 0 bytes, unexpected
> > read 0 bytes, unexpected
> > $ ./unlinktest 1
> > $
> >
> > The failure rate on my recent-vintage laptop is more like one
> > failure every five loops.
>
> Wow.  Ok, I'll add a note to the bug report to say it's reproducible
> on "Darwin Kernel Version 8.11.0: Wed Oct 10 18:26:00 PDT 2007;
> root:xnu-792.24.17~1/RELEASE_PPC" next time I'm near an Apple device
> that will let me log into the bug thing.  On the off-chance that
> someone from Apple stumbles on this and is interested, the Radar
> number is rdar://46318934 and the title is "unlink(2) is not atomic
> (kernel/file system bug)".

My bug report got an automated-looking message telling me to retest in
Big Sur beta 6 back in September, and I've just now upgraded an old
x86 Mac to Big Sur 11.01 and I can no longer reproduce the problem, so
it looks like it was fixed!  Thanks, Apple.

Re: Misc typos

2020-11-13 Thread Michael Paquier

On Fri, Nov 13, 2020 at 12:43:41PM +0100, Daniel Gustafsson wrote:
> When spellchecking one of my patches for submission, a few other typos fell 
> out
> as per the attached.  The one change which isn't in a comment is the object
> address class description used for error messages:
> 
> - "extented statistics",
> + "extended statistics",
> 
> It's used in an elog for cache lookup failures, so maybe it should be
> backported?

This is new as of HEAD (b1d32d3), so applied there.  Thanks!
--
Michael


signature.asc
Description: PGP signature

Re: PATCH: Batch/pipelining support for libpq

2020-11-13 Thread Alvaro Herrera

Here's a v25.

I made a few more changes to the docs per David's suggestions; I also
reordered the sections quite a bit.  It's now like this:
 * Batch Mode
   * Using Batch Mode
 * Issuing Queries
 * Processing Results
 * Error Handling
 * Interleaving Result Processing and Query Dispatch
 * Ending Batch Mode
   * Functions Associated with Batch Mode
   * When to Use Batching

To me as a reader, this makes more sense, but if you disagree, I think
we should discuss further changes.  (For example, maybe we should move
the "Functions" section at the end?)  The original had "When to Use
Batching" at the start, but it seemed to me that the points it's making
are not as critical as understanding what it is.

I reworked the test program to better fit the TAP model; I found that if
one test mecha failed in whatever way, the connection would be in a
weird state and cause the next test to also fail.  I changed so that it
runs one test and exits; then the t/001_libpq_async.pl file (hmm, need
to rename to 001_batch.pl I guess) calls it once for each test.
I adapted the test code to our code style.  I also removed the "timings"
stuff; I think that's something better left to pgbench.

(I haven't looked at Daniel's pgbench stuff yet, but I will do that
next.)

While looking at how the tests worked, I gave a hard stare at the new
libpq code and cleaned it up also.  There's a lot of minor changes, but
nothing truly substantial.  I moved the code around a lot, to keep
things where grouped together they belong.

I'm not 100% clear on things like PGconn->batch_status and how
PGconn->asyncStatus works.  Currently everything seems to work
correctly, but I'm worried that because we add new status values to
asyncStatus, some existing code might not be doing everything correctly
(for example when changing to/from ASYNC_BUSY in some cases, are we 100%
we shouldn't be changing to ASYNC_QUEUED?)


While looking this over I noticed a thread from 2014[1] where Matt
Newell tried to implement this stuff and apparently the main review
comment he got before abandoning the patch was that the user would like
a way to access the query that corresponded to each result.  The current
patch set does not address that need; the approach to this problem is
that it's on the application's head to keep track of this.  Honestly I
don't understand why it would be otherwise ... I'm not sure that it
makes sense to expect that the application is stupid enough that it
doesn't keep track in which order it sent things, but bright enough to
keep pointers to the queries it sent (??).  So this seems okay to me.
But added Heikki and Claudio to CC because of that old thread.


[1] https://postgr.es/m/2059418.jZQvL3lH90@obsidian

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 73c6a6ea4d..60477ca85c 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -3061,6 +3061,30 @@ ExecStatusType PQresultStatus(const PGresult *res);

   
  
+
+ 
+  PGRES_BATCH_END
+  
+   
+The PGresult represents the end of a batch.
+This status occurs only when batch mode has been selected.
+   
+  
+ 
+
+ 
+  PGRES_BATCH_ABORTED
+  
+   
+The PGresult represents a batch that's
+received an error from the server.  PQgetResult
+must be called repeatedly, and it will return this status code,
+until the end of the current batch, at which point it will return
+PGRES_BATCH_END and normal processing can resume.
+   
+  
+ 
+
 
 
 If the result status is PGRES_TUPLES_OK or
@@ -4819,6 +4843,482 @@ int PQflush(PGconn *conn);
 
  
 
+ 
+  Batch Mode
+
+  
+   libpq
+   batch mode
+  
+
+  
+   pipelining
+   in libpq
+  
+
+  
+   libpq batch mode allows applications to
+   send a query without having to read the result of the previously
+   sent query.  Taking advantage of the batch mode, a client will wait
+   less for the server, since multiple queries/results can be sent/
+   received in a single network transaction.
+  
+
+  
+   While batch mode provides a significant performance boost, writing
+   clients using the batch mode is more complex because it involves
+   managing a queue of pending queries and finding which result
+   corresponds to which query in the queue.
+  
+
+  
+   Using Batch Mode
+
+   
+To issue batches the application must switch a connection into batch mode.
+Enter batch mode with 
+or test whether batch mode is active with
+.
+In batch mode, only asynchronous operations
+are permitted, and COPY is not recommended as it
+may trigger failure in batch processing.  Using any synchronous
+command execution functions such as PQfn,
+PQexec or one of its sibling functions are error
+conditions.
+   
+
+   
+
+ It is best to use batch mode

Re: Zedstore - compressed in-core columnar storage

2020-11-13 Thread Tomas Vondra

On 11/13/20 8:07 PM, Jacob Champion wrote:
> On Nov 12, 2020, at 2:40 PM, Tomas Vondra  
> wrote:
>>
>> Hi,
>>
>> Thanks for the updated patch. It's a quite massive amount of code - I I
>> don't think we had many 2MB patches in the past, so this is by no means
>> a full review.
> 
> Thanks for taking a look! You're not kidding about the patch size.
> 
> FYI, the tableam changes made recently have been extracted into their
> own patch, which is up at [1].
> 
>> 1) the psql_1.out is missing a bit of expected output (due to 098fb0079)
> 
> Yeah, this patch was rebased as of efc5dcfd8a.
> 
>> 2) I'm getting crashes in intarray contrib, due to hitting this error in
>> lwlock.c (backtrace attached):
>>
>>  /* Ensure we will have room to remember the lock */
>>  if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
>>  elog(ERROR, "too many LWLocks taken");
>>
>> I haven't investigates this too much, but it's regular build with
>> asserts and TAP tests, so it should be simple to reproduce using "make
>> check-world" I guess.
> 
> I've only seen this intermittently in installcheck, and I'm not able to
> reproduce with the intarray tests on my machine (macOS). Definitely
> something we need to look into. What OS are you testing on?
> 

Fedora 32, nothing special. I'm not sure if I ran the tests with pglz or
lz4, maybe there's some dependence on that, but it does fail for me
quite reliably with this:

./configure --enable-debug  --enable-cassert --enable-tap-tests
--with-lz4 && make -s clean && make -s -j4 && make check-world

>> It's mostly expected lz4 beats pglz in performance and compression
>> ratio, but this seems a bit too extreme I guess. Per past benchmarks
>> (e.g. [1] and [2]) the difference in compression/decompression time
>> should be maybe 1-2x or something like that, not 35x like here.
> 
> Yeah, something seems off about that. We'll take a look.
> 
>> BTW the comments in general need updating and tidying up, to make
>> reviews easier. For example the merge_attstream comment references
>> attstream1 and attstream2, but those are not the current parameters of
>> the function.
> 
> Agreed.
> 
>> 5) IHMO there should be a #define specifying the maximum number of items
>> per chunk (60). Currently there are literal constants used in various
>> places, sometimes 60, sometimes 59 etc. which makes it harder to
>> understand the code. FWIW 60 seems a bit low, but maybe it's OK.
> 
> Yeah, that seems like a good idea.
> 
> I think the value 60 comes from the use of simple-8b encoding -- see the
> comment at the top of zedstore_attstream.c.
> 

Yeah, I understand where it comes from. I'm just saying that when you
see 59 hardcoded, it may not be obvious where it came from, and
something like ITEMS_PER_CHUNK would be better.

I wonder how complicated would it be to allow larger chunks, e.g. by
using one bit to say "there's another 64-bit codeword". Not sure if it's
worth the extra complexity, though - it's just that 60 feels a bit low.

>> 6) I do think ZSAttStream should track which compression is used by the
>> stream, for two main reasons. Firstly, there's another patch to support
>> "custom compression" methods, which (also) allows multiple compression
>> methods per column. It'd be a bit strange to support that for varlena
>> columns in heap table, and not here, I guess. Secondly, I think one of
>> the interesting columnstore features down the road will be execution on
>> compressed data, which however requires compression method designed for
>> that purpose, and it's often datatype-specific (delta encoding, ...).
>>
>> I don't think we need to go as far as supporting "custom" compression
>> methods here, but I think we should allow different built-in compression
>> methods for different attstreams.
> 
> Interesting. We'll need to read/grok that ML thread.
> 

That thread is a bit long not sure it's worth reading as a whole unless
you want to work on that feature. The gist is that to seamlessly support
multiple compression algorithms we need to store an ID of the algorithm
somewhere. For TOAST that's not too difficult, we can do that in the
TOAST pointer - the the main challenge is in doing it in a
backwards-compatible way. For zedstore we can actually design it from
the start.

I wonder if we should track version of the format somewhere, to allow
future improvements. So that if/when we decide to change something in
the future, we don't have to scavenge bits etc. Or perhaps just a
"uint32 flags" field, unused/reserved for future use.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

2020-11-13 Thread Tom Lane

Alexander Lakhin  writes:
> 13.11.2020 23:16, Tom Lane wrote:
>> That would soon lead us to changing every stat() caller in the system
>> to have Windows-specific looping logic.  No thanks.  If we need to do
>> this, let's put in a Windows wrapper layer comparable to pgwin32_open()
>> for open().

> Maybe pgwin32_safestat() should perform this...

Uh ... now that you mention it, that's gone since bed90759f.

There is code in win32stat.c that purports to cope with this case, though
I've not tested it personally.

regards, tom lane

Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

2020-11-13 Thread Alexander Lakhin

13.11.2020 23:16, Tom Lane wrote:
> Alexander Lakhin  writes:
>> Shouldn't pg_ls_dir_files() retry stat() on ERROR_ACCESS_DENIED just
>> like the pgwin32_open() does to ignore files in "delete pending" state?
> That would soon lead us to changing every stat() caller in the system
> to have Windows-specific looping logic.  No thanks.  If we need to do
> this, let's put in a Windows wrapper layer comparable to pgwin32_open()
> for open().
Maybe pgwin32_safestat() should perform this... For now it checks
GetLastError() for ERROR_DELETE_PENDING, but as we've found out
previously this error code in fact is not returned by the OS.
And if the fix is not going to be quick, probably it should be discussed
in another thread...

Best regards,
Alexander

Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

2020-11-13 Thread Tom Lane

Alexander Lakhin  writes:
> Shouldn't pg_ls_dir_files() retry stat() on ERROR_ACCESS_DENIED just
> like the pgwin32_open() does to ignore files in "delete pending" state?

That would soon lead us to changing every stat() caller in the system
to have Windows-specific looping logic.  No thanks.  If we need to do
this, let's put in a Windows wrapper layer comparable to pgwin32_open()
for open().

regards, tom lane

Re: pg11+: pg_ls_*dir LIMIT 1: temporary files .. not closed at end-of-transaction

2020-11-13 Thread Alexander Lakhin

Hello hackers,
31.03.2020 19:41, Tom Lane wrote:
> Justin Pryzby  writes:
>> I suggest to leave stat() alone in your patch for stable releases.  I think
>> it's okay if we change behavior so that a broken symlink is skipped instead 
>> of
>> erroring (as a side effect of skipping ENOENT with stat()).  But not okay if 
>> we
>> change pg_ls_logdir() to hide symlinks in back braches.
> Meh.  I'm not really convinced, but in the absence of anyone expressing
> support for my position, I'll do it that way.  I don't think it's worth
> doing both a stat and lstat to tell the difference between file-is-gone
> and file-is-a-broken-symlink.
As we've discovered in Bug #[16161], stat() for "concurrently-deleted
file" can also return ERROR_ACCESS_DENIED on Windows. It seems that
pg_upgradeCheck failures seen on
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=fairywren&br=REL_13_STABLE
caused by the same issue.
Shouldn't pg_ls_dir_files() retry stat() on ERROR_ACCESS_DENIED just
like the pgwin32_open() does to ignore files in "delete pending" state?

[16161]
https://www.postgresql.org/message-id/16161-7a985d2f1bbe8f71%40postgresql.org

Best regards,
Alexander

Re: Hash support for row types

2020-11-13 Thread Tom Lane

Peter Eisentraut  writes:
> Here is an updated patch with the type cache integration added.

> To your point, this now checks each fields hashability before 
> considering a row type as hashable.  It can still have run-time errors 
> for untyped record datums, but that's not something we can do anything 
> about.

This looks good code-wise.  A couple small niggles on the tests:

* The new test in with.sql claims to be testing row hashing, but
it's not very apparent that any such thing actually happens.  Maybe
EXPLAIN the query, as well as execute it, to confirm that a
hash-based plan is used.

* Is it worth devising a test case in which hashing is not possible
because one of the columns isn't hashable?  I have mixed feelings
about this because the set of suitable column types may decrease
to empty over time, making it hard to maintain the test case.

I marked it RFC.

regards, tom lane

Re: Zedstore - compressed in-core columnar storage

2020-11-13 Thread Jacob Champion

On Nov 12, 2020, at 2:40 PM, Tomas Vondra  wrote:
> 
> Hi,
> 
> Thanks for the updated patch. It's a quite massive amount of code - I I
> don't think we had many 2MB patches in the past, so this is by no means
> a full review.

Thanks for taking a look! You're not kidding about the patch size.

FYI, the tableam changes made recently have been extracted into their
own patch, which is up at [1].

> 1) the psql_1.out is missing a bit of expected output (due to 098fb0079)

Yeah, this patch was rebased as of efc5dcfd8a.

> 2) I'm getting crashes in intarray contrib, due to hitting this error in
> lwlock.c (backtrace attached):
> 
>   /* Ensure we will have room to remember the lock */
>   if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
>   elog(ERROR, "too many LWLocks taken");
> 
> I haven't investigates this too much, but it's regular build with
> asserts and TAP tests, so it should be simple to reproduce using "make
> check-world" I guess.

I've only seen this intermittently in installcheck, and I'm not able to
reproduce with the intarray tests on my machine (macOS). Definitely
something we need to look into. What OS are you testing on?

> It's mostly expected lz4 beats pglz in performance and compression
> ratio, but this seems a bit too extreme I guess. Per past benchmarks
> (e.g. [1] and [2]) the difference in compression/decompression time
> should be maybe 1-2x or something like that, not 35x like here.

Yeah, something seems off about that. We'll take a look.

> BTW the comments in general need updating and tidying up, to make
> reviews easier. For example the merge_attstream comment references
> attstream1 and attstream2, but those are not the current parameters of
> the function.

Agreed.

> 5) IHMO there should be a #define specifying the maximum number of items
> per chunk (60). Currently there are literal constants used in various
> places, sometimes 60, sometimes 59 etc. which makes it harder to
> understand the code. FWIW 60 seems a bit low, but maybe it's OK.

Yeah, that seems like a good idea.

I think the value 60 comes from the use of simple-8b encoding -- see the
comment at the top of zedstore_attstream.c.

> 6) I do think ZSAttStream should track which compression is used by the
> stream, for two main reasons. Firstly, there's another patch to support
> "custom compression" methods, which (also) allows multiple compression
> methods per column. It'd be a bit strange to support that for varlena
> columns in heap table, and not here, I guess. Secondly, I think one of
> the interesting columnstore features down the road will be execution on
> compressed data, which however requires compression method designed for
> that purpose, and it's often datatype-specific (delta encoding, ...).
> 
> I don't think we need to go as far as supporting "custom" compression
> methods here, but I think we should allow different built-in compression
> methods for different attstreams.

Interesting. We'll need to read/grok that ML thread.

Thanks again for the review!

--Jacob

[1] 
https://www.postgresql.org/message-id/CAE-ML%2B9RmTNzKCNTZPQf8O3b-UjHWGFbSoXpQa3Wvuc8YBbEQw%40mail.gmail.com

Re: [PATCH] Support negative indexes in split_part

2020-11-13 Thread Tom Lane

Jacob Champion  writes:
> Patch looks good to me. Seems like a useful feature, and I agree that the 
> two-pass implementation makes the change very easy to review.

LGTM too.  I made a couple of cosmetic improvements and pushed it.

> Quick note on test coverage: gcov marks the "needle not found" branch (the 
> one marked `/* special case if fldsep not found at all */`) as being 
> completely uncovered. I don't think that needs to gate this patch; it looks 
> like it was uncovered before this feature was added.

We seem to be trying for full test coverage of this function now,
so I added a test case for that branch too.

> Doc builds are currently failing due to what appears to be an xmllint failure:

Unrelated, but see

https://www.postgresql.org/message-id/flat/E2EE6B76-2D96-408A-B961-CAE47D1A86F0%40yesql.se

regards, tom lane

Re: error_severity of brin work item

2020-11-13 Thread Justin Pryzby

On Fri, Nov 13, 2020 at 01:39:31PM -0300, Alvaro Herrera wrote:
> On 2020-Nov-13, Justin Pryzby wrote:
> 
> > I saw a bunch of these in my logs:
> > 
> > log_time | 2020-10-25 22:59:45.619-07
> > database | 
> > left | could not open relation with OID 292103095
> > left | processing work entry for relation 
> > "ts.child.alarms_202010_alarm_clear_time_idx"
> > 
> > Those happen following a REINDEX job on that index.
> > 
> > I think that should be more like an INFO message, since that's what vacuum 
> > does
> > (vacuum_open_relation), and a queued work item is even more likely to hit a
> > dropped relation.
> 
> Ah, interesting.  Yeah, I agree this is a bug.  I think it can be fixed
> by using try_relation_open() on the index; if that returns NULL, discard
> the work item.
> 
> Does this patch solve the problem?

Your patch didn't actually say "try_relation_open", so didn't work.
But it does works if I do that, and close the table.

I tested like:

pryzbyj=# ALTER SYSTEM SET 
backtrace_functions='try_relation_open,relation_open';
pryzbyj=# ALTER SYSTEM SET autovacuum_naptime=3; SELECT pg_reload_conf();
pryzbyj=# CREATE TABLE tt AS SELECT generate_series(1,)i;
pryzbyj=# CREATE INDEX ON tt USING brin(i) 
WITH(autosummarize,pages_per_range=1);
pryzbyj=# \! while :; do psql -h /tmp -qc 'SET client_min_messages=info' -c 
'REINDEX INDEX CONCURRENTLY tt_i_idx'; done&

-- run this 5-10 times and hit the "...was not recorded" message, which for
-- whatever reason causes the race condition involving work queue
pryzbyj=# UPDATE tt SET i=1+i;

2020-11-13 11:50:46.093 CST [30687] ERROR:  could not open relation with OID 
1110882
2020-11-13 11:50:46.093 CST [30687] CONTEXT:  processing work entry for 
relation "pryzbyj.public.tt_i_idx"
2020-11-13 11:50:46.093 CST [30687] BACKTRACE:
postgres: autovacuum worker pryzbyj(+0xb9ce8) [0x55acf2af0ce8]
postgres: autovacuum worker pryzbyj(index_open+0xb) [0x55acf2bab59b]
postgres: autovacuum worker pryzbyj(brin_summarize_range+0x8f) 
[0x55acf2b5b5bf]
postgres: autovacuum worker pryzbyj(DirectFunctionCall2Coll+0x62) 
[0x55acf2f40372]
...

-- 
Justin
>From e08c6d3e2b10964633904ff247e70330077d31b4 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 13 Nov 2020 13:39:31 -0300
Subject: [PATCH v2] error_severity of brin work item

---
 src/backend/access/brin/brin.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1f72562c60..8278a5209c 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -887,8 +887,10 @@ brin_summarize_range(PG_FUNCTION_ARGS)
 	/*
 	 * We must lock table before index to avoid deadlocks.  However, if the
 	 * passed indexoid isn't an index then IndexGetRelation() will fail.
-	 * Rather than emitting a not-very-helpful error message, postpone
-	 * complaining, expecting that the is-it-an-index test below will fail.
+	 * Rather than emitting a not-very-helpful error message, prepare to
+	 * return without doing anything.  This allows autovacuum work-items to be
+	 * silently discarded rather than uselessly accumulating error messages in
+	 * the server log.
 	 */
 	heapoid = IndexGetRelation(indexoid, true);
 	if (OidIsValid(heapoid))
@@ -896,7 +898,14 @@ brin_summarize_range(PG_FUNCTION_ARGS)
 	else
 		heapRel = NULL;
 
-	indexRel = index_open(indexoid, ShareUpdateExclusiveLock);
+	indexRel = try_relation_open(indexoid, ShareUpdateExclusiveLock);
+	if (heapRel == NULL || indexRel == NULL)
+	{
+		if (heapRel != NULL)
+			table_close(heapRel, ShareUpdateExclusiveLock);
+
+		PG_RETURN_INT32(0);
+	}
 
 	/* Must be a BRIN index */
 	if (indexRel->rd_rel->relkind != RELKIND_INDEX ||
-- 
2.17.0

Table AM modifications to accept column projection lists

2020-11-13 Thread Soumyadeep Chakraborty

Hello,

This patch introduces a set of changes to the table AM APIs, making them
accept a column projection list. That helps columnar table AMs, so that
they don't need to fetch all columns from disk, but only the ones
actually needed.

The set of changes in this patch is not exhaustive -
there are many more opportunities that are discussed in the TODO section
below. Before digging deeper, we want to elicit early feedback on the
API changes and the column extraction logic.

TableAM APIs that have been modified are:

1. Sequential scan APIs
2. Index scan APIs
3. API to lock and return a row
4. API to fetch a single row

We have seen performance benefits in Zedstore for many of the optimized
operations [0]. This patch is extracted from the larger patch shared in
[0].


Building the column projection set:

In terms of building the column projection set necessary for each of
these APIs, this patch builds off of the scanCols patch [1], which
Ashwin and Melanie had started earlier. As noted in [1], there are cases
where the scanCols set is not representative of the columns to be
projected. For instance, in a DELETE .. RETURNING query, there is
typically a sequential scan and a separate invocation of
tuple_fetch_row_version() in order to satisfy the RETURNING clause (see
ExecDelete()). So for a query such as:

  DELETE from foo WHERE i < 100 && j < 1000 RETURNING k, l;

We need to pass the set (i, j) to the scan and (k, l) to the
tuple_fetch_row_version() invocation. This is why we had to introduce
the returningCols field.

In the same spirit, separate column projection sets are computed for any
operations that involve an EPQ check (INSERT, DELETE, UPDATE, row-level
locking etc), the columns involved in an ON CONFLICT UPDATE etc.

Recognizing and collecting these sets of columns is done at various
stages: analyze and rewrite, planner and executor - depending on the
type of operation for which the subset of columns is calculated. The
column bitmaps are stored in different places as well - such as the ones
for scans and RETURNING are stored in RangeTblEntry, whereas the set of
columns for ON CONFLICT UPDATE are stored in OnConflictSetState.


Table AM API changes:

The changes made to the table AM API, introducing the column projection
set, come in different flavors. We would like feedback on what style
we need to converge to or if we should use different styles depending
on the situation.

- A new function variant that takes a column projection list, such as:

  TableScanDesc (*scan_begin) (Relation rel,
Snapshot snapshot,
int nkeys, struct ScanKeyData *key,
ParallelTableScanDesc pscan,
uint32 flags);
->

  TableScanDesc (*scan_begin_with_column_projection)(Relation relation,
 Snapshot snapshot,
 int nkeys, struct ScanKeyData *key,
 ParallelTableScanDesc parallel_scan,
 uint32 flags,
 Bitmapset *project_columns);

- Modifying the existing function to take a column projection list, such
as:

  TM_Result (*tuple_lock) (Relation rel,
 ItemPointer tid,
 Snapshot snapshot,
 TupleTableSlot *slot,
 CommandId cid,
 LockTupleMode mode,
 LockWaitPolicy wait_policy,
 uint8 flags,
 TM_FailureData *tmfd);

->

  TM_Result (*tuple_lock) (Relation rel,
 ItemPointer tid,
 Snapshot snapshot,
 TupleTableSlot *slot,
 CommandId cid,
 LockTupleMode mode,
 LockWaitPolicy wait_policy,
 uint8 flags,
 TM_FailureData *tmfd,
 Bitmapset *project_cols);

- A new function index_fetch_set_column_projection() to be called after
index_beginscan() to set the column projection set, which will be used
later by index_getnext_slot().

  void (*index_fetch_set_column_projection) (struct IndexFetchTableData *data,
 Bitmapset *project_columns);

The set of columns expected by the new/modified functions is represented
as a Bitmapset of attnums for a specific base relation. An empty/NULL
bitmap signals to the AM that no data columns are needed. A bitmap
containing the single element 0 indicates that we want all data columns
to be fetched.

The bitmaps do not include system columns.

Additionally, the TupleTableSlots populated by functions such
as table_scan_getnextslot(), need to be densely filled upto the highest
numbered column in the projection list (any column not in the projection
list should be populated with NULL). This is due to the implicit
assumptions of the slot_get_***() APIs.


TODOs:

- Explore opportunities to push the column extraction logic to the
planner or pre-planner stages from the executor stage (like scanCols and
returningCols), or at least elevate the column extraction logic to be
done once per executor run instead of once per tuple.

- As was requested in [1], we should guard column projection set
extraction logic

Re: Add docs stub for recovery.conf

2020-11-13 Thread Bruce Momjian

On Fri, Nov 13, 2020 at 01:27:34PM +0800, Craig Ringer wrote:
> On Fri, Nov 13, 2020 at 11:50 AM Bruce Momjian  wrote:
>  
> 
> > So you are saying you don't think you are getting sufficient thought
> > into your proposal, and getting just a reflex?  Just because we don't
> > agree with you don't mean we didn't think about it.  In fact, we have
> > thought about it a lot, which is evident from the URL I sent you
> > already.
> 
> 
> I am mostly trying to say that I don't think the issues I raised were actually
> addressed in the proposed alternatives. I put in a fair bit of effort to
> clearly set out the problem that this is meant to solve, and was frustrated to
> perceive the response as "yeah, nah, lets just do this other thing that only
> addresses one part of the original issue." It wasn't clear why my proposal
> appeared to be being rejected. Perhaps I didn't fully grasp the context of the
> linked discussion.

I think the big problem, and I have seen this repeatedly, is showing up
with a patch without discussing whether people actually want the
feature.  I know it is a doc issue, but our TODO list has the order as:

Desirability -> Design -> Implement -> Test -> Review -> Commit

and there is a reason for that.  When you appear with a patch, you are
already far down the steps, and you have to back up to explain why it is
useful.

Clearly we have need for documenting these renamings somewhere. We were
going to go with a simple URL redirect and a "tip" for
default/pre-installed roles, but I like the idea of doing something more
wholistic that covers all of our recent renaming cases.  Let's get
buy-in from that, and then someone can work on a patch.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: error_severity of brin work item

2020-11-13 Thread Alvaro Herrera

On 2020-Nov-13, Justin Pryzby wrote:

> I saw a bunch of these in my logs:
> 
> log_time | 2020-10-25 22:59:45.619-07
> database | 
> left | could not open relation with OID 292103095
> left | processing work entry for relation 
> "ts.child.alarms_202010_alarm_clear_time_idx"
> 
> Those happen following a REINDEX job on that index.
> 
> I think that should be more like an INFO message, since that's what vacuum 
> does
> (vacuum_open_relation), and a queued work item is even more likely to hit a
> dropped relation.

Ah, interesting.  Yeah, I agree this is a bug.  I think it can be fixed
by using try_relation_open() on the index; if that returns NULL, discard
the work item.

Does this patch solve the problem?

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1f72562c60..6021b16a01 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -887,8 +887,10 @@ brin_summarize_range(PG_FUNCTION_ARGS)
 	/*
 	 * We must lock table before index to avoid deadlocks.  However, if the
 	 * passed indexoid isn't an index then IndexGetRelation() will fail.
-	 * Rather than emitting a not-very-helpful error message, postpone
-	 * complaining, expecting that the is-it-an-index test below will fail.
+	 * Rather than emitting a not-very-helpful error message, prepare to
+	 * return without doing anything.  This allows autovacuum work-items to be
+	 * silently discarded rather than uselessly accumulating error messages in
+	 * the server log.
 	 */
 	heapoid = IndexGetRelation(indexoid, true);
 	if (OidIsValid(heapoid))
@@ -897,6 +899,11 @@ brin_summarize_range(PG_FUNCTION_ARGS)
 		heapRel = NULL;
 
 	indexRel = index_open(indexoid, ShareUpdateExclusiveLock);
+	if (heapRel == NULL || indexRel == NULL)
+	{
+		/* Table or index don't exist anymore; ignore this request */
+		PG_RETURN_INT32(0);
+	}
 
 	/* Must be a BRIN index */
 	if (indexRel->rd_rel->relkind != RELKIND_INDEX ||

Re: [PATCH] Combine same ternary types in GIN and TSearch

2020-11-13 Thread Tom Lane

Heikki Linnakangas  writes:
> On 13/11/2020 11:04, Pavel Borisov wrote:
>> For historical reasons, now we have two differently named but similar 
>> ternary data types in TSearch and Gin text-related types. Before v13 
>> there was also Gin's private TS_execute() version, from which we 
>> eventually shifted to Tsearch's TS_execute().
>> To make things more even and beautiful I've made a minor refactor to 
>> combine two left ternary types into one.

> GIN is not just for full-text search, so using TSTernaryValue in 
> GinScanKeyData is wrong. And it would break existing extensions.

> I didn't look much further than that, but I've got a feeling that 
> combining those is a bad idea. TSTernaryValue is used in text-search 
> code, even when there is no GIN involved. It's a separate concept, even 
> though it happens to have the same values.

I'm definitely not on board with importing a TS-specific type into GIN,
and even less with requiring major GIN headers to import random
TS-related headers.

There might be a case for having just one neutrally-named "ternary" enum
type, declared in a neutral (probably new) header, that both areas of
the code could use.  But it's not clear that it'd be worth the code
thrashing to do that.  As Heikki says, this will surely break some
extensions; and I'd prefer that there be some non-cosmetic benefit
if we ask extension authors to cope with that.

regards, tom lane

Re: Strange behavior with polygon and NaN

2020-11-13 Thread Tom Lane

Kyotaro Horiguchi  writes:
> At Tue, 10 Nov 2020 14:30:08 -0500, Tom Lane  wrote in 
>> For instance, {1,-1,0} is the line "x = y".  We could argue about
>> whether it'd be sensible to return zero for the distance between that
>> and the point (inf,inf), but surely any point with one inf and one
>> finite coordinate must be an infinite distance away from that line.
>> There's nothing ill-defined about that situation.

> Mmm... (swinging my arms to mimic lines..)
> dist(x = y, (1e300, Inf)) looks indeterminant to me..

Well, what you're showing is that we get an internal overflow,
essentially, on the way to calculating the result.  Which is true,
so it's sort of accidental that we got a sensible result before.
Nonetheless, we *did* get a sensible result, so producing NaN
instead seems like a regression.

We might need to introduce special-case handling to protect the
low-level calculations from ever seeing NaN or Inf in their inputs.
Getting the right answer to "just fall out" of those calculations
might be an unreasonable hope.

For example, for a line with positive slope (A and B of opposite
signs), I think that the right answer for points (Inf,Inf) and
(-Inf,-Inf) should be NaN, on much the same grounds that Inf
minus Inf is NaN not zero.  But all other points involving any Inf
coordinates are clearly an infinite distance away from that line.

regards, tom lane

don't allocate HashAgg hash tables when running explain only

2020-11-13 Thread Alexey Bashtanov


Hi,

I got somewhat scared when my explain took a few seconds to complete and 
used a few gigs of RAM.

To reproduce try the following:

discard temp;
create temp table a as select to_timestamp(generate_series(1, 7000)) i;
analyze a;
set work_mem to '3GB';
explain select distinct a1.i - a2.i from a a1, a a2;

I would appreciate if someone could have a look at the patch attached, 
which makes executor skip initializing hash tables when doing explain only.


Best, Alex
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index d87677d659..196fe09c52 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -3665,7 +3665,11 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			&aggstate->hash_ngroups_limit,
 			&aggstate->hash_planned_partitions);
 		find_hash_columns(aggstate);
-		build_hash_tables(aggstate);
+
+		/* Skip massive memory allocation when running only explain */
+		if (!(eflags & EXEC_FLAG_EXPLAIN_ONLY))
+			build_hash_tables(aggstate);
+
 		aggstate->table_filled = false;
 
 		/* Initialize this to 1, meaning nothing spilled, yet */

error_severity of brin work item

2020-11-13 Thread Justin Pryzby

I saw a bunch of these in my logs:

log_time | 2020-10-25 22:59:45.619-07
database | 
left | could not open relation with OID 292103095
left | processing work entry for relation 
"ts.child.alarms_202010_alarm_clear_time_idx"

Those happen following a REINDEX job on that index.

I think that should be more like an INFO message, since that's what vacuum does
(vacuum_open_relation), and a queued work item is even more likely to hit a
dropped relation.  It's easy to hit by setting autovacuum_naptime=1 and looping
around REINDEX CONCURRENTLY while updating a table.

Autovacuum is queueing work items for later:

src/backend/postmaster/autovacuum.c-switch (workitem->avw_type)
src/backend/postmaster/autovacuum.c-{
src/backend/postmaster/autovacuum.c-case 
AVW_BRINSummarizeRange:
src/backend/postmaster/autovacuum.c:
DirectFunctionCall2(brin_summarize_range,
src/backend/postmaster/autovacuum.c-
ObjectIdGetDatum(workitem->avw_relation),
src/backend/postmaster/autovacuum.c-
Int64GetDatum((int64) workitem->avw_blockNumber));
src/backend/postmaster/autovacuum.c-break;

And if the index is missing:

brin_summarize_range(PG_FUNCTION_ARGS)
|/*
| * We must lock table before index to avoid deadlocks.  However, if the
| * passed indexoid isn't an index then IndexGetRelation() will fail.
| * Rather than emitting a not-very-helpful error message, postpone
| * complaining, expecting that the is-it-an-index test below will fail.
| */
|heapoid = IndexGetRelation(indexoid, true);
|if (OidIsValid(heapoid))
|heapRel = table_open(heapoid, ShareUpdateExclusiveLock);
|else
|heapRel = NULL;
|
|indexRel = index_open(indexoid, ShareUpdateExclusiveLock);

table_open is succcessful and then index_open fails.  (I thought locks would
have prevented that ?)

Justin

Re: [PATCH] remove deprecated v8.2 containment operators

2020-11-13 Thread Tom Lane

Peter Eisentraut  writes:
> On 2020-11-12 23:28, Tom Lane wrote:
>> I'm on board with pulling these now --- 8.2 to v14 is plenty of
>> deprecation notice.  However, the patch seems incomplete in that
>> the code support for these is still there -- look for
>> RTOldContainedByStrategyNumber and RTOldContainsStrategyNumber.
>> Admittedly, there's not much to be removed except some case labels,
>> but it still seems like we oughta do that to avoid future confusion.

> Yeah, the stuff in gistproc.c should be removed now.  But I wonder what 
> the mentions in brin_inclusion.c are and whether or how they should be 
> removed.

I think those probably got cargo-culted in there at some point.
They're visibly dead code, because there are no pg_amop entries
for BRIN opclasses with amopstrategy 13 or 14.

This comment that you removed in 2f70fdb06 is suspicious:

# we could, but choose not to, supply entries for strategies 13 and 14

I'm guessing that somebody was vacillating about whether it'd be
a feature to support these old operator names in BRIN, and
eventually didn't, but forgot to remove the code support.

regards, tom lane

Re: WIP: WAL prefetch (another approach)

2020-11-13 Thread Stephen Frost

Greetings,

* Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
> On 11/13/20 3:20 AM, Thomas Munro wrote:
> > I'm not really sure what to do about achive restore scripts that
> > block.  That seems to be fundamentally incompatible with what I'm
> > doing here.
> 
> IMHO we can't do much about that, except for documenting it - if the
> prefetch can't work because of blocking restore script, someone has to
> fix/improve the script. No way around that, I'm afraid.

I'm a bit confused about what the issue here is- is the concern that a
restore_command is specified that isn't allowed to run concurrently but
this patch is intending to run more than one concurrently..?  There's
another patch that I was looking at for doing pre-fetching of WAL
segments, so if this is also doing that we should figure out which
patch we want..

I don't know that it's needed, but it feels likely that we could provide
a better result if we consider making changes to the restore_command API
(eg: have a way to say "please fetch this many segments ahead, and you
can put them in this directory with these filenames" or something).  I
would think we'd be able to continue supporting the existing API and
accept that it might not be as performant.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: [PATCH] remove deprecated v8.2 containment operators

2020-11-13 Thread Stephen Frost

Greetings,

* Magnus Hagander (mag...@hagander.net) wrote:
> On Thu, Nov 12, 2020 at 11:28 PM Tom Lane  wrote:
> > > The changes to the contrib modules appear to be incomplete in some ways.
> > >   In cube, hstore, and seg, there are no changes to the extension
> > > scripts to remove the operators.  All you're doing is changing the C
> > > code to no longer recognize the strategy, but that doesn't explain what
> > > will happen if the operator is still used.  In intarray, by contrast,
> > > you're editing an existing extension script, but that should be done by
> > > an upgrade script instead.
> >
> > In the contrib modules, I'm afraid what you gotta do is remove the
> > SQL operator definitions but leave the opclass code support in place.
> > That's because there's no guarantee that users will update the extension's
> > SQL version immediately, so a v14 build of the .so might still be used
> > with the old SQL definitions.  It's not clear how much window we need
> > give for people to do that update, but I don't think "zero" is an
> > acceptable answer.
> 
> Based on my experience from the field, the answer is "never".
> 
> As in, most people have no idea they are even *supposed* to do such an
> upgrade, so they don't do it. Until we solve that problem, I think
> we're basically stuck with keeping them "forever". (and even if/when
> we do, "zero" is probably not going to cut it, no)

Yeah, this is a serious problem and one that we should figure out a way
to fix or at least improve on- maybe by having pg_upgrade say something
about extensions that could/should be upgraded..?

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: recovery_target immediate timestamp

2020-11-13 Thread Fujii Masao

On 2020/11/13 8:39, Euler Taveira wrote:

On Wed, 11 Nov 2020 at 22:40, Fujii Masao mailto:masao.fu...@oss.nttdata.com>> wrote:

On 2020/11/12 6:00, Euler Taveira wrote:

 > The first patch adds a new message that prints the latest completed 
checkpoint
 > when the consistent state is reached.

I'm not sure how useful this information is in practice.

Fujii, thanks for reviewing it. It provides the same information as the "last
completed transaction was" message.

 > It also exposes the checkpoint timestamp
 > in debug messages.

                         ereport(DEBUG1,
                                         (errmsg("checkpoint record is at 
%X/%X",
                                                         (uint32) (checkPointLoc 
>> 32), (uint32) checkPointLoc)));
+                       ereport(DEBUG1,
+                                       (errmsg("checkpoint time is %s", 
str_time(checkPoint.time;

The above first debug message displays the LSN of the checkpoint record.
OTOH, the second message displays the time when the checkpoint started
(not the time when checkpoint record was written at the end of checkpoint).
So isn't it confusing to display those inconsistent information together?

Indeed the checkpoint timestamp is from before it determines the REDO LSN. Are
you saying the this checkpoint timestamp is inconsistent because it is not near
the point it saves the RedoRecPtr? If so, let's move checkPoint.time a few
lines below.

No. What I'd like to say is; checkPointLoc that the first debug message
outputs is the LSN of checkpoint record, not the checkpoint REDO location.
The checkpoint REDO location is determined at the early stage of
checkpointing. OTOH, the location of checkpoint record is determined
at the end of checkpointing. They are different.

The checkpoint time that the second debug message you added outputs is
the timestamp determined at the beginning of checkpointing. So it seems
not reasonable to display the location of checkpoint record and
the checkpoint time because they are determined at the different timing.
Am I missing something?

     /*
      * Here we update the shared RedoRecPtr for future XLogInsert calls; this
      * must be done while holding all the insertion locks.
      *
      * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
      * pointing past where it really needs to point.  This is okay; the only
      * consequence is that XLogInsert might back up whole buffers that it
      * didn't really need to.  We can't postpone advancing RedoRecPtr because
      * XLogInserts that happen while we are dumping buffers must assume that
      * their buffer changes are not included in the checkpoint.
      */
     RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
     checkPoint.time = (pg_time_t) time(NULL);

I realized that I was using the wrong variable in one of the debug messages.

 > The second patch provides the checkpoint timestamp in the pg_waldump 
output and
 > also when you enable wal_debug parameter. The pg_waldump output looks 
like

+1

+#ifdef FRONTEND
+               strftime(checkpointstr, sizeof(checkpointstr), "%c", 
localtime(&time_tmp));
+#else
+               pg_strftime(checkpointstr, sizeof(checkpointstr), "%c", 
pg_localtime(&time_tmp, log_timezone));
+#endif

You can simplify the code by using timestamptz_to_str() here instead, like 
xact_desc_commit() does.

I have the same idea until I realized that checkPoint.time is pg_time_t and not
TimestampTz. [digging the code a bit...] I figure out there is a function that
converts from pg_time_t to TimestampTz: time_t_to_timestamptz(). I removed that
ugly code but have to duplicate this function into compat.c. I don't have a
strong preference but I'm attaching a new patch.

Thanks for updating the patch! At least for me this approach looks better.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: POC: Cleaning up orphaned files using undo logs

2020-11-13 Thread Antonin Houska

Amit Kapila  wrote:

> On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska  wrote:
> >
> >
> > No background undo
> > --
> >
> > Reduced complexity of the patch seems to be the priority at the moment. Amit
> > suggested that cleanup of an orphaned relation file is simple enough to be
> > done on foreground and I agree.
> >
> 
> Yeah, I think we should try and see if we can make it work but I
> noticed that there are few places like AbortOutOfAnyTransaction where
> we have the assumption that undo will be executed in the background.
> We need to deal with it.

I think this is o.k. if we always check for unapplied undo during startup.

> > "undo worker" is still there, but it only processes undo requests after 
> > server
> > restart because relation data can only be changed in a transaction - it 
> > seems
> > cleaner to launch a background worker for this than to hack the startup
> > process.
> >
> 
> But, I see there are still multiple undoworkers that are getting
> launched and I am not sure if that works correctly because a
> particular undoworker is connected to a database and then it starts
> processing all the pending undo.

Each undo worker applies only transactions for its own database, see
ProcessExistingUndoRequests():

/* We can only process undo of the database we are connected to. */
if (xact_hdr.dboid != MyDatabaseId)
continue;

Nevertheless, as I've just mentioned in my response to Thomas, I admit that we
should try to live w/o the undo worker altogether.

> > Discarding
> > --
> >
> > There's no discard worker for the URS infrastructure yet. I thought about
> > discarding the undo log during checkpoint, but checkpoint should probably do
> > more straightforward tasks than the calculation of a new discard pointer for
> > each undo log, so a background worker is needed. A few notes on that:
> >
> >   * until the zheap AM gets added, only the transaction that creates the 
> > undo
> > records needs to access them. This assumption should make the discarding
> > algorithm a bit simpler. Note that with zheap, the other transactions 
> > need
> > to look for old versions of tuples, so the concept of 
> > oldestXidHavingUndo
> > variable is needed there.
> >
> >   * it's rather simple to pass pointer the URS pointer to the discard worker
> > when transaction either committed or the undo has been executed.
> >
> 
> Why can't we have a separate discard worker which keeps on scanning
> the undorecords and discard accordingly? Giving the onus of foreground
> process might be tricky because say discard worker is not up to speed
> and we ran out of space to pass such information for each commit/abort
> request.

Sure, there should be a discard worker. The question is how to make its work
efficient. The initial run after restart probably needs to scan everything
between 'discard' and 'insert' pointers, but then it should process only the
parts created by individual transactions.

> >
> > Do not execute the same undo record multiple times
> > --
> >
> > Although I've noticed in the zheap code that it checks whether particular 
> > undo
> > action was already undone, I think this functionality fits better in the URS
> > layer.
> >
> 
> If you want to track at undo record level, then won't it lead to
> performance overhead and probably additional WAL overhead considering
> this action needs to be WAL-logged. I think recording at page-level
> might be a better idea.

I'm not worried about WAL because the undo execution needs to be WAL-logged
anyway - see smgr_undo() in the 0005- part of the patch set. What needs to be
evaluated regarding performance is the (exclusive) locking of the page that
carries the progress information. I'm still not sure whether this info should
be on every page or only in the chunk header. In either case, we have a
problem if there are two or more chunks created by different transactions on
the same page, and if more than on of these transactions need to perform
undo. I tend to believe that this should happen rarely though.

> > I can spend more time on this project, but need a hint which part I should
> > focus on.
> >
> 
> I can easily imagine that this needs a lot of work and I can try to
> help with this as much as possible from my side. I feel at this stage
> we should try to focus on undo-related work (to start with you can
> look at finishing the undoprocessing work for which I have shared some
> thoughts) and then probably at some point in time we need to rebase
> zheap over this.

I agree, thanks!

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: [HACKERS] logical decoding of two-phase transactions

2020-11-13 Thread Amit Kapila

On Wed, Nov 11, 2020 at 4:30 PM Ajin Cherian  wrote:
>
> Did some further tests on the problem I saw and I see that it does not
> have anything to do with this patch. I picked code from top of head.
> If I have enough changes in a transaction to initiate streaming, then
> it will also stream the same changes after a commit.
>
> BEGIN;
> INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM
> generate_series(1,800) g(i);
> SELECT data FROM pg_logical_slot_get_changes('regression_slot',
> NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0',
> 'skip-empty-xacts', '1', 'stream-changes', '1');
> ** see streamed output here **
> END;
> SELECT data FROM pg_logical_slot_get_changes('regression_slot',
> NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0',
> 'skip-empty-xacts', '1', 'stream-changes', '1');
> ** see the same streamed output here **
>
> I think this is because since the transaction has not been committed,
> SnapBuildCommitTxn is not called which is what moves the
> "builder->start_decoding_at", and as a result
> later calls to pg_logical_slot_get_changes will start from the
> previous lsn.
>

No, we always move start_decoding_at after streaming changes. It will
be moved because we have advanced the confirmed_flush location after
streaming all the changes (via LogicalConfirmReceivedLocation()) which
will be used to set 'start_decoding_at' when we create decoding
context (CreateDecodingContext) next time. However, we don't advance
'restart_lsn' due to which it start from the same point and accumulate
all changes for transaction each time. Now, after Commit we get an
extra record which is ahead of 'start_decoding_at' and we try to
decode it, it will get all the changes of the transaction. It might be
that we update the documentation for pg_logical_slot_get_changes() to
indicate the same but I don't think this is a problem.

> I did do a quick test in pgoutput using pub/sub and I
> dont see duplication of data there but I haven't
> verified exactly what happens.
>

Yeah, because we always move ahead for WAL locations in that unless
the subscriber/publisher is restarted in which case it should start
from the required location. But still, we can try to see if there is
any bug.

-- 
With Regards,
Amit Kapila.

Re: [HACKERS] logical decoding of two-phase transactions

2020-11-13 Thread Ajin Cherian

On Fri, Nov 13, 2020 at 9:44 PM Amit Kapila  wrote:
>
> On Thu, Nov 12, 2020 at 2:28 PM Ajin Cherian  wrote:

> Hmm, introducing an additional boolean variable for this doesn't seem
> like a good idea neither the other alternative suggested by you. How
> about if we change the comment to make it clear. How about: "If output
> plugin supports two-phase commits and doesn't skip the transaction at
> prepare time then we don't need to decode the transaction data at
> commit prepared time as it would have already been decoded at prepare
> time."?

Yes, that works for me.

regards,
Ajin Cherian
Fujitsu Australia

Re: [HACKERS] logical decoding of two-phase transactions

2020-11-13 Thread Amit Kapila

On Tue, Nov 10, 2020 at 4:19 PM Ajin Cherian  wrote:
>
> I was doing some testing, and I found some issues. Two issues. The
> first one, seems to be a behaviour that might be acceptable, the
> second one not so much.
> I was using test_decoding, not sure how this might behave with the
> pg_output plugin.
>
> Test 1:
> A transaction that is immediately rollbacked after the prepare.
>
> SET synchronous_commit = on;
> SELECT 'init' FROM
> pg_create_logical_replication_slot('regression_slot',
> 'test_decoding');
> CREATE TABLE stream_test(data text);
> -- consume DDL
> SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
>
> BEGIN;
> INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM
> generate_series(1, 20) g(i);
> PREPARE TRANSACTION 'test1';
> ROLLBACK PREPARED 'test1';
> SELECT data FROM pg_logical_slot_get_changes('regression_slot',
> NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0',
> 'skip-empty-xacts', '1', 'stream-changes', '1');
> ==
>
> Here, what is seen is that while the transaction was not decoded at
> all  since it was rollbacked before it could get decoded, the ROLLBACK
> PREPARED is actually decoded.
> The result being that the standby could get a spurious ROLLBACK
> PREPARED. The current code in worker.c does handle this silently. So,
> this might not be an issue.
>

Yeah, this seems okay because it is quite possible that such a
Rollback would have encountered after processing few records in which
case sending the Rollback is required. This can happen when rollback
has been issues concurrently when we are decoding prepare. If the
Output plugin wants, then can detect that transaction has not written
any data and ignore rollback and we already do something similar in
test_decoding. So, I think this should be fine.

-- 
With Regards,
Amit Kapila.

Re: MultiXact\SLRU buffers configuration

2020-11-13 Thread Andrey Borodin




> 10 нояб. 2020 г., в 23:07, Tomas Vondra  
> написал(а):
> 
> On 11/10/20 7:16 AM, Andrey Borodin wrote:
>> 
>> 
>> but this picture was not stable.
>> 
> 
> Seems we haven't made much progress in reproducing the issue :-( I guess
> we'll need to know more about the machine where this happens. Is there
> anything special about the hardware/config? Are you monitoring size of
> the pg_multixact directory?

It's Ubuntu 18.04.4 LTS, Intel Xeon E5-2660 v4, 56 CPU cores with 256Gb of RAM.
PostgreSQL 10.14, compiled by gcc 7.5.0, 64-bit

No, unfortunately we do not have signals for SLRU sizes.
3.5Tb mdadm raid10 over 28 SSD drives, 82% full.

First incident triggering investigation was on 2020-04-19, at that time cluster 
was running on PG 10.11. But I think it was happening before.

I'd say nothing special...

> 
>> How do you collect wait events for aggregation? just insert into some table 
>> with cron?
>> 
> 
> No, I have a simple shell script (attached) sampling data from
> pg_stat_activity regularly. Then I load it into a table and aggregate to
> get a summary.

Thanks!

Best regards, Andrey Borodin.

Re: [PATCH] Combine same ternary types in GIN and TSearch

2020-11-13 Thread Pavel Borisov

>
> GIN is not just for full-text search, so using TSTernaryValue in
> GinScanKeyData is wrong. And it would break existing extensions.
>
> I didn't look much further than that, but I've got a feeling that
> combining those is a bad idea. TSTernaryValue is used in text-search
> code, even when there is no GIN involved. It's a separate concept, even
> though it happens to have the same values.

Probably you are right. But now the code already rely on equivalent value
assignments for GinTernaryValue and TSTernaryValue
(in checkcondition_gin()). So my idea was to combine them and use them like
we use other global data types. We may declare it somewhere outside both
gin and search. Or just leave as it is.

Thank you, Heikki for your feedback!

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Misc typos

2020-11-13 Thread Daniel Gustafsson

When spellchecking one of my patches for submission, a few other typos fell out
as per the attached.  The one change which isn't in a comment is the object
address class description used for error messages:

-   "extented statistics",
+   "extended statistics",

It's used in an elog for cache lookup failures, so maybe it should be
backported?

cheers ./daniel



typos-misc.diff
Description: Binary data

Re: POC: Cleaning up orphaned files using undo logs

2020-11-13 Thread Antonin Houska

Thomas Munro  wrote:

> On Thu, Nov 12, 2020 at 10:15 PM Antonin Houska  wrote:

> I saw that -- great news! -- and have been meaning to write for a
> while.  I think I am nearly ready to talk about it again.

I'm looking forward to it :-)

> 100% that it's worth trying to do something much simpler than a new
> access manager, and this was the simplest useful feature solving a
> real-world-problem-that-people-actually-have we could come up with
> (based on an idea from Robert).  I think it needs a convincing
> explanation for why there is no scenario where the relfilenode is
> recycled for a new unlucky table before the rollback is executed,
> which might depend on details that you might be working on/changing
> (scenarios where you execute undo twice because you forgot you already
> did it).

Oh, I haven't thought about this problem yet. That might be another reason for
the undo log infrastructure to record the progress somehow.

> > No background undo
> > --
> >
> > Reduced complexity of the patch seems to be the priority at the moment. Amit
> > suggested that cleanup of an orphaned relation file is simple enough to be
> > done on foreground and I agree.
> >
> > "undo worker" is still there, but it only processes undo requests after 
> > server
> > restart because relation data can only be changed in a transaction - it 
> > seems
> > cleaner to launch a background worker for this than to hack the startup
> > process.
> 
> I suppose the simplest useful system would be one does the work at
> startup before allowing connections, and also in regular backends, and
> panics if a backend ever exits while it has pending undo (panic =
> "goto crash recovery").  Then you don't have to deal with undo workers
> running at the same time as regular sessions which might run into
> trouble reacquiring locks (for an AM I mean), or due to OIDs being
> recycled with multiple checkpoints, or undo work that gets deferred
> until the next restart of the server.

I think that zheap can recognize that page has unapplied undo, so we don't
need to reacquire any page lock on restart. However I agree that the
background undo might introduce other concurrency issues. At least for now
it's worth trying to move the cleanup into the startup process. We can
reconsider this when implementing more expensive undo actions, especially the
zheap rollback.

> > Since the concept of undo requests is closely related to the undo worker, I
> > removed undorequest.c too. The new (much simpler) undo worker gets the
> > information on incomplete / aborted transactions from the undo log as
> > mentioned above.
> >
> > SMGR enhancement
> > 
> >
> > I used the 0001 patch from [3] rather than [4], although it's more invasive
> > because I noticed somewhere in the discussion that there should be no 
> > reserved
> > database OID for the undo log. (InvalidOid cannot be used because it's 
> > already
> > in use for shared catalogs.)
> 
> I give up thinking about the colour of the BufferTag shed and went
> back to magic database 9, mainly because there seemed to be more
> pressing matters.  I don't even think it's that crazy to store this
> type of system-wide data in pseudo databases, and I know of other
> systems that do similar sorts of things without blinking...

ok

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Detecting File Damage & Inconsistencies

2020-11-13 Thread Simon Riggs

On Fri, 13 Nov 2020 at 00:50, tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Simon Riggs 
> > If a rogue user/process is suspected, this would allow you to identify
> > more easily the changes made by specific sessions/users.
>
> Isn't that kind of auditing a job of pgAudit or log_statement = mod?  Or, 
> does "more easily" mean that you find pgAudit complex to use and/or 
> log_statement's overhead is big?

Well, I designed pgaudit, so yes, I think pgaudit is useful.

However, pgaudit works at the statement level, not the data level. So
using pgaudit to locate data rows that have changed is fairly hard.

What I'm proposing is an option to add 16 bytes onto each COMMIT
record, which is considerably less than turning on full auditing in
pgaudit. This option would allow identifying data at the row level, so
you could for example find all rows changed by specific sessions.
Also, because it is stored in WAL it will show updates that might no
longer exist in the database because the changed row versions might
have been vacuumed away. So pgaudit will tell you that happened, but
having extra info in WAL is important also.

So thank you for the question because it has allowed me to explain why
it is useful and important.

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module

2020-11-13 Thread Bharath Rupireddy

On Thu, Nov 12, 2020 at 10:06 AM Fujii Masao
 wrote:
>
> Thanks for the analysis! I pushed the patch.
>

Thanks! Since we are replacing custom SIGHUP and SIGTERM handlers with
standard ones, how about doing the same thing in worker_spi.c? I
posted a patch previously [1] in this mail thread. If it makes sense,
please review it.

[1] - 
https://www.postgresql.org/message-id/CALj2ACXDEZhAFOTDcqO9cFSRvrFLyYOnPKrcA1UG4uZn9hUAVg%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Online checksums patch - once again

2020-11-13 Thread Heikki Linnakangas


On 12/11/2020 15:17, Daniel Gustafsson wrote:

On 5 Oct 2020, at 14:14, Heikki Linnakangas  wrote:
I would expect the checksums worker to be automatically started at postmaster 
startup. Can we make that happen?


A dynamic background worker has to be registered from a regular backend, so
it's not entirely clear to me where in startup processing that would take
place.  Do you have any good suggestions?


Could you launch it from the startup process, in StartupXLOG()?
Does it have to be dynamic?

- Heikki

Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-11-13 Thread Bharath Rupireddy

On Tue, Nov 10, 2020 at 3:47 PM Paul Guo  wrote:
>
> Thanks for doing this. There might be another solution - use raw insert 
> interfaces (i.e. raw_heap_insert()).
> Attached is the test (not formal) patch that verifies this idea. 
> raw_heap_insert() writes the page into the
> table files directly and also write the FPI xlog when the tuples filled up 
> the whole page. This seems be
> more efficient.
>

Thanks. Will the new raw_heap_insert() APIs scale well (i.e. extend
the table parallelly) with parallelism? The existing
table_multi_insert() API scales well, see, for instance, the benefit
with parallel copy[1] and parallel multi inserts in CTAS[2].

[1] - 
https://www.postgresql.org/message-id/CALj2ACWeQVd-xoQZHGT01_33St4xPoZQibWz46o7jW1PE3XOqQ%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf%2B0-Jg%2BKYT7ZO-Ug%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Combine same ternary types in GIN and TSearch

2020-11-13 Thread Heikki Linnakangas


On 13/11/2020 11:04, Pavel Borisov wrote:

Hi, hackers!

For historical reasons, now we have two differently named but similar 
ternary data types in TSearch and Gin text-related types. Before v13 
there was also Gin's private TS_execute() version, from which we 
eventually shifted to Tsearch's TS_execute().


To make things more even and beautiful I've made a minor refactor to 
combine two left ternary types into one.



typedef char GinTernaryValue
#define GIN_FALSE 0
#define GIN_TRUE 1
#define GIN_MAYBE 2


typedef enum { TS_NO, TS_YES, TS_MAYBE } TSTernaryValue;

The change is simple and most of it is just the text replacement. The 
only thing worth noting is that some code does pointer cast between 
*bool and *TernaryValue so the size of them should coincide. 
(Declaration done in /char/ type because simple enum on most 
architectures will be of /int/ size). There is no actual change in the 
code despite the order of header files inclusion in some modules.


What do you think about this?


GIN is not just for full-text search, so using TSTernaryValue in 
GinScanKeyData is wrong. And it would break existing extensions.


I didn't look much further than that, but I've got a feeling that 
combining those is a bad idea. TSTernaryValue is used in text-search 
code, even when there is no GIN involved. It's a separate concept, even 
though it happens to have the same values.


- Heikki

Re: Supporting = operator in gin/gist_trgm_ops

2020-11-13 Thread Georgios




‐‐‐ Original Message ‐‐‐
On Friday, November 13, 2020 10:50 AM, Julien Rouhaud  
wrote:

> On Wed, Nov 11, 2020 at 8:34 PM Georgios Kokolatos
> gkokola...@protonmail.com wrote:
>
> > The following review has been posted through the commitfest application:
> > make installcheck-world: tested, passed
> > Implements feature: tested, passed
> > Spec compliant: not tested
> > Documentation: not tested
> > Hi,
> > this patch implements a useful and missing feature. Thank you.
> > It includes documentation, which to a non-native speaker as myself seems 
> > appropriate.
> > It includes comprehensive tests that cover the implemented cases.
> > In the thread Alexander has pointed out, quote:
> > "It would be more efficient to generate trigrams for equal operator
> > using generate_trgm() instead of generate_wildcard_trgm()"
> > I will echo the sentiment, though from a slightly different and possibly not
> > as important point of view. The method used to extract trigrams from the 
> > query
> > should match the method used to extract trigrams from the values when they
> > get added to the index. This is gin_extract_value_trgm() and is indeed using
> > generate_trgm().
> > I have no opinion over Alexander's second comment regarding costing.
> > I change the status to 'Waiting on Author', but please feel free to override
> > my opinion if you feel I am wrong and reset it to 'Needs review'.
>
> Thanks for the reminder Georgios! Thanks a lot Alexander for the review!
>
> Indeed, I should have used generate_trgm() rather than
> generate_wildcard_trgm(). IIUC, the rest of the code should still be
> doing the same as [I]LikeStrategyNumber. I attach a v3 with that
> modification.

Great, thanks!

v3 looks good.

>
> For the costing, I tried this naive dataset:
>
> CREATE TABLE t1 AS select md5(random()::text) AS val from
> generate_series(1, 10);
> CREATE INDEX t1_btree ON t1 (val);
> CREATE INDEX t1_gist ON t1 USING gist (val gist_trgm_ops);
>
> Cost are like this (all default configuration, using any random existing 
> entry):
>
> EXPLAIN ANALYZE SELECT * FROM t1 where val =
>
> =
>
> '8dcf324ce38428e4d27a363953ac1c51';
> QUERY PLAN
>
> ---
>
> Index Only Scan using t1_btree on t1 (cost=0.42..4.44 rows=1
> width=33) (actual time=0.192..0.194 rows=1 loops=1)
> Index Cond: (val = '8dcf324ce38428e4d27a363953ac1c51'::text)
> Heap Fetches: 0
> Planning Time: 0.133 ms
> Execution Time: 0.222 ms
> (5 rows)
>
> EXPLAIN ANALYZE SELECT * FROM t1 where val =
>
> =
>
> '8dcf324ce38428e4d27a363953ac1c51';
> QUERY PLAN
>
> ---
>
> Index Scan using t1_gist on t1 (cost=0.28..8.30 rows=1 width=33)
> (actual time=0.542..2.359 rows=1 loops=1)
> Index Cond: (val = '8dcf324ce38428e4d27a363953ac1c51'::text)
> Planning Time: 0.189 ms
> Execution Time: 2.382 ms
> (4 rows)
>
> EXPLAIN ANALYZE SELECT * FROM t1 where val =
>
> =
>
> '8dcf324ce38428e4d27a363953ac1c51';
> QUERY PLAN
>
> ---
>
> Bitmap Heap Scan on t1 (cost=400.01..404.02 rows=1 width=33) (actual
> time=2.486..2.488 rows=1 loops=1)
> Recheck Cond: (val = '8dcf324ce38428e4d27a363953ac1c51'::text)
> Heap Blocks: exact=1
> -> Bitmap Index Scan on t1_gin (cost=0.00..400.01 rows=1 width=0)
> (actual time=2.474..2.474 rows=1 loops=1)
> Index Cond: (val = '8dcf324ce38428e4d27a363953ac1c51'::text)
> Planning Time: 0.206 ms
> Execution Time: 2.611 ms
>
> So assuming that this dataset is representative enough, costing indeed
> prefers a btree index over a gist/gin index, which should avoid
> regression with this change.

It sounds reasonable, although I would leave it to a bit more cost savvy
people to argue the point.

>
> Gin is however quite off, likely because it's a bitmap index scan
> rather than an index scan, so gist is preferred in this scenario.
> That's not ideal, but I'm not sure that there are many people having
> both gin_trgm_ops and gist_trgm_ops.

Same as above.

In short, I think v3 of the patch looks good to change to 'RFC' status.
Given the possible costing concerns, I will refrain from changing the
status just yet, to give the opportunity to more reviewers to chime in.
If in the next few days there are no more reviews, I will update the
status to 'RFC' to move the patch forward.

Thoughts?

Cheers,
//Georgios

Re: WIP: WAL prefetch (another approach)

2020-11-13 Thread Tomas Vondra

On 11/13/20 3:20 AM, Thomas Munro wrote:
>
> ...
> 
> I'm not really sure what to do about achive restore scripts that
> block.  That seems to be fundamentally incompatible with what I'm
> doing here.
> 

IMHO we can't do much about that, except for documenting it - if the
prefetch can't work because of blocking restore script, someone has to
fix/improve the script. No way around that, I'm afraid.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] logical decoding of two-phase transactions

2020-11-13 Thread Amit Kapila

On Thu, Nov 12, 2020 at 2:28 PM Ajin Cherian  wrote:
>
> On Wed, Nov 11, 2020 at 12:35 AM Amit Kapila  wrote:
>  I have rearranged the code in DecodeCommit/Abort/Prepare so
> > that it does only the required things (like in DecodeCommit was still
> > processing subtxns even when it has to just perform FinishPrepared,
> > also the stats were not updated properly which I have fixed.) and
> > added/edited the comments. Apart from 0001 and 0002, I have not
> > changed anything in the remaining patches.
>
> One small comment on the patch:
>
> - DecodeCommit(ctx, buf, &parsed, xid);
> + /*
> + * If we have already decoded this transaction data then
> + * DecodeCommit doesn't need to decode it again. This is
> + * possible iff output plugin supports two-phase commits and
> + * doesn't skip the transaction at prepare time.
> + */
> + if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase)
> + {
> + already_decoded = !(ctx->callbacks.filter_prepare_cb &&
> + ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
> + }
> +
>
> Just a small nitpick but the way already_decoded is assigned here is a
> bit misleading. It appears that the callbacks determine if the
> transaction is already decoded when in
> reality the callbacks only decide if the transaction should skip two
> phase commits. I think it's better to either move it to the if
> condition or if that is too long then have one more variable
> skip_twophase.
>
> if (info ==  XLOG_XACT_COMMIT_PREPARED && ctx->twophase &&
>  !(ctx->callbacks.filter_prepare_cb &&
> ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid)))
> already_decoded = true;
>
> OR
> bool skip_twophase = false;
> skip_twophase =  !(ctx->callbacks.filter_prepare_cb &&
> ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
> if (info ==  XLOG_XACT_COMMIT_PREPARED && ctx->twophase && skip_twophase)
> already_decoded = true;
>

Hmm, introducing an additional boolean variable for this doesn't seem
like a good idea neither the other alternative suggested by you. How
about if we change the comment to make it clear. How about: "If output
plugin supports two-phase commits and doesn't skip the transaction at
prepare time then we don't need to decode the transaction data at
commit prepared time as it would have already been decoded at prepare
time."?

-- 
With Regards,
Amit Kapila.

RE: Terminate the idle sessions

2020-11-13 Thread kuroda.hay...@fujitsu.com

Dear Li, 

I read your patch, and I think the documentation is too simple to avoid all 
problems.
(I think if some connection pooling is used, the same problem will occur.)
Could you add some explanations in the doc file? I made an example:

```
Note that this values should be set to zero if you use postgres_fdw or some
Connection-pooling software, because connections might be closed unexpectedly. 
```

I will send other comments if I find something.

Hayato Kuroda
FUJITSU LIMITED

Re: Supporting = operator in gin/gist_trgm_ops

2020-11-13 Thread Julien Rouhaud

On Wed, Nov 11, 2020 at 8:34 PM Georgios Kokolatos
 wrote:
>
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   not tested
> Documentation:not tested
>
> Hi,
>
> this patch implements a useful and missing feature. Thank you.
>
> It includes documentation, which to a non-native speaker as myself seems 
> appropriate.
> It includes comprehensive tests that cover the implemented cases.
>
> In the thread Alexander has pointed out, quote:
> "It would be more efficient to generate trigrams for equal operator
> using generate_trgm() instead of generate_wildcard_trgm()"
>
> I will echo the sentiment, though from a slightly different and possibly not
> as important point of view. The method used to extract trigrams from the query
> should match the method used to extract trigrams from the values when they
> get added to the index. This is gin_extract_value_trgm() and is indeed using
> generate_trgm().
>
> I have no opinion over Alexander's second comment regarding costing.
>
> I change the status to 'Waiting on Author', but please feel free to override
> my opinion if you feel I am wrong and reset it to 'Needs review'.

Thanks for the reminder Georgios!  Thanks a lot Alexander for the review!

Indeed, I should have used generate_trgm() rather than
generate_wildcard_trgm().  IIUC, the rest of the code should still be
doing the same as [I]LikeStrategyNumber.  I attach a v3 with that
modification.

For the costing, I tried this naive dataset:

CREATE TABLE t1 AS select md5(random()::text) AS val from
generate_series(1, 10);
CREATE INDEX t1_btree ON t1 (val);
CREATE INDEX t1_gist ON t1 USING gist (val gist_trgm_ops);

Cost are like this (all default configuration, using any random existing entry):

# EXPLAIN ANALYZE SELECT * FROM t1 where val =
'8dcf324ce38428e4d27a363953ac1c51';
QUERY PLAN
---
 Index Only Scan using t1_btree on t1  (cost=0.42..4.44 rows=1
width=33) (actual time=0.192..0.194 rows=1 loops=1)
   Index Cond: (val = '8dcf324ce38428e4d27a363953ac1c51'::text)
   Heap Fetches: 0
 Planning Time: 0.133 ms
 Execution Time: 0.222 ms
(5 rows)

# EXPLAIN ANALYZE SELECT * FROM t1 where val =
'8dcf324ce38428e4d27a363953ac1c51';
 QUERY PLAN
-
 Index Scan using t1_gist on t1  (cost=0.28..8.30 rows=1 width=33)
(actual time=0.542..2.359 rows=1 loops=1)
   Index Cond: (val = '8dcf324ce38428e4d27a363953ac1c51'::text)
 Planning Time: 0.189 ms
 Execution Time: 2.382 ms
(4 rows)

# EXPLAIN ANALYZE SELECT * FROM t1 where val =
'8dcf324ce38428e4d27a363953ac1c51';
   QUERY PLAN
-
 Bitmap Heap Scan on t1  (cost=400.01..404.02 rows=1 width=33) (actual
time=2.486..2.488 rows=1 loops=1)
   Recheck Cond: (val = '8dcf324ce38428e4d27a363953ac1c51'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on t1_gin  (cost=0.00..400.01 rows=1 width=0)
(actual time=2.474..2.474 rows=1 loops=1)
 Index Cond: (val = '8dcf324ce38428e4d27a363953ac1c51'::text)
 Planning Time: 0.206 ms
 Execution Time: 2.611 ms

So assuming that this dataset is representative enough, costing indeed
prefers a btree index over a gist/gin index, which should avoid
regression with this change.

Gin is however quite off, likely because it's a bitmap index scan
rather than an index scan, so gist is preferred in this scenario.
That's not ideal, but I'm not sure that there are many people having
both gin_trgm_ops and gist_trgm_ops.


v3-0001-Handle-operator-in-pg_trgm.patch
Description: Binary data

Re: logical streaming of xacts via test_decoding is broken

2020-11-13 Thread Amit Kapila

On Thu, Nov 12, 2020 at 3:10 PM Dilip Kumar  wrote:
>
> On Tue, Nov 10, 2020 at 7:20 PM Amit Kapila  wrote:
> >
> > On Tue, Nov 10, 2020 at 2:25 PM Dilip Kumar  wrote:
>
> > 3. Can you please prepare a separate patch for test case changes so
> > that it would be easier to verify that it fails without the patch and
> > passed after the patch?
>
> Done
>

Few comments:
=
1.
   -- consume DDL
   SELECT data FROM pg_logical_slot_get_changes('isolation_slot',
NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-  CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS
'select array_agg(md5(g::text))::text from generate_series(1, 8)
g';
+  CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS
'select array_agg(md5(g::text))::text from generate_series(1, 6)
g';
 }

Is there a reason for this change? I think probably here a lesser
number of rows are sufficient to serve the purpose of the test but I
am not sure if it is related to this patch or there is any other
reason behind this change?

2.
+typedef struct
+{
+ bool xact_wrote_changes;
+ bool stream_wrote_changes;
+} TestDecodingTxnData;
+

I think here a comment explaining why we need this as a separate
structure would be better and probably explain why we need two
different members.

3.
pg_decode_commit_txn()
{
..
- if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ pfree(txndata);
+ txn->output_plugin_private = false;
+

Here, don't we need to set txn->output_plugin_private as NULL as it is
a pointer and we do explicitly test it for being NULL at other places?
Also, change at other places where it is set as false.

4.
@@ -592,10 +610,24 @@ pg_decode_stream_start(LogicalDecodingContext *ctx,
 ReorderBufferTXN *txn)
 {
  TestDecodingData *data = ctx->output_plugin_private;
+ TestDecodingTxnData *txndata = txn->output_plugin_private;

- data->xact_wrote_changes = false;
+ /*
+ * If this is the first stream for the txn then allocate the txn plugin
+ * data and set the xact_wrote_changes to false.
+ */
+ if (txndata == NULL)
+ {
+ txndata =

As we are explicitly testing for NULL here, isn't it better to
explicitly initialize 'output_plugin_private' with NULL in
ReorderBufferGetTXN?

5.
@@ -633,8 +666,18 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 XLogRecPtr abort_lsn)
 {
  TestDecodingData *data = ctx->output_plugin_private;
+ ReorderBufferTXN *toptxn = txn->toptxn ? txn->toptxn : txn;
+ TestDecodingTxnData *txndata = toptxn->output_plugin_private;
+ bool xact_wrote_changes = txndata->xact_wrote_changes;

- if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ if (txn->toptxn == NULL)
+ {
+ Assert(txn->output_plugin_private != NULL);
+ pfree(txndata);
+ txn->output_plugin_private = false;
+ }
+

Here, if we are expecting 'output_plugin_private' to be set only for
toptxn then the Assert and reset should happen for toptxn? I find the
changes in this function a bit unclear so probably adding a comment
here could help.

-- 
With Regards,
Amit Kapila.

pg_rewind copies

2020-11-13 Thread Heikki Linnakangas

If a file is modified and becomes larger in the source system while 
pg_rewind is running, pg_rewind can leave behind a partial copy of file. 
That's by design, and it's OK for relation files because they're 
replayed from WAL. But it can cause trouble for configuration files.


I ran into this while playing with pg_auto_failover. After failover, 
pg_auto_failover would often launch pg_rewind, and run ALTER SYSTEM on 
the primary while pg_rewind was running. The resulting rewound system 
would fail to start up:


Nov 13 09:24:42 pg-node-a pg_autoctl[2217]: 09:24:42 2220 ERROR 
2020-11-13 09:24:32.547 GMT [2246] LOG:  syntax error in file 
"/data/pgdata/postgresql.auto.conf" line 4, near token "'"
Nov 13 09:24:42 pg-node-a pg_autoctl[2217]: 09:24:42 2220 ERROR 
2020-11-13 09:24:32.547 GMT [2246] FATAL:  configuration file 
"postgresql.auto.conf" contains errors


Attached is a patch to mitigate that. It changes pg_rewind so that when 
it copies a whole file, it ignores the original file size. It's not a 
complete cure: it still believes the original size for files larger than 
1 MB. That limit was just expedient given the way the chunking logic in 
libpq_source.c works, but should be enough for configuration files.


There's another race condition that this doesn't try to fix: If a file 
is modified while it's being copied, you can have a torn file with one 
half of the file from the old version, and one half from the new. That's 
a much more narrow window, though, and pg_basebackup has the same problem.


- Heikki
>From f0207e355c6be2d710d2ac11e4895795c3471ef3 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Fri, 13 Nov 2020 11:04:29 +0200
Subject: [PATCH 1/1] pg_rewind: Fetch small files according to new size.

There's a race condition if a file changes in the source system after we
have collected the file list. If the file becomes larger, we only fetched
up to its original size. That can easily result in a truncated file.
That's not a problem for relation files, files in pg_xact, etc. because
any actions on them will be replayed from the WAL. However, configuration
files are affected.

This commit mitigates the race condition by fetching small files in
whole, even if they have grown. This is not a full fix: we still believe
the original file size for files larger than 1 MB. That should be enough
for configuration files, and doing more than that would require bigger
changes to the chunking logic in in libpq_source.c.

That mitigates the race condition if the file is modified between the
original scan of files and copying the file, but there's still a race
condition if a file is changed while it's being copied. That's a much
smaller window, though, and pg_basebackup has the same issue.

I ran into this while playing with pg_auto_failover, which frequently
uses ALTER SYSTEM, which update postgresql.auto.conf. Often, pg_rewind
would fail, because the postgresql.auto.conf file changed concurrently
and a partial version of it was copied to the target. The partial
file would fail to parse, preventing the server from starting up.
---
 src/bin/pg_rewind/libpq_source.c  | 32 +
 src/bin/pg_rewind/local_source.c  | 76 +++
 src/bin/pg_rewind/pg_rewind.c |  5 +-
 src/bin/pg_rewind/rewind_source.h | 13 ++
 4 files changed, 112 insertions(+), 14 deletions(-)

diff --git a/src/bin/pg_rewind/libpq_source.c b/src/bin/pg_rewind/libpq_source.c
index 47beba277a4..f544d9828a2 100644
--- a/src/bin/pg_rewind/libpq_source.c
+++ b/src/bin/pg_rewind/libpq_source.c
@@ -63,6 +63,7 @@ static void process_queued_fetch_requests(libpq_source *src);
 /* public interface functions */
 static void libpq_traverse_files(rewind_source *source,
  process_file_callback_t callback);
+static void libpq_queue_fetch_file(rewind_source *source, const char *path, size_t len);
 static void libpq_queue_fetch_range(rewind_source *source, const char *path,
 	off_t off, size_t len);
 static void libpq_finish_fetch(rewind_source *source);
@@ -88,6 +89,7 @@ init_libpq_source(PGconn *conn)
 
 	src->common.traverse_files = libpq_traverse_files;
 	src->common.fetch_file = libpq_fetch_file;
+	src->common.queue_fetch_file = libpq_queue_fetch_file;
 	src->common.queue_fetch_range = libpq_queue_fetch_range;
 	src->common.finish_fetch = libpq_finish_fetch;
 	src->common.get_current_wal_insert_lsn = libpq_get_current_wal_insert_lsn;
@@ -307,6 +309,36 @@ libpq_traverse_files(rewind_source *source, process_file_callback_t callback)
 	PQclear(res);
 }
 
+/*
+ * Queue up a request to fetch a file from remote system.
+ */
+static void
+libpq_queue_fetch_file(rewind_source *source, const char *path, size_t len)
+{
+	/*
+	 * Truncate the target file immediately, and queue a request to fetch it
+	 * from the source. If the file is small, smaller than MAX_CHUNK_SIZE,
+	 * request fetching a full-sized chunk anyway, so that if the file has
+	 * become larger in the source system, after we scanned the sourc

Re: [PATCH] remove deprecated v8.2 containment operators

2020-11-13 Thread Magnus Hagander

On Thu, Nov 12, 2020 at 11:28 PM Tom Lane  wrote:
>
> > The changes to the contrib modules appear to be incomplete in some ways.
> >   In cube, hstore, and seg, there are no changes to the extension
> > scripts to remove the operators.  All you're doing is changing the C
> > code to no longer recognize the strategy, but that doesn't explain what
> > will happen if the operator is still used.  In intarray, by contrast,
> > you're editing an existing extension script, but that should be done by
> > an upgrade script instead.
>
> In the contrib modules, I'm afraid what you gotta do is remove the
> SQL operator definitions but leave the opclass code support in place.
> That's because there's no guarantee that users will update the extension's
> SQL version immediately, so a v14 build of the .so might still be used
> with the old SQL definitions.  It's not clear how much window we need
> give for people to do that update, but I don't think "zero" is an
> acceptable answer.

Based on my experience from the field, the answer is "never".

As in, most people have no idea they are even *supposed* to do such an
upgrade, so they don't do it. Until we solve that problem, I think
we're basically stuck with keeping them "forever". (and even if/when
we do, "zero" is probably not going to cut it, no)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Parallel INSERT (INTO ... SELECT ...)

2020-11-13 Thread Greg Nancarrow

On Wed, Nov 4, 2020 at 2:18 PM Amit Kapila  wrote:
>
> Or you might want to consider moving the check related to
> IsModifySupportedInParallelMode() inside
> PrepareParallelModeForModify(). That way the code might look a bit
> cleaner.
>

Posting an updated Parallel SELECT for "INSERT INTO ... SELECT ..."
patch which addresses previously-identified issues and suggestions,
and adds some tests and doc updates.
I won't post an updated Parallel INSERT patch just yet (which just
builds on the 1st patch), because there's at least a couple of issues
in this 1st patch which need to be discussed first.

Firstly, in order to perform parallel-safety checks in the case of
partitions, the patch currently recursively locks/unlocks
(AccessShareLock) each partition during such checks (as each partition
may itself be a partitioned table). Is there a better way of
performing the parallel-safety checks and reducing the locking
requirements?

Secondly, I found that when running "make check-world", the
"partition-concurrent-attach" test fails, because it is expecting a
partition constraint to be violated on insert, while an "alter table
attach partition ..." is concurrently being executed in another
transaction. Because of the partition locking done by the patch's
parallel-safety checking code, the insert blocks on the exclusive lock
held by the "alter table" in the other transaction until the
transaction ends, so the insert ends up successfully completing (and
thus fails the test) when the other transaction ends. To overcome this
test failure, the patch code was updated to instead perform a
conditional lock on the partition, and on failure (i.e. because of an
exclusive lock held somewhere else), just assume it's parallel-unsafe
because the parallel-safety can't be determined without blocking on
the lock. This is not ideal, but I'm not sure of what other approach
could be used and I am somewhat reluctant to change that test. If
anybody is familiar with the "partition-concurrent-attach" test, any
ideas or insights would be appreciated.

Regards,
Greg Nancarrow
Fujitsu Australia

v7-0001-Enable-parallel-SELECT-for-INSERT-INTO-.-SELECT.patch
Description: Binary data

v7-0002-Parallel-SELECT-for-INSERT-INTO-.-SELECT-tests-and-doc.patch
Description: Binary data

[PATCH] Combine same ternary types in GIN and TSearch

2020-11-13 Thread Pavel Borisov

Hi, hackers!

For historical reasons, now we have two differently named but similar
ternary data types in TSearch and Gin text-related types. Before v13 there
was also Gin's private TS_execute() version, from which we eventually
shifted to Tsearch's TS_execute().

To make things more even and beautiful I've made a minor refactor to
combine two left ternary types into one.


typedef char GinTernaryValue
#define GIN_FALSE 0
#define GIN_TRUE 1
#define GIN_MAYBE 2


typedef enum { TS_NO, TS_YES, TS_MAYBE } TSTernaryValue;

The change is simple and most of it is just the text replacement. The only
thing worth noting is that some code does pointer cast between *bool and
*TernaryValue so the size of them should coincide. (Declaration done in
*char* type because simple enum on most architectures will be of *int*
size). There is no actual change in the code despite the order of header
files inclusion in some modules.

What do you think about this?

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


v1-0001-Combine-same-types-GinTernaryValue-and-TSTernaryV.patch
Description: Binary data

Re: Parallel copy

2020-11-13 Thread Amit Kapila

On Wed, Nov 11, 2020 at 10:42 PM vignesh C  wrote:
>
> On Tue, Nov 10, 2020 at 7:27 PM Amit Kapila  wrote:
> >
> > On Tue, Nov 10, 2020 at 7:12 PM vignesh C  wrote:
> > >
> > > On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila  
> > > wrote:
> > > >
> > >
> > > I have worked to provide a patch for the parallel safety checks. It
> > > checks if parallely copy can be performed, Parallel copy cannot be
> > > performed for the following a) If relation is temporary table b) If
> > > relation is foreign table c) If relation has non parallel safe index
> > > expressions d) If relation has triggers present whose type is of non
> > > before statement trigger type e) If relation has check constraint
> > > which are not parallel safe f) If relation has partition and any
> > > partition has the above type. This patch has the checks for it. This
> > > patch will be used by parallel copy implementation.
> > >
> >
> > How did you ensure that this is sufficient? For parallel-insert's
> > patch we have enabled parallel-mode for Inserts and ran the tests with
> > force_parallel_mode to see if we are not missing anything. Also, it
> > seems there are many common things here w.r.t parallel-insert patch,
> > is it possible to prepare this atop that patch or do you have any
> > reason to keep this separate?
> >
>
> I have done similar testing for copy too, I had set force_parallel
> mode to regress, hardcoded in the code to pick parallel workers for
> copy operation and ran make installcheck-world to verify. Many checks
> in this patch are common between both patches, but I was not sure how
> to handle it as both the projects are in-progress and are being
> updated based on the reviewer's opinion. How to handle this?
> Thoughts?
>

I have not studied the differences in detail but if it is possible to
prepare it on top of that patch then there shouldn't be a problem. To
avoid confusion if you want you can always either post the latest
version of that patch with your patch or point to it.

-- 
With Regards,
Amit Kapila.

Re: Bogus documentation for bogus geometric operators

2020-11-13 Thread Pavel Borisov

>
> 1. The patch removes <^ and >^ from func.sgml, which is fine, but

shouldn't there be an addition for the new operators?  (I think
>
I fully agree and added "point" as a possible input type for <<| and |>> in
manual. PFA v5


> undocumented.  Maybe instead of removing, change the text to be
> "Deprecated, use the equivalent XXX operator instead."  Or we could
> add a footnote similar to what was there for a previous renaming:
>
The problem that this new <<| is equivalent to <^ only for points (To
recap: the source of a problem is the same name of <^  operator for points
and boxes with different meaning for these types).

   point
   box
<<| |>>  strictly above/below  (new)
strictly above/below
<^ >^ strictly above/below (deprecated, but available)
 above/below

So it seems to me that trying to mention the subtle difference of
deprecated operator to same-named one for different data type inevitably
make things much worse for reader. On this reason I'd vote for complete
nuking <^ for point type (but this is not the only way so I haven't done
this in v5).

What do you think?

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


v5-0001-Deprecate-and-replace-and-operators-for-points.patch
Description: Binary data

Re: In-placre persistance change of a relation

2020-11-13 Thread Kyotaro Horiguchi

At Fri, 13 Nov 2020 07:15:41 +, "osumi.takami...@fujitsu.com" 
 wrote in 
> Hello, Tsunakawa-San
> 

Thanks for sharing it!

> > Do you know the reason why data copy was done before?  And, it may be
> > odd for me to ask this, but I think I saw someone referred to the past
> > discussion that eliminating data copy is difficult due to some processing at
> > commit.  I can't find it.
> I can share 2 sources why to eliminate the data copy is difficult in hackers 
> thread.
> 
> Tom's remark and the context to copy relation's data.
> https://www.postgresql.org/message-id/flat/31724.1394163360%40sss.pgh.pa.us#31724.1394163...@sss.pgh.pa.us

https://www.postgresql.org/message-id/CA+Tgmob44LNwwU73N1aJsGQyzQ61SdhKJRC_89wCm0+aLg=x...@mail.gmail.com

> No, not really.  The issue is more around what happens if we crash
> part way through.  At crash recovery time, the system catalogs are not
> available, because the database isn't consistent yet and, anyway, the
> startup process can't be bound to a database, let alone every database
> that might contain unlogged tables.  So the sentinel that's used to
> decide whether to flush the contents of a table or index is the
> presence or absence of an _init fork, which the startup process
> obviously can see just fine.  The _init fork also tells us what to
> stick in the relation when we reset it; for a table, we can just reset
> to an empty file, but that's not legal for indexes, so the _init fork
> contains a pre-initialized empty index that we can just copy over.
> 
> Now, to make an unlogged table logged, you've got to at some stage
> remove those _init forks.  But this is not a transactional operation.
> If you remove the _init forks and then the transaction rolls back,
> you've left the system an inconsistent state.  If you postpone the
> removal until commit time, then you have a problem if it fails,

It's true. That are the cause of headache.

> particularly if it works for the first file but fails for the second.
> And if you crash at any point before you've fsync'd the containing
> directory, you have no idea which files will still be on disk after a
> hard reboot.

This is not an issue in this patch *except* the case where init fork
is failed to removed but the following removal of inittmp fork
succeeds.  Another idea is adding a "not-yet-committed" property to a
fork.  I added a new fork type for easiness of the patch but I could
go that way if that is an issue.

> Amit-San quoted this thread and mentioned that point in another thread.
> https://www.postgresql.org/message-id/CAA4eK1%2BHDqS%2B1fhs5Jf9o4ZujQT%3DXBZ6sU0kOuEh2hqQAC%2Bt%3Dw%40mail.gmail.com

This sounds like a bit differrent discussion. Making part-of-a-table
UNLOGGED looks far difficult to me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Support for NSS as a libpq TLS backend

2020-11-13 Thread Jacob Champion

On Nov 11, 2020, at 10:57 AM, Jacob Champion  wrote:
> 
> False alarm -- the stderr debugging I'd added in to track down the
> assertion tripped up the "no stderr" tests. Zero failing tests now.

I took a look at the OpenSSL interop problems you mentioned upthread. I
don't see a hang like you did, but I do see a PR_IO_TIMEOUT_ERROR during
connection.

I think pgtls_read() needs to treat PR_IO_TIMEOUT_ERROR as if no bytes
were read, in order to satisfy its API. There was some discussion on
this upthread:

On Oct 27, 2020, at 1:07 PM, Daniel Gustafsson  wrote:
> 
> On 20 Oct 2020, at 21:15, Andres Freund  wrote:
>> 
>>> +   case PR_IO_TIMEOUT_ERROR:
>>> +   break;
>> 
>> What does this mean? We'll return with a 0 errno here, right? When is
>> this case reachable?
> 
> It should, AFAICT, only be reachable when PR_Recv is used with a timeout which
> we don't do.  It mentioned somewhere that it had happened in no-wait calls due
> to a bug, but I fail to find that reference now.  Either way, I've removed it
> to fall into the default error handling which now sets errno correctly as that
> was a paddle short here.

PR_IO_TIMEOUT_ERROR is definitely returned in no-wait calls on my
machine. It doesn't look like the PR_Recv() API has a choice -- if
there's no data, it can't return a positive integer, and returning zero
means that the socket has been disconnected. So -1 with a timeout error
is the only option.

I'm not completely sure why this is exposed so easily with an OpenSSL
server -- I'm guessing the implementation slices up its packets
differently on the wire, causing a read event before NSS is able to
decrypt a full record -- but it's worth noting that this case also shows
up during NSS-to-NSS psql connections, when handling notifications at
the end of every query. PQconsumeInput() reports a hard failure with the
current implementation, but its return value is ignored by
PrintNotifications(). Otherwise this probably would have showed up
earlier.

(What's the best way to test this case? Are there lower-level tests for
the protocol/network layer somewhere that I'm missing?) 

While patching this case, I also noticed that pgtls_read() doesn't call
SOCK_ERRNO_SET() for the disconnection case. That is also in the
attached patch.

--Jacob

nss-handle-timeouts-and-disconnections-in-pgtls_read.patch
Description:  nss-handle-timeouts-and-disconnections-in-pgtls_read.patch

58 matches

Mail list logo