Re: Optimising compactify_tuples()

2020-09-10 Thread Thomas Munro
On Fri, Sep 11, 2020 at 3:53 AM David Rowley  wrote:
> That gets my benchmark down to 60.8 seconds, so 2.2 seconds better than v4b.

. o O ( I wonder if there are opportunities to squeeze some more out
of this with __builtin_prefetch... )

> I've attached v6b and an updated chart showing the results of the 10
> runs I did of it.

One failure seen like this while running check word (cfbot):

#2 0x0091f93f in ExceptionalCondition
(conditionName=conditionName@entry=0x987284 "nitems > 0",
errorType=errorType@entry=0x97531d "FailedAssertion",
fileName=fileName@entry=0xa9df0d "bufpage.c",
lineNumber=lineNumber@entry=442) at assert.c:67




Re: Optimising compactify_tuples()

2020-09-10 Thread Thomas Munro
On Fri, Sep 11, 2020 at 1:45 AM David Rowley  wrote:
> On Thu, 10 Sep 2020 at 10:40, Thomas Munro  wrote:
> > I wonder if we could also identify a range at the high end that is
> > already correctly sorted and maximally compacted so it doesn't even
> > need to be copied out.
>
> I've experimented quite a bit with this patch today. I think I've
> tested every idea you've mentioned here, so there's quite a lot of
> information to share.
>
> I did write code to skip the copy to the separate buffer for tuples
> that are already in the correct place and with a version of the patch
> which keeps tuples in their traditional insert order (later lineitem's
> tuple located earlier in the page) I see a generally a very large
> number of tuples being skipped with this method.  See attached
> v4b_skipped_tuples.png. The vertical axis is the number of
> compactify_tuple() calls during the benchmark where we were able to
> skip that number of tuples. The average skipped tuples overall calls
> during recovery was 81 tuples, so we get to skip about half the tuples
> in the page doing this on this benchmark.

Excellent.

> > So one question is whether we want to do the order-reversing as part
> > of this patch, or wait for a more joined-up project to make lots of
> > code paths collude on making scan order match memory order
> > (corellation = 1).  Most or all of the gain from your patch would
> > presumably still apply if did the exact opposite and forced offset
> > order to match reverse-item ID order (correlation = -1), which also
> > happens to be the initial state when you insert tuples today; you'd
> > still tend towards a state that allows nice big memmov/memcpy calls
> > during page compaction.  IIUC currently we start with correlation -1
> > and then tend towards correlation = 0 after many random updates
> > because we can't change the order, so it gets scrambled over time.
> > I'm not sure what I think about that.
>
> So I did lots of benchmarking with both methods and my conclusion is
> that I think we should stick to the traditional INSERT order with this
> patch. But we should come back and revisit that more generally one
> day. The main reason that I'm put off flipping the tuple order is that
> it significantly reduces the number of times we hit the preordered
> case.  We go to all the trouble of reversing the order only to have it
> broken again when we add 1 more tuple to the page.  If we keep this
> the traditional way, then it's much more likely that we'll maintain
> that pre-order and hit the more optimal memmove code path.

Right, that makes sense.  Thanks for looking into it!

> I've also attached another tiny patch that I think is pretty useful
> separate from this. It basically changes:
>
> LOG:  redo done at 0/D518FFD0
>
> into:
>
> LOG:  redo done at 0/D518FFD0 system usage: CPU: user: 58.93 s,
> system: 0.74 s, elapsed: 62.31 s

+1




Re: Fix for parallel BTree initialization bug

2020-09-10 Thread Hunter, James

Nice repro, thanks!
--
James Hunter, Amazon Web Services (AWS)

On 9/10/20 7:37 PM, Justin Pryzby wrote:

Against all odds, I was able to reproduce this.

begin;
CREATE TABLE t AS SELECT generate_series(1,99)i;
ALTER TABLE t SET (parallel_workers=2, autovacuum_enabled=off);
CREATE INDEX ON t(i);
commit;

SET parallel_leader_participation=off; SET min_parallel_table_scan_size=0; SET 
enable_bitmapscan=off; SET enable_indexonlyscan=off; SET enable_seqscan=off;
explain(analyze , verbose on) SELECT COUNT(1) FROM t a WHERE a.i>555 AND i IN ( 
333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666
 ) ORDER BY 1;

Which gives a plan like:
  Sort  (cost=5543.71..5543.72 rows=1 width=8)
Sort Key: (count(1))
->  Finalize Aggregate  (cost=5543.69..5543.70 rows=1 width=8)
  ->  Gather  (cost=5543.48..5543.69 rows=2 width=8)
Workers Planned: 2
->  Partial Aggregate  (cost=4543.48..4543.49 rows=1 width=8)
  ->  Parallel Index Scan using t_i_idx on t a  
(cost=0.42..4204.92 rows=135423 width=0)

I don't get an error, on read-only hot standby.  I do get inconsistent results,
including on primary server.

count | 222
count | 214

This appears to be a bug in commit 569174f1b btree: Support parallel index 
scans.

I've added your patch here:
https://commitfest.postgresql.org/30/2729/

In the course of reproducing this, I also added:
@@ -1972,2 +1975,3 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, 
ScanDirection dir)
 rel = scan->indexRelation;
+   Assert(BlockNumberIsValid(blkno));

--
Justin




RE: Implement UNLOGGED clause for COPY FROM

2020-09-10 Thread tsunakawa.ta...@fujitsu.com
From: Peter Smith 
On Thu, Sep 10, 2020 at 7:16 PM tsunakawa.ta...@fujitsu.com
>  wrote:
> > ALTER TABLE takes long time proportional to the amount of existing data,
> while wal_level = none doesn't.
> 
> Right, but if wal_level=none is considered OK for that table with
> existing data, then why not just create the table UNLOGGED in the
> first place? (or ALTER it to set UNLOGGED just one time and then leave
> it as UNLOGGED).

The target tables sometimes receive updates (for data maintenance and/or 
correction).  They don't want those updates to be lost due to the database 
server crash.  Unlogged tables lose their entire contents during crash recovery.

Please think like this: logging is is the norm, and unlogged operations are 
exceptions/hacks for some requirement of which the user wants to minimize the 
use.


Regards
Takayuki Tsunakawa






Re: New statistics for tuning WAL buffer size

2020-09-10 Thread Fujii Masao




On 2020/09/11 12:17, Kyotaro Horiguchi wrote:

Hello.

At Wed, 09 Sep 2020 13:57:37 +0900, Masahiro Ikeda  
wrote in

I checked what function calls XLogBackgroundFlush() which calls
AdvanceXLInsertBuffer() to increment m_wal_buffers_full.

I found that WalSndWaitForWal() calls it, so I added it.
Is it better to move it in WalSndLoop() like the attached patch?


By the way, we are counting some wal-related numbers in
pgWalUsage.(bytes, records, fpi).  Since now that we are going to have
a new view related to WAL statistics, wouln't it be more useful to
show them together in the view?


Probably yes. But IMO it's better to commit the current patch first, and then 
add those stats into the view after confirming exposing them is useful.

BTW, to expose the total WAL bytes, I think it's better to just save the LSN at 
when pg_stat_wal is reset rather than counting pgWalUsage.bytes. If we do that, 
we can easily total WAL bytes by subtracting that LSN from the latest LSN. Also 
saving the LSN at the reset timing causes obviously less overhead than counting 
pgWalUsage.bytes.



(Another reason to propose this is that a substantially one-column
  table may look not-great..)


I'm ok with such "small" view. But if this is really problem, I'm ok to expose 
only functions pg_stat_get_wal_buffers_full() and pg_stat_get_wal_stat_reset_time(), 
without the view, at first.

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: New statistics for tuning WAL buffer size

2020-09-10 Thread Kyotaro Horiguchi
Hello.

At Wed, 09 Sep 2020 13:57:37 +0900, Masahiro Ikeda  
wrote in 
> I checked what function calls XLogBackgroundFlush() which calls
> AdvanceXLInsertBuffer() to increment m_wal_buffers_full.
> 
> I found that WalSndWaitForWal() calls it, so I added it.
> Is it better to move it in WalSndLoop() like the attached patch?

By the way, we are counting some wal-related numbers in
pgWalUsage.(bytes, records, fpi).  Since now that we are going to have
a new view related to WAL statistics, wouln't it be more useful to
show them together in the view?

(Another reason to propose this is that a substantially one-column
 table may look not-great..)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Transactions involving multiple postgres foreign servers, take 2

2020-09-10 Thread Fujii Masao




On 2020/09/11 0:37, Masahiko Sawada wrote:

On Tue, 8 Sep 2020 at 13:00, tsunakawa.ta...@fujitsu.com
 wrote:


From: Amit Kapila 

I intend to say that the global-visibility work can impact this in a
major way and we have analyzed that to some extent during a discussion
on the other thread. So, I think without having a complete
design/solution that addresses both the 2PC and global-visibility, it
is not apparent what is the right way to proceed. It seems to me that
rather than working on individual (or smaller) parts one needs to come
up with a bigger picture (or overall design) and then once we have
figured that out correctly, it would be easier to decide which parts
can go first.


I'm really sorry I've been getting late and late and latex10 to publish the 
revised scale-out design wiki to discuss the big picture!  I don't know why I'm 
taking this long time; I feel I were captive in a time prison (yes, nobody is 
holding me captive; I'm just late.)  Please wait a few days.

But to proceed with the development, let me comment on the atomic commit and 
global visibility.

* We have to hear from Andrey about their check on the possibility that 
Clock-SI could be Microsoft's patent and if we can avoid it.

* I have a feeling that we can adopt the algorithm used by Spanner, 
CockroachDB, and YugabyteDB.  That is, 2PC for multi-node atomic commit, Paxos 
or Raft for replica synchronization (in the process of commit) to make 2PC more 
highly available, and the timestamp-based global visibility.  However, the 
timestamp-based approach makes the database instance shut down when the node's 
clock is distant from the other nodes.

* Or, maybe we can use the following Commitment ordering that doesn't require 
the timestamp or any other information to be transferred among the cluster 
nodes.  However, this seems to have to track the order of read and write 
operations among concurrent transactions to ensure the correct commit order, so 
I'm not sure about the performance.  The MVCO paper seems to present the 
information we need, but I haven't understood it well yet (it's difficult.)  
Could you anybody kindly interpret this?

Commitment ordering (CO) - yoavraz2
https://sites.google.com/site/yoavraz2/the_principle_of_co


As for the Sawada-san's 2PC patch, which I find interesting purely as FDW 
enhancement, I raised the following issues to be addressed:

1. Make FDW API implementable by other FDWs than postgres_fdw (this is what 
Amit-san kindly pointed out.)  I think oracle_fdw and jdbc_fdw would be good 
examples to consider, while MySQL may not be good because it exposes the XA 
feature as SQL statements, not C functions as defined in the XA specification.


I agree that we need to verify new FDW APIs will be suitable for other
FDWs than postgres_fdw as well.



2. 2PC processing is queued and serialized in one background worker.  That 
severely subdues transaction throughput.  Each backend should perform 2PC.


Not sure it's safe that each backend perform PREPARE and COMMIT
PREPARED since the current design is for not leading an inconsistency
between the actual transaction result and the result the user sees.


Can I check my understanding about why the resolver process is necessary?

Firstly, you think that issuing COMMIT PREPARED command to the foreign server 
can cause an error, for example, because of connection error, OOM, etc. On the 
other hand, only waiting for other process to issue the command is less likely 
to cause an error. Right?

If an error occurs in backend process after commit record is WAL-logged, the 
error would be reported to the client and it may misunderstand that the 
transaction failed even though commit record was already flushed. So you think 
that each backend should not issue COMMIT PREPARED command to avoid that 
inconsistency. To avoid that, it's better to make other process, the resolver, 
issue the command and just make each backend wait for that to completed. Right?

Also using the resolver process has another merit; when there are unresolved 
foreign transactions but the corresponding backend exits, the resolver can try 
to resolve them. If something like this automatic resolution is necessary, the 
process like the resolver would be necessary. Right?

To the contrary, if we don't need such automatic resolution (i.e., unresolved 
foreign transactions always need to be resolved manually) and we can prevent 
the code to issue COMMIT PREPARED command from causing an error (not sure if 
that's possible, though...), probably we don't need the resolver process. Right?



But in the future, I think we can have multiple background workers per
database for better performance.


Yes, that's an idea.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Inconsistency in determining the timestamp of the db statfile.

2020-09-10 Thread Amit Kapila
On Thu, Sep 10, 2020 at 6:42 PM Alvaro Herrera  wrote:
>
> On 2020-Sep-10, Amit Kapila wrote:
>
> > On Thu, Sep 10, 2020 at 1:03 PM Magnus Hagander  wrote:
>
> > The comments already say what you said in the second suggestion:"The
> > caller must rely on timestamp stored in *ts iff the function returns
> > true.". Read iff "as if and only if"
>
> I think "must" should be "may" there, if we're nitpicking.
>

Here, we want to say that "caller can rely on *ts only if the function
returns true". If we replace 'must' with 'may' then it seems to me we
are trying to say that caller can ignore the timestamp value, if so,
why at first place caller has called this function.

-- 
With Regards,
Amit Kapila.




Re: SIGQUIT handling, redux

2020-09-10 Thread Tom Lane
I wrote:
> I'll take a closer look at the idea of using _exit(1) tomorrow,
> but I'd be pretty hesitant to back-patch that.

Here's a patch for that; it passes light testing, including forcing
the backend through the SIGTERM and timeout code paths.  There's
not much to it except for changing the signal handlers, but I did
add a cross-check that no on-exit handlers have been registered
before we get done with using these signal handlers.

I looked briefly at the idea of postponing ProcessStartupPacket
until InitPostgres has set up a fairly normal environment.  It
seems like it'd take a fair amount of refactoring.  I really
doubt it's worth the effort, even though the result would be
arguably cleaner logic than what we have now.

regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 15ad675bc1..081022a206 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -4298,6 +4298,8 @@ report_fork_failure_to_client(Port *port, int errnum)
  * returns: nothing.  Will not return at all if there's any failure.
  *
  * Note: this code does not depend on having any access to shared memory.
+ * Indeed, our approach to SIGTERM/timeout handling *requires* that
+ * shared memory not have been touched yet; see comments within.
  * In the EXEC_BACKEND case, we are physically attached to shared memory
  * but have not yet set up most of our local pointers to shmem structures.
  */
@@ -4341,27 +4343,14 @@ BackendInitialize(Port *port)
 	whereToSendOutput = DestRemote; /* now safe to ereport to client */
 
 	/*
-	 * We arrange to do proc_exit(1) if we receive SIGTERM or timeout while
-	 * trying to collect the startup packet; while SIGQUIT results in
-	 * _exit(2).  Otherwise the postmaster cannot shutdown the database FAST
-	 * or IMMED cleanly if a buggy client fails to send the packet promptly.
+	 * We arrange to do _exit(1) if we receive SIGTERM or timeout while trying
+	 * to collect the startup packet; while SIGQUIT results in _exit(2).
+	 * Otherwise the postmaster cannot shutdown the database FAST or IMMED
+	 * cleanly if a buggy client fails to send the packet promptly.
 	 *
-	 * XXX this is pretty dangerous; signal handlers should not call anything
-	 * as complex as proc_exit() directly.  We minimize the hazard by not
-	 * keeping these handlers active for longer than we must.  However, it
-	 * seems necessary to be able to escape out of DNS lookups as well as the
-	 * startup packet reception proper, so we can't narrow the scope further
-	 * than is done here.
-	 *
-	 * XXX it follows that the remainder of this function must tolerate losing
-	 * control at any instant.  Likewise, any pg_on_exit_callback registered
-	 * before or during this function must be prepared to execute at any
-	 * instant between here and the end of this function.  Furthermore,
-	 * affected callbacks execute partially or not at all when a second
-	 * exit-inducing signal arrives after proc_exit_prepare() decrements
-	 * on_proc_exit_index.  (Thanks to that mechanic, callbacks need not
-	 * anticipate more than one call.)  This is fragile; it ought to instead
-	 * follow the norm of handling interrupts at selected, safe opportunities.
+	 * Exiting with _exit(1) is only possible because we have not yet touched
+	 * shared memory; therefore no outside-the-process state needs to get
+	 * cleaned up.
 	 */
 	pqsignal(SIGTERM, process_startup_packet_die);
 	pqsignal(SIGQUIT, SignalHandlerForCrashExit);
@@ -4420,8 +4409,8 @@ BackendInitialize(Port *port)
 		port->remote_hostname = strdup(remote_host);
 
 	/*
-	 * Ready to begin client interaction.  We will give up and proc_exit(1)
-	 * after a time delay, so that a broken client can't hog a connection
+	 * Ready to begin client interaction.  We will give up and _exit(1) after
+	 * a time delay, so that a broken client can't hog a connection
 	 * indefinitely.  PreAuthDelay and any DNS interactions above don't count
 	 * against the time limit.
 	 *
@@ -4449,6 +4438,17 @@ BackendInitialize(Port *port)
 	disable_timeout(STARTUP_PACKET_TIMEOUT, false);
 	PG_SETMASK();
 
+	/*
+	 * As a safety check that nothing in startup has yet performed
+	 * shared-memory modifications that would need to be undone if we had
+	 * exited through SIGTERM or timeout above, check that no on_shmem_exit
+	 * handlers have been registered yet.  (This isn't terribly bulletproof,
+	 * since someone might misuse an on_proc_exit handler for shmem cleanup,
+	 * but it's a cheap and helpful check.  We cannot disallow on_proc_exit
+	 * handlers unfortunately, since pq_init() already registered one.)
+	 */
+	check_on_shmem_exit_lists_are_empty();
+
 	/*
 	 * Stop here if it was bad or a cancel packet.  ProcessStartupPacket
 	 * already did any appropriate error reporting.
@@ -5369,23 +5369,21 @@ sigusr1_handler(SIGNAL_ARGS)
 
 /*
  * SIGTERM while processing startup packet.
- * Clean up and exit(1).
  *
- * 

Re: Fix for parallel BTree initialization bug

2020-09-10 Thread Justin Pryzby
Against all odds, I was able to reproduce this.

begin;
CREATE TABLE t AS SELECT generate_series(1,99)i;
ALTER TABLE t SET (parallel_workers=2, autovacuum_enabled=off);
CREATE INDEX ON t(i);
commit;

SET parallel_leader_participation=off; SET min_parallel_table_scan_size=0; SET 
enable_bitmapscan=off; SET enable_indexonlyscan=off; SET enable_seqscan=off;
explain(analyze , verbose on) SELECT COUNT(1) FROM t a WHERE a.i>555 AND i IN ( 
333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666
 ) ORDER BY 1;

Which gives a plan like:
 Sort  (cost=5543.71..5543.72 rows=1 width=8)
   Sort Key: (count(1))
   ->  Finalize Aggregate  (cost=5543.69..5543.70 rows=1 width=8)
 ->  Gather  (cost=5543.48..5543.69 rows=2 width=8)
   Workers Planned: 2
   ->  Partial Aggregate  (cost=4543.48..4543.49 rows=1 width=8)
 ->  Parallel Index Scan using t_i_idx on t a  
(cost=0.42..4204.92 rows=135423 width=0)

I don't get an error, on read-only hot standby.  I do get inconsistent results,
including on primary server.

count | 222
count | 214

This appears to be a bug in commit 569174f1b btree: Support parallel index 
scans.

I've added your patch here:
https://commitfest.postgresql.org/30/2729/

In the course of reproducing this, I also added:
@@ -1972,2 +1975,3 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, 
ScanDirection dir)
rel = scan->indexRelation;
+   Assert(BlockNumberIsValid(blkno));

-- 
Justin




Re: Implement UNLOGGED clause for COPY FROM

2020-09-10 Thread Peter Smith
On Thu, Sep 10, 2020 at 7:16 PM tsunakawa.ta...@fujitsu.com
 wrote:

> ALTER TABLE takes long time proportional to the amount of existing data, 
> while wal_level = none doesn't.

Right, but if wal_level=none is considered OK for that table with
existing data, then why not just create the table UNLOGGED in the
first place? (or ALTER it to set UNLOGGED just one time and then leave
it as UNLOGGED).

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-10 Thread Peter Geoghegan
On Tue, Sep 8, 2020 at 11:28 PM Jeff Davis  wrote:
> Preallocation showed significant gains for HashAgg, and BufFile doesn't
> support sparse writes. So, for HashAgg, it seems like we should just
> update the comment and consider it the price of using BufFile.

> For Sort, we can just disable preallocation.

+1.

I think that you can push sort-no-prealloc.patch without delay. That
looks good to me.

> Right now, it seems nBlocksAllocated means "number of blocks returned
> by ltsGetFreeBlock(), plus nHoleBlocks".

Just to be clear: I'm assuming that we must honor the original intent
of earlier code in my remarks here (in particular, code added by
7ac4a389a7d). This may not be exactly where we end up, but it's a good
starting point.

nHoleBlocks was added by the parallel CREATE INDEX commit in 2018,
which added BufFile/logtape.c concatenation. Whereas nBlocksAllocated
was added by the earlier bugfix commit 7ac4a389a7d in 2017 (the bugfix
I've referenced several times upthread). Clearly nBlocksAllocated
cannot be defined in terms of some other thing that wasn't there when
it was first added (I refer to nHoleBlocks). In general nHoleBlocks
can only be non-zero when the logical tapeset has been unified in a
leader process (for a parallel CREATE INDEX).

nBlocksAllocated is not the same thing as nBlocksWritten, though the
difference is more subtle than you suggest. nBlocksAllocated actually
means (or should mean) "the number of blocks allocated to the file",
which is usually the same thing that a stat() call or tools like "ls"
are expected report for the underlying temp file once the merge phase
of an external sort is reached (assuming that you only need one temp
file for a BufFile, and didn't use parallelism/concatenation, which is
the common case). That's why nBlocksAllocated is what
LogicalTapeSetBlocks() returns (pretty much). At least, the original
post-7ac4a389a7d version of LogicalTapeSetBlocks() was just "return
nBlocksAllocated;". nHoleBlocks was added for parallel CI, but that
was only supposed to compensate for the holes left behind by
concatenation/parallel sort, without changing the logtape.c space
management design in any fundamental way.

Obviously you must be wondering what the difference is, if it's not
just the nHoleBlocks thing. nBlocksAllocated is not necessarily equal
to nBlocksWritten (even when we ignore concatenation/nHoleBlocks), but
it's almost always equal in practice (again, barring nHoleBlocks !=
0). It's possible that a tuplesort will not have flushed the last
block at a point when LogicalTapeSetBlocks() is called -- it will have
allocated the block, but not yet written it to the BufFile. IOW, as
far as tuplesort.c is concerned the data is written to tape, but it
happens to not have been passed through to the OS via write(), or even
passed through to BufFileWrite() -- it happens to still be in one of
the small per-tape write buffers. When this occurs, a small amount of
dirty data in said per-tape buffer is considered written by
tuplesort.c, but from the point of view of logtape.c it is allocated
but not yet "written" (by which I mean not yet passed to buffile.c,
which actually does its own buffering, which it can neglect to flush
immediately in turn).

It's possible that tuplesort.c will need to call
LogicalTapeSetBlocks() at an earlier point after all tuples are
written but before they're "flushed" in logtape.c/buffile.c. We need
to avoid confusion when that happens. We want to insulate tuplesort.c
from implementation details that are private to logtape.c and/or
buffile.c. Bear in mind that nBlocksAllocated was originally only ever
supposed to have a value equal to nBlocksWritten, or the value
nBlocksWritten + 1. It is reasonable to want to hide the buffering
from LogicalTapeSetBlocks() once you realize that this mechanism is
only supposed to smooth-over an edge case involving one extra block
that will be written out in a moment anyway.

What does all of this mean for the new preallocation stuff that
benefits HashAggs that spill? Well, I'm not sure. I was specifically
concerned that somebody would end up misusing the ltsWriteBlock()
allocated-but-not-written thing in this way back in 2017, and said so
at the time -- that's why commit 7ac4a389a7d added comments about all
this to ltsWriteBlock(). For external sorts, that we're agreed won't
be using preallocation anyway, I think that we should go back to
reporting allocated blocks from LogicalTapeSetBlocks() -- very often
this is nBlocksWritten, but occasionally it's nBlocksWritten + 1. I
haven't yet refreshed my memory on the exact details of when you get
one behavior rather than the other, but I know it is possible in
practice with a tuplesort on Postgres 12. It might depend on subtle
issues like the alignment with BufFile segments -- see my test case
from 2017 to get an idea of how to make it easier to reveal problems
in this area:


Re: Logical Replication - detail message with names of missing columns

2020-09-10 Thread Kyotaro Horiguchi
Thanks for the revised version.

At Tue, 8 Sep 2020 22:36:17 +0530, Bharath Rupireddy 
 wrote in 
> Thanks for the comments. Attaching the v3 patch.
> 
> On Tue, Sep 8, 2020 at 8:19 PM Alvaro Herrera  
> wrote:
> >
> > This looks a bit fiddly.  Would it be less cumbersome to use
> > quote_identifier here instead?
> >
> 
> Changed. quote_identifier() adds quotes wherever necessary.
> 
> >
> > Please do use errmsg_plural -- have the new function return the number
> > of missing columns.  Should be pretty straightforward.
> >
> 
> Changed. Now the error message looks as below:

^^;

I don't think the logicalrep_find_missing_attrs worth a separate
function. The detection code would be short enough to be embedded in
the checking loop in logicalrep_rel_open. Note that remoterel doesn't
have missing columns since they are already removed when it is
constructed.  See fetch_remote_table_info and
logicalrep_rel_att_by_name is relying on that fact.  As the result
this code could be reduced to something like the following.

+ /* remoterel doesn't contain dropped attributes, see  */
- found = 0;
+ missing_attr = bms_add_range(NULL, 0, remoterel->natts - 1);
  for (i = 0; i < desc->natts; i++)
if (attnum >= 0)
- found++;
+ missing_attrs = bms_del_member(missing_attrs, attnum);
- if (found < remoterel->natts)
+ if (!bms_is_empty(missing_attrs))
+ {
+   while ((i = bms_first_memeber(missing_attrs)) >= 0)
+   {
+  if (not_first) appendStringInfo();
+  appendStringInfo(str, remoterel->attnames[i])
+   }
-   erreport("some attrs missing");
+   ereport(ERROR, );
+ }

> ERROR:  logical replication target relation "public.test_tbl1" is
> missing replicated columns:b1,c1,e1

I think we always double-quote identifiers in error messages. For
example:

./catalog/index.c

Re: PG 13 release notes, first draft

2020-09-10 Thread Tom Lane
Justin Pryzby  writes:
> Rebasing onto 3965de54e718600a4703233936e56a3202caf73f, I'm left with:

Sorry, I hadn't seen that you submitted more updates.  I pushed
these with minor additional wordsmithing.

regards, tom lane




Re: Print logical WAL message content

2020-09-10 Thread Alvaro Herrera
On 2020-Aug-19, Ashutosh Bapat wrote:

> I like this format as well. Done.
> 
> PFA the patch attached with your comments addressed.

Pushed, thanks!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: PG 13 release notes, first draft

2020-09-10 Thread Justin Pryzby
On Mon, Sep 07, 2020 at 08:40:26AM -0500, Justin Pryzby wrote:

Rebasing onto 3965de54e718600a4703233936e56a3202caf73f, I'm left with:

diff --git a/doc/src/sgml/release-13.sgml b/doc/src/sgml/release-13.sgml
index 8fffc6fe0a..69d143e10c 100644
--- a/doc/src/sgml/release-13.sgml
+++ b/doc/src/sgml/release-13.sgml
@@ -131,7 +131,7 @@ Author: Thomas Munro 
  
 
 
-SELECT round(sum(OLDVALUE / n::float)) FROM 
generate_series(1, OLDVALUE) s(n);
+SELECT round(sum(OLDVALUE / n::float)) AS newvalue 
FROM generate_series(1, OLDVALUE) s(n);
 
 
 
@@ -776,8 +776,8 @@ Author: Noah Misch 
 -->
 

-Allow skipping of WAL for full table writes if WAL writes to be skipped during a transaction
+which creates or rewrites a relation if  is minimal (Kyotaro
 Horiguchi)

@@ -1007,7 +1007,7 @@ Author: Michael Paquier 
 

 Add leader_pid to  to report parallel worker ownership
+linkend="pg-stat-activity-view"/> to report parallel worker's leader 
process
 (Julien Rouhaud)

   
@@ -1262,8 +1262,8 @@ Author: Peter Eisentraut 
 -->
 

-Enable Unix-domain sockets
-support on Windows (Peter Eisentraut)
+Enable support for Unix-domain 
sockets
+on Windows (Peter Eisentraut)

   
 
@@ -1391,8 +1391,8 @@ Author: Fujii Masao 
 -->
 
   
-   Allow WAL recovery to continue even if invalid
-   pages are referenced (Fujii Masao)
+   Allow recovery to continue even if invalid
+   pages are referenced by WAL (Fujii Masao)
   
 
   




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2020-09-10 Thread Alvaro Herrera
On 2020-Aug-31, Hao Wu wrote:

> I tested the patch provided by Alvaro. It seems not well in REPEATABLE READ.

> -- the tuples from tpart_2 are gone.
> gpadmin=*# select * from tpart;
>  i  | j
> +
>  10 | 10
>  50 | 50
> (2 rows)

Interesting example, thanks.  It seems this can be fixed without
breaking anything else by changing the planner so that it includes
detached partitions when we are in a snapshot-isolation transaction.
Indeed, the results from the detach-partition-concurrently-1.spec
isolation test are more satisfying with this change.

The attached v2, on current master, includes that change, as well as
fixes a couple of silly bugs in the previous one.

(Because of experimenting with git workflow I did not keep the 0001
part split in this one, but that part is unchanged from v1.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 3ae63eba543990d78c233ebeffa299e25c446f42 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 10 Sep 2020 17:50:41 -0300
Subject: [PATCH v2] ALTER TABLE ...  DETACH CONCURRENTLY

---
 doc/src/sgml/catalogs.sgml|  10 +
 doc/src/sgml/ref/alter_table.sgml |  25 +-
 src/backend/catalog/heap.c|  13 +-
 src/backend/catalog/index.c   |   4 +-
 src/backend/catalog/pg_inherits.c |  51 +-
 src/backend/commands/indexcmds.c  |   5 +-
 src/backend/commands/tablecmds.c  | 510 ++
 src/backend/commands/trigger.c|   7 +-
 src/backend/executor/execPartition.c  |  16 +-
 src/backend/nodes/copyfuncs.c |   1 +
 src/backend/nodes/equalfuncs.c|   1 +
 src/backend/optimizer/util/plancat.c  |  11 +-
 src/backend/parser/gram.y |  22 +-
 src/backend/partitioning/partbounds.c |   6 +-
 src/backend/partitioning/partdesc.c   |  21 +-
 src/backend/tcop/utility.c|  19 +
 src/bin/psql/describe.c   |  40 +-
 src/include/catalog/pg_inherits.h |   7 +-
 src/include/nodes/parsenodes.h|   2 +
 src/include/parser/kwlist.h   |   1 +
 src/include/partitioning/partdesc.h   |   5 +-
 .../detach-partition-concurrently-1.out   | 233 
 .../detach-partition-concurrently-2.out   |  66 +++
 src/test/isolation/isolation_schedule |   2 +
 .../detach-partition-concurrently-1.spec  |  76 +++
 .../detach-partition-concurrently-2.spec  |  41 ++
 26 files changed, 1035 insertions(+), 160 deletions(-)
 create mode 100644 src/test/isolation/expected/detach-partition-concurrently-1.out
 create mode 100644 src/test/isolation/expected/detach-partition-concurrently-2.out
 create mode 100644 src/test/isolation/specs/detach-partition-concurrently-1.spec
 create mode 100644 src/test/isolation/specs/detach-partition-concurrently-2.spec

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 508bea3bc6..abef70ad46 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -4481,6 +4481,16 @@ SCRAM-SHA-256$iteration count:
when using declarative partitioning.
   
  
+
+ 
+  
+   inhdetached bool
+  
+  
+   Set to true for a partition that is in the process of being detached;
+   false otherwise.
+  
+ 
 

   
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index e9c6a8a6c1..c35e83f229 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -36,7 +36,9 @@ ALTER TABLE ALL IN TABLESPACE name
 ALTER TABLE [ IF EXISTS ] name
 ATTACH PARTITION partition_name { FOR VALUES partition_bound_spec | DEFAULT }
 ALTER TABLE [ IF EXISTS ] name
-DETACH PARTITION partition_name
+DETACH PARTITION partition_name [ CONCURRENTLY ]
+ALTER TABLE [ IF EXISTS ] name
+DETACH PARTITION partition_name FINALIZE
 
 where action is one of:
 
@@ -938,7 +940,8 @@ WITH ( MODULUS numeric_literal, REM

 

-DETACH PARTITION partition_name
+DETACH PARTITION partition_name [ { CONCURRENTLY | FINALIZE } ]
+
 
  
   This form detaches the specified partition of the target table.  The detached
@@ -947,6 +950,24 @@ WITH ( MODULUS numeric_literal, REM
   attached to the target table's indexes are detached.  Any triggers that
   were created as clones of those in the target table are removed.
  
+ 
+  If CONCURRENTLY is specified, this process runs in two
+  transactions in order to avoid blocking other sessions that might be accessing
+  the partitioned table.  During the first transaction,
+  SHARE UPDATE EXCLUSIVE is taken in both parent table and
+  partition, and the partition is marked detached; at that point, the transaction
+  is committed and all transactions using the partitioned table 

Re: WIP: BRIN multi-range indexes

2020-09-10 Thread John Naylor
Ok, here's an attempt at a somewhat more natural test, to see what
happens after bulk updates and deletes, followed by more inserts. The
short version is that multi-minmax is resilient to a change that
causes a 4x degradation for simple minmax.


shared_buffers = 1GB
random_page_cost = 1.1
effective_cache_size = 4GB
work_mem = 64MB
maintenance_work_mem = 512MB

create unlogged table iot (
id bigint generated by default as identity primary key,
num double precision not null,
create_dt timestamptz not null,
stuff text generated always as (md5(id::text)) stored
)
with (fillfactor = 95);

insert into iot (num, create_dt)
select random(), x
from generate_series(
  '2020-01-01 0:00'::timestamptz,
  '2020-01-01 0:00'::timestamptz +'49000999 seconds'::interval,
  '2 seconds'::interval) x;

INSERT 0 24500500

(01:18s, 2279 MB)

-- done in separate tests so the planner can choose each in turn
create index cd_single on iot using brin(create_dt);
6.7s
create index cd_multi on iot using brin(create_dt timestamptz_minmax_multi_ops);
34s

vacuum analyze;

-- aggregate February
-- single minmax and multi-minmax same plan and same Heap Blocks
below, so only one plan shown
-- query times between the opclasses within noise of variation

explain analyze select date_trunc('day', create_dt), avg(num)
from iot
where create_dt >= '2020-02-01 0:00' and create_dt < '2020-03-01 0:00'
group by 1;


  QUERY PLAN

 HashAggregate  (cost=357664.79..388181.83 rows=1232234 width=16)
(actual time=559.805..561.649 rows=29 loops=1)
   Group Key: date_trunc('day'::text, create_dt)
   Planned Partitions: 4  Batches: 1  Memory Usage: 24601kB
   ->  Bitmap Heap Scan on iot  (cost=323.74..313622.05 rows=1232234
width=16) (actual time=1.787..368.256 rows=1252800 loops=1)
 Recheck Cond: ((create_dt >= '2020-02-01
00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01
00:00:00-04'::timestamp with time zone))
 Rows Removed by Index Recheck: 15936
 Heap Blocks: lossy=15104
 ->  Bitmap Index Scan on cd_single  (cost=0.00..15.68
rows=1236315 width=0) (actual time=0.933..0.934 rows=151040 loops=1)
   Index Cond: ((create_dt >= '2020-02-01
00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01
00:00:00-04'::timestamp with time zone))
 Planning Time: 0.118 ms
 Execution Time: 568.653 ms
(11 rows)


-- delete first month and hi/lo values to create some holes in the table
delete from iot
where create_dt < '2020-02-01 0:00'::timestamptz;

DELETE 1339200

delete from iot
where num < 0.05
or num > 0.95;

DELETE 2316036

vacuum analyze iot;

-- add add back first month, but with double density (1s step rather
than 2s) so it spills over into other parts of the table, causing more
block ranges to have a lower bound with this month.

insert into iot (num, create_dt)
select random(), x
from generate_series(
  '2020-01-01 0:00'::timestamptz,
  '2020-01-31 23:59'::timestamptz,
  '1 second'::interval) x;

INSERT 0 2678341

vacuum analyze;

-- aggregate February again

explain analyze select date_trunc('day', create_dt), avg(num)
from iot
where create_dt >= '2020-02-01 0:00' and create_dt < '2020-03-01 0:00'
group by 1;

-- simple minmax:

  QUERY PLAN

 HashAggregate  (cost=354453.63..383192.38 rows=1160429 width=16)
(actual time=2375.075..2376.982 rows=29 loops=1)
   Group Key: date_trunc('day'::text, create_dt)
   Planned Partitions: 4  Batches: 1  Memory Usage: 24601kB
   ->  Bitmap Heap Scan on iot  (cost=305.85..312977.36 rows=1160429
width=16) (actual time=8.162..2201.547 rows=1127668 loops=1)
 Recheck Cond: ((create_dt >= '2020-02-01
00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01
00:00:00-04'::timestamp with time zone))
 Rows Removed by Index Recheck: 12278985
 Heap Blocks: lossy=159616
 ->  Bitmap Index Scan on cd_single  (cost=0.00..15.74
rows=1206496 width=0) (actual time=7.177..7.178 rows=1596160 loops=1)
   Index Cond: ((create_dt >= '2020-02-01
00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01
00:00:00-04'::timestamp with time zone))
 Planning Time: 0.117 ms
 Execution Time: 2383.685 ms
(11 rows)

-- multi minmax:

  QUERY PLAN

 HashAggregate  (cost=354089.57..382932.46 rows=1164634 width=16)
(actual time=535.773..537.731 rows=29 loops=1)
   Group Key: date_trunc('day'::text, create_dt)
   Planned Partitions: 4  Batches: 1  Memory Usage: 24601kB
   ->  Bitmap Heap Scan on iot  

Re: PG 13 release notes, first draft

2020-09-10 Thread Tom Lane
"Jonathan S. Katz"  writes:
> One thing that did not make it through was this:

> - 2020-XX-XX, CURRENT AS OF 2020-08-09
> + 2020-09-24, CURRENT AS OF 2020-09-09

Yeah, that's a placeholder to recall how far back to look for additional
changes to the relnotes, so unless you already read the git history that
far back and concluded nothing needed documenting, that was premature.

(I've just about finished making those updates and making an editorial
pass over the notes, so I will change it in a little bit.)

regards, tom lane




Re: SIGQUIT handling, redux

2020-09-10 Thread Tom Lane
Robert Haas  writes:
> On Thu, Sep 10, 2020 at 12:56 PM Tom Lane  wrote:
>> Also, man that CHECK_FOR_INTERRUPTS() looks like trouble.
>> Could we take that out?

> Maybe I'm missing something, but why wouldn't that be a horrible idea?
> We do not want to have long waits where we refuse to respond to
> interrupts.

It might be appropriate for some of the callers to do it.  But I don't
see any great argument why ProcWaitForSignal itself has to do it.

regards, tom lane




Re: WIP: BRIN multi-range indexes

2020-09-10 Thread Alvaro Herrera
On 2020-Sep-10, Tomas Vondra wrote:

> I've spent a bit of time experimenting with this. My idea was to allow
> keeping an "expanded" version of the summary somewhere. As the addValue
> function only receives BrinValues I guess one option would be to just
> add bv_mem_values field. Or do you have a better idea?

Maybe it's okay to pass the BrinMemTuple to the add_value function, and
keep something there.  Or maybe that's pointless and just a new field in
BrinValues is okay.

> Of course, more would need to be done:
> 
> 1) We'd need to also pass the right memory context (bt_context seems
> like the right thing, but that's not something addValue sees now).

You could use GetMemoryChunkContext() for that.

> 2) We'd also need to specify some sort of callback that serializes the
> in-memory value into bt_values. That's not something addValue can do,
> because it doesn't know whether it's the last value in the range etc. I
> guess one option would be to add yet another support proc, but I guess a
> simple callback would be enough.

Hmm.

> I've hacked together an experimental version of this to see how much
> would it help, and it reduces the duration from ~4.6s to ~3.3s. Which is
> nice, but plain minmax is ~1.1s. I suppose there's room for further
> improvements in compare_combine_ranges/reduce_combine_ranges and so on,
> but I still think there'll always be a gap compared to plain minmax.

The main reason I'm talking about desupporting plain minmax is that,
even if it's amazingly fast, it loses quite quickly in real-world cases
because of loss of correlation.  Minmax's build time is pretty much
determined by speed at which you can seqscan the table.  I don't think
we lose much if we add overhead in order to create an index that is 100x
more useful.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Collation versioning

2020-09-10 Thread Julien Rouhaud
On Thu, Sep 10, 2020 at 6:52 PM Peter Eisentraut
 wrote:
>
> On 2020-09-08 21:33, Christoph Berg wrote:
> > IOW, I think we should aim at simply tracking the version, and leave
> > it to the admin (with the help of supplied SQL queries) to either
> > rebuild indexes or waive them.
>
> It's certainly safer to track dependency for all indexes and then
> carefully create exceptions afterwards.

Do you mean storing the collation version even when those are not
relevant, and let client code (or backend command) deal with it?  This
would require to store a dependency per index and column (or at least
if it's a column or an expression to avoid bloating the dependencies
too much), as it's otherwise impossible to know if a version mismatch
can be safely ignored or not.
I'm also wondering how much more complexity it would add to people who
want to actively monitor such mismatch using SQL queries.




Probable documentation errors or improvements

2020-09-10 Thread Yaroslav
Disclaimer: I'm not a native speaker, so not sure those are actually
incorrect, and can't offer non-trivial suggestions.

General ones:
. "boolean" vs "Boolean" --- usage seems to be inconsistent, even in the
same context.

. Transaction isolation levels are sometimes called "transaction isolation
modes", is that correct?

. In https://www.postgresql.org/docs/current/tableam.html, links to source
code
  are also hyperlinks into git, like (from SGML source):

  For details, please refer to the https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/include/access/tableam.h;hb=HEAD;>
  src/include/access/tableam.h file.

  Perhaps, other similar links in documentation should also be made into
  hyperlinks?


Specific ones:

--
https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-CONSTANTS

4.1.2.4 Dollar-quoted String Constants

While the standard syntax for specifying string constants is usually
convenient, it can be difficult to understand when the desired string
contains
many single quotes or backslashes, since each of those must be doubled.
  -- Not so for backslashes (perhaps, this sentence is from
 pre-standard_conforming_strings ages).

-  --

Notice that inside the dollar-quoted string, single quotes can be used
without
needing to be escaped. Indeed, no characters inside a dollar-quoted string
are
ever escaped: the string content is always written literally. Backslashes
are
not special, and neither are dollar signs, unless they are part of a
sequence
matching the opening tag.
  -- Backslashes, again. Though here in may be justified, not sure.

-  --

$function$
BEGIN
RETURN ($1 ~ $q$[\t\r\n\v\\]$q$);
END;
$function$
  -- While it's understandable what the example is trying to demonstrate,
 single-quoted string would work here, too (so, no actual advantage, in
 this case).

-  --

With single-quote syntax, each backslash in the above example would have to
be
written as four backslashes, which would be reduced to two backslashes in
parsing the original string constant, and then to one when the inner string
constant is re-parsed during function execution.
  -- Nothing needs to be changed about backslashes, yet again.


-- https://www.postgresql.org/docs/current/ddl-basics.html
5.1. Table Basics

A table in a relational database is much like a table on paper: It consists
of
rows and columns.
  -- Why "I" in It after ":" is capitalized?

-  --
Some of the frequently used data types are integer for whole numbers,
numeric
for possibly fractional numbers, text for character strings, date for dates,
time for time-of-day values, and timestamp for values containing both date
and
time.
  -- Perhaps, add (or replace with) timestamptz for storing moments in time
(or
 something like that)?


--
https://www.postgresql.org/docs/current/ddl-constraints.html#DDL-CONSTRAINTS-EXCLUSION

5.4.6. Exclusion Constraints

Exclusion constraints ensure that if any two rows are compared on the
specified
columns or expressions using the specified operators, at least one of these
operator comparisons will return false or null. The syntax is:

CREATE TABLE circles (
c circle,
EXCLUDE USING gist (c WITH &&)
);
  -- Not only the definition is hard to grasp, but the example doesn't
clarify
 a lot, as it's not explained what it actually achieves (especially
given
 geometric data types and operators are described several chapters
later).


https://www.postgresql.org/docs/current/ddl-alter.html#DDL-ALTER-REMOVING-A-CONSTRAINT

5.6.4. Removing a Constraint

(If you are dealing with a generated constraint name like $2, don't forget
that
you'll need to double-quote it to make it a valid identifier.)
  -- I don't think current releases generate names like that anymore?


-- https://www.postgresql.org/docs/current/ddl-rowsecurity.html

5.8. Row Security Policies
  -- A general note: interaction of row security with search_path is not
 documented at all, but it may be important is some cases, like using
 functions in row security policies.

-  --

We can then see that an administrator connecting over a network will not see
any records, due to the restrictive policy:

=> SELECT current_user;
 current_user
--
 admin
(1 row)

=> select inet_client_addr();
 inet_client_addr
--
 127.0.0.1
(1 row)

=> SELECT current_user;
 current_user
--
 admin
  -- "SELECT current_user;" twice (I guess first one should have been psql
-h).



Re: Improvements in Copy From

2020-09-10 Thread Surafel Temesgen
On Thu, Sep 10, 2020 at 1:17 PM vignesh C  wrote:

>
> >
> > We have a patch for column matching feature [1] that may need a header
> line to be further processed. Even without that I think it is preferable to
> process the header line for nothing than adding those checks to the loop,
> performance-wise.
>
> I had seen that patch, I feel that change to match the header if the
> header is specified can be addressed in this patch if that patch gets
> committed first or vice versa. We are doing a lot of processing for
> the data which we need not do anything. Shouldn't this be skipped if
> not required. Similar check is present in NextCopyFromRawFields also
> to skip header.
>

The existing check is unavoidable but we can live better without the checks
added by the patch. For very large files the loop may iterate millions of
times if it is not in billion and I am sure doing the check that many times
will incur noticeable performance degradation than further processing a
single line.

regards

Surafel


Re: PG 13 release notes, first draft

2020-09-10 Thread Jonathan S. Katz
On 9/10/20 1:14 PM, Tom Lane wrote:
> "Jonathan S. Katz"  writes:
>> Attached is a proposal for the "major enhancements" section. I borrowed
>> from the press release[1] but tried to stay true to the release notes
>> format and listed out the enhancements that way.
> 
> Pushed with some very minor wording tweaks.

Thanks! The tweaks were minor enough that it took a few readthroughs to
catch them.

One thing that did not make it through was this:

- 2020-XX-XX, CURRENT AS OF 2020-08-09
+ 2020-09-24, CURRENT AS OF 2020-09-09

Is the plan to update that at a later date? Understandable if so, but
wanted to check.

Thanks,

Jonathan



signature.asc
Description: OpenPGP digital signature


Re: WIP: BRIN multi-range indexes

2020-09-10 Thread Tomas Vondra

On Wed, Sep 09, 2020 at 10:26:00PM +0200, Tomas Vondra wrote:

On Wed, Sep 09, 2020 at 04:53:30PM -0300, Alvaro Herrera wrote:

On 2020-Sep-09, Tomas Vondra wrote:


There are some minor optimizations possible - for example I noticed we
call minmax_multi_get_strategy_procinfo often because it happens in a
loop, and we could easily do it just once. But that saves only about 10%
or so, it's not a ground-breaking optimization.


Well, I guess this kind of thing should be fixed regardless while we
still know it's there, just to avoid an obvious inefficiency.



Sure. I was just suggesting it's not something that'd make this very
close to plain minmax opclass.


The main reason for the slowness is that we pass the values one by one
to brin_minmax_multi_add_value - and on each call we need to deserialize
(and then sometimes also serialize) the summary, which may be quite
expensive. The regular minmax does not have this issue, it just swaps
the Datum value and that's it.


Ah, right, that's more interesting.  The original dumb BRIN code
separates BrinMemTuple from BrinTuple so that things can be operated
efficiently in memory.  Maybe something similar can be done in this
case, which also sounds like your second suggestion:


Another option would be to teach add_value to keep the deserialized
summary somewhere, and then force serialization at the end of the BRIN
page range. The end result would be roughly the same, I think.




Well, the patch already has Ranges (memory) and SerializedRanges (disk)
but it's not very clear to me where to stash the in-memory data and
where to make the conversion.



I've spent a bit of time experimenting with this. My idea was to allow
keeping an "expanded" version of the summary somewhere. As the addValue
function only receives BrinValues I guess one option would be to just
add bv_mem_values field. Or do you have a better idea?

Of course, more would need to be done:

1) We'd need to also pass the right memory context (bt_context seems
like the right thing, but that's not something addValue sees now).

2) We'd also need to specify some sort of callback that serializes the
in-memory value into bt_values. That's not something addValue can do,
because it doesn't know whether it's the last value in the range etc. I
guess one option would be to add yet another support proc, but I guess a
simple callback would be enough.

I've hacked together an experimental version of this to see how much
would it help, and it reduces the duration from ~4.6s to ~3.3s. Which is
nice, but plain minmax is ~1.1s. I suppose there's room for further
improvements in compare_combine_ranges/reduce_combine_ranges and so on,
but I still think there'll always be a gap compared to plain minmax.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: SIGQUIT handling, redux

2020-09-10 Thread Robert Haas
On Thu, Sep 10, 2020 at 12:56 PM Tom Lane  wrote:
> Also, man that CHECK_FOR_INTERRUPTS() looks like trouble.
> Could we take that out?

Maybe I'm missing something, but why wouldn't that be a horrible idea?
We do not want to have long waits where we refuse to respond to
interrupts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Fix for parallel BTree initialization bug

2020-09-10 Thread Jameson, Hunter 'James'
Answers inline below:

On 9/10/20, 4:58 AM, "Amit Kapila"  wrote:

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



On Tue, Sep 8, 2020 at 11:55 PM Jameson, Hunter 'James'
 wrote:
>
> Hi, I ran across a small (but annoying) bug in initializing parallel 
BTree scans, which causes the parallel-scan state machine to get confused. The 
fix is one line; the description is a bit longer—
>
>
>
> Before, function _bt_first() would exit immediately if the specified scan 
keys could never be satisfied--without notifying other parallel workers, if 
any, that the scan key was done.
>

The first question that comes to mind is how is it possible that for
one of the workers specified scan keys is not satisfied while for
others it is satisfied? I think it is possible when other workers are
still working on the previous scan key and this worker has moved to
the next scan key. If not, then what is the other case?

I think that's right. If I remember correctly, the first to move to the next 
IN-list condition exits early and *locally* moves on to the next-next IN-list 
condition, but doesn't properly advance the global scan key. At that point, "By 
allowing the first worker to move on to the next scan key, in this one case, 
without notifying other workers, the global key ends up < the first worker's 
local key." So the first worker now has a local scan key > the global scan key, 
because it didn't call _bt_parallel_done().

> This moved that particular worker to a scan key beyond what was in the 
shared parallel-query state, so that it would later try to read in 
"InvalidBlockNumber", without recognizing it as a special sentinel value.
>

Now, if it happens as I mentioned then the other workers should not
try to advance their scan because their local scan key will be lesser
than shared key. Basically, they should return from the below
condition:
_bt_parallel_seize()
{
..
if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
{
/* Parallel scan has already advanced to a new set of scankeys. */
status = false;
}
..
}

After this, those workers will also update their scan key and move
forward from there. So, I am not seeing how this could create a
problem.

I think, if I understand my notes on the bug, that the problem is with the 
first worker, not the other workers. So it doesn't matter if the other workers 
aren't confused, because the first worker confuses itself. The first worker has 
moved on, without telling anyone else, basically.

--
With Regards,
Amit Kapila.

Thanks,
James
--
James Hunter, Amazon Web Services (AWS)




Re: Online checksums verification in the backend

2020-09-10 Thread Julien Rouhaud
On Thu, Sep 10, 2020 at 09:47:23AM +0200, Julien Rouhaud wrote:
> On Wed, Sep 09, 2020 at 03:41:30PM +0200, Julien Rouhaud wrote:
> > On Wed, Sep 09, 2020 at 11:25:29AM +0200, Julien Rouhaud wrote:
> > > 
> > > I'll do some becnhmarking and see if I can get some figures, but it'll 
> > > probably
> > > take some time.  In the meantime I'm attaching v11 of the patch that 
> > > should
> > > address all other comments.
> > 
> > I just realized that I forgot to update one of the Makefile when moving the 
> > TAP
> > test folder.  v12 attached should fix that.
> 
> 
> And the cfbot just reported a new error for Windows build.  Attached v13 
> should
> fix that.


I did some benchmarking using the following environnment:

- 16MB shared buffers
- 490MB table (10M rows)
- synchronized_seqscan to off
- table in OS cache

I don't have a big machine so I went with a very small shared_buffers and a
small table, to make sure that all data is in OS cache but the table more than
an order bigger than the shared_buffers, to simulate some plausible environment.

I used a simple read only query that performs a sequential scan of the table (a
simple SELECT * FROM table), run using 10 concurrent connections, 5 runs of 700
seconds.  I did that without any other activity, with a \watch of the original
pg_check_relation function using \watch .1, and a modified version of that
function without the optimisation, still with a \watch .1

The TPS is obviously overall extremely bad, but I can see that the submitted
version added an overhead of ~3.9% (average of 5 runs), while the version
without the optimisation added an overhead of ~6.57%.

This is supposed to be a relatively fair benchmark as all the data are cached
on the OS side, so IO done while holding the bufmapping lock aren't too long,
but we can see that we already get a non negligible benefit from this
optimisation.  Should I do additional benchmarking, like dropping the OS cache
and/or adding some write activity?  This would probably only make the
unoptimized version perform even worse.




Re: Fix for parallel BTree initialization bug

2020-09-10 Thread Jameson, Hunter 'James'
Answers inline below, sorry for the formatting-- am still trying to get 
corporate email to work nicely with this mailing list, thanks.

On 9/9/20, 9:22 PM, "Justin Pryzby"  wrote:

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



On Tue, Sep 08, 2020 at 06:25:03PM +, Jameson, Hunter 'James' wrote:
> Hi, I ran across a small (but annoying) bug in initializing parallel 
BTree scans, which causes the parallel-scan state machine to get confused. The 
fix is one line; the description is a bit longer—

What postgres version was this ?

We have observed this bug on PostgreSQL versions 11.x and 10.x. I don't believe 
it occurs in PostgreSQL versions 9.x, because 9.x does not have parallel BTree 
scan.

> Before, function _bt_first() would exit immediately if the specified scan 
keys could never be satisfied--without notifying other parallel workers, if 
any, that the scan key was done. This moved that particular worker to a scan 
key beyond what was in the shared parallel-query state, so that it would later 
try to read in "InvalidBlockNumber", without recognizing it as a special 
sentinel value.
>
> The basic bug is that the BTree parallel query state machine assumes that 
a worker process is working on a key <= the global key--a worker process can be 
behind (i.e., hasn't finished its work on a previous key), but never ahead. By 
allowing the first worker to move on to the next scan key, in this one case, 
without notifying other workers, the global key ends up < the first worker's 
local key.
>
> Symptoms of the bug are: on R/O, we get an error saying we can't extend 
the index relation, while on an R/W we just extend the index relation by 1 
block.

What's the exact error ?  Are you able to provide a backtrace ?

I am not able to provide a full backtrace, unfortunately, but the relevant part 
appears to be:

  ReadBuffer (... blockNum=blockNum@entry=4294967295)
 _bt_getbuf (... blkno=4294967295 ...)
 _bt_readnextpage (... blkno=4294967295 ... )
 _bt_steppage (...)
 _bt_next (...)
 btgettuple (...)
 index_getnext_tid (...)
 index_getnext (...)
 IndexNext (...) 

Notice that _bt_steppage() is passing InvalidBlockNumber to ReadBuffer(). That 
is the bug.

> To reproduce, you need a query that:
>
> 1. Executes parallel BTree index scan;
> 2. Has an IN-list of size > 1;

Do you mean you have an index on col1 and a query condition like: col1 IN 
(a,b,c...) ?

Something like that, yes,

> 3. Has an additional index filter that makes it impossible to satisfy the
> first IN-list condition.

.. AND col1::text||'foo' = '';
I think you mean that the "impossible" condition makes it so that a btree
worker exits early.

Specifically, on that worker, _bt_first() sees !so->qual_ok and just returns 
"false". That is the bug. The fix is that the worker must also call 
_bt_parallel_done(scan), as is done everywhere else in _bt_first() where it 
returns "false".

> (We encountered such a query, and therefore the bug, on a production 
instance.)

Could you send the "shape" of the query or its plan, obfuscated and 
redacted as
need be ?

Plan is something like:

Finalize GroupAggregate  ... (... loops=1)
   Group Key: (...)
   ->  Gather Merge  ... (... loops=1)
 Workers Planned: 2
 Workers Launched: 2
 ->  Partial GroupAggregate  ... (... loops=3)
   Group Key: (...)
   ->  Sort  ... (... loops=3)
 Sort Key: (...)
 Sort Method: quicksort  ...
 ->  Nested Loop ...  (... loops=3)
   ->  Parallel Index Scan using ... (... loops=3)
 Index Cond: (((f ->> 't') >= ... ) AND ((f ->> 
't') < ...) AND (((f -> 'c') ->> 't') = ANY (...)) AND (((f-> 'c') ->> 't') = 
...))
 Filter: (CASE WHEN ... END IS NOT NULL)
 Rows Removed by Filter: ...
   ->  Index Only Scan using ... (... rows=1 loops=...)
 Index Cond: (a = b)
 Heap Fetches: ...

--
Justin

James
--
James Hunter, Amazon Web Services (AWS)





Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-10 Thread Justin Pryzby
On Thu, Sep 10, 2020 at 11:21:02AM -0400, Robert Haas wrote:
> On Fri, Aug 28, 2020 at 5:55 AM Ashutosh Sharma  wrote:
> > Please have a look into the attached patch for the changes and let me know 
> > for any other concerns. Thank you.
> 
> I have committed this version.

Thanks ; I marked it as such in CF app.
https://commitfest.postgresql.org/29/2700/

-- 
Justin




Re: PG 13 release notes, first draft

2020-09-10 Thread Tom Lane
"Jonathan S. Katz"  writes:
> Attached is a proposal for the "major enhancements" section. I borrowed
> from the press release[1] but tried to stay true to the release notes
> format and listed out the enhancements that way.

Pushed with some very minor wording tweaks.

regards, tom lane




Re: PostgreSQL 13 Release Timeline

2020-09-10 Thread Jonathan S. Katz
Hi,

On 9/2/20 2:13 PM, Jonathan S. Katz wrote:

> * PostgreSQL 13 Release Candidate 1 (RC1) will be released on September
> 17, 2020.
> 
> * In absence of any critical issues, PostgreSQL 13 will become generally
> available on September 24, 2020.
> 
> The aim is to have all outstanding open items committed before RC1.
> Please ensure everything is committed by September 16, 2020 AoE. We will
> send out a reminder prior to that date.

As a friendly reminder, RC1 is one week away. I also realized I made an
error in the commit date above for RC1. Sorry :( Corrected date below.

Please ensure you have all of your commits in by **September 13, 2020
AoE** so we can wrap RC1 and get it out the door.

Presuming RC1 is successful, commits for GA must be in by **September
20, 2020 AoE**.

Please try to close out open items[1] before RC1.

Thanks!

Jonathan

[1] https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items




signature.asc
Description: OpenPGP digital signature


Re: SIGQUIT handling, redux

2020-09-10 Thread Tom Lane
Robert Haas  writes:
> On Wed, Sep 9, 2020 at 10:07 PM Tom Lane  wrote:
>> bgworker_die (SIGTERM)
>> Calls ereport(FATAL).  This is surely not any safer than, say,
>> quickdie().  No, it's worse, because at least that won't try
>> to go out via proc_exit().

> I think bgworker_die() is pretty much a terrible idea.

Yeah.  It'd likely be better to insist that bgworkers handle SIGTERM
the same way backends do, via CHECK_FOR_INTERRUPTS.  Not sure how big
a change that would be.

> I think that the only way this could actually
> be safe is if you have a background worker that never uses ereport()
> itself, so that the ereport() in the signal handler can't be
> interrupting one that's already happening.

That doesn't begin to make it safe, because quite aside from anything
that happens in elog.c, we're then going to go out via proc_exit()
which will invoke who knows what.

>> StandbyDeadLockHandler (from SIGALRM)
>> StandbyTimeoutHandler (ditto)
>> Calls CancelDBBackends, which just for starters tries to acquire
>> an LWLock.  I think the only reason we've gotten away with this
>> for this long is the high probability that by the time either
>> timeout fires, we're going to be blocked on a semaphore.

> Yeah, I'm not sure these are so bad. In fact, in the deadlock case, I
> believe the old coding was designed to make sure we *had to* be
> blocked on a semaphore, but I'm not sure whether that's still true.

Looking at this closer, the only code that could get interrupted
by these timeouts is a call to ProcWaitForSignal, which is

void
ProcWaitForSignal(uint32 wait_event_info)
{
(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
 wait_event_info);
ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS();
}

Interrupting the latch operations might be safe enough,
although now I'm casting a side-eye at Munro's recent
changes to share WaitEvent state all over the place.
I wonder whether blocking on an LWLock inside the
signal handler will break an interrupted WaitLatch.

Also, man that CHECK_FOR_INTERRUPTS() looks like trouble.
Could we take that out?

regards, tom lane




Re: Collation versioning

2020-09-10 Thread Peter Eisentraut

On 2020-09-08 21:33, Christoph Berg wrote:

IOW, I think we should aim at simply tracking the version, and leave
it to the admin (with the help of supplied SQL queries) to either
rebuild indexes or waive them.


It's certainly safer to track dependency for all indexes and then 
carefully create exceptions afterwards.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: New statistics for tuning WAL buffer size

2020-09-10 Thread Fujii Masao




On 2020/09/09 13:57, Masahiro Ikeda wrote:

On 2020-09-07 16:19, Fujii Masao wrote:

On 2020/09/07 9:58, Masahiro Ikeda wrote:

Thanks for the review and advice!

On 2020-09-03 16:05, Fujii Masao wrote:

On 2020/09/02 18:56, Masahiro Ikeda wrote:

+/* --
+ * Backend types
+ * --

You seem to forget to add "*/" into the above comment.
This issue could cause the following compiler warning.
../../src/include/pgstat.h:761:1: warning: '/*' within block comment [-Wcomment]


Thanks for the comment. I fixed.


Thanks for the fix! But why are those comments necessary?


Sorry about that. This comment is not necessary.
I removed it.


The pg_stat_walwriter is not security restricted now, so ordinary users can 
access it.
It has the same security level as pg_stat_archiver. If you have any comments, 
please let me know.


+   dirty_writes bigint

I guess that the column name "dirty_writes" derived from
the DTrace probe name. Isn't this name confusing? We should
rename it to "wal_buffers_full" or something?


I agree and rename it to "wal_buffers_full".


+/* --
+ * PgStat_MsgWalWriter    Sent by the walwriter to update statistics.

This comment seems not accurate because backends also send it.

+/*
+ * WAL writes statistics counter is updated in XLogWrite function
+ */
+extern PgStat_MsgWalWriter WalWriterStats;

This comment seems not right because the counter is not updated in XLogWrite().


Right. I fixed it to "Sent by each backend and background workers to update WAL 
statistics."
In the future, other statistics will be included so I remove the function's 
name.



+-- There will surely and maximum one record
+select count(*) = 1 as ok from pg_stat_walwriter;

What about changing this comment to "There must be only one record"?


Thanks, I fixed.


+    WalWriterStats.m_xlog_dirty_writes++;
 LWLockRelease(WALWriteLock);

Since WalWriterStats.m_xlog_dirty_writes doesn't need to be protected
with WALWriteLock, isn't it better to increment that after releasing the lock?


Thanks, I fixed.


+CREATE VIEW pg_stat_walwriter AS
+    SELECT
+    pg_stat_get_xlog_dirty_writes() AS dirty_writes,
+    pg_stat_get_walwriter_stat_reset_time() AS stats_reset;
+
 CREATE VIEW pg_stat_progress_vacuum AS

In system_views.sql, the definition of pg_stat_walwriter should be
placed just after that of pg_stat_bgwriter not pg_stat_progress_analyze.


OK, I fixed it.


 }
-
 /*
  * We found an existing collector stats file. Read it and put all the

You seem to accidentally have removed the empty line here.


Sorry about that. I fixed it.


- errhint("Target must be \"archiver\" or \"bgwriter\".")));
+ errhint("Target must be \"archiver\" or \"bgwriter\" or
\"walwriter\".")));

There are two "or" in the message, but the former should be replaced with ","?


Thanks, I fixed.


On 2020-09-05 18:40, Magnus Hagander wrote:

On Fri, Sep 4, 2020 at 5:42 AM Fujii Masao
 wrote:


On 2020/09/04 11:50, tsunakawa.ta...@fujitsu.com wrote:

From: Fujii Masao 

I changed the view name from pg_stat_walwrites to

pg_stat_walwriter.

I think it is better to match naming scheme with other views

like

pg_stat_bgwriter,

which is for bgwriter statistics but it has the statistics

related to backend.


I prefer the view name pg_stat_walwriter for the consistency with
other view names. But we also have pg_stat_wal_receiver. Which
makes me think that maybe pg_stat_wal_writer is better for
the consistency. Thought? IMO either of them works for me.
I'd like to hear more opinons about this.


I think pg_stat_bgwriter is now a misnomer, because it contains

the backends' activity.  Likewise, pg_stat_walwriter leads to
misunderstanding because its information is not limited to WAL
writer.


How about simply pg_stat_wal?  In the future, we may want to

include WAL reads in this view, e.g. reading undo logs in zheap.

Sounds reasonable.


+1.

pg_stat_bgwriter has had the "wrong name" for quite some time now --
it became even more apparent when the checkpointer was split out to
it's own process, and that's not exactly a recent change. And it had
allocs in it from day one...

I think naming it for what the data in it is ("wal") rather than which
process deals with it ("walwriter") is correct, unless the statistics
can be known to only *ever* affect one type of process. (And then
different processes can affect different columns in the view). As a
general rule -- and that's from what I can tell exactly what's being
proposed.


Thanks for your comments. I agree with your opinions.
I changed the view name to "pg_stat_wal".


I fixed the code to send the WAL statistics from not only backend and walwriter
but also checkpointer, walsender and autovacuum worker.


Good point! Thanks for updating the patch!


@@ -604,6 +604,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
  onerel->rd_rel->relisshared,
 

Re: Allow CURRENT_ROLE in GRANTED BY

2020-09-10 Thread Peter Eisentraut

On 2020-09-07 12:02, Asif Rehman wrote:

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

The patch applies cleanly and looks fine to me. However wouldn't it be better 
to just map the CURRENT_ROLE to CURRENT_USER in backend grammar?


Existing code treats them differently.  I think, given that the code is 
already written, it is good to preserve what the user wrote.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Optimising compactify_tuples()

2020-09-10 Thread David Rowley
On Fri, 11 Sep 2020 at 01:45, David Rowley  wrote:
> I've attached v4b (b is for backwards since the traditional backwards
> tuple order is maintained). v4b seems to be able to run my benchmark
> in 63 seconds. I did 10 runs today of yesterday's v3 patch and got an
> average of 72.8 seconds, so quite a big improvement from yesterday.

After reading the patch back again I realised there are a few more
things that can be done to make it a bit faster.

1. When doing the backup buffer, use code to skip over tuples that
don't need to be moved at the end of the page and only memcpy() tuples
earlier than that.
2. The position that's determined in #1 can be used to start the
memcpy() loop at the first tuple that needs to be moved.
3. In the memmove() code for the preorder check, we can do a similar
skip of the tuples at the end of the page that don't need to be moved.

I also ditched the #ifdef'd out code as I'm pretty sure #1 and #2 are
a much better way of doing the backup buffer given how many tuples are
likely to be skipped due to maintaining the traditional tuple order.

That gets my benchmark down to 60.8 seconds, so 2.2 seconds better than v4b.

I've attached v6b and an updated chart showing the results of the 10
runs I did of it.

David


compactify_tuples_dgr_v6b.patch
Description: Binary data


Re: Transactions involving multiple postgres foreign servers, take 2

2020-09-10 Thread Masahiko Sawada
On Tue, 8 Sep 2020 at 13:00, tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Amit Kapila 
> > I intend to say that the global-visibility work can impact this in a
> > major way and we have analyzed that to some extent during a discussion
> > on the other thread. So, I think without having a complete
> > design/solution that addresses both the 2PC and global-visibility, it
> > is not apparent what is the right way to proceed. It seems to me that
> > rather than working on individual (or smaller) parts one needs to come
> > up with a bigger picture (or overall design) and then once we have
> > figured that out correctly, it would be easier to decide which parts
> > can go first.
>
> I'm really sorry I've been getting late and late and latex10 to publish the 
> revised scale-out design wiki to discuss the big picture!  I don't know why 
> I'm taking this long time; I feel I were captive in a time prison (yes, 
> nobody is holding me captive; I'm just late.)  Please wait a few days.
>
> But to proceed with the development, let me comment on the atomic commit and 
> global visibility.
>
> * We have to hear from Andrey about their check on the possibility that 
> Clock-SI could be Microsoft's patent and if we can avoid it.
>
> * I have a feeling that we can adopt the algorithm used by Spanner, 
> CockroachDB, and YugabyteDB.  That is, 2PC for multi-node atomic commit, 
> Paxos or Raft for replica synchronization (in the process of commit) to make 
> 2PC more highly available, and the timestamp-based global visibility.  
> However, the timestamp-based approach makes the database instance shut down 
> when the node's clock is distant from the other nodes.
>
> * Or, maybe we can use the following Commitment ordering that doesn't require 
> the timestamp or any other information to be transferred among the cluster 
> nodes.  However, this seems to have to track the order of read and write 
> operations among concurrent transactions to ensure the correct commit order, 
> so I'm not sure about the performance.  The MVCO paper seems to present the 
> information we need, but I haven't understood it well yet (it's difficult.)  
> Could you anybody kindly interpret this?
>
> Commitment ordering (CO) - yoavraz2
> https://sites.google.com/site/yoavraz2/the_principle_of_co
>
>
> As for the Sawada-san's 2PC patch, which I find interesting purely as FDW 
> enhancement, I raised the following issues to be addressed:
>
> 1. Make FDW API implementable by other FDWs than postgres_fdw (this is what 
> Amit-san kindly pointed out.)  I think oracle_fdw and jdbc_fdw would be good 
> examples to consider, while MySQL may not be good because it exposes the XA 
> feature as SQL statements, not C functions as defined in the XA specification.

I agree that we need to verify new FDW APIs will be suitable for other
FDWs than postgres_fdw as well.

>
> 2. 2PC processing is queued and serialized in one background worker.  That 
> severely subdues transaction throughput.  Each backend should perform 2PC.

Not sure it's safe that each backend perform PREPARE and COMMIT
PREPARED since the current design is for not leading an inconsistency
between the actual transaction result and the result the user sees.
But in the future, I think we can have multiple background workers per
database for better performance.

>
> 3. postgres_fdw cannot detect remote updates when the UDF executed on a 
> remote node updates data.

I assume that you mean the pushing the UDF down to a foreign server.
If so, I think we can do this by improving postgres_fdw. In the
current patch, registering and unregistering a foreign server to a
group of 2PC and marking a foreign server as updated is FDW
responsible. So perhaps if we had a way to tell postgres_fdw that the
UDF might update the data on the foreign server, postgres_fdw could
mark the foreign server as updated if the UDF is shippable.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-10 Thread Robert Haas
On Fri, Aug 28, 2020 at 5:55 AM Ashutosh Sharma  wrote:
> Please have a look into the attached patch for the changes and let me know 
> for any other concerns. Thank you.

I have committed this version.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: SIGQUIT handling, redux

2020-09-10 Thread Robert Haas
On Wed, Sep 9, 2020 at 10:07 PM Tom Lane  wrote:
> bgworker_die (SIGTERM)
>
> Calls ereport(FATAL).  This is surely not any safer than, say,
> quickdie().  No, it's worse, because at least that won't try
> to go out via proc_exit().

I think bgworker_die() is pretty much a terrible idea. Every
background worker I've written has actually needed to use
CHECK_FOR_INTERRUPTS(). I think that the only way this could actually
be safe is if you have a background worker that never uses ereport()
itself, so that the ereport() in the signal handler can't be
interrupting one that's already happening. This seems unlikely to be
the normal case, or anything close to it. Most background workers
probably are shared-memory connected and use a lot of PostgreSQL
infrastructure and thus ereport() all over the place.

Now what to do about it I don't know exactly, but it would be nice to
do something.

> StandbyDeadLockHandler (from SIGALRM)
> StandbyTimeoutHandler (ditto)
>
> Calls CancelDBBackends, which just for starters tries to acquire
> an LWLock.  I think the only reason we've gotten away with this
> for this long is the high probability that by the time either
> timeout fires, we're going to be blocked on a semaphore.

Yeah, I'm not sure these are so bad. In fact, in the deadlock case, I
believe the old coding was designed to make sure we *had to* be
blocked on a semaphore, but I'm not sure whether that's still true.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: BUG #15858: could not stat file - over 4GB

2020-09-10 Thread Greg Steiner
I assigned myself as a reviewer for this patch, as I hit this bug today and had 
to perform a workaround.  I have never reviewed a patch before but will try to 
update within the next 5 days.  I intend on performing "Implements Feature" 
reviewing.

Re: Global snapshots

2020-09-10 Thread Alexey Kondratov

On 2020-09-09 20:29, Fujii Masao wrote:

On 2020/09/09 2:00, Alexey Kondratov wrote:


According to the Sawada-san's v25 0002 the logic is pretty much the 
same there:


+2. Pre-Commit phase (1st phase of two-phase commit)

+3. Commit locally
+Once we've prepared all of them, commit the transaction locally.

+4. Post-Commit Phase (2nd phase of two-phase commit)

Brief look at the code confirms this scheme. IIUC, AtEOXact_FdwXact / 
FdwXactParticipantEndTransaction happens after 
ProcArrayEndTransaction() in the CommitTransaction(). Thus, I don't 
see many difference between these approach and CallXactCallbacks() 
usage regarding this point.


IIUC the commit logic in Sawada-san's patch looks like

1. PreCommit_FdwXact()
PREPARE TRANSACTION command is issued

2. RecordTransactionCommit()
2-1. WAL-log the commit record
2-2. Update CLOG
2-3. Wait for sync rep
2-4. FdwXactWaitForResolution()
Wait until COMMIT PREPARED commands are issued to the
remote servers and completed.

3. ProcArrayEndTransaction()
4. AtEOXact_FdwXact(true)

So ISTM that the timing of when COMMIT PREPARED is issued
to the remote server is different between the patches.
Am I missing something?



No, you are right, sorry. At a first glance I thought that 
AtEOXact_FdwXact is responsible for COMMIT PREPARED as well, but it is 
only calling FdwXactParticipantEndTransaction in the abort case.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: [PATCH] Missing links between system catalog documentation pages

2020-09-10 Thread Peter Eisentraut

On 2020-09-03 17:40, Dagfinn Ilmari Mannsåker wrote:

I just noticed that both this and the command link patch modified the
same sentence about CREATE DATABASE and pg_database, so those changes
seem to have been lost in the merge.  Attached is a follow-up patch that
adds them both.


I think in those cases I would leave off the link.  The mentions there 
are just examples of the relationship between a catalog and a command. 
It doesn't mean you are meant to look up the specific catalog and command.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

2020-09-10 Thread Peter Eisentraut

On 2020-09-06 02:24, David Rowley wrote:

I would add DEBUG1 back into the conditional, like

if (__builtin_constant_p(elevel) && ((elevel) >= ERROR || (elevel) <=
DEBUG1) ? \


hmm, but surely we don't want to move all code that's in the same
branch as an elog(DEBUG1) call into a cold area.


Yeah, nevermind that.


The v3 patch just put an unlikely() around the errstart() call if the
level was <= DEBUG1.  That just to move the code that's inside the if
(errstart(...)) in the macro into a cold area.


That could be useful.  Depends on how much effect it has.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Missing "Up" navigation link between parts and doc root?

2020-09-10 Thread Peter Eisentraut

On 2020-09-08 21:10, Bruce Momjian wrote:

On Sun, Sep  6, 2020 at 04:59:11PM +0200, Peter Eisentraut wrote:

I have made the analogous changes to the footer as well and committed this.


I see this only applied to master.  Shouldn't this be backpatched?


I wasn't planning to.  It's not a bug fix.

Other thoughts?

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Range checks of pg_test_fsync --secs-per-test and pg_test_timing --duration

2020-09-10 Thread Peter Eisentraut

On 2020-09-10 09:59, Michael Paquier wrote:

I notice that the error checking you introduce is different from the checks
for pgbench -t and -T (the latter having no errno checks).  I'm not sure
which is correct, but it's perhaps worth making them the same.

pgbench currently uses atoi() to parse the options of -t and -T.  Are
you suggesting to switch that to strtoXX() as well or perhaps you are
referring to the parsing of the weight in parseScriptWeight()?  FWIW,
the error handling introduced in this patch is similar to what we do
for example in pg_resetwal.  This has its own problems as strtoul()
would not report ERANGE except for values higher than ULONG_MAX, but
the returned results are stored in 32 bits.  We could switch to just
use uint64 where we could of course, but is that really worth it for
such tools?  For example, pg_test_timing could overflow the
total_timing calculated if using a too high value, but nobody would
use such values anyway.  So I'd rather just use uint32 and call it a
day, for simplicity's sake mainly..


The first patch you proposed checks for errno == ERANGE, but pgbench 
code doesn't do that.  So one of them is not correct.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [Patch] Add missing libraries to Libs.private of libpq.pc

2020-09-10 Thread Peter Eisentraut

On 2020-09-04 22:07, Peter Eisentraut wrote:

On 2020-07-10 21:47, Peter Eisentraut wrote:

On 2020-04-08 11:38, Sandro Mani wrote:

The following patch, which we added to build mingw-postgresql on Fedora,
adds some missing libraries to Libs.private of libpq.pc, discovered when
attempting to statically link with libpq:

-lz: is required by -lcrypto


I think the correct fix for that would be to add libssl to libpq's
Requires.private.


For that, I propose the attached patch.


committed

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Boundary value check in lazy_tid_reaped()

2020-09-10 Thread Masahiko Sawada
On Tue, 1 Sep 2020 at 08:01, Peter Geoghegan  wrote:
>
> On Mon, Aug 31, 2020 at 1:56 PM Peter Geoghegan  wrote:
> > I wonder if Roaring bitmaps would work well for this:
> >
> > https://arxiv.org/abs/1709.07821
>
> Alternatively, perhaps we could use a negative cache of heap blocks
> that have no tuples to kill at all. Maybe just a simple array whose
> elements are BlockNumber pairs. Each pair represents a range of blocks
> known to contain heap pages that *don't* have any TIDs for VACUUM to
> kill. The array could be size limited to 8KB, allowing it to be
> accessed in L1 cache throughout vacuuming. When the representation
> cannot fit in 8KB, maybe we disable the negative cache for the rest of
> the VACUUM operation.
>
> This is a bit like Masahiko's min_blkno + max_blkno idea, except: 1).
> It's a negative cache, and 2.) There are perhaps as many as 1024
> min/max pairs -- not just 1.

Interesting idea. Maybe we need one more bsearch() for the min/max
pairs array when a TID should be killed?

>
> It's pretty clear that more than 90% of all calls to lazy_tid_reaped()
> return false unless vacuum is running behind (false means "don't kill
> this TID, it's alive"). But it could be much more than 90%. This may
> be because autovacuum is configured to run more aggressively than the
> default settings allow. But it may also happen when LP_DEAD killing in
> indexes works well enough to do most of the cleanup needed in
> practice. I bet pgbench finds that ~99% of all calls to
> lazy_tid_reaped() that take place during index vacuuming find that the
> TID doesn't need to be killed. So negative caching could really help.

I agree that lazy_tid_reaped() returns false in most checks in the
case autovacuum is not running behind. But I'm concerned a bit that it
instead costs the case where vacuum needs to kill many TIDs and uses
the min/max filter because it needs to call bsearch() twice. I think
that this is the case where autovacuum is running behind and the user
wants to complete the vacuum as soon as possible. Since I expect that
checking the filtering using min/max pairs array should be done in a
very short time it might not be a problem but I think it’s worth
benchmarking the overhead in the worst case. Or I guess we can use a
bloom filter for this purpose instead.

Also, I'm concerned that 1024 min/max pairs might not be enough in
practice, especially for very large tables.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Optimising compactify_tuples()

2020-09-10 Thread David Rowley
On Thu, 10 Sep 2020 at 10:40, Thomas Munro  wrote:
>
> I wonder if we could also identify a range at the high end that is
> already correctly sorted and maximally compacted so it doesn't even
> need to be copied out.

I've experimented quite a bit with this patch today. I think I've
tested every idea you've mentioned here, so there's quite a lot of
information to share.

I did write code to skip the copy to the separate buffer for tuples
that are already in the correct place and with a version of the patch
which keeps tuples in their traditional insert order (later lineitem's
tuple located earlier in the page) I see a generally a very large
number of tuples being skipped with this method.  See attached
v4b_skipped_tuples.png. The vertical axis is the number of
compactify_tuple() calls during the benchmark where we were able to
skip that number of tuples. The average skipped tuples overall calls
during recovery was 81 tuples, so we get to skip about half the tuples
in the page doing this on this benchmark.

> > With the attached v3, performance is better. The test now runs
> > recovery 65.6 seconds, vs master's 148.5 seconds. So about 2.2x
> > faster.
>
> On my machine I'm seeing 57s, down from 86s on unpatched master, for
> the simple pgbench workload from
> https://github.com/macdice/redo-bench/.  That's not quite what you're
> reporting but it still blows the doors off the faster sorting patch,
> which does it in 74s.

Thanks for running the numbers on that.  I might be seeing a bit more
gain as I dropped the fillfactor down to 85. That seems to cause more
calls to compactify_tuples().

> So one question is whether we want to do the order-reversing as part
> of this patch, or wait for a more joined-up project to make lots of
> code paths collude on making scan order match memory order
> (corellation = 1).  Most or all of the gain from your patch would
> presumably still apply if did the exact opposite and forced offset
> order to match reverse-item ID order (correlation = -1), which also
> happens to be the initial state when you insert tuples today; you'd
> still tend towards a state that allows nice big memmov/memcpy calls
> during page compaction.  IIUC currently we start with correlation -1
> and then tend towards correlation = 0 after many random updates
> because we can't change the order, so it gets scrambled over time.
> I'm not sure what I think about that.

So I did lots of benchmarking with both methods and my conclusion is
that I think we should stick to the traditional INSERT order with this
patch. But we should come back and revisit that more generally one
day. The main reason that I'm put off flipping the tuple order is that
it significantly reduces the number of times we hit the preordered
case.  We go to all the trouble of reversing the order only to have it
broken again when we add 1 more tuple to the page.  If we keep this
the traditional way, then it's much more likely that we'll maintain
that pre-order and hit the more optimal memmove code path.

To put that into numbers, using my benchmark, I see 13.25% of calls to
compactify_tuples() when the tuple order is reversed (i.e earlier
lineitems earlier in the page).  However, if I keep the lineitems in
their proper order where earlier lineitems are at the end of the page
then I hit the preordered case 60.37% of the time. Hitting the
presorted case really that much more often is really speeding things
up even further.

I've attached some benchmark results as benchmark_table.txt, and
benchmark_chart.png.

The v4 patch implements your copy skipping idea for the prefix of
tuples which are already in the correct location. v4b is that patch
but changed to keep the tuples in the traditional order.  v5 was me
experimenting further by adding a precalculated end of tuple Offset to
save having to calculate it each time by adding itemoff and alignedlen
together.  It's not an improvement, so but I just wanted to mention
that I tried it.

If you look the benchmark results, you'll see that v4b is the winner.
The v4b + NOTUSED is me changing the #ifdef NOTUSED part so that we
use the smarter code to populate the backup buffer.  Remember that I
got 60.37% of calls hitting the preordered case in v4b, so less than
40% had to do the backup buffer.  So the slowness of that code is more
prominent when you compare v5 to v5 NOTUSED since the benchmark is
hitting the non-preordered code 86.75% of the time with that version.


> PS You might as well post future patches with .patch endings so that
> the cfbot tests them; it seems pretty clear now that a patch to
> optimise sorting (as useful as it may be for future work) can't beat a
> patch to skip it completely.  I took the liberty of switching the
> author and review names in the commitfest entry to reflect this.

Thank you.

I've attached v4b (b is for backwards since the traditional backwards
tuple order is maintained). v4b seems to be able to run my benchmark
in 63 seconds. I did 10 runs today of 

Re: Inconsistency in determining the timestamp of the db statfile.

2020-09-10 Thread Alvaro Herrera
On 2020-Sep-10, Amit Kapila wrote:

> On Thu, Sep 10, 2020 at 1:03 PM Magnus Hagander  wrote:

> The comments already say what you said in the second suggestion:"The
> caller must rely on timestamp stored in *ts iff the function returns
> true.". Read iff "as if and only if"

I think "must" should be "may" there, if we're nitpicking.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Logical Replication - detail message with names of missing columns

2020-09-10 Thread Bharath Rupireddy
Added this to the commitfest - https://commitfest.postgresql.org/30/2727/

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Yet another fast GiST build

2020-09-10 Thread Heikki Linnakangas

On 09/09/2020 19:50, Andrey M. Borodin wrote:

9 сент. 2020 г., в 20:39, Heikki Linnakangas  написал(а):

On 09/09/2020 15:20, Darafei "Komяpa" Praliaskouski wrote:

On Wed, Sep 9, 2020 at 3:09 PM Heikki Linnakangas  wrote:

Come to think of it, the point z-order comparator could benefit a lot
from key abbreviation, too. You could do the point -> zorder conversion
in the abbreviation routine.

That's how it works in PostGIS, only that we moved to more
effecient Hilbert curve:
https://github.com/postgis/postgis/blob/54399b9f6b0f02e8db9444f9f042b8d4ca6d4fa4/postgis/lwgeom_btree.c#L171


Thanks, that's interesting.

I implemented the abbreviated keys for the point opclass, too, and noticed that 
the patch as it was never used it. I reworked the patch so that 
tuplesort_begin_index_gist() is responsible for looking up the sortsupport 
function, like tuplesort_begin_index_btree() does, and uses abbreviation when 
possible.

Wow, abbreviated sort made gist for points construction even 1.5x faster!
btw there is small typo in arg names in gist_bbox_zorder_cmp_abbrev(); z1,z2 -> 
a,b


One more patch version attached. I fixed some memory leaks, and fixed 
the abbreviation on 32-bit systems, and a bunch of small comment changes 
and such.


- Heikki
>From 79800152b6305e93129293452002cd082daadff4 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 10 Sep 2020 15:37:08 +0300
Subject: [PATCH v18 1/1] Add support for building GiST index by sorting.

This adds a new optional support function to the GiST access method:
sortsupport. If it is defined, the GiST index is built by sorting all data
to the order defined by the sortsupport's comparator function, and packing
the tuples in that order to GiST pages. This is similar to how B-tree
index build works, and is much faster than inserting the tuples one by one.
The resulting index is smaller too, because the pages are packed more
tightly, upto 'fillfactor'. The normal build method works by splitting
pages, which tends to lead to more wasted space.

The quality of the resulting index depends on how good the opclass-defined
sort order is. A good order preserves locality of the input data.

As the first user of this facility, add 'sortsupport' function to the
point_ops opclass. It sorts the points in Z-order (aka Morton Code), by
interleaving the bits of the X and Y coordinates.

Author: Andrey Borodin
Reviewed-by: Pavel Borisov, Thomas Munro
Discussion: https://www.postgresql.org/message-id/1A36620E-CAD8-4267-9067-FB31385E7C0D%40yandex-team.ru
---
 doc/src/sgml/gist.sgml  |  70 
 src/backend/access/gist/gistbuild.c | 506 
 src/backend/access/gist/gistproc.c  | 229 +++
 src/backend/access/gist/gistutil.c  |  53 ++-
 src/backend/access/gist/gistvalidate.c  |   6 +-
 src/backend/access/transam/xloginsert.c |  57 +++
 src/backend/utils/sort/sortsupport.c|  34 ++
 src/backend/utils/sort/tuplesort.c  |  57 +++
 src/include/access/gist.h   |   3 +-
 src/include/access/gist_private.h   |   3 +
 src/include/access/xloginsert.h |   2 +
 src/include/catalog/catversion.h|   1 +
 src/include/catalog/pg_amproc.dat   |   2 +
 src/include/catalog/pg_proc.dat |   3 +
 src/include/utils/sortsupport.h |   1 +
 src/include/utils/tuplesort.h   |   4 +
 16 files changed, 928 insertions(+), 103 deletions(-)

diff --git a/doc/src/sgml/gist.sgml b/doc/src/sgml/gist.sgml
index f9226e7a35c..7c72a547409 100644
--- a/doc/src/sgml/gist.sgml
+++ b/doc/src/sgml/gist.sgml
@@ -259,6 +259,8 @@ CREATE INDEX ON my_table USING GIST (my_inet_column inet_ops);
compress method is omitted. The optional tenth method
options is needed if the operator class provides
the user-specified parameters.
+   The sortsupport method is also optional and is used to
+   speed up building a GiST index.
  
 
  
@@ -1065,6 +1067,74 @@ my_compress(PG_FUNCTION_ARGS)
   
  
 
+
+
+ sortsupport
+ 
+  
+   Returns a comparator function to sort data in a way that preserves
+   locality. It is used by CREATE INDEX and
+   REINDEX. The quality of the created index depends on
+   how well the sort order determined by the comparator function preserves
+   locality of the inputs.
+  
+  
+   The sortsupport method is optional. If it is not
+   provided, CREATE INDEX builds the index by inserting
+   each tuple to the tree using the penalty and
+   picksplit functions, which is much slower.
+  
+
+  
+   The SQL declaration of the function must look like
+   this:
+
+
+CREATE OR REPLACE FUNCTION my_sortsupport(internal)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
+
+
+   The argument is a pointer to a SortSupport
+   struct. At a minimum, the function must fill in its comparator field.
+   The comparator takes three arguments: two Datums to compare, and
+   a pointer to the 

Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a

2020-09-10 Thread Daniel Gustafsson
> On 12 Jul 2020, at 07:48, Julien Rouhaud  wrote:

> Currently, getTableAttrs() always retrieves info about columns defaults and
> check constraints, while this will never be used if --data-only option if 
> used.
> This seems like a waste of resources, so here's a patch to skip those parts
> when the DDL won't be generated.

Given how unintrusive this optimization is, +1 from me to go ahead with this
patch.  pg_dump tests pass.  Personally I would've updated the nearby comments
to reflect why the check for dataOnly is there, but MMV there.  I'm moving this
patch to Ready for Committer.

I'm fairly sure there is a lot more we can do to improve the performance of
data-only dumps, but this nicely chips away at the problem.

cheers ./daniel



Re: Get memory contexts of an arbitrary backend process

2020-09-10 Thread Kasahara Tatsuhito
Hi,

On Thu, Sep 10, 2020 at 8:53 PM torikoshia  wrote:
>
> On 2020-09-04 21:46, Tomas Vondra wrote:
> > On Fri, Sep 04, 2020 at 11:47:30AM +0900, Kasahara Tatsuhito wrote:
> >> On Fri, Sep 4, 2020 at 2:40 AM Tom Lane  wrote:
> >>> Kasahara Tatsuhito  writes:
> >>> > Yes, but it's not only for future expansion, but also for the
> >>> > usability and the stability of this feature.
> >>> > For example, if you want to read one dumped file multiple times and 
> >>> > analyze it,
> >>> > you will want the ability to just read the dump.
> >>>
> >>> If we design it to make that possible, how are we going to prevent
> >>> disk
> >>> space leaks from never-cleaned-up dump files?
> >> In my thought, with features such as a view that allows us to see a
> >> list of dumped files,
> >> it would be better to have a function that simply deletes the dump
> >> files associated with a specific PID,
> >> or to delete all dump files.
> >> Some files may be dumped with unexpected delays, so I think the
> >> cleaning feature will be necessary.
> >> ( Also, as the pgsql_tmp file, it might better to delete dump files
> >> when PostgreSQL start.)
> >>
> >> Or should we try to delete the dump file as soon as we can read it?
> >>
> >
> > IMO making the cleanup a responsibility of the users (e.g. by exposing
> > the list of dumped files through a view and expecting users to delete
> > them in some way) is rather fragile.
> >
> > I don't quite see what's the point of designing it this way. It was
> > suggested this improves stability and usability of this feature, but
> > surely making it unnecessarily complex contradicts both points?
> >
> > IMHO if the user needs to process the dump repeatedly, what's
> > preventing
> > him/her from storing it in a file, or something like that? At that
> > point
> > it's clear it's up to them to remove the file. So I suggest to keep the
> > feature as simple as possible - hand the dump over and delete.
Yeah, it might be better  to avoid making the user responsible for removal.

I think it's fine to have an interface to delete in an emergency, but
I agree that
users shouldn't be made aware of the existence or deletion of dump
files, basically.

> +1.
> If there are no other objections, I'm going to accept this
> suggestion.
So +1

Best regards,

-- 
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com




Re: Proposal of new PostgreSQL Extension - PGSpiderExt

2020-09-10 Thread Daniel Gustafsson
> On 10 Sep 2020, at 07:54, Taiga KATAYAMA  wrote:

> We hope to be incorporated this extension into PostgreSQL as one of
> contrib module, and would like to try to propose to Commit Fest.
> Could you kindly advise me and share your opinion?

FWIW I would like to see fewer modules in contrib rather than gaining more, and
(I can't stress this enough) that's not in any way a comment on the quality or
the usefulness of the extension in question here.

The outside-of-core ecosystem for postgres extensions has been discussed in
various threads lately, and improving that seems a more sustainable solution
than bundling more.  That's no doubt a lot easier said than done, but I think
it's extremely important for the community.

cheers ./daniel



Re: Windows regress fails (latest HEAD)

2020-09-10 Thread Ranier Vilela
Em sex., 26 de jun. de 2020 às 08:21, Ranier Vilela 
escreveu:

> Em qui., 11 de jun. de 2020 às 10:28, Ranier Vilela 
> escreveu:
>
>> Em qui., 11 de jun. de 2020 às 10:01, Andrew Dunstan <
>> andrew.duns...@2ndquadrant.com> escreveu:
>>
>>>
>>> On 6/11/20 8:52 AM, Ranier Vilela wrote:
>>> > Hi,
>>> > Latest HEAD, fails with windows regress tests.
>>> >
>>> >  float8   ... FAILED  517 ms
>>> >  partition_prune  ... FAILED 3085 ms
>>> >
>>> >
>>>
>>> The first thing you should do when you find this is to see if there is a
>>> buildfarm report of the failure. If there isn't then try to work out why.
>>>
>> Sorry, I will have to research the buildfarm, I have no reference to it.
>>
>>
>>>
>>> Also, when making a report like this, it is essential to let us know
>>> things like:
>>>
>>>   * which commit is causing the failure (git bisect is good for finding
>>> this)
>>>
>> Thanks for hit (git bisect).
>>
>>
>>>   * what Windows version you're testing on
>>>
>> Windows 10 (2004)
>>
>>   * which compiler you're using
>>>
>> msvc 2019 (64 bits)
>>
>
> Only for registry, if anyone else is using msvc 2019.
> I'm using latest msvc 2019 64 bits (16.6.0)
> Problably this is a compiler optimization bug.
> vcregress check with build DEBUG, pass all 200 tests.
>
With the current HEAD, the regression float8 in release mode (msvc 2019 64
bits) is gone.
Maybe it's this commit:
https://github.com/postgres/postgres/commit/0aa8f764088ea0f36620ae2955fa6c54ec736c46

But (partition_prune) persists.
partition_prune  ... FAILED

regards,
Ranier Vilela


Re: Fix for parallel BTree initialization bug

2020-09-10 Thread Amit Kapila
On Tue, Sep 8, 2020 at 11:55 PM Jameson, Hunter 'James'
 wrote:
>
> Hi, I ran across a small (but annoying) bug in initializing parallel BTree 
> scans, which causes the parallel-scan state machine to get confused. The fix 
> is one line; the description is a bit longer—
>
>
>
> Before, function _bt_first() would exit immediately if the specified scan 
> keys could never be satisfied--without notifying other parallel workers, if 
> any, that the scan key was done.
>

The first question that comes to mind is how is it possible that for
one of the workers specified scan keys is not satisfied while for
others it is satisfied? I think it is possible when other workers are
still working on the previous scan key and this worker has moved to
the next scan key. If not, then what is the other case?

> This moved that particular worker to a scan key beyond what was in the shared 
> parallel-query state, so that it would later try to read in 
> "InvalidBlockNumber", without recognizing it as a special sentinel value.
>

Now, if it happens as I mentioned then the other workers should not
try to advance their scan because their local scan key will be lesser
than shared key. Basically, they should return from the below
condition:
_bt_parallel_seize()
{
..
if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
{
/* Parallel scan has already advanced to a new set of scankeys. */
status = false;
}
..
}

After this, those workers will also update their scan key and move
forward from there. So, I am not seeing how this could create a
problem.

-- 
With Regards,
Amit Kapila.




Re: Get memory contexts of an arbitrary backend process

2020-09-10 Thread torikoshia

On 2020-09-04 21:46, Tomas Vondra wrote:

On Fri, Sep 04, 2020 at 11:47:30AM +0900, Kasahara Tatsuhito wrote:

On Fri, Sep 4, 2020 at 2:40 AM Tom Lane  wrote:

Kasahara Tatsuhito  writes:
> Yes, but it's not only for future expansion, but also for the
> usability and the stability of this feature.
> For example, if you want to read one dumped file multiple times and analyze 
it,
> you will want the ability to just read the dump.

If we design it to make that possible, how are we going to prevent 
disk

space leaks from never-cleaned-up dump files?

In my thought, with features such as a view that allows us to see a
list of dumped files,
it would be better to have a function that simply deletes the dump
files associated with a specific PID,
or to delete all dump files.
Some files may be dumped with unexpected delays, so I think the
cleaning feature will be necessary.
( Also, as the pgsql_tmp file, it might better to delete dump files
when PostgreSQL start.)

Or should we try to delete the dump file as soon as we can read it?



IMO making the cleanup a responsibility of the users (e.g. by exposing
the list of dumped files through a view and expecting users to delete
them in some way) is rather fragile.

I don't quite see what's the point of designing it this way. It was
suggested this improves stability and usability of this feature, but
surely making it unnecessarily complex contradicts both points?

IMHO if the user needs to process the dump repeatedly, what's 
preventing
him/her from storing it in a file, or something like that? At that 
point

it's clear it's up to them to remove the file. So I suggest to keep the
feature as simple as possible - hand the dump over and delete.


+1.
If there are no other objections, I'm going to accept this
suggestion.

Regards




Re: Two fsync related performance issues?

2020-09-10 Thread Thomas Munro
On Thu, Sep 3, 2020 at 12:09 PM Thomas Munro  wrote:
> On Tue, May 12, 2020 at 12:43 PM Paul Guo  wrote:
> > RecreateTwoPhaseFile(gxact->xid, buf, len);

> I hadn't previously focused on this second part of your email.  I
> think the fsync() call in RecreateTwoPhaseFile() might be a candidate
> for processing by the checkpoint code through the new facilities in
> sync.c, which effectively does something like what you describe.  Take

I looked at this more closely and realised that I misunderstood; I was
thinking of a problem like the one that was already solved years ago
with commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71.  Sorry for the
noise.




Re: Improvements in Copy From

2020-09-10 Thread vignesh C
On Wed, Sep 9, 2020 at 12:24 PM Peter Smith  wrote:
>
> My basic understanding of first part of your patch is that by
> adjusting the "minread" it now allows it to loop multiple times
> internally within the CopyGetData rather than calling CopyLoadRawBuf
> for every N lines. There doesn't seem to be much change to what other
> code gets executed so the saving is essentially whatever is the cost
> of making 2 x function calls (CopyLoadRawBuff + CopyGetData) x N. Is
> that understanding correct?

Yes you are right, we will avoid the function calls and try to get as
many records as possible from the buffer & insert it to the relation.

> But with that change there seems to be opportunity for yet another
> tiny saving possible. IIUC, now you are processing a lot more data
> within the CopyGetData so it is now very likely that you will also
> gobble the COPY_NEW_FE's 'c' marker. So cstate->reached_eof will be
> set. So this means the calling code of CopyReadLineText may no longer
> need to call the CopyLoadRawBuf one last time just to discover there
> are no more bytes to read - something that it already knows if
> cstate->reached_eof == true.
>
> For example, with your change can't you also modify CopyReadLineText like 
> below:
>
> BEFORE
> if (!CopyLoadRawBuf(cstate))
> hit_eof = true;
>
> AFTER
> if (cstate->reached_eof)
> {
> cstate->raw_buf[0] = '\0';
> cstate->raw_buf_index = cstate->raw_buf_len = 0;
> hit_eof = true;
> }
> else if (!CopyLoadRawBuf(cstate))
> {
> hit_eof = true;
> }
>
> Whether such a micro-optimisation is worth doing is another question.
Yes, what you suggested can also be done, but even I have the same
question as you. Because we will reduce just one function call, the
eof check is present immediately in the function, Should we include
this or not?

Regards,
VIgnesh
EnterpriseDB: http://www.enterprisedb.com




Re: Yet another fast GiST build

2020-09-10 Thread Pavel Borisov
>
> Interesting to see also the size of index, it should be several times less.
>
> I've tested this above in CF thread and got ordered GiST index ~1.7 times
smaller than non-ordered one for single column of real points.


-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


Re: Yet another fast GiST build

2020-09-10 Thread Oleg Bartunov
On Mon, Sep 7, 2020 at 7:50 PM Heikki Linnakangas  wrote:
>
> On 07/09/2020 13:59, Pavel Borisov wrote:
>  I suppose there is a big jump in integer value (whether signed or
>  unsigned) as you cross from positive to negative floats, and then the
>  sort order is reversed.  I have no idea if either of those things is a
>  problem worth fixing.  That made me wonder if there might also be an
> >>
> >> I took a stab at fixing this, see attached patch (applies on top of your
> >> patch v14).
> >>
> >> To evaluate this, I used the other attached patch to expose the zorder
> >> function to SQL, and plotted points around zero with gnuplot. See the
> >> attached two images, one with patch v14, and the other one with this patch.
> >
> > I'd made testing of sorted SpGist build in cases of points distributed only
> > in 2d quadrant and points in all 4 quadrants and it appears that this
> > abnormality doesn't affect as much as Andrey supposed. But Heikki's patch
> > is really nice way to avoid what can be avoided and I'd like it is included
> > together with Andrey's patch.
>
> Thanks! Did you measure the quality of the built index somehow? The
> ordering shouldn't make any difference to the build speed, but it
> affects the shape of the resulting index and the speed of queries
> against it.
>
> I played with some simple queries like this:
>
> explain (analyze, buffers) select count(*) from points_good where p <@
> box(point(50, 50), point(75, 75));
>
> and looking at the "Buffers" line for how many pages were accessed.
> There doesn't seem to be any consistent difference between v14 and my
> fix. So I concur it doesn't seem to matter much.
>
>
> I played some more with plotting the curve. I wrote a little python
> program to make an animation of it, and also simulated how the points
> would be divided into pages, assuming that each GiST page can hold 200
> tuples (I think the real number is around 150 with default page size).
> In the animation, the leaf pages appear as rectangles as it walks
> through the Z-order curve. This is just a simulation by splitting all
> the points into batches of 200 and drawing a bounding box around each
> batch. I haven't checked the actual pages as the GiST creates, but I
> think this gives a good idea of how it works.

Heikki, you may use our gevel extension to visualize index tree
http://www.sai.msu.su/~megera/wiki/Gevel
I used it to investigate rtree index
http://www.sai.msu.su/~megera/wiki/Rtree_Index


>
> The animation shows that there's quite a lot of overlap between the
> pages. It's not necessarily this patch's business to try to improve
> that, and the non-sorting index build isn't perfect either. But it
> occurs to me that there's maybe one pretty simple trick we could do:
> instead of blindly filling the leaf pages in Z-order, collect tuples
> into a larger buffer, in Z-order. I'm thinking 32 pages worth of tuples,
> or something in that ballpark, or maybe go all the way up to work_mem.
> When the buffer fills up, call the picksplit code to divide the buffer
> into the actual pages, and flush them to disk. If you look at the
> animation and imagine that you would take a handful of pages in the
> order they're created, and re-divide the points with the split
> algorithm, there would be much less overlap.

Interesting to see also the size of index, it should be several times less.

>
> - Heikki



-- 
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Transactions involving multiple postgres foreign servers, take 2

2020-09-10 Thread Fujii Masao




On 2020/09/10 10:13, tsunakawa.ta...@fujitsu.com wrote:

Alexey-san, Sawada-san,
cc: Fujii-san,


From: Fujii Masao 

But if we
implement 2PC as the improvement on FDW independently from PostgreSQL
sharding, I think that it's necessary to support other FDW. And this is our
direction, isn't it?


I understand the same way as Fujii san.  2PC FDW is itself useful, so I think we should 
pursue the tidy FDW interface and good performance withinn the FDW framework.  
"tidy" means that many other FDWs should be able to implement it.  I guess 
XA/JTA is the only material we can use to consider whether the FDW interface is good.


Originally start(), commit() and rollback() are supported as FDW interfaces. 
With his patch, prepare() is supported. What other interfaces need to be 
supported per XA/JTA?

As far as I and Sawada-san discussed this upthread, to support MySQL, another type of start() would be 
necessary to issue "XA START id" command. end() might be also necessary to issue "XA END 
id", but that command can be issued via prepare() together with "XA PREPARE id".

I'm not familiar with XA/JTA and XA transaction interfaces on other major DBMS. 
So I'd like to know what other interfaces are necessary additionally?





Sawada-san's patch supports that case by implememnting some conponents
for that also in PostgreSQL core. For example, with the patch, all the remote
transactions that participate at the transaction are managed by PostgreSQL
core instead of postgres_fdw layer.

Therefore, at least regarding the difference 2), I think that Sawada-san's
approach is better. Thought?


I think so.  Sawada-san's patch needs to address the design issues I posed 
before digging into the code for thorough review, though.

BTW, is there something Sawada-san can take from Alexey-san's patch?  I'm 
concerned about the performance for practical use.  Do you two have differences 
in these points, for instance?


IMO Sawada-san's version of 2PC is less performant, but it's because
his patch provides more functionality. For example, with his patch,
WAL is written to automatically complete the unresolve foreign transactions
in the case of failure. OTOH, Alexey patch introduces no new WAL for 2PC.
Of course, generating more WAL would cause more overhead.
But if we need automatic resolution feature, it's inevitable to introduce
new WAL whichever the patch we choose.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Global snapshots

2020-09-10 Thread Fujii Masao




On 2020/09/10 18:01, tsunakawa.ta...@fujitsu.com wrote:

From: Fujii Masao 

But I'm concerned about that it's really hard to say there is no patent risk
around that. I'm not sure who can judge there is no patent risk,
in the community. Maybe no one? Anyway, I was thinking that Google Spanner,
YugabyteDB, etc use the global transaction approach based on the clock
similar to Clock-SI. Since I've never heard they have the patent issues,
I was just thinking Clock-SI doesn't have. No? This type of *guess* is not
safe, though...


Hm, it may be difficult to be sure that the algorithm does not violate a 
patent.  But it may not be difficult to know if the algorithm apparently 
violates a patent or is highly likely (for those who know Clock-SI well.)  At 
least, Andrey-san seems to have felt that it needs careful study, so I guess he 
had some hunch.

I understand this community is sensitive to patents.  After the discussions at 
and after PGCon 2018, the community concluded that it won't accept patented 
technology.  In the distant past, the community released Postgres 8.0 that 
contains an IBM's pending patent ARC, and removed it in 8.0.2.  I wonder how 
could this could be detected, and how hard to cope with the patent issue.  
Bruce warned that we should be careful not to violate Greenplum's patents.

E.25. Release 8.0.2
https://www.postgresql.org/docs/8.0/release-8-0-2.html
--
New cache management algorithm 2Q replaces ARC (Tom)
This was done to avoid a pending US patent on ARC. The 2Q code might be a few 
percentage points slower than ARC for some work loads. A better cache 
management algorithm will appear in 8.1.
--


I think I'll try to contact the people listed in Clock-SI paper and the 
Microsoft patent to ask about this.


Thanks!



I'm going to have a late summer vacation next week, so this is my summer 
homework?



One alternative is to add only hooks into PostgreSQL core so that we can
implement the global transaction management outside. This idea was
discussed before as the title "eXtensible Transaction Manager API".


Yeah, I read that discussion.  And I remember Robert Haas and Postgres Pro 
people said it's not good...


But it may be worth revisiting this idea if we cannot avoid the patent issue.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Bug in logical decoding of in-progress transactions

2020-09-10 Thread Dilip Kumar
On Thu, Sep 10, 2020 at 2:48 PM Amit Kapila  wrote:

> On Thu, Sep 10, 2020 at 12:00 PM Dilip Kumar 
> wrote:
> >
> > On Thu, Sep 10, 2020 at 11:53 AM Dilip Kumar 
> wrote:
> >>
> >>> >
> >>> > I have written a test case to reproduce the same.
> >
>
> Can we write an isolation test for this scenario? See some similar
> tests in contrib/test_decoding/specs. If that is possible then we can
> probably remove the test which failed and instead write an isolation
> test involving three transactions as shown by you. Also, please
> prepare two separate patches (one for test and other for code) if you
> are able to convert existing test to an isolation test as that will
> make it easier to test the fix.
>

I have written a test in isolation test.  IMHO, we should not try to merge
stream.sql to this isolation test mainly for two reasons a) this isolation
test is very specific that while we are trying to stream we must have the
incomplete changes so if we try to put more operation like
message/truncate/abort then it will become unpredictable.  Currently, I
have kept it with just one big tuple so it is a guarantee that whenever the
the logical_decoding_work_mem exceed then it will have the partial
changes.   b) we can add another operation in the transaction and cover the
stream changes but then those are not very specific to the isolation test.
So I feel it is better to put only the specific scenario in the isolation
test.



>
> >
> > I have removed some comments which are not valid after this patch.
> >
>
> Few comments:
> =
> 1. We need to set xact_wrote_changes in pg_decode_stream_truncate() as
> well along with the APIs in which you have set it.
> 2.
> +static void
> +pg_output_stream_start(LogicalDecodingContext *ctx, TestDecodingData
> *data, ReorderBufferTXN *txn, bool last_write)
> +{
>   OutputPluginPrepareWrite(ctx, true);
>   if (data->include_xids)
>   appendStringInfo(ctx->out, "opening a streamed block for transaction
> TXN %u", txn->xid);
> @@ -601,16 +610,15 @@ pg_decode_stream_start(LogicalDecodingContext *ctx,
>   OutputPluginWrite(ctx, true);
>
> In this API, we need to use 'last_write' in OutputPluginPrepareWrite()
> and OutputPluginWrite().
>
> The attached patch fixes both these comments.
>

Okay,  there is some change in stream.out so I have included that in the
first patch.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v4-0002-Test-case-to-test-streaming-with-concurrent-empty.patch
Description: Binary data


v4-0001-Skip-printing-empty-stream-in-test-decoding.patch
Description: Binary data


Re: Improvements in Copy From

2020-09-10 Thread vignesh C
On Mon, Sep 7, 2020 at 1:19 PM Surafel Temesgen  wrote:
>
>
> Hi Vignesh
>
> On Wed, Jul 1, 2020 at 3:46 PM vignesh C  wrote:
>>
>> Hi,
>>
>> While reviewing copy from I identified few  improvements for copy from
>> that can be done :
>> a) copy from stdin copies lesser amount of data to buffer even though
>> space is available in buffer because minread was passed as 1 to
>> CopyGetData, Hence it only reads until the data read from libpq is
>> less than minread. This can be fixed by passing the actual space
>> available in buffer, this reduces the unnecessary frequent calls to
>> CopyGetData.
>
>
> why not applying the same optimization on file read ?

For file read this is already taken care as you can see from below code:
bytesread = fread(databuf, 1, maxread, cstate->copy_file);
if (ferror(cstate->copy_file))
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not read from COPY file: %m")));
if (bytesread == 0)
cstate->reached_eof = true;
break;

We do not have any condition to break unlike the case of stdin, we
read 1 * maxread size of data, So no need to do anything for it.

>
>>
>> c) Copy from reads header line and do nothing for the header line, we
>> need not clear EOL & need not convert to server encoding for the
>> header line.
>
>
> We have a patch for column matching feature [1] that may need a header line 
> to be further processed. Even without that I think it is preferable to 
> process the header line for nothing than adding those checks to the loop, 
> performance-wise.

I had seen that patch, I feel that change to match the header if the
header is specified can be addressed in this patch if that patch gets
committed first or vice versa. We are doing a lot of processing for
the data which we need not do anything. Shouldn't this be skipped if
not required. Similar check is present in NextCopyFromRawFields also
to skip header.

Thoughts?

Regards,
VIgnesh
EnterpriseDB: http://www.enterprisedb.com




Re: SIGQUIT handling, redux

2020-09-10 Thread Kyotaro Horiguchi
At Wed, 09 Sep 2020 10:46:28 -0400, Tom Lane  wrote in 
> > I also think we should
> > back-patch these changes, as the commit mentioned in Horiguchi-san's
> > patch for pgarch_exit() was.
> 
> Agreed; I'll go make it happen.

Thank you for committing this!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: v13: CLUSTER segv with wal_level=minimal and parallel index creation

2020-09-10 Thread Kyotaro Horiguchi
At Wed, 09 Sep 2020 14:40:37 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Sure. relNode is filled with zeros so RelationInitPhysicalAddr() works
> correctly for RelationBuildDesc().
> 
> Both changes look good to me.

Thank you for committing this!

regards.
-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-09-10 Thread Andrey V. Lepikhov

On 9/9/20 5:51 PM, Amit Langote wrote:

On Wed, Sep 9, 2020 at 6:42 PM Alexey Kondratov
 wrote:

On 2020-09-09 11:45, Andrey V. Lepikhov wrote:

This does not seem very convenient and will lead to errors in the
future. So, I agree with Amit.


And InitResultRelInfo() may set ri_usesMultiInsert to false by default,
since it's used only by COPY now. Then you won't need this in several
places:

+   resultRelInfo->ri_usesMultiInsert = false;

While the logic of turning multi-insert on with all the validations
required could be factored out of InitResultRelInfo() to a separate
routine.


Interesting idea.  Maybe better to have a separate routine like Alexey says.

Ok. I rewrited the patch 0001 with the Alexey suggestion.
Patch 0002... required minor changes (new version see in attachment).

Also I added some optimization (see 0003 and 0004 patches). Here we 
execute 'COPY .. FROM  STDIN' at foreign server only once, in the 
BeginForeignCopy routine. It is a proof-of-concept patches.


Also I see that error messages processing needs to be rewritten. Unlike 
the INSERT operation applied to each row, here we find out copy errors 
only after sending the END of copy. Currently implementations 0002 and 
0004 provide uninformative error messages for some cases.


--
regards,
Andrey Lepikhov
Postgres Professional
>From 2053ac530db87ae4617aa953142c447e0b27e3a2 Mon Sep 17 00:00:00 2001
From: amitlan 
Date: Mon, 24 Aug 2020 15:08:37 +0900
Subject: [PATCH 1/4] Move multi-insert decision logic into executor

When 0d5f05cde introduced support for using multi-insert mode when
copying into partitioned tables, it introduced single variable of
enum type CopyInsertMethod shared across all potential target
relations (partitions) that, along with some target relation
proprties, dictated whether to engage multi-insert mode for a given
target relation.

Move that decision logic into InitResultRelInfo which now sets a new
boolean field ri_usesMultiInsert of ResultRelInfo when a target
relation is first initialized.  That prevents repeated computation
of the same information in some cases, especially for partitions,
and the new arrangement results in slightly more readability.
---
 src/backend/commands/copy.c  | 190 ++-
 src/backend/executor/execMain.c  |   3 +
 src/backend/executor/execPartition.c |  47 +++
 src/include/executor/execPartition.h |   2 +
 src/include/nodes/execnodes.h|   9 +-
 5 files changed, 131 insertions(+), 120 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index db7d24a511..2119db4213 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -85,16 +85,6 @@ typedef enum EolType
 	EOL_CRNL
 } EolType;
 
-/*
- * Represents the heap insert method to be used during COPY FROM.
- */
-typedef enum CopyInsertMethod
-{
-	CIM_SINGLE,	/* use table_tuple_insert or fdw routine */
-	CIM_MULTI,	/* always use table_multi_insert */
-	CIM_MULTI_CONDITIONAL		/* use table_multi_insert only if valid */
-} CopyInsertMethod;
-
 /*
  * This struct contains all the state variables used throughout a COPY
  * operation. For simplicity, we use the same struct for all variants of COPY,
@@ -2715,12 +2705,10 @@ CopyFrom(CopyState cstate)
 	CommandId	mycid = GetCurrentCommandId(true);
 	int			ti_options = 0; /* start with default options for insert */
 	BulkInsertState bistate = NULL;
-	CopyInsertMethod insertMethod;
 	CopyMultiInsertInfo multiInsertInfo = {0};	/* pacify compiler */
 	uint64		processed = 0;
 	bool		has_before_insert_row_trig;
 	bool		has_instead_insert_row_trig;
-	bool		leafpart_use_multi_insert = false;
 
 	Assert(cstate->rel);
 
@@ -2833,6 +2821,58 @@ CopyFrom(CopyState cstate)
 	  0);
 	target_resultRelInfo = resultRelInfo;
 
+	Assert(target_resultRelInfo->ri_usesMultiInsert == false);
+
+	/*
+	 * It's generally more efficient to prepare a bunch of tuples for
+	 * insertion, and insert them in bulk, for example, with one
+	 * table_multi_insert() call than call table_tuple_insert() separately
+	 * for every tuple. However, there are a number of reasons why we might
+	 * not be able to do this.  We check some conditions below while some
+	 * other target relation properties are checked in InitResultRelInfo().
+	 * Partition initialization will use result of this check implicitly as
+	 * the ri_usesMultiInsert value of the parent relation.
+	 */
+	if (!checkMultiInsertMode(target_resultRelInfo, NULL))
+	{
+		/*
+		 * Do nothing. Can't allow multi-insert mode if previous conditions
+		 * checking disallow this.
+		 */
+	}
+	else if (cstate->volatile_defexprs || list_length(cstate->attnumlist) == 0)
+	{
+		/*
+		 * Can't support bufferization of copy into foreign tables without any
+		 * defined columns or if there are any volatile default expressions in the
+		 * table. Similarly to the trigger case above, such expressions may query
+		 * the table we're inserting into.
+		 *
+		 * Note: It does not matter 

Re: Evaluate expression at planning time for two more cases

2020-09-10 Thread Surafel Temesgen
On Tue, Sep 8, 2020 at 12:59 PM Surafel Temesgen 
wrote:

> Hi Tom
>
> On Tue, Sep 8, 2020 at 4:46 AM Tom Lane  wrote:
>
>
>> The "check_null_side" code you're proposing seems really horrid.
>> For one thing, it seems quite out of place for eval_const_expressions
>> to be doing that.  For another, it's wrong in principle because
>> eval_const_expressions doesn't know which part of the query tree
>> it's being invoked on, so it cannot know whether outer-join
>> nullability is an issue.  For another, doing that work over again
>> from scratch every time we see a potentially optimizable NullTest
>> looks expensive.  (I wonder whether you have tried to measure the
>> performance penalty imposed by this patch in cases where it fails
>> to make any proof.)
>>
>>
> I was thinking about collecting data about joins only once at the start of
> eval_const_expressions but I assume most queries don't have NULL check
> expressions and postpone it until we find one. Thinking about it again I
> think it can be done better by storing check_null_side_state into
> eval_const_expressions_context to use it for subsequent evaluation.
>
>

Attached patch does like the above and includes NOT NULL constraint column.

regards

Surafel
diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index 750586fceb..5fe4d88b5d 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -20,8 +20,10 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/table.h"
 #include "catalog/pg_aggregate.h"
 #include "catalog/pg_class.h"
+#include "catalog/pg_constraint.h"
 #include "catalog/pg_language.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_proc.h"
@@ -39,6 +41,7 @@
 #include "optimizer/plancat.h"
 #include "optimizer/planmain.h"
 #include "parser/analyze.h"
+#include "parser/parsetree.h"
 #include "parser/parse_agg.h"
 #include "parser/parse_coerce.h"
 #include "parser/parse_func.h"
@@ -50,6 +53,7 @@
 #include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
+#include "utils/rel.h"
 #include "utils/syscache.h"
 #include "utils/typcache.h"
 
@@ -61,6 +65,14 @@ typedef struct
 	AggClauseCosts *costs;
 } get_agg_clause_costs_context;
 
+typedef struct check_null_side_state
+{
+	Relids		relids;			/* base relids within this subtree */
+	bool		contains_outer; /* does subtree contain outer join(s)? */
+	JoinType	jointype;		/* type of join */
+	List	   *sub_states;		/* List of states for subtree components */
+}			check_null_side_state;
+
 typedef struct
 {
 	ParamListInfo boundParams;
@@ -68,6 +80,7 @@ typedef struct
 	List	   *active_fns;
 	Node	   *case_val;
 	bool		estimate;
+	check_null_side_state *state;
 } eval_const_expressions_context;
 
 typedef struct
@@ -156,6 +169,9 @@ static Query *substitute_actual_srf_parameters(Query *expr,
 			   int nargs, List *args);
 static Node *substitute_actual_srf_parameters_mutator(Node *node,
 	  substitute_actual_srf_parameters_context *context);
+static bool check_null_side(PlannerInfo *root, int relid, eval_const_expressions_context *context);
+static check_null_side_state * collect_jointree_data(Node *jtnode);
+
 
 
 /*
@@ -2296,6 +2312,7 @@ eval_const_expressions(PlannerInfo *root, Node *node)
 	context.active_fns = NIL;	/* nothing being recursively simplified */
 	context.case_val = NULL;	/* no CASE being examined */
 	context.estimate = false;	/* safe transformations only */
+	context.state= NULL;
 	return eval_const_expressions_mutator(node, );
 }
 
@@ -2626,6 +2643,7 @@ eval_const_expressions_mutator(Node *node,
 	{
 		has_null_input |= ((Const *) lfirst(arg))->constisnull;
 		all_null_input &= ((Const *) lfirst(arg))->constisnull;
+
 	}
 	else
 		has_nonconst_input = true;
@@ -3382,7 +3400,52 @@ eval_const_expressions_mutator(Node *node,
 
 	return makeBoolConst(result, false);
 }
+if (IsA(arg, Var) &&
+	((Var *) arg)->varlevelsup == 0 && context->root)
+{
+	/*
+	 * Evaluate the test if it is on NOT NULL Constraint column and the
+	 * relation is not mentioned on nullable side of outer
+	 * join
+	 */
+	Var		   *var = (Var *) arg;
+	Query	   *parse = context->root->parse;
+	int			relid;
+	RangeTblEntry *rte;
 
+	relid = var->varno;
+	rte = rt_fetch(relid, parse->rtable);
+	if (rte->relkind ==RELKIND_RELATION)
+	{
+		Relationrel;
+		Form_pg_attribute att;
+		rel = table_open(rte->relid, NoLock);
+		att = TupleDescAttr(rel->rd_att, var->varattno - 1);
+
+		if (att->attnotnull && !check_null_side(context->root, relid, context))
+		{
+			bool		result;
+
+			switch (ntest->nulltesttype)
+			{
+case IS_NULL:
+	result = false;
+	break;
+case IS_NOT_NULL:
+	result = true;
+	break;
+

Re: PATCH: Attempt to make dbsize a bit more consistent

2020-09-10 Thread Daniel Gustafsson
>> So what do you think of the patch?
> 
> I would suggest to keep the original logic unless there is a bug.

IIUC, the premise of this path submission is that this codepath is in fact
buggy as it may lead to incorrect results for non-heap relations?

Since we have introduced the table AM api I support going throug it for all
table accesses, so +1 on the overall idea.

Some comments on the patch:

- * Note that this also behaves sanely if applied to an index or toast table;
+ * Note that this also behaves sanely if applied to a toast table;
  * those won't have attached toast tables, but they can have multiple forks.
This comment reads a bit odd now and should probably be reworded.


-   return size;
+   Assert(size < PG_INT64_MAX);
+
+   return (int64)size;
I assume that this change, and the other ones like that, aim to handle int64
overflow?  Using the extra legroom of uint64 can still lead to an overflow,
however theoretical it may be.  Wouldn't it be better to check for overflow
before adding to size rather than after the fact?  Something along the lines of
checking for headroom left:

  rel_size = table_relation_size(..);
  if (rel_size > (PG_INT64_MAX - total_size))
< error codepath >
  total_size += rel_size;


+   if (rel->rd_rel->relkind != RELKIND_INDEX)
+   {
+   relation_close(rel, AccessShareLock);
+   PG_RETURN_NULL();
+   }
pg_indexes_size is defined as returning the size of the indexes attached to the
specified relation, so this hunk is wrong as it instead requires rel to be an
index?

cheers ./daniel



Re: Strange behavior with polygon and NaN

2020-09-10 Thread Kyotaro Horiguchi
Hello, Georgios.

At Mon, 07 Sep 2020 12:46:50 +, gkokola...@pm.me wrote in 
> ‐‐‐ Original Message ‐‐‐
> On Thursday, 27 August 2020 14:24, Kyotaro Horiguchi 
>  wrote:
> 
> > At Wed, 26 Aug 2020 08:18:49 -0400, Tom Lane t...@sss.pgh.pa.us wrote in
> >
> > > Kyotaro Horiguchi horikyota@gmail.com writes:
> > >
> > > > At Tue, 25 Aug 2020 19:03:50 -0400, Bruce Momjian br...@momjian.us 
> > > > wrote in
> > > >
> > > > > I can confirm that this two-month old email report still produces
> > > > > different results with indexes on/off in git master, which I don't 
> > > > > think
> > > > > is ever correct behavior.
> > >
> > > > I agree to that the behavior is broken.
> > >
> > > Yeah, but ... what is "non broken" in this case? I'm not convinced
> > > that having point_inside() return zero for any case involving NaN
> > > is going to lead to noticeably saner behavior than today.
> >
> > Yes, just doing that leaves many unfixed behavior come from NaNs, but
> > at least it seems to me one of sane definition candidates that a point
> > cannot be inside a polygon when NaN is involved. It's similar to
> > Fpxx() returns false if NaN is involved. As mentioned, I had't fully
> > checked and haven't considered this seriously, but I changed my mind
> > to check all the callers. I started checking all the callers of
> > point_inside, then finally I had to check all functions in geo_ops.c:(
> >
> 
> For what is worth, I agree with this definition.

Thanks.

> > The attached is the result as of now.
> >
> > === Resulting behavior
> >
> > With the attached:
> >
> > -   All boolean functions return false if NaN is involved.
> > -   All float8 functions return NaN if NaN is involved.
> > -   All geometric arithmetics return NaN as output if NaN is involved.
> 
> Agreed! As in both this behaviour conforms to the definition above and the 
> patch provides this behaviour with the exceptions below.
> 
> >
> > With some exceptions:
> >
> > -   line_eq: needs to consider that NaNs are equal each other.
> > -   point_eq/ne (point_eq_pint): ditto
> > -   lseg_eq/ne: ditto
...
> > === pg_hypot mistake?
> >
> > I noticed that pg_hypot returns inf for the parameters (NaN, Inf) but
> > I think NaN is appropriate here since other operators behaves that
> > way. This change causes a change of distance between point(1e+300,Inf)
> > and line (1,-1,0) from infinity to NaN, which I think is correct
> > because the arithmetic generates NaN as an intermediate value.
> >
> > === Infinity annoyances
> >
> > Infinity makes some not-great changes in regresssion results. For example:
> >
> > -   point '(1e+300,Infinity)' <-> path '((10,20))' returns
> > NaN(previously Infinity), but point '(1e+300,Infinity)' <-> path
> > '[(1,2),(3,4)]' returns Infinity. The difference of the two
> > expressions is whether (0 * Inf = NaN) is performed or not. The
> > former performs it but that was concealed by not propagating NaN to
> > upper layer without the patch.
> 
> Although I understand the reasoning for this change. I am not certain I agree 
> with the result. I feel that:
> point '(1e+300,Infinity)' <-> path '((10,20))'
> should return Infinity. Even if I am wrong to think that, the two results 
> should be expected to behave the same. Am I wrong on that too?

No. Actually that's not correct and that just comes from avoiding
special code paths for Infinity.  I put more thought on
line_interpt_line and found that that issue is "fixed" by just
simplifying formulas by removing invariants. But one more if-block is
needed to make the function work a symmetrical way, though..

However, still we have a similar issue.

point '(Infinity,1e+300)' <-> line '{-1,0,5}' => Infinity
point '(Infinity,1e+300)' <-> line '{0,-1,5}' => NaN
point '(Infinity,1e+300)' <-> line '{1,1,5}' => NaN

The second should be 1e+300 and the third infinity. This is because
line_closept_point taking the distance between the foot of the
perpendicular line from the point and the point. We can fix the second
case by adding special code paths for vertical and horizontal lines,
but the third needs another special code path explicitly identifying
Infinity. It seems a kind of too-much..

Finally, I gave up fixing that and added doucmentation.

As another issue, (point '(Infinity, 1e+300)' <-> path '((10,20))')
results in NaN. That is "fixed" by adding a special path for "m ==
0.0" case, but I'm not sure it's worth doing..

By the way I found that float8_div(, infinity) erros
out as underflow. It is inconsistent with the behavior that float8_mul
doesn't error-out as overflow when Infinity is given. So fixed it.

> > -   This is not a difference caused by this patch, but for both patched
> > and unpatched, point '(1e+300,Inf)' <-> line '{3,0,0}' returns NaN,
> > which should be 1e+300. However, the behavior comes from arithmetic
> > reasons and wouldn't be a problem.
> >
> > create_index.out is changed since point(NaN,NaN) <@ polygon 

Re: autovac issue with large number of tables

2020-09-10 Thread Kasahara Tatsuhito
Therefore, we expect this patch [1] to be committed for its original
purpose, as well as to improve autovacuum from v14 onwards.Hi,

On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
 wrote:
>
> Hi,
>
> On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
>  wrote:
> > > I wonder if we could have table_recheck_autovac do two probes of the stats
> > > data.  First probe the existing stats data, and if it shows the table to
> > > be already vacuumed, return immediately.  If not, *then* force a stats
> > > re-read, and check a second time.
> > Does the above mean that the second and subsequent table_recheck_autovac()
> > will be improved to first check using the previous refreshed statistics?
> > I think that certainly works.
> >
> > If that's correct, I'll try to create a patch for the PoC
>
> I still don't know how to reproduce Jim's troubles, but I was able to 
> reproduce
> what was probably a very similar problem.
>
> This problem seems to be more likely to occur in cases where you have
> a large number of tables,
> i.e., a large amount of stats, and many small tables need VACUUM at
> the same time.
>
> So I followed Tom's advice and created a patch for the PoC.
> This patch will enable a flag in the table_recheck_autovac function to use
> the existing stats next time if VACUUM (or ANALYZE) has already been done
> by another worker on the check after the stats have been updated.
> If the tables continue to require VACUUM after the refresh, then a refresh
> will be required instead of using the existing statistics.
>
> I did simple test with HEAD and HEAD + this PoC patch.
> The tests were conducted in two cases.
> (I changed few configurations. see attached scripts)
>
> 1. Normal VACUUM case
>   - SET autovacuum = off
>   - CREATE tables with 100 rows
>   - DELETE 90 rows for each tables
>   - SET autovacuum = on and restart PostgreSQL
>   - Measure the time it takes for all tables to be VACUUMed
>
> 2. Anti wrap round VACUUM case
>   - CREATE brank tables
>   - SELECT all of these tables (for generate stats)
>   - SET autovacuum_freeze_max_age to low values and restart PostgreSQL
>   - Consumes a lot of XIDs by using txid_curent()
>   - Measure the time it takes for all tables to be VACUUMed
>
> For each test case, the following results were obtained by changing
> autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10.
> Also changing num of tables to 1000, 5000, 1 and 2.
>
> Due to the poor VM environment (2 VCPU/4 GB), the results are a little 
> unstable,
> but I think it's enough to ask for a trend.
>
> ===
> [1.Normal VACUUM case]
>  tables:1000
>   autovacuum_max_workers 1:   (HEAD) 20 sec VS (with patch)  20 sec
>   autovacuum_max_workers 2:   (HEAD) 18 sec VS (with patch)  16 sec
>   autovacuum_max_workers 3:   (HEAD) 18 sec VS (with patch)  16 sec
>   autovacuum_max_workers 5:   (HEAD) 19 sec VS (with patch)  17 sec
>   autovacuum_max_workers 10:  (HEAD) 19 sec VS (with patch)  17 sec
>
>  tables:5000
>   autovacuum_max_workers 1:   (HEAD) 77 sec VS (with patch)  78 sec
>   autovacuum_max_workers 2:   (HEAD) 61 sec VS (with patch)  43 sec
>   autovacuum_max_workers 3:   (HEAD) 38 sec VS (with patch)  38 sec
>   autovacuum_max_workers 5:   (HEAD) 45 sec VS (with patch)  37 sec
>   autovacuum_max_workers 10:  (HEAD) 43 sec VS (with patch)  35 sec
>
>  tables:1
>   autovacuum_max_workers 1:   (HEAD) 152 sec VS (with patch)  153 sec
>   autovacuum_max_workers 2:   (HEAD) 119 sec VS (with patch)   98 sec
>   autovacuum_max_workers 3:   (HEAD)  87 sec VS (with patch)   78 sec
>   autovacuum_max_workers 5:   (HEAD) 100 sec VS (with patch)   66 sec
>   autovacuum_max_workers 10:  (HEAD)  97 sec VS (with patch)   56 sec
>
>  tables:2
>   autovacuum_max_workers 1:   (HEAD) 338 sec VS (with patch)  339 sec
>   autovacuum_max_workers 2:   (HEAD) 231 sec VS (with patch)  229 sec
>   autovacuum_max_workers 3:   (HEAD) 220 sec VS (with patch)  191 sec
>   autovacuum_max_workers 5:   (HEAD) 234 sec VS (with patch)  147 sec
>   autovacuum_max_workers 10:  (HEAD) 320 sec VS (with patch)  113 sec
>
> [2.Anti wrap round VACUUM case]
>  tables:1000
>   autovacuum_max_workers 1:   (HEAD) 19 sec VS (with patch) 18 sec
>   autovacuum_max_workers 2:   (HEAD) 14 sec VS (with patch) 15 sec
>   autovacuum_max_workers 3:   (HEAD) 14 sec VS (with patch) 14 sec
>   autovacuum_max_workers 5:   (HEAD) 14 sec VS (with patch) 16 sec
>   autovacuum_max_workers 10:  (HEAD) 16 sec VS (with patch) 14 sec
>
>  tables:5000
>   autovacuum_max_workers 1:   (HEAD) 69 sec VS (with patch) 69 sec
>   autovacuum_max_workers 2:   (HEAD) 66 sec VS (with patch) 47 sec
>   autovacuum_max_workers 3:   (HEAD) 59 sec VS (with patch) 37 sec
>   autovacuum_max_workers 5:   (HEAD) 39 sec VS (with patch) 28 sec
>   autovacuum_max_workers 10:  (HEAD) 39 sec VS (with patch) 29 sec
>
>  tables:1
>   autovacuum_max_workers 1:   (HEAD) 139 sec VS (with patch) 138 

Re: Bug in logical decoding of in-progress transactions

2020-09-10 Thread Amit Kapila
On Thu, Sep 10, 2020 at 12:00 PM Dilip Kumar  wrote:
>
> On Thu, Sep 10, 2020 at 11:53 AM Dilip Kumar  wrote:
>>
>>> >
>>> > I have written a test case to reproduce the same.
>

Can we write an isolation test for this scenario? See some similar
tests in contrib/test_decoding/specs. If that is possible then we can
probably remove the test which failed and instead write an isolation
test involving three transactions as shown by you. Also, please
prepare two separate patches (one for test and other for code) if you
are able to convert existing test to an isolation test as that will
make it easier to test the fix.

>
> I have removed some comments which are not valid after this patch.
>

Few comments:
=
1. We need to set xact_wrote_changes in pg_decode_stream_truncate() as
well along with the APIs in which you have set it.
2.
+static void
+pg_output_stream_start(LogicalDecodingContext *ctx, TestDecodingData
*data, ReorderBufferTXN *txn, bool last_write)
+{
  OutputPluginPrepareWrite(ctx, true);
  if (data->include_xids)
  appendStringInfo(ctx->out, "opening a streamed block for transaction
TXN %u", txn->xid);
@@ -601,16 +610,15 @@ pg_decode_stream_start(LogicalDecodingContext *ctx,
  OutputPluginWrite(ctx, true);

In this API, we need to use 'last_write' in OutputPluginPrepareWrite()
and OutputPluginWrite().

The attached patch fixes both these comments.

-- 
With Regards,
Amit Kapila.


v3-0001-Skip-printing-empty-stream-in-test-decoding.patch
Description: Binary data


RE: Implement UNLOGGED clause for COPY FROM

2020-09-10 Thread tsunakawa.ta...@fujitsu.com
From: Peter Smith 
> Earlier, Osumi-san was rejecting the idea of using ALTER TABLE tbl SET
> UNLOGGED on basis that it is too time consuming for large data to
> switch the table modes [1].

> Doesn't wal_level=none essentially just behave as if every table was
> UNLOGGED; not just the ones we are loading?
> 
> Doesn't wal_level=none come with all the same limitations/requirements
> (full daily backups/restarts etc) that the UNLOGGED TABLE would also
> have?

ALTER TABLE takes long time proportional to the amount of existing data, while 
wal_level = none doesn't.


Regards
Takayuki Tsunakawa



Re: Minor cleanup of partbounds.c

2020-09-10 Thread Etsuro Fujita
On Thu, Sep 10, 2020 at 2:05 AM Alvaro Herrera  wrote:
> On 2020-Sep-09, Etsuro Fujita wrote:
> > Here is a patch for minor cleanup of the partbounds.c changes made by
> > commit c8434d64c: 1) removes a useless assignment (in normal builds)
>
> LGTM.
>
> > and 2) improves comments a little.
>
> No objection to changing "bounds" to "range bounds".
>
> I think the point other is to replace the only appearance of "dummy
> relation" to better match the extensive use of "dummy partition" in this
> file.  The concept of a "dummy relation" is well established in the
> planner.  I didn't know if "dummy partition" is itself a concept
> (apparently in the newfangled partition-wise join stuff), or just
> glorified wording to say "a dummy relation that happens to be a
> partition".  Looking at is_dummy_partition, apparently a dummy partition
> is either a dummy relation or a partition that doesn't have a
> RelOptInfo.  So my conclusion is that this wording is okay to change
> too.

Cool!

I pushed the patch.  Thanks for reviewing!

Best regards,
Etsuro Fujita




RE: Global snapshots

2020-09-10 Thread tsunakawa.ta...@fujitsu.com
From: Fujii Masao 
> But I'm concerned about that it's really hard to say there is no patent risk
> around that. I'm not sure who can judge there is no patent risk,
> in the community. Maybe no one? Anyway, I was thinking that Google Spanner,
> YugabyteDB, etc use the global transaction approach based on the clock
> similar to Clock-SI. Since I've never heard they have the patent issues,
> I was just thinking Clock-SI doesn't have. No? This type of *guess* is not
> safe, though...

Hm, it may be difficult to be sure that the algorithm does not violate a 
patent.  But it may not be difficult to know if the algorithm apparently 
violates a patent or is highly likely (for those who know Clock-SI well.)  At 
least, Andrey-san seems to have felt that it needs careful study, so I guess he 
had some hunch.

I understand this community is sensitive to patents.  After the discussions at 
and after PGCon 2018, the community concluded that it won't accept patented 
technology.  In the distant past, the community released Postgres 8.0 that 
contains an IBM's pending patent ARC, and removed it in 8.0.2.  I wonder how 
could this could be detected, and how hard to cope with the patent issue.  
Bruce warned that we should be careful not to violate Greenplum's patents.

E.25. Release 8.0.2
https://www.postgresql.org/docs/8.0/release-8-0-2.html
--
New cache management algorithm 2Q replaces ARC (Tom)
This was done to avoid a pending US patent on ARC. The 2Q code might be a few 
percentage points slower than ARC for some work loads. A better cache 
management algorithm will appear in 8.1.
--


I think I'll try to contact the people listed in Clock-SI paper and the 
Microsoft patent to ask about this.  I'm going to have a late summer vacation 
next week, so this is my summer homework?


> One alternative is to add only hooks into PostgreSQL core so that we can
> implement the global transaction management outside. This idea was
> discussed before as the title "eXtensible Transaction Manager API".

Yeah, I read that discussion.  And I remember Robert Haas and Postgres Pro 
people said it's not good...


Regards
Takayuki Tsunakawa




Re: Global snapshots

2020-09-10 Thread Fujii Masao




On 2020/09/10 10:38, tsunakawa.ta...@fujitsu.com wrote:

Hi Andrey san,

From: Andrey V. Lepikhov > > From: 
tsunakawa.ta...@fujitsu.com 

While Clock-SI seems to be considered the best promising for global

Could you take a look at this patent?  I'm afraid this is the Clock-SI for MVCC.

Microsoft holds this until 2031.  I couldn't find this with the keyword
"Clock-SI.""



US8356007B2 - Distributed transaction management for database systems

with multiversioning - Google Patents

https://patents.google.com/patent/US8356007


If it is, can we circumvent this patent?



Thank you for the research (and previous links too).
I haven't seen this patent before. This should be carefully studied.


I wanted to ask about this after I've published the revised scale-out design 
wiki, but I'm taking too long, so could you share your study results?  I think 
we need to make it clear about the patent before discussing the code.


Yes.

But I'm concerned about that it's really hard to say there is no patent risk
around that. I'm not sure who can judge there is no patent risk,
in the community. Maybe no one? Anyway, I was thinking that Google Spanner,
YugabyteDB, etc use the global transaction approach based on the clock
similar to Clock-SI. Since I've never heard they have the patent issues,
I was just thinking Clock-SI doesn't have. No? This type of *guess* is not
safe, though...



 After we hear your opinion, we also have to check to see if Clock-SI is 
patented or avoid it by modifying part of the algorithm.  Just in case we 
cannot use it, we have to proceed with thinking about alternatives.


One alternative is to add only hooks into PostgreSQL core so that we can
implement the global transaction management outside. This idea was
discussed before as the title "eXtensible Transaction Manager API".

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Range checks of pg_test_fsync --secs-per-test and pg_test_timing --duration

2020-09-10 Thread Michael Paquier
On Mon, Sep 07, 2020 at 10:06:57AM +0200, Peter Eisentraut wrote:
> However, I see that in the case of pg_test_fsync you end up in alarm(0),
> which does something different, so it's okay in that case to disallow it.

Yep.

> I notice that the error checking you introduce is different from the checks
> for pgbench -t and -T (the latter having no errno checks).  I'm not sure
> which is correct, but it's perhaps worth making them the same.

pgbench currently uses atoi() to parse the options of -t and -T.  Are
you suggesting to switch that to strtoXX() as well or perhaps you are
referring to the parsing of the weight in parseScriptWeight()?  FWIW,
the error handling introduced in this patch is similar to what we do
for example in pg_resetwal.  This has its own problems as strtoul()
would not report ERANGE except for values higher than ULONG_MAX, but
the returned results are stored in 32 bits.  We could switch to just
use uint64 where we could of course, but is that really worth it for
such tools?  For example, pg_test_timing could overflow the
total_timing calculated if using a too high value, but nobody would
use such values anyway.  So I'd rather just use uint32 and call it a
day, for simplicity's sake mainly..

> (pgbench -t 0, which is also currently not allowed, is a good example of why
> this could be useful, because that would allow checking whether the script
> etc. can be loaded without running an actual test.)

Perhaps.  That looks like a separate item to me though.
--
Michael


signature.asc
Description: PGP signature


Re: Online checksums verification in the backend

2020-09-10 Thread Julien Rouhaud
On Wed, Sep 09, 2020 at 03:41:30PM +0200, Julien Rouhaud wrote:
> On Wed, Sep 09, 2020 at 11:25:29AM +0200, Julien Rouhaud wrote:
> > 
> > I'll do some becnhmarking and see if I can get some figures, but it'll 
> > probably
> > take some time.  In the meantime I'm attaching v11 of the patch that should
> > address all other comments.
> 
> I just realized that I forgot to update one of the Makefile when moving the 
> TAP
> test folder.  v12 attached should fix that.


And the cfbot just reported a new error for Windows build.  Attached v13 should
fix that.
>From 2e08641e25ccd21bf6e4870085eb8b6b741027c1 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud 
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v13] Add a pg_check_relation() function.

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada, Justin Pryzby
Discussion: 
https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/config.sgml  |  85 +
 doc/src/sgml/func.sgml|  51 +++
 src/backend/postmaster/pgstat.c   |   3 +
 src/backend/storage/page/checksum.c   | 322 +-
 src/backend/utils/adt/Makefile|   1 +
 src/backend/utils/adt/checksumfuncs.c | 218 
 src/backend/utils/init/globals.c  |   7 +
 src/backend/utils/misc/guc.c  |  33 ++
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/catalog/pg_proc.dat   |  16 +
 src/include/miscadmin.h   |   7 +
 src/include/pgstat.h  |   3 +-
 src/include/utils/checksumfuncs.h |  31 ++
 src/include/utils/guc_tables.h|   1 +
 src/test/modules/Makefile |   1 +
 src/test/modules/check_relation/.gitignore|   2 +
 src/test/modules/check_relation/Makefile  |  14 +
 src/test/modules/check_relation/README|  23 ++
 .../check_relation/t/001_checksums_check.pl   | 276 +++
 src/tools/msvc/Mkvcbuild.pm   |   3 +-
 20 files changed, 1098 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/utils/adt/checksumfuncs.c
 create mode 100644 src/include/utils/checksumfuncs.h
 create mode 100644 src/test/modules/check_relation/.gitignore
 create mode 100644 src/test/modules/check_relation/Makefile
 create mode 100644 src/test/modules/check_relation/README
 create mode 100644 src/test/modules/check_relation/t/001_checksums_check.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c4ba49ffaf..b7629fde60 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2137,6 +2137,91 @@ include_dir 'conf.d'
  
 
 
+
+ Cost-based Checksum Verification Delay
+
+ 
+  During the execution of 
+  function, the system maintains an internal counter that keeps track of
+  the estimated cost of the various I/O operations that are performed.
+  When the accumulated cost reaches a limit (specified by
+  checksum_cost_limit), the process performing the
+  operation will sleep for a short period of time, as specified by
+  checksum_cost_delay. Then it will reset the counter
+  and continue execution.
+ 
+
+ 
+  This feature is disabled by default. To enable it, set the
+  checksum_cost_delay variable to a nonzero
+  value.
+ 
+
+ 
+  
+   checksum_cost_delay (floating 
point)
+   
+checksum_cost_delay configuration 
parameter
+   
+   
+   
+
+ The amount of time that the process will sleep
+ when the cost limit has been exceeded.
+ If this value is specified without units, it is taken as milliseconds.
+ The default value is zero, which disables the cost-based checksum
+ verification delay feature.  Positive values enable cost-based
+ checksum verification.
+
+   
+  
+
+  
+   checksum_cost_page (integer)
+   
+checksum_cost_page configuration 
parameter
+   
+   
+   
+
+ The estimated cost for verifying a buffer, whether it's found in the
+ shared buffer cache or not. It represents the cost to lock the buffer
+ pool, lookup the shared hash table, read the content of the page from
+ disk and compute its checksum.  The default value is 10.
+
+   
+  
+
+  
+   checksum_cost_limit (integer)
+   
+checksum_cost_limit configuration 
parameter
+   
+   
+   
+
+ The accumulated cost that will cause the verification process to 
sleep.
+ The default value is 200.
+
+   
+  
+ 
+
+   

Re: Inconsistency in determining the timestamp of the db statfile.

2020-09-10 Thread Amit Kapila
On Thu, Sep 10, 2020 at 1:03 PM Magnus Hagander  wrote:
>
> On Thu, Sep 10, 2020 at 9:05 AM Amit Kapila  wrote:
>>
>> On Thu, Sep 10, 2020 at 11:52 AM Masahiko Sawada
>>  wrote:
>> >
>> >
>> > Regarding the v2 patch, I think we should return false in the
>> > following case too:
>> >
>> > default:
>> > ereport(pgStatRunningInCollector ? LOG : WARNING,
>> > (errmsg("corrupted statistics file \"%s\"",
>> > statfile)));
>> > goto done;
>> >
>>
>> makes sense, attached find the updated patch.
>
>
> As a minor nitpick, technically, I think the comment change is wrong, because 
> it says that the caller *must* rely on the timestamp, which it of course 
> doesn't. I think a more proper one is "The caller must not rely on the 
> timestamp in case the function returns false" or "The caller must only rely 
> on the timestamp if the function returns true".
>

The comments already say what you said in the second suggestion:"The
caller must rely on timestamp stored in *ts iff the function returns
true.". Read iff "as if and only if"

> +1 on the code parts.
>

BTW, do we want to backpatch this? There is no user reported bug and
not sure if the user will encounter any problem. I think it is a minor
improvement and more of code consistency. So, making HEAD only change
should be okay.

-- 
With Regards,
Amit Kapila.




Re: Proposals for making it easier to write correct bgworkers

2020-09-10 Thread Magnus Hagander
On Thu, Sep 10, 2020 at 5:02 AM Craig Ringer  wrote:

> Hi all
>
> As I've gained experience working on background workers, it's become
> increasingly clear that they're a bit too different to normal backends for
> many nontrivial uses.
>

 a lot of proposals I agree with.



PROPOSED GENERALISED WORKER MANAGEMENT
> 
>
> Finally I'm wondering if there's any interest in generalizing the logical
> rep worker management for other bgworkers. I've done a ton of work with
> worker management and it's something I'm sure I could take on but I don't
> want to write it without knowing there's some chance of acceptance.
>
> The general idea is to provide a way for bgworkers to start up managers
> for pools / sets of workers. They launch them and have a function they can
> call in their mainloop that watches their child worker states, invoking
> callbacks when they fail to launch, launch successfully, exit cleanly after
> finishing their work, or die with an error. Workers are tracked in a shmem
> seg where the start of the seg must be a key struct (akin to how the hash
> API works). We would provide calls to look up a worker shmem struct by key,
> signal a worker by key, wait for a worker to exit (up to timeout), etc.
> Like in the logical rep code, access to the worker registration shmem would
> be controlled by LWLock. The extension code can put whatever it wants in
> the worker shmem entries after the key, including various unions or
> whatever - the worker management API won't care.
>
> This abstracts all the low level mess away from bgworker implementations
> and lets them focus on writing the code they want to run.
>
> I'd probably suggest doing so by extracting the logical rep worker
> management, and making the logical rep code use the generalized worker
> management. So it'd be proven, and have in core users.
>

Yes, there is definitely a lot of interest in this.

It would also be good to somehow generalize away the difference between
static bgworkers and dynamic ones. That's something that really annoyed us
with the work on the "online checksums" patch, and I've also run into that
issue in other cases. I think finding a way to launch a dynamic worker out
of the postmaster would be a way to do that -- I haven't looked into the
detail, but if we're looking at generalizing the worker management this is
definitely something we should include in the consideration.

I haven't looked at the different places we could in theory extract the
management out of and reuse, but it makes sense that the logical
replication one would be the most appropriate since it's the newest one (vs
autovacuum which is the other one that can at least do similar things). And
yes, it definitely makes sense to have a generalized set of code for this,
because it's certainly a fairly complicated pattern that we shouldn't be
re-inventing over and over again with slightly different bugs.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Inconsistency in determining the timestamp of the db statfile.

2020-09-10 Thread Magnus Hagander
On Thu, Sep 10, 2020 at 9:05 AM Amit Kapila  wrote:

> On Thu, Sep 10, 2020 at 11:52 AM Masahiko Sawada
>  wrote:
> >
> >
> > Regarding the v2 patch, I think we should return false in the
> > following case too:
> >
> > default:
> > ereport(pgStatRunningInCollector ? LOG : WARNING,
> > (errmsg("corrupted statistics file \"%s\"",
> > statfile)));
> > goto done;
> >
>
> makes sense, attached find the updated patch.
>

As a minor nitpick, technically, I think the comment change is wrong,
because it says that the caller *must* rely on the timestamp, which it of
course doesn't. I think a more proper one is "The caller must not rely on
the timestamp in case the function returns false" or "The caller must only
rely on the timestamp if the function returns true".

+1 on the code parts.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: v13: show extended stats target in \d

2020-09-10 Thread Georgios



‐‐‐ Original Message ‐‐‐
On Thursday, 10 September 2020 00:36, Justin Pryzby  
wrote:

> On Tue, Sep 01, 2020 at 12:35:19PM +, Georgios Kokolatos wrote:
>
> > I will humbly disagree with the current review. I shall refrain from 
> > changing the status though because I do not have very strong feeling about 
> > it.
>
> Sorry but this was in my spam, and didn't see until now.

No worries at all. Thank you for replying.

>
> > However the patch contains:
> >
> > -   "  'm' = 
> > any(stxkind) AS mcv_enabled\\n"
> >
> >
> >
> > -   "  'm' = 
> > any(stxkind) AS mcv_enabled,\\n"
> >
> >
> > -   "  %s"
> > "FROM 
> > pg_catalog.pg_statistic_ext stat "
> > "WHERE stxrelid 
> > = '%s'\\n"
> > "ORDER BY 1;",
> >
> >
> > -   (pset.sversion 
> > >= 13 ? "stxstattarget\\n" : "-1\\n"),
> > oid);
> >
> >
> >
> > This seems to be breaking a bit the pattern in describeOneTableDetails().
> > First, it is customary to write the whole query for the version in its own 
> > block. I do find this pattern rather helpful and clean. So in my humble 
> > opinion, the ternary expression should be replaced with a distinct if block 
> > that would implement stxstattarget. Second, setting the value to -1 as an 
> > indication of absence breaks the pattern a bit further. More on that bellow.
>
> Hm, I did like this using the "hastriggers" code as a template. But you're
> right that everywhere else does it differently.


Thank you for taking my input.

>
> > - if (strcmp(PQgetvalue(result, i, 
> > 8), "-1") != 0)
> >
> >
> > - appendPQExpBuffer(, " 
> > STATISTICS %s",
> >
> >
> > -   
> > PQgetvalue(result, i, 8));
> >
> >
> > -
> >
> > In the same function, one will find a bit of a different pattern for 
> > printing the statistics, e.g.
> > if (strcmp(PQgetvalue(result, i, 7), "t") == 0)
> > I will again hold the opinion that if the query gets crafted a bit 
> > differently, one can inspect if the table has `stxstattarget` and then, go 
> > ahead and print the value.
> > Something in the lines of:
> > if (strcmp(PQgetvalue(result, i, 8), "t") == 0)
> > appendPQExpBuffer(, " STATISTICS %s", PQgetvalue(result, i, 9));
>
> I think what you've written wouldn't give the desired behavior, since it would
> show the stats target even when it's set to the default. I don't see the point
> of having additional, separate, version-specific boolean columns for 1) column
> exists; 2) column is not default, in addition to 3) column value. But I added
> comment about what Peter and I agree is desirable, anyway.

Fair enough. As I said above, I do not have a very strong feeling, so it gets 
my +1 if it is worth anything.


>
> > Finally, the patch might be more complete if a note is added in the 
> > documentation.
> > Have you considered adding something in doc/src/sgml/ref/psql-ref.sgml? If 
> > no, will you consider it? If yes, why did you discard it?
>
> I don't think the details of psql output are currently documented. This shows
> nothing about column statistics target, nor about MV stats at all.
> https://www.postgresql.org/docs/13/app-psql.html

Yeah, I have noticed that. Hence my question. If anything maybe the 
documentation can be expanded to cover these cases in a different patch. Thank 
you for your answer.

>
> As for the discussion about a separator, I think maybe a comma is enough. I
> doubt anyone is going to think that you can get a valid command by prefixing
> this by "CREATE STATISTICS". Actually, it didn't even occur to me it was valid
> command without the stats target - after all, that's not true of indexes.
>
> -   "public"."ab1_a_b_stats" (ndistinct, dependencies, mcv) ON a, b FROM ab1, 
> STATISTICS 0
>
> This revision only shows the stats target in verbose mode (slash dee 
> plus).
>
> --
> Justin
>






Re: Inconsistent Japanese name order in v13 contributors list

2020-09-10 Thread Etsuro Fujita
On Wed, Sep 9, 2020 at 9:16 PM Peter Eisentraut
 wrote:
> On 2020-09-09 07:15, Etsuro Fujita wrote:
> > Attached is a patch to standardize Japanese names as given-name-first
> > in the v13 contributors list as before.
>
> Given existing practice, this patch looks okay.

I've applied the patch.  Thanks for the review!

Best regards,
Etsuro Fujita




Re: [PATCH] audo-detect and use -moutline-atomics compilation flag for aarch64

2020-09-10 Thread Zidenberg, Tsahi
On 08/09/2020, 1:01, "Tom Lane"  wrote:

> I wonder what version of gcc you intend this for.  AFAICS, older
> gcc versions lack this flag at all, while newer ones have it on
> by default.


(previously sent private reply, sorry)

The moutline-atomics flag showed substantial enough improvements
that it has been backported to GCC 9, 8 and there is a gcc-7 branch in
the works.
Ubuntu has integrated this in 20.04, Amazon Linux 2 supports it,
with other distributions including Ubuntu 18.04 and Debian on the way.
all distributions, including the upcoming Ubuntu with GCC-10, have
moutline-atomics turned off by default.



Re: Inconsistency in determining the timestamp of the db statfile.

2020-09-10 Thread Amit Kapila
On Thu, Sep 10, 2020 at 11:52 AM Masahiko Sawada
 wrote:
>
>
> Regarding the v2 patch, I think we should return false in the
> following case too:
>
> default:
> ereport(pgStatRunningInCollector ? LOG : WARNING,
> (errmsg("corrupted statistics file \"%s\"",
> statfile)));
> goto done;
>

makes sense, attached find the updated patch.

-- 
With Regards,
Amit Kapila.


v3-0001-Fix-inconsistency-in-determining-the-timestamp-of.patch
Description: Binary data


Re: doc review for v13

2020-09-10 Thread Michael Paquier
On Wed, Sep 09, 2020 at 09:37:42AM -0500, Justin Pryzby wrote:
> I've added a few more.

I have done an extra round of review on this patch series, and applied
what looked obvious to me (basically the points already discussed
upthread).  Some parts applied down to 9.6 for the docs.
--
Michael


signature.asc
Description: PGP signature


Re: Implement UNLOGGED clause for COPY FROM

2020-09-10 Thread Peter Smith
Hi.

I expect I have some basic misunderstanding because IMO now this
thread seems to have come full circle.

Earlier, Osumi-san was rejecting the idea of using ALTER TABLE tbl SET
UNLOGGED on basis that it is too time consuming for large data to
switch the table modes [1].

Now the latest idea is to introduce a wal_level=none. But now
apparently full daily backups are OK, and daily restarting the server
before the copies is also OK [2].

~

Doesn't wal_level=none essentially just behave as if every table was
UNLOGGED; not just the ones we are loading?

Doesn't wal_level=none come with all the same limitations/requirements
(full daily backups/restarts etc) that the UNLOGGED TABLE would also
have?

So I don't recognise the difference?

If wal_level=none is judged OK as a fast loading solution, then why
wasn't an initially UNLOGGED table also judged OK by the same
criteria? And if there is no real difference, then why is it necessary
to introduce wal_level=none (instead of using the existing UNLOGGED
feature) in the first place?

Or, if all this problem is simply due to a quirk that the BI tool
referred to does not support the CREATE UNLOGGED TABLE syntax [3],
then surely there is some other workaround could be written to handle
that.

What part am I missing?

--
[1] - 
https://www.postgresql.org/message-id/OSBPR01MB48884832932F93DAA953CEB9ED650%40OSBPR01MB4888.jpnprd01.prod.outlook.com
[2] - 
https://www.postgresql.org/message-id/TYAPR01MB299005FC543C43348A4993FDFE550%40TYAPR01MB2990.jpnprd01.prod.outlook.com
[3] - 
https://www.postgresql.org/message-id/OSBPR01MB4888CBD08DDF73721C18D2C0ED790%40OSBPR01MB4888.jpnprd01.prod.outlook.com

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Bug in logical decoding of in-progress transactions

2020-09-10 Thread Dilip Kumar
On Thu, Sep 10, 2020 at 11:53 AM Dilip Kumar  wrote:

> On Thu, Sep 10, 2020 at 11:47 AM Amit Kapila 
> wrote:
>
>> On Thu, Sep 10, 2020 at 11:42 AM Dilip Kumar 
>> wrote:
>> >
>> > On Thu, Sep 10, 2020 at 11:29 AM Amit Kapila 
>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> There is a recent build farm failure [1] in one of the test_decoding
>> >> tests as pointed by Tom Lane [2]. The failure report is shown below:
>> >>
>> >> @@ -71,6 +71,8 @@
>> >> data
>> >>  --
>> >>   opening a streamed block for transaction
>> >> + closing a streamed block for transaction
>> >> + opening a streamed block for transaction
>> >>   streaming change for transaction
>> >>   streaming change for transaction
>> >>   streaming change for transaction
>> >> @@ -83,7 +85,7 @@
>> >>   streaming change for transaction
>> >>   closing a streamed block for transaction
>> >>   committing streamed transaction
>> >> -(13 rows)
>> >> +(15 rows)
>> >>
>> >> Here, the symptoms are quite similar to what we have fixed in commit
>> >> 82a0ba7707 which is that an extra empty transaction is being decoded
>> >> in the test. It can happen even if have instructed the test to 'skip
>> >> empty xacts' for streaming transactions because the test_decoding
>> >> plugin APIs (related to streaming changes for in-progress xacts) makes
>> >> no effort to skip such empty xacts. It was kept intentionally like
>> >> that under the assumption that we would never try to stream empty
>> >> xacts but on closer inspection of the code, it seems to me that
>> >> assumption was not correct. Basically, we can pick to stream a
>> >> transaction that has change messages for
>> >> REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT and we don't send such
>> >> messages to downstream rather they are just to update the internal
>> >> state. So, in this particular failure, it is possible that autovacuum
>> >> transaction has got such a change message added by one of the other
>> >> committed xact and on trying to stream it we get such additional
>> >> messages. The fix is to skip empty xacts when indicated by the user in
>> >> streaming APIs of test_decoding.
>> >>
>> >> Thoughts?
>> >
>> >
>> > Yeah, that's an issue.
>> >
>> >>
>> >>
>> >> [1] -
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2020-09-09+03%3A42%3A19
>> >> [2] -
>> https://www.postgresql.org/message-id/118303.1599691636%40sss.pgh.pa.us
>> >>
>> >
>> > I have written a test case to reproduce the same.  I have also prepared
>> a patch to skip the empty transaction.  And after that, the issue has been
>> fixed.  But the extra side effect will be that it would skip any empty
>> stream even if the transaction is not empty.  As such I don't see any
>> problem with that but this is not what the user has asked for.
>> >
>>
>> Isn't that true for non-streaming xacts as well? Basically
>> skip-empty-xacts option indicates that if there is no change for
>> 'tuple' or 'message', we skip it.
>>
>
> Yeah, that's right.
>

I have removed some comments which are not valid after this patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v2-0001-Skip-printing-empty-stream-in-test-decoding.patch
Description: Binary data


Re: Bug in logical decoding of in-progress transactions

2020-09-10 Thread Dilip Kumar
On Thu, Sep 10, 2020 at 11:47 AM Amit Kapila 
wrote:

> On Thu, Sep 10, 2020 at 11:42 AM Dilip Kumar 
> wrote:
> >
> > On Thu, Sep 10, 2020 at 11:29 AM Amit Kapila 
> wrote:
> >>
> >> Hi,
> >>
> >> There is a recent build farm failure [1] in one of the test_decoding
> >> tests as pointed by Tom Lane [2]. The failure report is shown below:
> >>
> >> @@ -71,6 +71,8 @@
> >> data
> >>  --
> >>   opening a streamed block for transaction
> >> + closing a streamed block for transaction
> >> + opening a streamed block for transaction
> >>   streaming change for transaction
> >>   streaming change for transaction
> >>   streaming change for transaction
> >> @@ -83,7 +85,7 @@
> >>   streaming change for transaction
> >>   closing a streamed block for transaction
> >>   committing streamed transaction
> >> -(13 rows)
> >> +(15 rows)
> >>
> >> Here, the symptoms are quite similar to what we have fixed in commit
> >> 82a0ba7707 which is that an extra empty transaction is being decoded
> >> in the test. It can happen even if have instructed the test to 'skip
> >> empty xacts' for streaming transactions because the test_decoding
> >> plugin APIs (related to streaming changes for in-progress xacts) makes
> >> no effort to skip such empty xacts. It was kept intentionally like
> >> that under the assumption that we would never try to stream empty
> >> xacts but on closer inspection of the code, it seems to me that
> >> assumption was not correct. Basically, we can pick to stream a
> >> transaction that has change messages for
> >> REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT and we don't send such
> >> messages to downstream rather they are just to update the internal
> >> state. So, in this particular failure, it is possible that autovacuum
> >> transaction has got such a change message added by one of the other
> >> committed xact and on trying to stream it we get such additional
> >> messages. The fix is to skip empty xacts when indicated by the user in
> >> streaming APIs of test_decoding.
> >>
> >> Thoughts?
> >
> >
> > Yeah, that's an issue.
> >
> >>
> >>
> >> [1] -
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2020-09-09+03%3A42%3A19
> >> [2] -
> https://www.postgresql.org/message-id/118303.1599691636%40sss.pgh.pa.us
> >>
> >
> > I have written a test case to reproduce the same.  I have also prepared
> a patch to skip the empty transaction.  And after that, the issue has been
> fixed.  But the extra side effect will be that it would skip any empty
> stream even if the transaction is not empty.  As such I don't see any
> problem with that but this is not what the user has asked for.
> >
>
> Isn't that true for non-streaming xacts as well? Basically
> skip-empty-xacts option indicates that if there is no change for
> 'tuple' or 'message', we skip it.
>

Yeah, that's right.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: Inconsistency in determining the timestamp of the db statfile.

2020-09-10 Thread Masahiko Sawada
On Thu, 10 Sep 2020 at 14:24, Amit Kapila  wrote:
>
> On Wed, Sep 9, 2020 at 9:37 PM Fujii Masao  
> wrote:
> >
> > On 2020/09/09 22:57, Magnus Hagander wrote:
> > > On Wed, Sep 9, 2020 at 3:56 PM Tomas Vondra  > > > wrote:
> > >
> > > On Wed, Sep 09, 2020 at 03:53:40PM +0530, Amit Kapila wrote:
> > >  >On Wed, Sep 9, 2020 at 3:15 PM Magnus Hagander  > > > wrote:
> > >  >>
> > >  >> On Wed, Sep 9, 2020 at 5:04 AM Amit Kapila 
> > > mailto:amit.kapil...@gmail.com>> wrote:
> > >  >>>
> > >  >>
> > >  >> Though in fact the one inconsistent place in the code now is that 
> > > if it is corrupt in the db entry part of the file it returns true and the 
> > > global timestamp, which I would argue is perhaps incorrect and it should 
> > > return false.
> > >  >>
> > >  >
> > >  >Yeah, this is exactly the case I was pointing out where we return 
> > > true
> > >  >before the patch, basically the code below:
> > >  >case 'D':
> > >  >if (fread(, 1, offsetof(PgStat_StatDBEntry, tables),
> > >  >  fpin) != offsetof(PgStat_StatDBEntry, tables))
> > >  >{
> > >  >ereport(pgStatRunningInCollector ? LOG : WARNING,
> > >  >(errmsg("corrupted statistics file \"%s\"",
> > >  >statfile)));
> > >  >goto done;
> > >  >}
> > >  >
> > >  >done:
> > >  >FreeFile(fpin);
> > >  >return true;
> > >  >
> > >  >Now, if we decide to return 'false' here, then surely there is no
> > >  >argument and we should return false in other cases as well. 
> > > Basically,
> > >  >I think we should be consistent in handling the corrupt file case.
> > >  >
> > >
> > > FWIW I do agree with this - we should return false here, to make it
> > > return false like in the other data corruption cases. AFAICS that's 
> > > the
> > > only inconsistency here.
> > >
> > >
> > > +1, I think that's the place to fix, rather than reversing all the other 
> > > places.
> >
> > +1 as I suggested upthread!
> >
>
> Please find the patch attached based on the above discussion. I have
> slightly adjusted the comments.

FWIW reading the discussion, I also agree to fix it to return false in
case of file corruption.

Regarding the v2 patch, I think we should return false in the
following case too:

default:
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
statfile)));
goto done;

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Bug in logical decoding of in-progress transactions

2020-09-10 Thread Amit Kapila
On Thu, Sep 10, 2020 at 11:42 AM Dilip Kumar  wrote:
>
> On Thu, Sep 10, 2020 at 11:29 AM Amit Kapila  wrote:
>>
>> Hi,
>>
>> There is a recent build farm failure [1] in one of the test_decoding
>> tests as pointed by Tom Lane [2]. The failure report is shown below:
>>
>> @@ -71,6 +71,8 @@
>> data
>>  --
>>   opening a streamed block for transaction
>> + closing a streamed block for transaction
>> + opening a streamed block for transaction
>>   streaming change for transaction
>>   streaming change for transaction
>>   streaming change for transaction
>> @@ -83,7 +85,7 @@
>>   streaming change for transaction
>>   closing a streamed block for transaction
>>   committing streamed transaction
>> -(13 rows)
>> +(15 rows)
>>
>> Here, the symptoms are quite similar to what we have fixed in commit
>> 82a0ba7707 which is that an extra empty transaction is being decoded
>> in the test. It can happen even if have instructed the test to 'skip
>> empty xacts' for streaming transactions because the test_decoding
>> plugin APIs (related to streaming changes for in-progress xacts) makes
>> no effort to skip such empty xacts. It was kept intentionally like
>> that under the assumption that we would never try to stream empty
>> xacts but on closer inspection of the code, it seems to me that
>> assumption was not correct. Basically, we can pick to stream a
>> transaction that has change messages for
>> REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT and we don't send such
>> messages to downstream rather they are just to update the internal
>> state. So, in this particular failure, it is possible that autovacuum
>> transaction has got such a change message added by one of the other
>> committed xact and on trying to stream it we get such additional
>> messages. The fix is to skip empty xacts when indicated by the user in
>> streaming APIs of test_decoding.
>>
>> Thoughts?
>
>
> Yeah, that's an issue.
>
>>
>>
>> [1] - 
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2020-09-09+03%3A42%3A19
>> [2] - https://www.postgresql.org/message-id/118303.1599691636%40sss.pgh.pa.us
>>
>
> I have written a test case to reproduce the same.  I have also prepared a 
> patch to skip the empty transaction.  And after that, the issue has been 
> fixed.  But the extra side effect will be that it would skip any empty stream 
> even if the transaction is not empty.  As such I don't see any problem with 
> that but this is not what the user has asked for.
>

Isn't that true for non-streaming xacts as well? Basically
skip-empty-xacts option indicates that if there is no change for
'tuple' or 'message', we skip it.

-- 
With Regards,
Amit Kapila.




Re: Bug in logical decoding of in-progress transactions

2020-09-10 Thread Dilip Kumar
On Thu, Sep 10, 2020 at 11:29 AM Amit Kapila 
wrote:

> Hi,
>
> There is a recent build farm failure [1] in one of the test_decoding
> tests as pointed by Tom Lane [2]. The failure report is shown below:
>
> @@ -71,6 +71,8 @@
> data
>  --
>   opening a streamed block for transaction
> + closing a streamed block for transaction
> + opening a streamed block for transaction
>   streaming change for transaction
>   streaming change for transaction
>   streaming change for transaction
> @@ -83,7 +85,7 @@
>   streaming change for transaction
>   closing a streamed block for transaction
>   committing streamed transaction
> -(13 rows)
> +(15 rows)
>
> Here, the symptoms are quite similar to what we have fixed in commit
> 82a0ba7707 which is that an extra empty transaction is being decoded
> in the test. It can happen even if have instructed the test to 'skip
> empty xacts' for streaming transactions because the test_decoding
> plugin APIs (related to streaming changes for in-progress xacts) makes
> no effort to skip such empty xacts. It was kept intentionally like
> that under the assumption that we would never try to stream empty
> xacts but on closer inspection of the code, it seems to me that
> assumption was not correct. Basically, we can pick to stream a
> transaction that has change messages for
> REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT and we don't send such
> messages to downstream rather they are just to update the internal
> state. So, in this particular failure, it is possible that autovacuum
> transaction has got such a change message added by one of the other
> committed xact and on trying to stream it we get such additional
> messages. The fix is to skip empty xacts when indicated by the user in
> streaming APIs of test_decoding.
>
> Thoughts?
>

Yeah, that's an issue.


>
> [1] -
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2020-09-09+03%3A42%3A19
> [2] -
> https://www.postgresql.org/message-id/118303.1599691636%40sss.pgh.pa.us
>
>
I have written a test case to reproduce the same.  I have also prepared a
patch to skip the empty transaction.  And after that, the issue has been
fixed.  But the extra side effect will be that it would skip any empty
stream even if the transaction is not empty.  As such I don't see any
problem with that but this is not what the user has asked for.

logical_decoding_work_mem=64kB
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot',
'test_decoding');

CREATE TABLE stream_test(data text);

-- consume DDL
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL,
'include-xids', '0', 'skip-empty-xacts', '1');
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select
array_agg(md5(g::text))::text from generate_series(1, 8) g';

--session1
BEGIN;
CREATE TABLE stream_test1(data text);

--session2
BEGIN;
CREATE TABLE stream_test2(data text);
COMMIT;

--session3
BEGIN;
INSERT INTO stream_test SELECT large_val();
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL,
'include-xids', '1', 'skip-empty-xacts', '1', 'stream-changes', '1');

   data
--
 opening a streamed block for transaction TXN 508
 closing a streamed block for transaction TXN 508
 opening a streamed block for transaction TXN 510
 streaming change for TXN 510
 closing a streamed block for transaction TXN 510
(5 rows)


After patch
   data
--
 opening a streamed block for transaction TXN 510
 streaming change for TXN 510
 closing a streamed block for transaction TXN 510


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v1-0001-Skip-printing-empty-stream-in-test-decoding.patch
Description: Binary data


Bug in logical decoding of in-progress transactions

2020-09-10 Thread Amit Kapila
Hi,

There is a recent build farm failure [1] in one of the test_decoding
tests as pointed by Tom Lane [2]. The failure report is shown below:

@@ -71,6 +71,8 @@
data
 --
  opening a streamed block for transaction
+ closing a streamed block for transaction
+ opening a streamed block for transaction
  streaming change for transaction
  streaming change for transaction
  streaming change for transaction
@@ -83,7 +85,7 @@
  streaming change for transaction
  closing a streamed block for transaction
  committing streamed transaction
-(13 rows)
+(15 rows)

Here, the symptoms are quite similar to what we have fixed in commit
82a0ba7707 which is that an extra empty transaction is being decoded
in the test. It can happen even if have instructed the test to 'skip
empty xacts' for streaming transactions because the test_decoding
plugin APIs (related to streaming changes for in-progress xacts) makes
no effort to skip such empty xacts. It was kept intentionally like
that under the assumption that we would never try to stream empty
xacts but on closer inspection of the code, it seems to me that
assumption was not correct. Basically, we can pick to stream a
transaction that has change messages for
REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT and we don't send such
messages to downstream rather they are just to update the internal
state. So, in this particular failure, it is possible that autovacuum
transaction has got such a change message added by one of the other
committed xact and on trying to stream it we get such additional
messages. The fix is to skip empty xacts when indicated by the user in
streaming APIs of test_decoding.

Thoughts?

[1] - 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2020-09-09+03%3A42%3A19
[2] - https://www.postgresql.org/message-id/118303.1599691636%40sss.pgh.pa.us

-- 
With Regards,
Amit Kapila.