Re: POC: Cleaning up orphaned files using undo logs

2019-09-15 Thread Thomas Munro
On Mon, Sep 16, 2019 at 5:27 AM Kuntal Ghosh  wrote:
> While testing zheap over undo apis, we've found the following
> issues/scenarios that might need some fixes/discussions:

Thanks!

> 1. In UndoLogAllocateInRecovery, when we find the current log number
> from the list of registered blocks, we don't check whether the
> block->in_use flag is true or not. In XLogResetInsertion, we just
> reset in_use flag without reseting the blocks[]->rnode information.
> So, if we don't check the in_use flag, it's possible that we'll
> consult some block information from the previous WAL record. IMHO,
> just adding an in_use check in UndoLogAllocateInRecovery will solve
> the problem.

Agreed.  I added a line to break out of that loop if !block->in_use.

BTW I am planning to simplify that code considerably, based on a plan
to introduce a new rule: there can be only one undo record and
therefore only one undo allocation per WAL record.

> 2. A transaction, inserts one undo record and generated a WAL record
> for the same, say at WAL location 0/2000A000. Next, the undo record
> gets discarded and WAL is generated to update the meta.discard pointer
> at location 0/2000B000  At the same time, an ongoing checkpoint with
> checkpoint.redo at 0/2000 flushes the latest meta.discard pointer.
> Now, the system crashes.
> Now, the recovery starts from the location 0/2000. When the
> recovery of 0/2000A000 happens, it sees the undo record that it's
> about to insert, is already discarded as per meta.discard (flushed by
> checkpoint). In this case, should we just skip inserting the undo
> record?

I see two options:

1.  We make it so that if you're allocating in recovery and discard >
insert, we'll just set discard = insert so you can proceed.  The code
in undofile_get_segment_file() already copes with missing files during
recovery.

2.  We skip the insert as you said.

I think option 1 is probably best, otherwise you have to cope with
failure to insert by skipping, as you said.

> 3. Currently, we create a backup image of the unlogged part of the
> undo log's metadata only when some backend allocates some space from
> the undo log (in UndoLogAllocate). This helps us restore the unlogged
> meta part after a checkpoint.
> When we perform an undo action, we also update the undo action
> progress and emit an WAL record. The same operation can performed by
> the undo worker which doesn't allocate any space from the undo log.
> So, if an undo worker emits an WAL record to update undo action
> progress after a checkpoint, it'll not be able to WAL log the backup
> image of the meta unlogged part. IMHO, this breaks the recovery logic
> of unlogged part of undo meta.

I thought that was OK because those undo data updates don't depend on
the insert pointer.  But I see what you mean: the next modification of
the page that DOES depend on the insert pointer might not log the
meta-data if it's not the first WAL record to touch it after a
checkpoint.  Rats.  I'll have to think about that some more.

-- 
Thomas Munro
https://enterprisedb.com




Re:

2019-09-15 Thread nilsocket
Extremely, sorry forgot to add attachment.

On Mon, Sep 16, 2019 at 11:11 AM nilsocket  wrote:

> We want to export data from PG to Kafka,
> We can't rely on extension which we have written as there could be any
> problems which we are not aware of and PG might break.
> We don't want our master to go down because of the extension we have
> written.
>
> So, we are okay with having a new PG instance whose work is just to export
> data, as slave.
>
> What we thought of doing is to pause recovery (start-up) process, on any
> vacuum changes on system catalog tables and resume once, our
> logical-decoding has caught up.
> That way, we aren't bloating our master.
>
> Our problem is, we aren't able to exactly identify what WAL records are
> causing Vacuum changes, as far as our understanding goes, `HEAP_2 (rmgr),
> CLEAN` and `BTREE (rmgr), VACUUM` records.
>
> Inorder to see our understanding is right or not, IPC (inter-process
> communication) between WAL_SENDER and START_UP process, is not efficient
> enough.
>
> I have seen Latches, but I'm not sure exactly how to use them, as from
> comments, my understanding is START_UP process is not available in PGPROC
> array.
>
> For efficient inter-process communication, what would be ideal way to
> communicate between two processes.
>
> I'm new to PostgreSQL, and C - world.
>
> What we are trying to achieve is something similar to this:
>
> START_UP process goes to sleep, as soon as it sees any vacuum on catalog.
> WAL_SENDER process will resume recovery (wakeup), as soon as it caught up
> and goes to sleep.
> START_UP process will wake up when THERE is something for WAL_SENDER to
> send.
>
> Basically, IPC, between two processes, where one process generates work,
> and other consumes it.
> producer should go to sleep, until consumer caught up.
> consumer should signal producer of it's completion and wake up producer.
> cycle goes on...
>
> As I have indicated before, I new to C and PostgreSQL, I familiar with
> GoLang, and I have written a sample program in Go (attached below).
>
> Any suggestions and pointers would be greatly helpful.
>
> --
> Thank you
>


-- 
Thank you
package main

import (
	"fmt"
	"math/rand"
	"sync"
	"time"
)

// just trying to simulate communication
// between goroutines, to test similar one in c with latches
var wg = new(sync.WaitGroup)
var tick = time.Tick(time.Second * 20)

func main() {

	wakeUpT1 := make(chan struct{}, 1)
	wakeUpT2 := make(chan struct{}, 1)
	closeCh := make(chan struct{}, 1)

	wg.Add(2)
	go thread(1, wakeUpT1, wakeUpT2, closeCh)
	go thread(2, wakeUpT2, wakeUpT1, closeCh)

	// boot strapping,
	wakeUpT1 <- struct{}{}

	wg.Wait()

}

func thread(id int, wakeMeUp, wakeHimup, closeCh chan struct{}) {
	defer wg.Done()

end:
	for {
		select {
		case <-wakeMeUp:
			t := time.Millisecond * time.Duration(rand.Intn(1000))
			fmt.Println(id, "woke up and doing work for:", t)
			<-time.Tick(t)
			wakeHimup <- struct{}{}
		case <-closeCh:
			fmt.Println("timeout, quitting", id)
			break end
		case <-tick:
			close(closeCh)
		}
	}
}


[no subject]

2019-09-15 Thread nilsocket
We want to export data from PG to Kafka,
We can't rely on extension which we have written as there could be any
problems which we are not aware of and PG might break.
We don't want our master to go down because of the extension we have
written.

So, we are okay with having a new PG instance whose work is just to export
data, as slave.

What we thought of doing is to pause recovery (start-up) process, on any
vacuum changes on system catalog tables and resume once, our
logical-decoding has caught up.
That way, we aren't bloating our master.

Our problem is, we aren't able to exactly identify what WAL records are
causing Vacuum changes, as far as our understanding goes, `HEAP_2 (rmgr),
CLEAN` and `BTREE (rmgr), VACUUM` records.

Inorder to see our understanding is right or not, IPC (inter-process
communication) between WAL_SENDER and START_UP process, is not efficient
enough.

I have seen Latches, but I'm not sure exactly how to use them, as from
comments, my understanding is START_UP process is not available in PGPROC
array.

For efficient inter-process communication, what would be ideal way to
communicate between two processes.

I'm new to PostgreSQL, and C - world.

What we are trying to achieve is something similar to this:

START_UP process goes to sleep, as soon as it sees any vacuum on catalog.
WAL_SENDER process will resume recovery (wakeup), as soon as it caught up
and goes to sleep.
START_UP process will wake up when THERE is something for WAL_SENDER to
send.

Basically, IPC, between two processes, where one process generates work,
and other consumes it.
producer should go to sleep, until consumer caught up.
consumer should signal producer of it's completion and wake up producer.
cycle goes on...

As I have indicated before, I new to C and PostgreSQL, I familiar with
GoLang, and I have written a sample program in Go (attached below).

Any suggestions and pointers would be greatly helpful.

-- 
Thank you


Re: Leakproofness of texteq()/textne()

2019-09-15 Thread Tom Lane
Stephen Frost  writes:
> * Robert Haas (robertmh...@gmail.com) wrote:
>> On Thu, Sep 12, 2019 at 5:19 PM Tom Lane  wrote:
>>> However, if there is some character C that makes ICU misbehave like
>>> that, we are going to have problems with indexing strings containing C,
>>> whether we think varstr_cmp is leaky or not.  So I'm not sure that
>>> focusing our attention on leakiness is a helpful way to proceed.

>> That seems like a compelling argument to me.

> Agreed.

So it seems that the consensus is that it's okay to mark these
functions leakproof, because if any of the errors they throw
are truly reachable for other than data-corruption reasons,
we would wish to try to prevent such errors.  (Maybe through
upstream validity checks?  Hard to say how we'd do it exactly,
when we don't have an idea what the problem is.)

My inclination is to do the proleakproof changes in HEAD, but
not touch v12.  The inconsistency in leakproof markings in v12
is annoying but it's not a regression or security hazard, so
I'm thinking it's not worth a late catversion bump to fix.

Thoughts?

regards, tom lane




Re: Leakproofness of texteq()/textne()

2019-09-15 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Thu, Sep 12, 2019 at 5:19 PM Tom Lane  wrote:
> > However, if there is some character C that makes ICU misbehave like
> > that, we are going to have problems with indexing strings containing C,
> > whether we think varstr_cmp is leaky or not.  So I'm not sure that
> > focusing our attention on leakiness is a helpful way to proceed.
> 
> That seems like a compelling argument to me.

Agreed.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: range test for hash index?

2019-09-15 Thread Paul A Jungwirth
On Sat, Sep 14, 2019 at 5:13 AM Amit Kapila  wrote:
> In general, the hash_range is covered by some of the existing test,
> but I don't which test.  See the code coverage report here:
> https://coverage.postgresql.org/src/backend/utils/adt/rangetypes.c.gcov.html

Thanks! I did some experimenting, and the current test code *only*
calls `hash_range_internal` when we force it like this:

set enable_nestloop=f;
set enable_hashjoin=t;
set enable_mergejoin=f;
select * from numrange_test natural join numrange_test2 order by nr;

But if I create that index as a hash index instead, we also call it
for these inserts and selects (except for the empty ranges):

create table numrange_test2(nr numrange);
create index numrange_test2_hash_idx on numrange_test2 (nr);

INSERT INTO numrange_test2 VALUES('[, 5)');
INSERT INTO numrange_test2 VALUES(numrange(1.1, 2.2));
INSERT INTO numrange_test2 VALUES(numrange(1.1, 2.2));
INSERT INTO numrange_test2 VALUES(numrange(1.1, 2.2,'()'));
INSERT INTO numrange_test2 VALUES('empty');

select * from numrange_test2 where nr = 'empty'::numrange;
select * from numrange_test2 where nr = numrange(1.1, 2.2);
select * from numrange_test2 where nr = numrange(1.1, 2.3);

(None of that is surprising, right? :-)

So that seems like more confirmation that it was always intended to be
a hash index. Would you like a commit for that? Is it a small enough
change for a committer to just do it? The entire change is simply
(also attached as a file):

diff --git a/src/test/regress/expected/rangetypes.out
b/src/test/regress/expected/rangetypes.out
index 60d875e898..6fd16bddd1 100644
--- a/src/test/regress/expected/rangetypes.out
+++ b/src/test/regress/expected/rangetypes.out
@@ -519,7 +519,7 @@ select numrange(1.0, 2.0) * numrange(2.5, 3.0);
 (1 row)

 create table numrange_test2(nr numrange);
-create index numrange_test2_hash_idx on numrange_test2 (nr);
+create index numrange_test2_hash_idx on numrange_test2 using hash (nr);
 INSERT INTO numrange_test2 VALUES('[, 5)');
 INSERT INTO numrange_test2 VALUES(numrange(1.1, 2.2));
 INSERT INTO numrange_test2 VALUES(numrange(1.1, 2.2));
diff --git a/src/test/regress/sql/rangetypes.sql
b/src/test/regress/sql/rangetypes.sql
index 9fdb1953df..8960add976 100644
--- a/src/test/regress/sql/rangetypes.sql
+++ b/src/test/regress/sql/rangetypes.sql
@@ -119,7 +119,7 @@ select numrange(1.0, 2.0) * numrange(1.5, 3.0);
 select numrange(1.0, 2.0) * numrange(2.5, 3.0);

 create table numrange_test2(nr numrange);
-create index numrange_test2_hash_idx on numrange_test2 (nr);
+create index numrange_test2_hash_idx on numrange_test2 using hash (nr);

 INSERT INTO numrange_test2 VALUES('[, 5)');
 INSERT INTO numrange_test2 VALUES(numrange(1.1, 2.2));

Yours,
Paul


hash_range_test_v0001.patch
Description: Binary data


Re: block-level incremental backup

2019-09-15 Thread Robert Haas
On Thu, Sep 12, 2019 at 9:13 AM Jeevan Chalke
 wrote:
> I had a look over this issue and observed that when the new database is
> created, the catalog files are copied as-is into the new directory
> corresponding to a newly created database. And as they are just copied,
> the LSN on those pages are not changed. Due to this incremental backup
> thinks that its an existing file and thus do not copy the blocks from
> these new files, leading to the failure.

*facepalm*

Well, this shoots a pretty big hole in my design for this feature. I
don't know why I didn't think of this when I wrote out that design
originally. Ugh.

Unless we change the way that CREATE DATABASE and any similar
operations work so that they always stamp pages with new LSNs, I think
we have to give up on the idea of being able to take an incremental
backup by just specifying an LSN. We'll instead need to get a list of
files from the server first, and then request the entirety of any that
we don't have, plus the changed blocks from the ones that we do have.
I guess that will make Stephen happy, since it's more like the design
he wanted originally (and should generalize more simply to parallel
backup).

One question I have is: is there any scenario in which an existing
page gets modified after the full backup and before the incremental
backup but does not end up with an LSN that follows the full backup's
start LSN? If there is, then the whole concept of using LSNs to tell
which blocks have been modified doesn't really work. I can't think of
a way that can happen off-hand, but then, I thought my last design was
good, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [PATCH] Improve performance of NOTIFY over many databases (v2)

2019-09-15 Thread Tom Lane
Martijn van Oosterhout  writes:
> On Sat, 14 Sep 2019 at 17:08, Tom Lane  wrote:
>> None of this seems to respond to my point: it looks to me like it would
>> work fine if you simply dropped the patch's additions in PreCommit_Notify
>> and ProcessCompletedNotifies, because there is already enough logic to
>> decide when to call asyncQueueAdvanceTail.

> ...
> However, I guess you're thinking of asyncQueueReadAllNotifications()
> triggering if the queue as a whole was too long. This could in
> principle work but it does mean that at some point all backends
> sending NOTIFY are going to start calling asyncQueueAdvanceTail()
> every time, until the tail gets advanced, and if there are many idle
> listening backends behind this could take a while. The slowest backend
> might receive more signals while it is processing and so end up
> running asyncQueueAdvanceTail() twice. The fact that signals coalesce
> stops the process getting completely out of hand but it does feel a
> little uncontrolled.
> The whole point of this patch is to ensure that at any time only one
> backend is being woken up and calling asyncQueueAdvanceTail() at a
> time.

I spent some more time thinking about this, and I'm still not too
satisfied with this patch's approach.  It seems to me the key insights
we're trying to make use of are:

1. We don't really need to keep the global tail pointer exactly
up to date.  It's bad if it falls way behind, but a few pages back
is fine.

2. When sending notifies, only listening backends connected to our
own database need be awakened immediately.  Backends connected to
other DBs will need to advance their queue pointer sometime, but
again it doesn't need to be right away.

3. It's bad for multiple processes to all be trying to do
asyncQueueAdvanceTail concurrently: they'll contend for exclusive
access to the AsyncQueueLock.  Therefore, having the listeners
do it is really the wrong thing, and instead we should do it on
the sending side.

However, the patch as presented doesn't go all the way on point 3,
instead having listeners maybe-or-maybe-not do asyncQueueAdvanceTail
in asyncQueueReadAllNotifications.  I propose that we should go all
the way and just define tail-advancing as something that happens on
the sending side, and only once every few pages.  I also think we
can simplify the handling of other-database listeners by including
them in the set signaled by SignalBackends, but only if they're
several pages behind.  So that leads me to the attached patch;
what do you think?

BTW, in my hands it seems like point 2 (skip wakening other-database
listeners) is the only really significant win here, and of course
that only wins when the notify traffic is spread across a fair number
of databases.  Which I fear is not the typical use-case.  In single-DB
use-cases, point 2 helps not at all.  I had a really hard time measuring
any benefit from point 3 --- I eventually saw a noticeable savings
when I tried having one notifier and 100 listen-only backends, but
again that doesn't seem like a typical use-case.  I could not replicate
your report of lots of time spent in asyncQueueAdvanceTail's lock
acquisition.  I wonder whether you're using a very large max_connections
setting and we already fixed most of the problem with that in bca6e6435.
Still, this patch doesn't seem to make any cases worse, so I don't mind
if it's just improving unusual use-cases.

regards, tom lane

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index f26269b..7791f78 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -75,8 +75,10 @@
  *	  list of listening backends and send a PROCSIG_NOTIFY_INTERRUPT signal
  *	  to every listening backend (we don't know which backend is listening on
  *	  which channel so we must signal them all). We can exclude backends that
- *	  are already up to date, though.  We don't bother with a self-signal
- *	  either, but just process the queue directly.
+ *	  are already up to date, though, and we can also exclude backends that
+ *	  are in other databases (unless they are way behind and should be kicked
+ *	  to make them advance their pointers).  We don't bother with a
+ *	  self-signal either, but just process the queue directly.
  *
  * 5. Upon receipt of a PROCSIG_NOTIFY_INTERRUPT signal, the signal handler
  *	  sets the process's latch, which triggers the event to be processed
@@ -89,13 +91,14 @@
  *	  Inbound-notify processing consists of reading all of the notifications
  *	  that have arrived since scanning last time. We read every notification
  *	  until we reach either a notification from an uncommitted transaction or
- *	  the head pointer's position. Then we check if we were the laziest
- *	  backend: if our pointer is set to the same position as the global tail
- *	  pointer is set, then we move the global tail pointer ahead to where the
- *	  second-laziest backend is (in general, we take the MIN of the current
- *	

Re: (Re)building index using itself or another index of the same table

2019-09-15 Thread Tomas Vondra

On Thu, Sep 12, 2019 at 11:08:28AM -0400, Tom Lane wrote:

Arseny Sher  writes:

A problem of similar nature can be reproduced with the following
stripped-down scenario:



CREATE TABLE pears(f1 int primary key, f2 int);
INSERT INTO pears SELECT i, i+1 FROM generate_series(1, 100) i;
CREATE OR REPLACE FUNCTION pears_f(i int) RETURNS int LANGUAGE SQL IMMUTABLE AS 
$$
  SELECT f1 FROM pears WHERE pears.f2 = 42
$$;
CREATE index ON pears ((pears_f(f1)));


We've seen complaints about this sort of thing before, and rejected
them because, as you say, that function is NOT immutable.  When you
lie to the system like that, you should not be surprised if things
break.


There is already a mechanism which prevents usage of indexes during
reindex -- ReindexIsProcessingIndex et al. However, to the contrary of
what index.c:3664 comment say, these protect only indexes on system
catalogs, not user tables: the only real caller is genam.c.
Attached patch extends it: the same check is added to
get_relation_info. Also SetReindexProcessing is cocked in index_create
to defend from index self usage during creation as in stripped example
above. There are some other still unprotected callers of index_build;
concurrent index creation doesn't need it because index is
'not indisvalid' during the build, and in RelationTruncateIndexes
table is empty, so it looks like it can be omitted.


I have exactly no faith that this fixes things enough to make such
cases supportable.  And I have no interest in opening that can of
worms anyway.  I'd rather put in some code to reject database
accesses in immutable functions.



Same here. My hunch is a non-trivaial fraction of applications using
this "trick" is silently broken in various subtle ways.


One might argue that function selecting from table can hardly be called
immutable, and immutability is required for index expressions. However,
if user is sure table contents doesn't change, why not?


If the table contents never change, why are you doing VACUUM FULL on it?



It's possible the columns referenced by the index expression are not
changing, but some additional columns are updated.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Implementing Incremental View Maintenance

2019-09-15 Thread Paul Draper
As I understand it, the current patch performs immediate IVM using AFTER
STATEMENT trigger transition tables.

However, multiple tables can be modified *before* AFTER STATEMENT triggers
are fired.

CREATE TABLE example1 (a int);
CREATE TABLE example2 (a int);

CREATE INCREMENTAL MATERIALIZED VIEW mv AS
SELECT example1.a, example2.a
FROM example1 JOIN example2 ON a;

WITH
  insert1 AS (INSERT INTO example1 VALUES (1)),
  insert2 AS (INSERT INTO example2 VALUES (1))
SELECT NULL;

Changes to example1 are visible in an AFTER STATEMENT trigger on example2,
and vice versa. Would this not result in the (1, 1) tuple being
"double-counted"?

IVM needs to either:

(1) Evaluate deltas "serially' (e.g. EACH ROW triggers)

(2) Have simultaneous access to multiple deltas:
delta_mv = example1 x delta_example2 + example2 x delta_example1 -
delta_example1 x delta_example2

This latter method is the "logged" approach that has been discussed for
deferred evaluation.

tl;dr It seems that AFTER STATEMENT triggers required a deferred-like
implementation anyway.


Re: log spam with postgres_fdw

2019-09-15 Thread Jeff Janes
On Sun, Sep 15, 2019 at 11:14 AM Tom Lane  wrote:

> Jeff Janes  writes:
> > When closing the local session which had used postgres_fdw over an ssl
> > connection, I get log spam on the foreign server saying:
> > LOG:  could not receive data from client: Connection reset by peer
> > It is easy to reproduce, but you must be using ssl to do so.
> > On searching, I see that a lot of people have run into this issue, with
> > considerable confusion, but as far as I can see it has never been
> diagnosed.
>
> In
>
> https://www.postgresql.org/message-id/flat/3DPLMQIC.YU6IFMLY.3PLOWL6W%40FQT5M7HS.IFBAANAE.A7GUPCPM
>
>
Thanks, I had not spotted that one, I guess because the log message itself
was not in the subject so it ranked lower.


> we'd concluded that the issue is probably that postgres_fdw has no
> logic to shut down its external connections when the session closes.
> It's not very clear why the SSL dependency, but we speculated that
> adding an on_proc_exit callback to close the connection(s) would help.
>
>
It is easy to reproduce the ssl dependency without any FDW, just by doing a
kill -9 on psql. Apparently the backend process for unencrypted connections
are happy to be ghosted, while ssl ones are not; which seems like an odd
distinction to make.  So should this be addressed on both sides (the server
not whining, and the client doing the on_proc_exit anyway?).  I can take a
stab at the client side one, but I'm over my head on the ssl connection
handling logic on the server side.

Cheers,

Jeff


Re: POC: Cleaning up orphaned files using undo logs

2019-09-15 Thread Kuntal Ghosh
Hello Thomas,

While testing zheap over undo apis, we've found the following
issues/scenarios that might need some fixes/discussions:

1. In UndoLogAllocateInRecovery, when we find the current log number
from the list of registered blocks, we don't check whether the
block->in_use flag is true or not. In XLogResetInsertion, we just
reset in_use flag without reseting the blocks[]->rnode information.
So, if we don't check the in_use flag, it's possible that we'll
consult some block information from the previous WAL record. IMHO,
just adding an in_use check in UndoLogAllocateInRecovery will solve
the problem.

2. A transaction, inserts one undo record and generated a WAL record
for the same, say at WAL location 0/2000A000. Next, the undo record
gets discarded and WAL is generated to update the meta.discard pointer
at location 0/2000B000  At the same time, an ongoing checkpoint with
checkpoint.redo at 0/2000 flushes the latest meta.discard pointer.
Now, the system crashes.
Now, the recovery starts from the location 0/2000. When the
recovery of 0/2000A000 happens, it sees the undo record that it's
about to insert, is already discarded as per meta.discard (flushed by
checkpoint). In this case, should we just skip inserting the undo
record?

3. Currently, we create a backup image of the unlogged part of the
undo log's metadata only when some backend allocates some space from
the undo log (in UndoLogAllocate). This helps us restore the unlogged
meta part after a checkpoint.
When we perform an undo action, we also update the undo action
progress and emit an WAL record. The same operation can performed by
the undo worker which doesn't allocate any space from the undo log.
So, if an undo worker emits an WAL record to update undo action
progress after a checkpoint, it'll not be able to WAL log the backup
image of the meta unlogged part. IMHO, this breaks the recovery logic
of unlogged part of undo meta.

Thoughts?

On Mon, Sep 2, 2019 at 9:47 AM Thomas Munro  wrote:
>
> On Fri, Aug 30, 2019 at 8:27 PM Kuntal Ghosh  
> wrote:
> > I'm getting the following assert failure while performing the recovery
> > with the same.
> > "TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL",
> > File: "undolog.c", Line: 997)"
> >
> > I found that we don't emit an WAL record when we update the
> > slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after
> > crash recovery, some new transaction may use that undo log which is
> > wrong, IMHO. Am I missing something?
>
> Thanks, right, that status logging is wrong, will fix in next version.
>
> --
> Thomas Munro
> https://enterprisedb.com



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com




Re: Primary keepalive message not appearing in Logical Streaming Replication

2019-09-15 Thread Tomas Vondra

On Sun, Sep 15, 2019 at 09:44:14AM -0600, Michael Loftis wrote:

On Sun, Sep 15, 2019 at 08:36 Virendra Negi  wrote:


Oh I miss the documentation link there you go
https://www.postgresql.org/docs/9.5/protocol-replication.html

On Sun, Sep 15, 2019 at 8:05 PM Virendra Negi 
wrote:


Agreed but why is there a message specification for it describe in the
documentation  and it ask to client reply back if a particular *bit* is
set.(1 means that the client should reply to this message as soon as
possible, to avoid a timeout disconnect. 0 otherwise)




This is unrelated to TCP keepalive. I honestly don't know where the knob is
to turn these on but the configuration variables you quoted earlier I am
familiar with and they are not it. Perhaps someone else can chime in with
how to enable the protocol level keepalive in replication.



Pretty sure it's wal_sender_timeout. Which by default is 60s, but if you
tune it down it should send keepalives more often.

See WalSndKeepaliveIfNecessary in [1]:

[1] 
https://github.com/postgres/postgres/blob/master/src/backend/replication/walsender.c#L3425

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Primary keepalive message not appearing in Logical Streaming Replication

2019-09-15 Thread Jeff Janes
On Sun, Sep 15, 2019 at 11:44 AM Michael Loftis  wrote:

>
>
> On Sun, Sep 15, 2019 at 08:36 Virendra Negi  wrote:
>
>> Oh I miss the documentation link there you go
>> https://www.postgresql.org/docs/9.5/protocol-replication.html
>>
>> On Sun, Sep 15, 2019 at 8:05 PM Virendra Negi 
>> wrote:
>>
>>> Agreed but why is there a message specification for it describe in the
>>> documentation  and it ask to client reply back if a particular *bit* is
>>> set.(1 means that the client should reply to this message as soon as
>>> possible, to avoid a timeout disconnect. 0 otherwise)
>>>
>>
> This is unrelated to TCP keepalive. I honestly don't know where the knob
> is to turn these on but the configuration variables you quoted earlier I am
> familiar with and they are not it. Perhaps someone else can chime in with
> how to enable the protocol level keepalive in replication.
>

Protocol-level keepalives are governed by "wal_sender_timeout"

Cheers,

Jeff


Re: Efficient output for integer types

2019-09-15 Thread David Fetter
On Sun, Sep 15, 2019 at 02:06:29PM +0500, Andrey Borodin wrote:
> > 15 сент. 2019 г., в 12:18, David Fetter  написал(а):
> > 
> > Please find attached a couple of patches intended to $subject.
> > 
> > This patch set cut the time to copy ten million rows of randomly sized
> > int8s (10 of them) by about a third, so at least for that case, it's
> > pretty decent.
> 
> Hi! Looks cool.
> 
> Just curious if for any fixed base and square here
> 
> + while(uvalue >= base)
>   {
> + const int i = (uvalue % square) * 2;
> + uvalue /= square;
> + vallen += 2;
> + memcpy(convert + sizeof(convert) - vallen, digits + i, 
> 2);
> + }
> 
> compiler will have a chance to avoid idiv instruction?

That could very well be.  I took the idea (and most of the code) from
the Ryū implementation Andrew Gierth committed for 12.

> Maybe few specialized functions could work better than generic
> algorithm?

Could be.  What do you have in mind?  I'm guessing that the ones for
decimals, that being both the most common case and the least obvious
as to how to optimize, would give the most benefit.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: Primary keepalive message not appearing in Logical Streaming Replication

2019-09-15 Thread Michael Loftis
On Sun, Sep 15, 2019 at 08:36 Virendra Negi  wrote:

> Oh I miss the documentation link there you go
> https://www.postgresql.org/docs/9.5/protocol-replication.html
>
> On Sun, Sep 15, 2019 at 8:05 PM Virendra Negi 
> wrote:
>
>> Agreed but why is there a message specification for it describe in the
>> documentation  and it ask to client reply back if a particular *bit* is
>> set.(1 means that the client should reply to this message as soon as
>> possible, to avoid a timeout disconnect. 0 otherwise)
>>
>
This is unrelated to TCP keepalive. I honestly don't know where the knob is
to turn these on but the configuration variables you quoted earlier I am
familiar with and they are not it. Perhaps someone else can chime in with
how to enable the protocol level keepalive in replication.


>>
>> Primary keepalive message (B)
>> Byte1('k')
>>
>> Identifies the message as a sender keepalive.
>> Int64
>>
>> The current end of WAL on the server.
>> Int64
>>
>> The server's system clock at the time of transmission, as microseconds
>> since midnight on 2000-01-01.
>> Byte1
>>
>> 1 means that the client should reply to this message as soon as possible,
>> to avoid a timeout disconnect. 0 otherwise.
>>
>> The receiving process can send replies back to the sender at any time,
>> using one of the following message formats (also in the payload of a
>> CopyData message):
>>
>>
>> On Sun, Sep 15, 2019 at 7:39 PM Michael Loftis  wrote:
>>
>>>
>>>
>>> On Fri, Sep 13, 2019 at 07:12 Virendra Negi 
>>> wrote:
>>>
 Implemented the Logical Streaming Replication thing are working fine I
 see the XLogData message appearing and I'm able to parse them.

 But I haven't see any "Primary Keepalive message"  yet. I had tried
 setting the *tcp_keepalive_interval*, *tcp_keepalives_idle* both from
 client runtime paramter and well as from postgresql.conf still no clue of
 it.

 Any information around it?

>>>
>>> Both of these options are not in the Pg protocol. They are within the OS
>>> TCP stack and are not visible to the applications at all.
>>>



 --
>>>
>>> "Genius might be described as a supreme capacity for getting its
>>> possessors
>>> into trouble of all kinds."
>>> -- Samuel Butler
>>>
>> --

"Genius might be described as a supreme capacity for getting its possessors
into trouble of all kinds."
-- Samuel Butler


Re: log spam with postgres_fdw

2019-09-15 Thread Tom Lane
Jeff Janes  writes:
> When closing the local session which had used postgres_fdw over an ssl
> connection, I get log spam on the foreign server saying:
> LOG:  could not receive data from client: Connection reset by peer
> It is easy to reproduce, but you must be using ssl to do so.
> On searching, I see that a lot of people have run into this issue, with
> considerable confusion, but as far as I can see it has never been diagnosed.

In
https://www.postgresql.org/message-id/flat/3DPLMQIC.YU6IFMLY.3PLOWL6W%40FQT5M7HS.IFBAANAE.A7GUPCPM

we'd concluded that the issue is probably that postgres_fdw has no
logic to shut down its external connections when the session closes.
It's not very clear why the SSL dependency, but we speculated that
adding an on_proc_exit callback to close the connection(s) would help.

I imagine dblink has a similar issue.

regards, tom lane




Re: Primary keepalive message not appearing in Logical Streaming Replication

2019-09-15 Thread Virendra Negi
Oh I miss the documentation link there you go
https://www.postgresql.org/docs/9.5/protocol-replication.html

On Sun, Sep 15, 2019 at 8:05 PM Virendra Negi  wrote:

> Agreed but why is there a message specification for it describe in the
> documentation  and it ask to client reply back if a particular *bit* is
> set.(1 means that the client should reply to this message as soon as
> possible, to avoid a timeout disconnect. 0 otherwise)
>
>
> Primary keepalive message (B)
> Byte1('k')
>
> Identifies the message as a sender keepalive.
> Int64
>
> The current end of WAL on the server.
> Int64
>
> The server's system clock at the time of transmission, as microseconds
> since midnight on 2000-01-01.
> Byte1
>
> 1 means that the client should reply to this message as soon as possible,
> to avoid a timeout disconnect. 0 otherwise.
>
> The receiving process can send replies back to the sender at any time,
> using one of the following message formats (also in the payload of a
> CopyData message):
>
>
> On Sun, Sep 15, 2019 at 7:39 PM Michael Loftis  wrote:
>
>>
>>
>> On Fri, Sep 13, 2019 at 07:12 Virendra Negi 
>> wrote:
>>
>>> Implemented the Logical Streaming Replication thing are working fine I
>>> see the XLogData message appearing and I'm able to parse them.
>>>
>>> But I haven't see any "Primary Keepalive message"  yet. I had tried
>>> setting the *tcp_keepalive_interval*, *tcp_keepalives_idle* both from
>>> client runtime paramter and well as from postgresql.conf still no clue of
>>> it.
>>>
>>> Any information around it?
>>>
>>
>> Both of these options are not in the Pg protocol. They are within the OS
>> TCP stack and are not visible to the applications at all.
>>
>>>
>>>
>>>
>>> --
>>
>> "Genius might be described as a supreme capacity for getting its
>> possessors
>> into trouble of all kinds."
>> -- Samuel Butler
>>
>


log spam with postgres_fdw

2019-09-15 Thread Jeff Janes
I'm sending this to hackers, because it is not exactly a bug, and it can't
be addressed from userland.  I think it is a coding issue, although I
haven't identified the exact code.

When closing the local session which had used postgres_fdw over an ssl
connection, I get log spam on the foreign server saying:

LOG:  could not receive data from client: Connection reset by peer

It is easy to reproduce, but you must be using ssl to do so.

On searching, I see that a lot of people have run into this issue, with
considerable confusion, but as far as I can see it has never been diagnosed.

Is there anything that can be done about this, other than just learning to
ignore it?

Cheers,

Jeff
-- set up ssl, I won't walk through those steps.
\! pgbench -i
create extension postgres_fdw ;
create server ssl foreign data wrapper postgres_fdw OPTIONS ( host '127.0.0.1');
create schema fgn;
create user MAPPING FOR CURRENT_USER SERVER ssl;
import  foreign schema public from server ssl into fgn;
select count(*) from fgn.pgbench_history;
\q


Fwd: Extending range type operators to cope with elements

2019-09-15 Thread Esteban Zimanyi
> So yes, I've had a need for those operators in the past. What I don't
know is whether adding these functions will be worth the catalog clutter.

The operators are tested and running within MobilityDB. It concerns lines
231-657 for the C code in file
https://github.com/ULB-CoDE-WIT/MobilityDB/blob/master/src/rangetypes_ext.c

and lines 32-248 for the SQL code in file
https://github.com/ULB-CoDE-WIT/MobilityDB/blob/master/src/sql/07_rangetypes_ext.in.sql


Since you don't really use PR, please let me know whether I can be of
any help.

Regards
Esteban

-- 

Prof. Esteban Zimanyi
Department of Computer & Decision Engineering  (CoDE) CP 165/15
Universite Libre de Bruxelles
Avenue F. D. Roosevelt 50
B-1050 Brussels, Belgium
fax: + 32.2.650.47.13
tel: + 32.2.650.31.85
e-mail: ezima...@ulb.ac.be
Internet: http://code.ulb.ac.be/



Re: Primary keepalive message not appearing in Logical Streaming Replication

2019-09-15 Thread Virendra Negi
Agreed but why is there a message specification for it describe in the
documentation  and it ask to client reply back if a particular *bit* is
set.(1 means that the client should reply to this message as soon as
possible, to avoid a timeout disconnect. 0 otherwise)


Primary keepalive message (B)
Byte1('k')

Identifies the message as a sender keepalive.
Int64

The current end of WAL on the server.
Int64

The server's system clock at the time of transmission, as microseconds
since midnight on 2000-01-01.
Byte1

1 means that the client should reply to this message as soon as possible,
to avoid a timeout disconnect. 0 otherwise.

The receiving process can send replies back to the sender at any time,
using one of the following message formats (also in the payload of a
CopyData message):


On Sun, Sep 15, 2019 at 7:39 PM Michael Loftis  wrote:

>
>
> On Fri, Sep 13, 2019 at 07:12 Virendra Negi  wrote:
>
>> Implemented the Logical Streaming Replication thing are working fine I
>> see the XLogData message appearing and I'm able to parse them.
>>
>> But I haven't see any "Primary Keepalive message"  yet. I had tried
>> setting the *tcp_keepalive_interval*, *tcp_keepalives_idle* both from
>> client runtime paramter and well as from postgresql.conf still no clue of
>> it.
>>
>> Any information around it?
>>
>
> Both of these options are not in the Pg protocol. They are within the OS
> TCP stack and are not visible to the applications at all.
>
>>
>>
>>
>> --
>
> "Genius might be described as a supreme capacity for getting its possessors
> into trouble of all kinds."
> -- Samuel Butler
>


Re: pg_rewind docs correction

2019-09-15 Thread Michael Paquier
On Sat, Sep 14, 2019 at 07:00:54PM -0400, James Coleman wrote:
> Updated (plus some additional wordsmithing).

+The rewind operation is not expected to result in a consistent data
+directory state either internally to the node or with respect to the rest
+of the cluster. Instead the resulting data directory will only be 
consistent
+after WAL replay has completed to at least the LSN at which changed blocks
+copied from the source were originally written on the source.

That's not necessarily true.  pg_rewind enforces in the control file
of the target the minimum consistency LSN to be
pg_current_wal_insert_lsn() when using a live source or the last
checkpoint LSN for a stopped source, so while that sounds true from
the point of view of all the blocks copied, the control file may still
cause a complain that the target recovering has not reached its
consistent point even if all the blocks are already at a position
not-so-far from what has been registered in the control file.

+   the point at which the WAL timelines of the source and target diverged plus
+   the current state on the source of any blocks changed on the target after
+   that divergence. While only changed blocks from existing relation files are

And here we could mention that all the blocks copied from the source
are the ones which are found in the WAL records of the target until
the end of WAL of its timeline.  Still, that's basically what is
mentioned in the first part of "How It Works", which explains things
better.  I honestly don't really see that all this paragraph is an
improvement over the simplicity of the original when it comes to
understand the global idea of what pg_rewind does.

+  
+Because pg_rewind copies configuration files
+entirely from the source, correcting recovery configuration options before
+restarting the server is necessary if you intend to re-introduce the target
+as a replica of the source. If you restart the server after the rewind
+operation has finished but without configuring recovery, the target will
+again diverge from the primary.
+   

No objections regarding that part.  Now it seems to me that we had
better apply that to the last part of "How it works" instead?  I kind
of agree that the last paragraph could provide more details regarding
the risks of overwriting the wanted configuration.  The existing docs
also mention that pg_rewind only creates a backup_label file to start
recovery, perhaps we could mention up to which point recovery happens
in this section?  There is a bit more here than just "apply the WAL".
--
Michael


signature.asc
Description: PGP signature


Re: Primary keepalive message not appearing in Logical Streaming Replication

2019-09-15 Thread Michael Loftis
On Fri, Sep 13, 2019 at 07:12 Virendra Negi  wrote:

> Implemented the Logical Streaming Replication thing are working fine I see
> the XLogData message appearing and I'm able to parse them.
>
> But I haven't see any "Primary Keepalive message"  yet. I had tried
> setting the *tcp_keepalive_interval*, *tcp_keepalives_idle* both from
> client runtime paramter and well as from postgresql.conf still no clue of
> it.
>
> Any information around it?
>

Both of these options are not in the Pg protocol. They are within the OS
TCP stack and are not visible to the applications at all.

>
>
>
> --

"Genius might be described as a supreme capacity for getting its possessors
into trouble of all kinds."
-- Samuel Butler


Re: BF failure: could not open relation with OID XXXX while querying pg_views

2019-09-15 Thread Dean Rasheed
On Sun, 15 Sep 2019 at 12:20, Tomas Vondra  wrote:
>
> On Sun, Sep 15, 2019 at 11:27:19AM +0100, Dean Rasheed wrote:
> >On Sun, 15 Sep 2019 at 11:11, Tomas Vondra  
> >wrote:
> >>
> >> On Sun, Sep 15, 2019 at 10:16:30AM +0100, Dean Rasheed wrote:
> >> >
> >> >Ah sorry, I missed this thread before. As author of that commit, it's
> >> >really on me to fix it, and the cause seems pretty clear-cut, so I'll
> >> >aim to get that done today.
> >>
> >> FWIW here is a draft patch that I was going to propose - it simply moves
> >> the table+view into a "tststats" schema. I suppose that's rougly what we
> >> discussed earlier in this thread.
> >>
> >
> >Yes, that matches what I just drafted.
> >Do you want to push it, or shall I?
>
> Please go ahead and push. I'm temporarily without commit access.
>

OK, pushed.

Regards,
Dean




Re: BF failure: could not open relation with OID XXXX while querying pg_views

2019-09-15 Thread Tomas Vondra

On Sun, Sep 15, 2019 at 11:27:19AM +0100, Dean Rasheed wrote:

On Sun, 15 Sep 2019 at 11:11, Tomas Vondra  wrote:


On Sun, Sep 15, 2019 at 10:16:30AM +0100, Dean Rasheed wrote:
>
>Ah sorry, I missed this thread before. As author of that commit, it's
>really on me to fix it, and the cause seems pretty clear-cut, so I'll
>aim to get that done today.

FWIW here is a draft patch that I was going to propose - it simply moves
the table+view into a "tststats" schema. I suppose that's rougly what we
discussed earlier in this thread.



Yes, that matches what I just drafted.
Do you want to push it, or shall I?


Please go ahead and push. I'm temporarily without commit access.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: pglz performance

2019-09-15 Thread Oleg Bartunov
On Wed, Sep 4, 2019 at 12:22 PM Andrey Borodin  wrote:
>
> Hi, Peter! Thanks for looking into this.
>
> > 4 сент. 2019 г., в 14:09, Peter Eisentraut 
> >  написал(а):
> >
> > On 2019-06-24 10:44, Andrey Borodin wrote:
> >>> 18 мая 2019 г., в 11:44, Andrey Borodin  написал(а):
> >>>
> >> Hi!
> >> Here's rebased version of patches.
> >>
> >> Best regards, Andrey Borodin.
> >
> > I think this is the most recent patch for the CF entry
> > .
> >
> > What about the two patches?  Which one is better?
> On our observations pglz_decompress_hacked.patch is best for most of tested 
> platforms.
> Difference is that pglz_decompress_hacked8.patch will not appply optimization 
> if decompressed match is not greater than 8 bytes. This optimization was 
> suggested by Tom, that's why we benchmarked it specifically.
>
> > Have you also considered using memmove() to deal with the overlap issue?
> Yes, memmove() resolves ambiguity of copying overlapping regions in a way 
> that is not compatible with pglz. In proposed patch we never copy overlapping 
> regions.
>
> > Benchmarks have been posted in this thread.  Where is the benchmarking
> > tool?  Should we include that in the source somehow?
>
> Benchmarking tool is here [0]. Well, code of the benchmarking tool do not 
> adhere to our standards in some places, we did not consider its inclusion in 
> core.
> However, most questionable part of benchmarking is choice of test data. It's 
> about 100Mb of useless WALs, datafile and valuable Shakespeare writings.

Why not use 'Silesia compression corpus'
(http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia), which used by
lzbench (https://github.com/inikep/lzbench) ?  I and Teodor remember
that testing on non-english texts could be very important.


>
> Best regards, Andrey Borodin.
>
>
> [0] https://github.com/x4m/test_pglz
>
>
>


-- 
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [HACKERS] [PROPOSAL] Effective storage of duplicates in B-tree index.

2019-09-15 Thread Oleg Bartunov
On Tue, Sep 1, 2015 at 12:33 PM Alexander Korotkov
 wrote:
>
> Hi, Tomas!
>
> On Mon, Aug 31, 2015 at 6:26 PM, Tomas Vondra  
> wrote:
>>
>> On 08/31/2015 09:41 AM, Anastasia Lubennikova wrote:
>>>
>>> I'm going to begin work on effective storage of duplicate keys in B-tree
>>> index.
>>> The main idea is to implement posting lists and posting trees for B-tree
>>> index pages as it's already done for GIN.
>>>
>>> In a nutshell, effective storing of duplicates in GIN is organised as
>>> follows.
>>> Index stores single index tuple for each unique key. That index tuple
>>> points to posting list which contains pointers to heap tuples (TIDs). If
>>> too many rows having the same key, multiple pages are allocated for the
>>> TIDs and these constitute so called posting tree.
>>> You can find wonderful detailed descriptions in gin readme
>>> 
>>> and articles .
>>> It also makes possible to apply compression algorithm to posting
>>> list/tree and significantly decrease index size. Read more in
>>> presentation (part 1)
>>> .
>>>
>>> Now new B-tree index tuple must be inserted for each table row that we
>>> index.
>>> It can possibly cause page split. Because of MVCC even unique index
>>> could contain duplicates.
>>> Storing duplicates in posting list/tree helps to avoid superfluous splits.
>>>
>>> So it seems to be very useful improvement. Of course it requires a lot
>>> of changes in B-tree implementation, so I need approval from community.
>>
>>
>> In general, index size is often a serious issue - cases where indexes need 
>> more space than tables are not quite uncommon in my experience. So I think 
>> the efforts to lower space requirements for indexes are good.
>>
>> But if we introduce posting lists into btree indexes, how different are they 
>> from GIN? It seems to me that if I create a GIN index (using btree_gin), I 
>> do get mostly the same thing you propose, no?
>
>
> Yes, In general GIN is a btree with effective duplicates handling + support 
> of splitting single datums into multiple keys.
> This proposal is mostly porting duplicates handling from GIN to btree.

Is it worth to make a provision to add an ability to control how
duplicates are sorted ?  If we speak about GIN, why not take into
account our experiments with RUM (https://github.com/postgrespro/rum)
?

>
>> Sure, there are differences - GIN indexes don't handle UNIQUE indexes,
>
>
> The difference between btree_gin and btree is not only UNIQUE feature.
> 1) There is no gingettuple in GIN. GIN supports only bitmap scans. And it's 
> not feasible to add gingettuple to GIN. At least with same semantics as it is 
> in btree.
> 2) GIN doesn't support multicolumn indexes in the way btree does. Multicolumn 
> GIN is more like set of separate singlecolumn GINs: it doesn't have composite 
> keys.
> 3) btree_gin can't effectively handle range searches. "a < x < b" would be 
> hangle as "a < x" intersect "x < b". That is extremely inefficient. It is 
> possible to fix. However, there is no clear proposal how to fit this case 
> into GIN interface, yet.
>
>>
>> but the compression can only be effective when there are duplicate rows. So 
>> either the index is not UNIQUE (so the b-tree feature is not needed), or 
>> there are many updates.
>
>
> From my observations users can use btree_gin only in some cases. They like 
> compression, but can't use btree_gin mostly because of #1.
>
>> Which brings me to the other benefit of btree indexes - they are designed 
>> for high concurrency. How much is this going to be affected by introducing 
>> the posting lists?
>
>
> I'd notice that current duplicates handling in PostgreSQL is hack over 
> original btree. It is designed so in btree access method in PostgreSQL, not 
> btree in general.
> Posting lists shouldn't change concurrency much. Currently, in btree you have 
> to lock one page exclusively when you're inserting new value.
> When posting list is small and fits one page you have to do similar thing: 
> exclusive lock of one page to insert new value.
> When you have posting tree, you have to do exclusive lock on one page of 
> posting tree.
>
> One can say that concurrency would became worse because index would become 
> smaller and number of pages would became smaller too. Since number of pages 
> would be smaller, backends are more likely concur for the same page. But this 
> argument can be user against any compression and for any bloat.
>
> --
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company



-- 
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: BF failure: could not open relation with OID XXXX while querying pg_views

2019-09-15 Thread Dean Rasheed
On Sun, 15 Sep 2019 at 11:11, Tomas Vondra  wrote:
>
> On Sun, Sep 15, 2019 at 10:16:30AM +0100, Dean Rasheed wrote:
> >
> >Ah sorry, I missed this thread before. As author of that commit, it's
> >really on me to fix it, and the cause seems pretty clear-cut, so I'll
> >aim to get that done today.
>
> FWIW here is a draft patch that I was going to propose - it simply moves
> the table+view into a "tststats" schema. I suppose that's rougly what we
> discussed earlier in this thread.
>

Yes, that matches what I just drafted.
Do you want to push it, or shall I?

Regards,
Dean




Re: [PATCH] Improve performance of NOTIFY over many databases (v2)

2019-09-15 Thread Martijn van Oosterhout
On Sat, 14 Sep 2019 at 17:08, Tom Lane  wrote:
> Martijn van Oosterhout  writes:
> > On Fri, 13 Sep 2019 at 22:04, Tom Lane  wrote:
> >> But, really ... do we need the backendTryAdvanceTail flag at all?

> None of this seems to respond to my point: it looks to me like it would
> work fine if you simply dropped the patch's additions in PreCommit_Notify
> and ProcessCompletedNotifies, because there is already enough logic to
> decide when to call asyncQueueAdvanceTail.  In particular, the result from
> Signal[MyDB]Backends tells us whether anyone else was awakened, and
> ProcessCompletedNotifies already does asyncQueueAdvanceTail if not.
> As long as we did awaken someone, the ball's now in their court to
> make sure asyncQueueAdvanceTail happens eventually.

Ah, I think I see what you're getting at. As written,
asyncQueueReadAllNotifications() only calls asyncQueueAdvanceTail() if
*it* was a slow backend (advanceTail =
QUEUE_SLOW_BACKEND(MyBackendId)). In a situation where some databases
are regularly using NOTIFY and a few others never (but still
listening) it will lead to the situation where the tail never gets
advanced.

However, I guess you're thinking of asyncQueueReadAllNotifications()
triggering if the queue as a whole was too long. This could in
principle work but it does mean that at some point all backends
sending NOTIFY are going to start calling asyncQueueAdvanceTail()
every time, until the tail gets advanced, and if there are many idle
listening backends behind this could take a while. The slowest backend
might receive more signals while it is processing and so end up
running asyncQueueAdvanceTail() twice. The fact that signals coalesce
stops the process getting completely out of hand but it does feel a
little uncontrolled.

The whole point of this patch is to ensure that at any time only one
backend is being woken up and calling asyncQueueAdvanceTail() at a
time.

But you do point out that the use of the return value of
SignalMyDBBackends() is used wrongly. The fact that no-one got
signalled only meant there were no other listeners on this database
which means nothing in terms of global queue cleanup. What you want to
know is if you're the only listener in the whole system and you can
test for that directly (QUEUE_FIRST_BACKEND == MyBackendId &&
QUEUE_NEXT_BACKEND(MyBackendId) == InvalidBackendId). I can adjust
this in the next version if necessary, it's fairly harmless as is as
it only triggers in the case where a database is only notifying
itself, which probably isn't that common.

I hope I have correctly understood this time.

Have a nice weekend.
-- 
Martijn van Oosterhout  http://svana.org/kleptog/




Re: BF failure: could not open relation with OID XXXX while querying pg_views

2019-09-15 Thread Tomas Vondra

On Sun, Sep 15, 2019 at 10:16:30AM +0100, Dean Rasheed wrote:

On Sat, 14 Sep 2019 at 05:25, Tom Lane  wrote:


Tomas Vondra  writes:
> On Wed, Aug 14, 2019 at 05:24:26PM +1200, Thomas Munro wrote:
>> On Wed, Aug 14, 2019 at 5:06 PM Tom Lane  wrote:
>>> Oh, hmm --- yeah, that should mean it's safe.  Maybe somebody incautiously
>>> changed one of the other tests that run concurrently with "rules"?

>> Looks like stats_ext.sql could be the problem.  It creates and drops
>> priv_test_view, not in a schema.  Adding Dean, author of commit
>> d7f8d26d.

> Yeah, that seems like it might be the cause. I'll take a look at fixing
> this, probably by creating the view in a different schema.

Ping?  We're still getting intermittent failures of this ilk, eg

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet=2019-09-14%2003%3A37%3A03

With v12 release approaching, I'd like to not have failures
like this in a released branch.



Ah sorry, I missed this thread before. As author of that commit, it's
really on me to fix it, and the cause seems pretty clear-cut, so I'll
aim to get that done today.



FWIW here is a draft patch that I was going to propose - it simply moves
the table+view into a "tststats" schema. I suppose that's rougly what we
discussed earlier in this thread.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 
diff --git a/src/test/regress/expected/stats_ext.out 
b/src/test/regress/expected/stats_ext.out
index e56c75cd33..b2fb260172 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -752,19 +752,21 @@ SELECT * FROM check_estimated_rows('SELECT * FROM 
mcv_lists_bool WHERE NOT a AND
 -- the underlying table.
 --
 -- Currently this is only relevant for MCV stats.
-CREATE TABLE priv_test_tbl (
+CREATE schema tststats;
+CREATE TABLE tststats.priv_test_tbl (
 a int,
 b int
 );
-INSERT INTO priv_test_tbl
+INSERT INTO tststats.priv_test_tbl
  SELECT mod(i,5), mod(i,10) FROM generate_series(1,100) s(i);
-CREATE STATISTICS priv_test_stats (mcv) ON a, b
-  FROM priv_test_tbl;
-ANALYZE priv_test_tbl;
+CREATE STATISTICS tststats.priv_test_stats (mcv) ON a, b
+  FROM tststats.priv_test_tbl;
+ANALYZE tststats.priv_test_tbl;
 -- User with no access
 CREATE USER regress_stats_user1;
+GRANT USAGE ON SCHEMA tststats TO regress_stats_user1;
 SET SESSION AUTHORIZATION regress_stats_user1;
-SELECT * FROM priv_test_tbl; -- Permission denied
+SELECT * FROM tststats.priv_test_tbl; -- Permission denied
 ERROR:  permission denied for table priv_test_tbl
 -- Attempt to gain access using a leaky operator
 CREATE FUNCTION op_leak(int, int) RETURNS bool
@@ -772,39 +774,41 @@ CREATE FUNCTION op_leak(int, int) RETURNS bool
 LANGUAGE plpgsql;
 CREATE OPERATOR <<< (procedure = op_leak, leftarg = int, rightarg = int,
  restrict = scalarltsel);
-SELECT * FROM priv_test_tbl WHERE a <<< 0 AND b <<< 0; -- Permission denied
+SELECT * FROM tststats.priv_test_tbl WHERE a <<< 0 AND b <<< 0; -- Permission 
denied
 ERROR:  permission denied for table priv_test_tbl
-DELETE FROM priv_test_tbl WHERE a <<< 0 AND b <<< 0; -- Permission denied
+DELETE FROM tststats.priv_test_tbl WHERE a <<< 0 AND b <<< 0; -- Permission 
denied
 ERROR:  permission denied for table priv_test_tbl
 -- Grant access via a security barrier view, but hide all data
 RESET SESSION AUTHORIZATION;
-CREATE VIEW priv_test_view WITH (security_barrier=true)
-AS SELECT * FROM priv_test_tbl WHERE false;
-GRANT SELECT, DELETE ON priv_test_view TO regress_stats_user1;
+CREATE VIEW tststats.priv_test_view WITH (security_barrier=true)
+AS SELECT * FROM tststats.priv_test_tbl WHERE false;
+GRANT SELECT, DELETE ON tststats.priv_test_view TO regress_stats_user1;
 -- Should now have access via the view, but see nothing and leak nothing
 SET SESSION AUTHORIZATION regress_stats_user1;
-SELECT * FROM priv_test_view WHERE a <<< 0 AND b <<< 0; -- Should not leak
+SELECT * FROM tststats.priv_test_view WHERE a <<< 0 AND b <<< 0; -- Should not 
leak
  a | b 
 ---+---
 (0 rows)
 
-DELETE FROM priv_test_view WHERE a <<< 0 AND b <<< 0; -- Should not leak
+DELETE FROM tststats.priv_test_view WHERE a <<< 0 AND b <<< 0; -- Should not 
leak
 -- Grant table access, but hide all data with RLS
 RESET SESSION AUTHORIZATION;
-ALTER TABLE priv_test_tbl ENABLE ROW LEVEL SECURITY;
-GRANT SELECT, DELETE ON priv_test_tbl TO regress_stats_user1;
+ALTER TABLE tststats.priv_test_tbl ENABLE ROW LEVEL SECURITY;
+GRANT SELECT, DELETE ON tststats.priv_test_tbl TO regress_stats_user1;
 -- Should now have direct table access, but see nothing and leak nothing
 SET SESSION AUTHORIZATION regress_stats_user1;
-SELECT * FROM priv_test_tbl WHERE a <<< 0 AND b <<< 0; -- Should not leak
+SELECT * FROM tststats.priv_test_tbl WHERE a <<< 0 AND b <<< 0; -- Should not 
leak
  a | b 
 ---+---
 (0 rows)
 
-DELETE FROM priv_test_tbl WHERE a <<< 0 AND 

Re: Efficient output for integer types

2019-09-15 Thread Andrey Borodin



> 15 сент. 2019 г., в 12:18, David Fetter  написал(а):
> 
> Please find attached a couple of patches intended to $subject.
> 
> This patch set cut the time to copy ten million rows of randomly sized
> int8s (10 of them) by about a third, so at least for that case, it's
> pretty decent.

Hi! Looks cool.

Just curious if for any fixed base and square here

+   while(uvalue >= base)
{
+   const int i = (uvalue % square) * 2;
+   uvalue /= square;
+   vallen += 2;
+   memcpy(convert + sizeof(convert) - vallen, digits + i, 
2);
+   }

compiler will have a chance to avoid idiv instruction?
Maybe few specialized functions could work better than generic algorithm?

Best regards, Andrey Borodin.



Efficient output for integer types

2019-09-15 Thread David Fetter
Folks,

Please find attached a couple of patches intended to $subject.

This patch set cut the time to copy ten million rows of randomly sized
int8s (10 of them) by about a third, so at least for that case, it's
pretty decent.

Thanks to Andrew Gierth for lots of patient help.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From 6e8136ece5b01ca9cd16bdb974c4d54e939c92cf Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Tue, 10 Sep 2019 02:06:31 -0700
Subject: [PATCH v1 1/2] Output digits two at a time in sprintf.c
To: hackers
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="2.21.0"

This is a multi-part message in MIME format.
--2.21.0
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit


diff --git a/src/port/snprintf.c b/src/port/snprintf.c
index 8fd997553e..fd9d384144 100644
--- a/src/port/snprintf.c
+++ b/src/port/snprintf.c
@@ -1014,9 +1014,60 @@ fmtint(long long value, char type, int forcesign, int leftjust,
 	   PrintfTarget *target)
 {
 	unsigned long long base;
+	unsigned long long square;
 	unsigned long long uvalue;
 	int			dosign;
-	const char *cvt = "0123456789abcdef";
+	/* Maps for octal, decimal, and two flavors of hexadecimal */
+	const char *digits;
+	const char	decimal_digits[200] =
+	/* 10^2 * 2 decimal digits */
+	"0001020304050607080910111213141516171819"
+	"2021222324252627282930313233343536373839"
+	"4041424344454647484950515253545556575859"
+	"6061626364656667686970717273747576777879"
+	"8081828384858687888990919293949596979899";
+	const char	octal_digits[128] =
+	/* 8^2 * 2 octal digits */
+	"00010203040506071011121314151617"
+	"20212223242526273031323334353637"
+	"40414243444546475051525354555657"
+	"60616263646566677071727374757677";
+	/* 16^2 * 2 hex digits */
+	const char	hex_lower_digits[512] =
+	"000102030405060708090a0b0c0d0e0f"
+	"101112131415161718191a1b1c1d1e1f"
+	"202122232425262728292a2b2c2d2e2f"
+	"303132333435363738393a3b3c3d3e3f"
+	"404142434445464748494a4b4c4d4e4f"
+	"505152535455565758595a5b5c5d5e5f"
+	"606162636465666768696a6b6c6d6e6f"
+	"707172737475767778797a7b7c7d7e7f"
+	"808182838485868788898a8b8c8d8e8f"
+	"909192939495969798999a9b9c9d9e9f"
+	"a0a1a2a3a4a5a6a7a8a9aaabacadaeaf"
+	"b0b1b2b3b4b5b6b7b8b9babbbcbdbebf"
+	"c0c1c2c3c4c5c6c7c8c9cacbcccdcecf"
+	"d0d1d2d3d4d5d6d7d8d9dadbdcdddedf"
+	"e0e1e2e3e4e5e6e7e8e9eaebecedeeef"
+	"f0f1f2f3f4f5f6f7f8f9fafbfcfdfeff";
+	const char	hex_upper_digits[512] =
+	/* 16^2 * 2 HEX DIGITS */
+	"000102030405060708090A0B0C0D0E0F"
+	"101112131415161718191A1B1C1D1E1F"
+	"202122232425262728292A2B2C2D2E2F"
+	"303132333435363738393A3B3C3D3E3F"
+	"404142434445464748494A4B4C4D4E4F"
+	"505152535455565758595A5B5C5D5E5F"
+	"606162636465666768696A6B6C6D6E6F"
+	"707172737475767778797A7B7C7D7E7F"
+	"808182838485868788898A8B8C8D8E8F"
+	"909192939495969798999A9B9C9D9E9F"
+	"A0A1A2A3A4A5A6A7A8A9AAABACADAEAF"
+	"B0B1B2B3B4B5B6B7B8B9BABBBCBDBEBF"
+	"C0C1C2C3C4C5C6C7C8C9CACBCCCDCECF"
+	"D0D1D2D3D4D5D6D7D8D9DADBDCDDDEDF"
+	"E0E1E2E3E4E5E6E7E8E9EAEBECEDEEEF"
+	"F0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF";
 	int			signvalue = 0;
 	char		convert[64];
 	int			vallen = 0;
@@ -1027,23 +1078,27 @@ fmtint(long long value, char type, int forcesign, int leftjust,
 	{
 		case 'd':
 		case 'i':
+			digits = decimal_digits;
 			base = 10;
 			dosign = 1;
 			break;
 		case 'o':
+			digits = octal_digits;
 			base = 8;
 			dosign = 0;
 			break;
 		case 'u':
+			digits = decimal_digits;
 			base = 10;
 			dosign = 0;
 			break;
 		case 'x':
+			digits = hex_lower_digits;
 			base = 16;
 			dosign = 0;
 			break;
 		case 'X':
-			cvt = "0123456789ABCDEF";
+			digits = hex_upper_digits;
 			base = 16;
 			dosign = 0;
 			break;
@@ -1051,6 +1106,8 @@ fmtint(long long value, char type, int forcesign, int leftjust,
 			return;/* keep compiler quiet */
 	}
 
+	square = base * base;
+
 	/* disable MSVC warning about applying unary minus to an unsigned value */
 #if _MSC_VER
 #pragma warning(push)
@@ -1073,12 +1130,20 @@ fmtint(long long value, char type, int forcesign, int leftjust,
 		vallen = 0;
 	else
 	{
-		/* make integer string */
-		do
+		/* make integer string, two digits at a time */
+		while(uvalue >= base)
 		{
-			convert[sizeof(convert) - (++vallen)] = cvt[uvalue % base];
-			uvalue = uvalue / base;
-		} while (uvalue);
+			const int i = (uvalue % square) * 2;
+			uvalue /= square;
+			vallen += 2;
+			memcpy(convert + sizeof(convert) - vallen, digits + i, 2);
+		}
+		/* Account for single digit */
+		if (uvalue > 0 || vallen == 0)
+		{
+			vallen++;
+			memcpy(convert + sizeof(convert) - vallen, digits + uvalue * 2 + 1, 1);
+		}
 	}
 
 	zeropad = Max(0, precision - vallen);

--2.21.0--


>From