Re: setLastTid() and currtid()

2019-04-11 Thread Michael Paquier
On Thu, Apr 11, 2019 at 02:06:13PM -0400, Tom Lane wrote:
> Yeah, if we could simplify the tableam API, that would be sufficient
> reason to remove the stuff in v12, IMO.  But if it is outside of that
> API, I'd counsel waiting till v13.

Yes, I agree that simplifying the table AM interface would be a reason
fine enough to delete this code for v12.  If not, v13 sounds better at
this stage.
--
Michael


signature.asc
Description: PGP signature


Re: Adding Unix domain socket path and port to pg_stat_get_wal_senders()

2019-04-11 Thread Michael Paquier
On Thu, Apr 11, 2019 at 10:19:01PM -0300, Euler Taveira wrote:
> application_name. I'm not sure if it solves your complain but Peter
> committed a patch [1] for v12 that distinguishes replicas in the same
> host via cluster_name.

Let's be honest, this is just a workaround.

> Socket has different semantic from TCP/UDP. We can't add socket
> information into client_addr unless we are prepared to break this view
> (client_addr has type inet and it would be necessary to change it to
> text). It could break a lot of applications.

client_addr does not seem the right place to store this information,
and it is already documented for years that NULL is used when using a
Unix socket.  But I think that we could change *client_hostname* so as
the path name is reported instead of NULL when connecting through a
Unix domain socket, and there is no need to switch the field type for
that.

I agree with Ishii-san that it would be nice to close the gap here.
For pg_stat_wal_receiver, please note that sender_host reports
correctly the domain path when connecting locally.
--
Michael


signature.asc
Description: PGP signature


Re: How to include the header files effectively

2019-04-11 Thread Michael Paquier
On Thu, Apr 11, 2019 at 10:22:56PM -0400, Alvaro Herrera wrote:
> What are you trying to do?  Your .c file must include "postgres.h"
> before any other header file.  There should be no other dependencies.

The usual rule when it comes to develop extensions or a patch is to
include headers in the following order:
1) postgres.h for backend code and postgres_fe.h for frontend (use
ifdef FRONTEND if a file is used in both context, see src/common/*.c).
2) System-related headers, like  or such.
3) Other PostgreSQL internal headers, in any patch posted to the lists
these in alphabetical order is warmly welcome.
--
Michael


signature.asc
Description: PGP signature


Re: Attempt to consolidate reading of XLOG page

2019-04-11 Thread Michael Paquier
On Fri, Apr 12, 2019 at 12:27:11PM +0900, Kyotaro HORIGUCHI wrote:
> This patch changes XLogRead to allow using other than
> BasicOpenFile to open a segment, and use XLogReaderState.private
> to hold a new struct XLogReadPos for the segment reader. The new
> struct is heavily duplicated with XLogReaderState and I'm not
> sure the rason why the XLogReadPos is needed.
> Any other opinions, or thoughts?

The focus is on the stability of v12 for the next couple of months, so
please make sure to register it to the next CF if you want feedback.

Here are some basic thoughts after a very quick lookup.

+/*
+ * Position in XLOG file while reading it.
+ */
+typedef struct XLogReadPos
+{
+   int segFile;/* segment file descriptor */
+   XLogSegNo   segNo;  /* segment number */
+   uint32 segOff;  /* offset in the segment */
+   TimeLineID tli; /* timeline ID of the currently open file */
+
+   char*dir;   /* directory (only needed by
frontends) */
+} XLogReadPos;
Not sure if there is any point to split that with the XLOG reader
status.

+static void fatal_error(const char *fmt,...) pg_attribute_printf(1, 2);
+
+static void
+fatal_error(const char *fmt,...)
In this more confusion accumulates with something which has enough
warts on HEAD when it comes to declare locally equivalents to the
elog() for src/common/.

+/*
+ * This is a front-end counterpart of XLogFileNameP.
+ */
+static char *
+XLogFileNameFE(TimeLineID tli, XLogSegNo segno)
+{
+   char   *result = palloc(MAXFNAMELEN);
+
+   XLogFileName(result, tli, segno, WalSegSz);
+   return result;
+}
We could use a pointer to an allocated area.  Or even better, just a
static variable as this just gets used for error messages to store
temporarily the segment name in a routine part of perhaps
xlogreader.c.
--
Michael


signature.asc
Description: PGP signature


Re: Experimenting with hash join prefetch

2019-04-11 Thread Thomas Munro
On Fri, Apr 12, 2019 at 1:35 AM Robert Haas  wrote:
> It would be interesting to see how this does with moderately-long text
> keys, say 32 or 64 byte strings, and with actually-long text keys, say
> several kB, and then with gigantic text keys, say several MB.  At some
> point the key probably gets large enough that computing the hash value
> for the next key evicts the current key from the relevant CPU cache,
> and if I had to guess, at that point prefetching will become a loser.
> But I don't know where that point is.  If it turns out for example
> that this technique is a winner for pass-by-value datatypes and a
> loser for pass-by-reference datatypes, or that it's a winner always,
> or some sort of simple rule like that, awesome!  But if it turns out
> that there's no simple rule that we can use to know whether it wins or
> loses, then that might make things a little tricky.

Ok, I ran the attached script on an E5-2695 v3 @ 2.30GHz with 32K of
L1, 256K of L2, 30M of L3.  I used shared_buffers=16GB and prewarmed
all tables.

The short version: very mixed results.  For small hash tables it
clearly hurts, for large ones it looks promising.  Much more digging
required to draw any conclusions.

It'd be nice to understand exactly how the hidden parameters work.
Hash bucket array size vs hardware cache sizes must be a factor.
Another is surely timing; there is an optimal time to begin
prefetching (too soon and your line might be replaced by the time you
need it, too late and you won't avoid stalling).  Here, I am doing a
ham-fisted prefetch of t+1's bucket header and not yet trying to
prefetch the tuple itself, but the Intel/CMU paper[1] of course shows
a deeper pipeline and talks about calibrating the prefetch distance
for the hardware.  As far as I can see, that's quite tricky in
general, and complicated by our executor design: the number of cycles
before the node runs again depends on the higher-level plan!
Previously I have heard icache-based arguments for why we should be
able to emit more than one tuple at a time, and I suppose this is a
new argument for that: it makes it actually plausible to calibrate the
prefetch distance for "software-pipelined prefetch" techniques.

Anyway, here are the results for what they are worth:

Table r has 50,000,000 rows.  Table s was tested with 3 different
sizes.  Both tables have the same layout: key (various types and
sizes) + 0, 2 or 4 extra columns.  I ran each test 3 times, and
compared the worst (w) and median (m) times and computed the speed-up
provided by the patch.  The absolute times (not shown) were all in the
range 9-40 seconds depending depending on the parameters.

s=1,000  s=100,000s=10,000,000
  
int w=-10.86%, m=-11.03% w=-8.56%, m=-9.17%   w=+16.79%, m=+19.63%
 + 2 cols   w=-14.19%, m=-16.97% w=-6.59%, m=-7.89%   w=+15.88%, m=+16.81%
 + 4 cols   w=-17.14%, m=-14.34% w=-10.01%, m=-8.85%  w=+37.38%, m=+24.01%
text(8) w=-12.42%, m=-12.32% w=-4.04%, m=-1.96%   w=+13.52%, m=+16.17%
 + 2 cols   w=-13.50%, m=-14.98% w=-4.29%, m=-3.40%   w=+15.98%, m=+19.45%
 + 4 cols   w=-11.53%, m=-9.61%  w=-3.70%, m=-5.91%   w=+46.66%, m=+51.06%
text(16)w=-11.46%, m=-9.71%  w=-4.85%, m=-3.86%   w=+16.47%, m=+22.10%
 + 2 cols   w=-13.97%, m=-14.55% w=-7.08%, m=-6.07%   w=+20.50%, m=+21.77%
 + 4 cols   w=-9.72%, m=-11.31%  w=-1.03%, m=-2.55%   w=+8.25%, m=+12.21%
text(32)w=-14.86%, m=-15.48% w=-9.36%, m=-9.27%   w=+19.86%, m=+15.34%
 + 2 cols   w=-12.35%, m=-11.71% w=-10.61%, m=-9.87%  w=+98.29%, m=+97.40%
 + 4 cols   w=-10.71%, m=-10.40% w=-2.42%, m=-1.25%   w=-8.34%, m=-10.44%
text(64)w=-9.45%, m=-11.36%  w=-13.94%, m=-11.42% w=+9.44%, m=+9.57%
 + 2 cols   w=-12.47%, m=-13.17% w=-9.60%, m=-6.61%   w=-4.69%, m=+10.06%
 + 4 cols   w=-9.47%, m=-12.20%  w=-5.60%, m=-3.55%   w=-15.91%, m=-23.29%

I'd expect the right-hand column to get a further speed-up (or
slow-down) of around 1/5 the given numbers, if we also prefetch during
the build phase (= s/r).  Maybe 2-stage pipeline would help, though
I'm starting to see the complexity of organising a perfecting primed
memory pipeline ... ie what this topic is really all about.

Well, that's enough learning-basic-facts-about-computers by trying to
whack PostgreSQL with database literature for now.  Looks like I'll
have to work quite a bit harder to make something useful out of this.
I think I see where some of the problems lie.  I think being able to
store multiple tuples in a slot (via TTS-implementation-specific
means, as we see with the heap scan in my patch, and as I think hash
join would want to do to emit multiple tuples before relinquishing
control) and then look at them via ExecScrollSlot() and perhaps also
consume them via ExecNextSlot() are promising ideas, but I've only
scratched the surface.

[1] http://www.cs.cmu.edu/~chensm/papers/hashjoin_tods_preliminary.pdf


--
Thomas Munro

Re: Minor fix in reloptions.c comments

2019-04-11 Thread Michael Paquier
On Fri, Apr 12, 2019 at 02:41:37AM +, Jamison, Kirk wrote:
> I found some minor grammar mistake while reading reloptions.c code comments.
> Attached is the fix.
> I just changed "affect" to "effect", for both n_distinct and vacuum_truncate.
>   - * values has no affect until the ...
>   + * values has no effect until the ...

A lot of those parameter updates affect processing and still they have
many side effects, as per those paragraphs.

Fixed, thanks!
--
Michael


signature.asc
Description: PGP signature


Re: Attempt to consolidate reading of XLOG page

2019-04-11 Thread Kyotaro HORIGUCHI
Hello.

At Thu, 11 Apr 2019 18:05:42 +0200, Antonin Houska  wrote in 
<14984.1554998...@spoje.net>
> While working on the instance encryption I found it annoying to apply
> decyption of XLOG page to three different functions. Attached is a patch that
> tries to merge them all into one function, XLogRead(). The existing
> implementations differ in the way new segment is opened. So I added a pointer
> to callback function as a new argument. This callback handles the specific
> ways to determine segment file name and to open the file.
> 
> I can split the patch into multiple diffs to make detailed review easier, but
> first I'd like to hear if anything is seriously wrong about this
> design. Thanks.

This patch changes XLogRead to allow using other than
BasicOpenFile to open a segment, and use XLogReaderState.private
to hold a new struct XLogReadPos for the segment reader. The new
struct is heavily duplicated with XLogReaderState and I'm not
sure the rason why the XLogReadPos is needed.

Anyway, in the first place, such two distinct-but-highly-related
callbacks makes things too complex. Heikki said that the log
reader stuff is better not using callbacks and I agree to that. I
did that once for my own but the code is no longer
applicable. But it seems to be the time to do that.

https://www.postgresql.org/message-id/47215279-228d-f30d-35d1-16af695e5...@iki.fi

That would seems like follows. That refactoring separates log
reader and page reader.


for(;;)
{
rc = XLogReadRecord(reader, startptr, errormsg);

if (rc == XLREAD_SUCCESS)
{
/* great, got record */
}
if (rc == XLREAD_INVALID_PAGE || XLREAD_INVALID_RECORD)
{
elog(ERROR, "invalid record");
}
if (rc == XLREAD_NEED_DATA)
{
/*
 * Read a page from disk, and place it into reader->readBuf
 */
XLogPageRead(reader->readPagePtr, /* page to read */
 reader->reqLen   /* # of bytes to read */ );
/*
 * Now that we have read the data that XLogReadRecord()
 * requested, call it again.
 */
continue;
}
}

DecodingContextFindStartpoint(ctx)
  do
  {
 read_local_xlog_page();
 rc = XLogReadRecord (reader);
  while (rc == XLREAD_NEED_DATA);

I'm going to do that again.


Any other opinions, or thoughts?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: REINDEX CONCURRENTLY 2.0

2019-04-11 Thread Michael Paquier
On Thu, Apr 11, 2019 at 09:49:47AM -0400, Alvaro Herrera wrote:
> Hmm, I suppose that makes sense for REINDEX TABLE or anything that
> reindexes more than one index, but if you do REINDEX INDEX surely it
> is reasonable to allow the case?

Yes, we could revisit the REINDEX INDEX portion of the decision, and
after sleeping on it my previous argument makes limited sense for
REINDEX processes using only one index.  One could note that the
header comment of ReindexRelationConcurrently() kind of implies the
same conclusion as you do, but that's perhaps just an accident.

So...  I am coming up with the patch attached.  I have introduced some
tests using a trick with CIC to have an invalid index to work on.
--
Michael
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index d9d95b20e3..929a326ae7 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -555,10 +555,8 @@ Indexes:
 
 The recommended recovery
 method in such cases is to drop the index and try again to perform
-CREATE INDEX CONCURRENTLY.  (Another possibility is to rebuild
-the index with REINDEX.  However, since REINDEX
-does not support concurrent builds, this option is unlikely to seem
-attractive.)
+CREATE INDEX CONCURRENTLY.  (Another possibility is
+to rebuild the index with REINDEX INDEX CONCURRENTLY.

 

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index e05a76c6d8..303436c89d 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -65,12 +65,11 @@ REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURR
 
 
  
-  An index build with the CONCURRENTLY option failed, leaving
-  an invalid index. Such indexes are useless but it can be
-  convenient to use REINDEX to rebuild them. Note that
-  REINDEX will not perform a concurrent build on an invalid index. To build the
-  index without interfering with production you should drop the index and
-  reissue the CREATE INDEX CONCURRENTLY command.
+  If an index build fails with the CONCURRENTLY option,
+  this index is left as invalid. Such indexes are useless
+  but it can be convenient to use REINDEX to rebuild
+  them. Note that only REINDEX INDEX is able
+  to perform a concurrent build on an invalid index.
  
 
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 46f32c21f9..a1c91b5fb8 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -2776,11 +2776,6 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 			}
 		case RELKIND_INDEX:
 			{
-/*
- * For an index simply add its Oid to list. Invalid indexes
- * cannot be included in list.
- */
-Relation	indexRelation = index_open(relationOid, ShareUpdateExclusiveLock);
 Oid			heapId = IndexGetRelation(relationOid, false);
 
 /* A shared relation cannot be reindexed concurrently */
@@ -2801,25 +2796,13 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 /* Track the heap relation of this index for session locks */
 heapRelationIds = list_make1_oid(heapId);
 
+/*
+ * Save the list of relation OIDs in private context.  Note
+ * that invalid indexes are allowed here.
+ */
+indexIds = lappend_oid(indexIds, relationOid);
+
 MemoryContextSwitchTo(oldcontext);
-
-if (!indexRelation->rd_index->indisvalid)
-	ereport(WARNING,
-			(errcode(ERRCODE_INDEX_CORRUPTED),
-			 errmsg("cannot reindex concurrently invalid index \"%s.%s\", skipping",
-	get_namespace_name(get_rel_namespace(relationOid)),
-	get_rel_name(relationOid;
-else
-{
-	/* Save the list of relation OIDs in private context */
-	oldcontext = MemoryContextSwitchTo(private_context);
-
-	indexIds = lappend_oid(indexIds, relationOid);
-
-	MemoryContextSwitchTo(oldcontext);
-}
-
-index_close(indexRelation, NoLock);
 break;
 			}
 		case RELKIND_PARTITIONED_TABLE:
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index f9b4768aee..8bfcf57d5a 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2033,6 +2033,38 @@ Referenced by:
 
 DROP MATERIALIZED VIEW concur_reindex_matview;
 DROP TABLE concur_reindex_tab, concur_reindex_tab2, concur_reindex_tab3;
+-- Check handling of invalid indexes
+CREATE TABLE concur_reindex_tab4 (c1 int);
+INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
+-- This trick creates an invalid index.
+CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+ERROR:  could not create unique index "concur_reindex_ind5"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- And this makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- The 

Minor fix in reloptions.c comments

2019-04-11 Thread Jamison, Kirk
Hi,

I found some minor grammar mistake while reading reloptions.c code comments.
Attached is the fix.
I just changed "affect" to "effect", for both n_distinct and vacuum_truncate.
  - * values has no affect until the ...
  + * values has no effect until the ...

Regards,
Kirk Jamison


reloptions-comment-fix.patch
Description: reloptions-comment-fix.patch


Re: Should we add GUCs to allow partition pruning to be disabled?

2019-04-11 Thread Justin Pryzby
On Fri, Apr 12, 2019 at 02:01:39PM +1200, David Rowley wrote:
> On Thu, 11 Apr 2019 at 17:40, Justin Pryzby  wrote:
> > I tweaked this patch some more (sorry):
> >  - remove "especially";
> 
> I think that likely needs to be kept for the PG11 version. I was
> hoping it was stop a casual tester testing a SELECT and seeing that
> it's not so bad only to find later that UPDATE/DELETE OOMs.

With "especially", it reads as if "excessive memory usage" might happen for
SELECT, and it'll be additionally worse for UPDATE/DELETE.

Without "especially", it makes "excessive RAM use" apply only to UPDATE/DELETE,
which I think is what's intended.

|Larger partition hierarchies may incur long planning time, and [especially] in
|the case of UPDATE and DELETE, excessive
|memory usage.

I think as long as UPDATE/DELETE are specifically mentioned, that would handle
your concern.  If I were to suggest an alternative:

|Larger partition hierarchies may incur long planning time; and, in
|the case of UPDATE and DELETE, may also
|incur excessive memory usage.

..after which I'll stop wrestling with words.

Justin




Re: How to include the header files effectively

2019-04-11 Thread Alvaro Herrera
On 2019-Apr-12, Andy Fan wrote:

> for example,  when I want the LOCKTAG in .c file,   which is defined in
> "storage/lock.h".  then I wrote the code like this:
> 
> #include "storage/lock.h"
> ...
> 
> /../../../src/include/storage/lockdefs.h:50:2: error: unknown type name
>   'TransactionId'
> TransactionId xid;  /* xid of holder of 
> AccessExclusiveLock */

What are you trying to do?  Your .c file must include "postgres.h"
before any other header file.  There should be no other dependencies.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: How to include the header files effectively

2019-04-11 Thread Tom Lane
Andy Fan  writes:
> for example,  when I want the LOCKTAG in .c file,   which is defined in
> "storage/lock.h".  then I wrote the code like this:

> #include "storage/lock.h"
> ...
> LOCKTAG tag;

> compile and get errors.

> In file included from
> .../src/include/storage/lock.h:21:
> /../../../src/include/storage/lockdefs.h:50:2: error: unknown type name
>   'TransactionId'
> TransactionId xid;  /* xid of holder of
> AccessExclusiveLock */

The reason that's failing is that you didn't include postgres.h first.

The general expectation --- and we do mechanically verify this,
periodically --- is that any Postgres header should have enough #include's
that you can include it without further work, so long as you included
postgres.h (or postgres_fe.h, or c.h, depending on context) beforehand.
One of those three headers must be the first inclusion in every Postgres
.c file.  There are portability reasons behind that rule, which you
don't really want to know about ;-) ... just do it like that.

regards, tom lane




How to include the header files effectively

2019-04-11 Thread Andy Fan
I find the dependency is complex among header files in PG.   At the same
time,  I find the existing code still can use the header file very
cleanly/alphabetically.   so I probably missed some knowledge here.

for example,  when I want the LOCKTAG in .c file,   which is defined in
"storage/lock.h".  then I wrote the code like this:

#include "storage/lock.h"
...

LOCKTAG tag;


compile and get errors.

In file included from
.../src/include/storage/lock.h:21:
/../../../src/include/storage/lockdefs.h:50:2: error: unknown type name
  'TransactionId'
TransactionId xid;  /* xid of holder of
AccessExclusiveLock */

so I HAVE TO
1.  include the header file which contains the TransactionId
2.  add it before the lock.h.

normally I think we can add the dependency in lock.h directly to resolve
this issue.

so how can I include header file effectively ?

Thanks


Re: Should we add GUCs to allow partition pruning to be disabled?

2019-04-11 Thread David Rowley
On Thu, 11 Apr 2019 at 17:40, Justin Pryzby  wrote:
> I tweaked this patch some more (sorry):
>  - remove "currently" since that's not expected to be changed (right?);

Seems like a good idea.  I think the way we exclude inheritance child
relations will never scale well. Other improvements that we'll see
will most likely be as a consequence of speeding up declarative
partitioning. For example the planner improvements in PG12 just did
that for UPDATE/DELETE.

>  - remove "especially";

I think that likely needs to be kept for the PG11 version. I was
hoping it was stop a casual tester testing a SELECT and seeing that
it's not so bad only to find later that UPDATE/DELETE OOMs.

>  - refer to "partition hierarchies" not "partitioning hierarchies";

fine

>  - rewrite bit about "When partition pruning is not possible"

fine.

> Also, I noticed awhile ago while grepping for "probably be fixed in future
> releases" that some items under ddl-inherit-caveats are actually possible for
> relkind=p partitions in v11.  I assume those will never be implemented for
> inheritence partitioning, so I propose another update to docs (if preferred,
> I'll bring up on a new thread).

Not sure about that. It may be very simple to implement if we one day
get global indexes. It may just be a matter of pointing all the tables
at the same index and letting the wonders of global indexes handle all
the hard stuff. I'm not that excited about removing that. I'd be
equally excited about adding the text if it wasn't already there and
you were proposing to add it.

>  - unique constraints on parent table;
>  - FK constraints on parent table;
>
> Note that FK constraints *referencing* a partitiond table are possible in v12
> but not in v11.  So if there's any finer-grained update to documentation of 
> the
> individual limitations, it'd need to be tweaked for back branches (v10 and 
> 11).

Don't we just need to remove or update:

 
  
   While primary keys are supported on partitioned tables, foreign
   keys referencing partitioned tables are not supported.  (Foreign key
   references from a partitioned table to some other table are supported.)
  
 

I didn't follow this work, but on testing, I see the foreign key does
not CASCADE when doing DETACH PARTITION, it errors instead. Perhaps
that's worth a mention here.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Switch TAP tests of pg_rewind to use role with only function permissions

2019-04-11 Thread Michael Paquier
On Thu, Apr 11, 2019 at 09:40:36AM +0200, Magnus Hagander wrote:
> If we haven't already (and knowing you it wouldn't surprise me if you had
> :P), we should probably look through the rest of the tests to see if we
> have other similar cases. In general I think any case where "can be run by
> non-superuser with specific permissions or a superuser" is the case, we
> should be testing it with the "non-superuser with permissions". Because,
> well, superusers will never have permission problems (and they will both
> test the functionality).

I am ready to bet that we have other problems lying around.

> I do think it's perfectly reasonable to have that hardcoded in the
> RewindTest.pm module. It doesn't have to be pluggable.

Thanks, I have committed the patch to do so (d4e2a84), after rewording
a bit the comments.  And particularly thanks to Peter to mention that
having more tests with such properties would be nicer.
--
Michael


signature.asc
Description: PGP signature


Re: Issue in ExecCleanupTupleRouting()

2019-04-11 Thread Amit Langote
On 2019/04/11 22:28, David Rowley wrote:
> On Fri, 12 Apr 2019 at 01:06, Etsuro Fujita  
> wrote:
>> +   /*
>> +* Check if this result rel is one belonging to the node's subplans,
>> +* if so, let ExecEndPlan() clean it up.
>> +*/
>> +   if (htab)
>> +   {
>> +   Oid partoid;
>> +   boolfound;
>> +
>> +   partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
>> +
>> +   (void) hash_search(htab, , HASH_FIND, );
>> +   if (found)
>> +   continue;
>> +   }
>>
>> /* Allow any FDWs to shut down if they've been exercised */
>> -   if (resultRelInfo->ri_PartitionReadyForRouting &&
>> -   resultRelInfo->ri_FdwRoutine != NULL &&
>> +   if (resultRelInfo->ri_FdwRoutine != NULL &&
>> resultRelInfo->ri_FdwRoutine->EndForeignInsert != NULL)
>>
>> resultRelInfo->ri_FdwRoutine->EndForeignInsert(mtstate->ps.state,
>>resultRelInfo);
>>
>> This skips subplan resultrels before calling EndForeignInsert() if they
>> are foreign tables, which I think causes an issue: the FDWs would fail
>> to release resources for their foreign insert operations, because
>> ExecEndPlan() and ExecEndModifyTable() don't do anything to allow them
>> to do that.  So I think we should skip subplan resultrels after
>> EndForeignInsert().  Attached is a small patch for that.
> 
> Oops. I had for some reason been under the impression that it was
> nodeModifyTable.c, or whatever the calling code happened to be that
> handles these ones, but this is not the case as we call
> ExecInitRoutingInfo() from ExecFindPartition() which makes the call to
> BeginForeignInsert. If that part is handled by the tuple routing code,
> then the subsequent cleanup should be too, in which case your patch
> looks fine.

That sounds right.

Thanks,
Amit





Re: Adding Unix domain socket path and port to pg_stat_get_wal_senders()

2019-04-11 Thread Euler Taveira
Em qui, 11 de abr de 2019 às 21:16, Tatsuo Ishii  escreveu:
>
> Currently pg_stat_replication view does not tell useful information
> regarding client connections if UNIX domain sockets are used for
> communication between sender and receiver. So it is not possible to
> tell which row corresponds to which standby server.
>
application_name. I'm not sure if it solves your complain but Peter
committed a patch [1] for v12 that distinguishes replicas in the same
host via cluster_name.

> test=# select client_addr, client_hostname, client_port,sync_state 
> client_port from pg_stat_replication;
>  client_addr | client_hostname | client_port | client_port
> -+-+-+-
>  | |  -1 | async
>  | |  -1 | async
> (2 rows)
>
> This is due to that pg_stat_replication is created from
> pg_stat_get_activity view. pg_stat_get_activity view calls
> pg_stat_get_activity() which returns always NULL, NULL, -1 for
> client_add, client_hostname and client_port.
>
Socket has different semantic from TCP/UDP. We can't add socket
information into client_addr unless we are prepared to break this view
(client_addr has type inet and it would be necessary to change it to
text). It could break a lot of applications.


[1] 
https://www.postgresql.org/message-id/flat/1257eaee-4874-e791-e83a-46720c72c...@2ndquadrant.com


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento




Re: finding changed blocks using WAL scanning

2019-04-11 Thread Kyotaro HORIGUCHI
At Thu, 11 Apr 2019 10:00:35 -0700, Ashwin Agrawal  wrote 
in 
> On Thu, Apr 11, 2019 at 6:27 AM Robert Haas  wrote:
> 
> > On Thu, Apr 11, 2019 at 3:52 AM Peter Eisentraut
> >  wrote:
> > > I had in mind that you could have different overlapping incremental
> > > backup jobs in existence at the same time.  Maybe a daily one to a
> > > nearby disk and a weekly one to a faraway cloud.  Each one of these
> > > would need a separate replication slot, so that the information that is
> > > required for *that* incremental backup series is preserved between runs.
> > >  So just one reserved replication slot that feeds the block summaries
> > > wouldn't work.  Perhaps what would work is a flag on the replication
> > > slot itself "keep block summaries for this slot".  Then when all the
> > > slots with the block summary flag are past an LSN, you can clean up the
> > > summaries before that LSN.
> >
> > I don't think that quite works.  There are two different LSNs.  One is
> > the LSN of the oldest WAL archive that we need to keep around so that
> > it can be summarized, and the other is the LSN of the oldest summary
> > we need to keep around so it can be used for incremental backup
> > purposes.  You can't keep both of those LSNs in the same slot.
> > Furthermore, the LSN stored in the slot is defined as the amount of
> > WAL we need to keep, not the amount of something else (summaries) that
> > we need to keep.  Reusing that same field to mean something different
> > sounds inadvisable.
> >
> > In other words, I think there are two problems which we need to
> > clearly separate: one is retaining WAL so we can generate summaries,
> > and the other is retaining summaries so we can generate incremental
> > backups.  Even if we solve the second problem by using some kind of
> > replication slot, we still need to solve the first problem somehow.
> 
> Just a thought for first problem, may not to simpler, can replication slot
> be enhanced to define X amount of WAL to retain, after reaching such limit
> collect summary and let the WAL be deleted.

I think Peter is saying that a slot for block summary doesn't
keep WAL segments themselves, but keeps maybe segmented block
summaries.  n block-summary-slots maintains n block summaries and
the newest block summary is "active", in other words,
continuously updated by WAL records pass-by. When backup-tool
requests for block summary, for example, for the oldest slot, the
acitve summary is closed then a new summary is opened from the
LSN at the time, which is the new LSN of the slot. Then the
concatenated block summary is sent. Finally the oldest summary is
removed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Adding Unix domain socket path and port to pg_stat_get_wal_senders()

2019-04-11 Thread Tatsuo Ishii
Currently pg_stat_replication view does not tell useful information
regarding client connections if UNIX domain sockets are used for
communication between sender and receiver. So it is not possible to
tell which row corresponds to which standby server.

test=# select client_addr, client_hostname, client_port,sync_state client_port 
from pg_stat_replication;
 client_addr | client_hostname | client_port | client_port 
-+-+-+-
 | |  -1 | async
 | |  -1 | async
(2 rows)

This is due to that pg_stat_replication is created from
pg_stat_get_activity view. pg_stat_get_activity view calls
pg_stat_get_activity() which returns always NULL, NULL, -1 for
client_add, client_hostname and client_port.

else if (beentry->st_clientaddr.addr.ss_family 
== AF_UNIX)
{
/*
 * Unix sockets always reports NULL for 
host and -1 for
 * port, so it's possible to tell the 
difference to
 * connections we have no permissions 
to view, or with
 * errors.
 */

Changing this behavior would affect existing pg_stat_get_activity view
users and I hesitate to do so. I wonder if we could add receiver's
UNIX domain socket path to from pg_stat_get_wal_senders() (which is
called from pg_stat_replication view) so that the poor UNIX domain
socket users could make their own view or access
pg_stat_get_wal_senders() to get the UNIX domain socket path.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp




Re: Enable data checksums by default

2019-04-11 Thread Daniel Gustafsson
On Thursday, April 11, 2019 8:56 PM, Andres Freund  wrote:

> On 2019-04-11 18:15:41 +, Daniel Gustafsson wrote:
>
> > On Thursday, April 11, 2019 6:58 PM, Andres Freund and...@anarazel.de wrote:
> >
> > > On 2019-04-09 23:11:03 -0400, Bruce Momjian wrote:
> > >
> > > > Enabling checksums by default will require anyone using pg_upgrade to
> > > > run initdb to disable checksums before running pg_upgrade, for one
> > > > release. We could add checksums for non-link pg_upgrade runs, but we
> > > > don't have code to do that yet, and most people use link anyway.
> > >
> > > Hm. We could just have pg_ugprade run pg_checksums --enable/disable,
> > > based on the old cluster, and print a warning on mismatches. Not sure if
> > > that's worth it, but ...
> >
> > That would be for link mode, for copy-mode you'd have to initdb with 
> > checksums
> > turned off and run pg_checksums on the new cluster, else the non-destructive
> > nature of copy mode would be lost.
>
> I don't think so? But I think we might just have misunderstood each
> other. What I was suggesting is that we could take the burden of having
> to match the old cluster's checksum enabled/disabled setting when
> initdb'ing the new cluster, by changing the new cluster instead of
> erroring out with:
> if (oldctrl->data_checksum_version == 0 &&
>
>   newctrl->data_checksum_version != 0)
>
>   pg_fatal("old cluster does not use data checksums but the new one 
> does\\n");
> else if (oldctrl->data_checksum_version != 0 &&
>
>newctrl->data_checksum_version == 0)
>
>   pg_fatal("old cluster uses data checksums but the new one does not\\n");
> else if (oldctrl->data_checksum_version != newctrl->data_checksum_version)
>
>   pg_fatal("old and new cluster pg_controldata checksum versions do not 
> match\\n");
>
>
> As the new cluster at that time isn't yet related to the old cluster, I
> don't see why that'd influence the non-destructive nature?

Right, now I see what you mean, and I indeed misunderstood you.  Thanks for
clarifying.

cheers ./daniel




Re: fix memory overflow in ecpg preproc module

2019-04-11 Thread Michael Meskes
Hi,

> I have found a potential memory overflow in ecpg preproc module.
> ... 

Thanks for finding and fixing, committed.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org
Jabber: michael at xmpp dot meskes dot org
VfL Borussia! Força Barça! SF 49ers! Use Debian GNU/Linux, PostgreSQL





Re: Enable data checksums by default

2019-04-11 Thread Andres Freund
Hi,

On 2019-04-11 18:15:41 +, Daniel Gustafsson wrote:
> On Thursday, April 11, 2019 6:58 PM, Andres Freund  wrote:
> 
> > On 2019-04-09 23:11:03 -0400, Bruce Momjian wrote:
> >
> > > Enabling checksums by default will require anyone using pg_upgrade to
> > > run initdb to disable checksums before running pg_upgrade, for one
> > > release. We could add checksums for non-link pg_upgrade runs, but we
> > > don't have code to do that yet, and most people use link anyway.
> >
> > Hm. We could just have pg_ugprade run pg_checksums --enable/disable,
> > based on the old cluster, and print a warning on mismatches. Not sure if
> > that's worth it, but ...
> 
> That would be for link mode, for copy-mode you'd have to initdb with checksums
> turned off and run pg_checksums on the new cluster, else the non-destructive
> nature of copy mode would be lost.

I don't think so? But I think we might just have misunderstood each
other. What I was suggesting is that we could take the burden of having
to match the old cluster's checksum enabled/disabled setting when
initdb'ing the new cluster, by changing the new cluster instead of
erroring out with:
if (oldctrl->data_checksum_version == 0 &&
newctrl->data_checksum_version != 0)
pg_fatal("old cluster does not use data checksums but the new 
one does\n");
else if (oldctrl->data_checksum_version != 0 &&
 newctrl->data_checksum_version == 0)
pg_fatal("old cluster uses data checksums but the new one does 
not\n");
else if (oldctrl->data_checksum_version != 
newctrl->data_checksum_version)
pg_fatal("old and new cluster pg_controldata checksum versions 
do not match\n");


As the new cluster at that time isn't yet related to the old cluster, I
don't see why that'd influence the non-destructive nature?

Greetings,

Andres Freund




Re: Reducing the runtime of the core regression tests

2019-04-11 Thread Peter Geoghegan
On Thu, Apr 11, 2019 at 11:00 AM Alvaro Herrera
 wrote:
> ./configure --enable-depend --enable-coverage --enable-tap-tests --enable-nls 
> --with-python --with-perl --with-tcl --with-openssl --with-libxml --with-ldap 
> --with-pam >> $LOG 2>&1
>
> make -j4 >> $LOG 2>&1
> make -j4 -C contrib >> $LOG 2>&1
> make check-world PG_TEST_EXTRA="ssl ldap" >> $LOG 2>&1
> make coverage-html >> $LOG 2>&1
>
> There are no environment variables that would affect it.

Could we add "CFLAGS=-O0"? This should prevent the kind of
machine-wise line-counting described here:

https://gcc.gnu.org/onlinedocs/gcc/Gcov-and-Optimization.html#Gcov-and-Optimization

I think that it makes sense to prioritize making it clear which exact
lines were executed in terms of the semantics of C. I might prefer to
have optimizations enabled if I was optimizing my code, but that's not
what the web resource is for, really.

Thanks
-- 
Peter Geoghegan




Re: Enable data checksums by default

2019-04-11 Thread Daniel Gustafsson
On Thursday, April 11, 2019 6:58 PM, Andres Freund  wrote:

> On 2019-04-09 23:11:03 -0400, Bruce Momjian wrote:
>
> > Enabling checksums by default will require anyone using pg_upgrade to
> > run initdb to disable checksums before running pg_upgrade, for one
> > release. We could add checksums for non-link pg_upgrade runs, but we
> > don't have code to do that yet, and most people use link anyway.
>
> Hm. We could just have pg_ugprade run pg_checksums --enable/disable,
> based on the old cluster, and print a warning on mismatches. Not sure if
> that's worth it, but ...

That would be for link mode, for copy-mode you'd have to initdb with checksums
turned off and run pg_checksums on the new cluster, else the non-destructive
nature of copy mode would be lost.

Another option would be to teach pg_upgrade to checksum the cluster during the
upgrade on the fly.  That would however be a big conceptual change for
pg_upgrade as it's currently not modifying the cluster data.  In Greenplum we
have done this, but it was an easier choice there as we are rewriting all the
pages anyways.  It would also create yet another utility which can checksum an
offline cluster, but wanted to bring the idea to the table.

cheers ./daniel




Re: setLastTid() and currtid()

2019-04-11 Thread Tom Lane
Andres Freund  writes:
> On 2019-04-11 13:27:03 -0400, Tom Lane wrote:
>> I think removing it after feature freeze is not something to do,
>> but +1 for nuking it as soon as the v13 branch opens.  Unless
>> there's some important reason we need it to be gone in v12?

> No, I don't think there really is. They're bogus and possibly a bit
> dangerous, but that's not really new.

> I was mostly just reminded of this when Heikki asked me to improve the
> documentation for heap_get_latest_tid/table_get_latest_tid() - and I was
> briefly wondering whether we could just nuke the whole functionality.
> But it's still used in nodeTidscan.c:

Yeah, if we could simplify the tableam API, that would be sufficient
reason to remove the stuff in v12, IMO.  But if it is outside of that
API, I'd counsel waiting till v13.

regards, tom lane




Re: Retronym: s/magnetic disk/main data/

2019-04-11 Thread Bruce Momjian
On Thu, Apr  4, 2019 at 05:49:36PM +1300, Thomas Munro wrote:
> Hello,
> 
> I propose that in v13 we redefine "MD" (from md.c) to mean "main data"
> (instead of "magnetic disk").  That's the standard storage layout for
> typical table and index AMs.  As opposed to the proposed undo and SLRU
> SMGRs that provide layouts specialised for different life cycles.

A bigger issue is that our documention often refers to "disk" as storage
when including SSD storage, which clearly have no disks.  They are
"solid state drives", not disks.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: setLastTid() and currtid()

2019-04-11 Thread Andres Freund
Hi,

On 2019-04-11 13:52:08 -0400, Alvaro Herrera wrote:
> As I understand, this setLastTid stuff would cause trouble if used
> with a non-core table AM.

I'm not sure there'd actually be trouble. I mean, what it does for heap
is basically meaningless already, so it's not going to be meaningfully
worse for any other table AM.  It's an undocumented odd interface, whose
implementation is also ugly, and that'd be a fair reason on its own to
rip it out though.

Greetings,

Andres Freund




Re: serializable transaction: exclude constraint violation (backed by GIST index) instead of ssi conflict

2019-04-11 Thread Peter Billen
On Thu, Apr 11, 2019 at 6:12 PM Peter Billen  wrote:

>
> I believe we are after multiple issues/improvements:
>
> 1. Could we extend
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=fcff8a575198478023ada8a48e13b50f70054766
> by adding support for exclude constraints?
> 2. Fully support gist & constraints in serializable transactions. I did
> not yet test a unique constraint backed by a gist constraint, which is also
> interesting to test I assume. This test would tell us if there currently is
> a status quo between btree and gist indexes.
>

Regarding the remark in (2), I forgot that a unique constraint cannot be
backed by a gist index, so forget the test I mentioned.


Re: Reducing the runtime of the core regression tests

2019-04-11 Thread Alvaro Herrera
On 2019-Apr-11, Peter Geoghegan wrote:

> I've noticed that the coverage reported on coverage.postgresql.org
> sometimes looks contradictory, which can happen due to compiler
> optimizations. I wonder if that could be addressed in some way,
> because I find the site to be a useful resource. I would at least like
> to know the settings used by its builds.

./configure --enable-depend --enable-coverage --enable-tap-tests --enable-nls 
--with-python --with-perl --with-tcl --with-openssl --with-libxml --with-ldap 
--with-pam >> $LOG 2>&1

make -j4 >> $LOG 2>&1
make -j4 -C contrib >> $LOG 2>&1
make check-world PG_TEST_EXTRA="ssl ldap" >> $LOG 2>&1
make coverage-html >> $LOG 2>&1

There are no environment variables that would affect it.

If you want to propose changes, feel free.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: setLastTid() and currtid()

2019-04-11 Thread Alvaro Herrera
On 2019-Apr-11, Tom Lane wrote:

> Andres Freund  writes:
> > On 2019-03-27 10:01:08 +0900, Inoue, Hiroshi wrote:
> >> The above code remains only for PG servers whose version < 8.2.
> >> Please remove the code around setLastTid().
> 
> > Does anybody else have concerns about removing this interface? Does
> > anybody think we should have a deprecation phase? Should we remove this
> > in 12 or 13?
> 
> I think removing it after feature freeze is not something to do,
> but +1 for nuking it as soon as the v13 branch opens.  Unless
> there's some important reason we need it to be gone in v12?

Umm ... I'm not sure I agree.  We're in feature freeze, not code freeze,
and while we're not expecting to have any new feature patches pushed,
cleanup for features that did make the cut is still fair game.  As I
understand, this setLastTid stuff would cause trouble if used with a
non-core table AM.  Furthermore, if nothing uses it, what's the point of
keeping it?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: setLastTid() and currtid()

2019-04-11 Thread Andres Freund
Hi,

On 2019-04-11 13:27:03 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-03-27 10:01:08 +0900, Inoue, Hiroshi wrote:
> >> The above code remains only for PG servers whose version < 8.2.
> >> Please remove the code around setLastTid().
> 
> > Does anybody else have concerns about removing this interface? Does
> > anybody think we should have a deprecation phase? Should we remove this
> > in 12 or 13?
> 
> I think removing it after feature freeze is not something to do,
> but +1 for nuking it as soon as the v13 branch opens.  Unless
> there's some important reason we need it to be gone in v12?

No, I don't think there really is. They're bogus and possibly a bit
dangerous, but that's not really new.

I was mostly just reminded of this when Heikki asked me to improve the
documentation for heap_get_latest_tid/table_get_latest_tid() - and I was
briefly wondering whether we could just nuke the whole functionality.
But it's still used in nodeTidscan.c:

/*
 * For WHERE CURRENT OF, the tuple retrieved from the cursor 
might
 * since have been updated; if so, we should fetch the version 
that is
 * current according to our snapshot.
 */
if (node->tss_isCurrentOf)
table_get_latest_tid(heapRelation, snapshot, );

If we were able to just get rid of that I think there'd have been a
strong case for removing $subject in v12, to avoid exposing something to
new AMs that we're going to nuke in v13.

The only other reason I can see is that there's literally no use for
them (bogus and only used by pgodbc when targeting <= 8.2), and that
they cost a bit of performance and are the only reason heapam.h is still
included in nodeModifyTable.h (hurting my pride).   But that's probably
not sufficient reason.

Greetings,

Andres Freund




Re: [PATCH v20] GSSAPI encryption support

2019-04-11 Thread Robbie Harwood
Stephen Frost  writes:

> Robbie Harwood (rharw...@redhat.com) wrote:
>> Bruce Momjian  writes:
>>> Magnus Hagander wrote:
 Joe Conway  wrote:

 If it was on the table it might have been better to keep hostgss
 and change the authentication method to gssauth or something, but
 that ship sailed *years* ago.
>>>
>>> Uh, did we consider keeping hostgss and changing the auth part at
>>> the end to "gssauth"?
>> 
>> I think that was implicitly rejected because we'd have to keep the
>> capability to configure "gss" there else break compatibility.
>
> Right, if we changed the name of the auth method then everyone who is
> using the "gss" auth method would have to update their pg_hba.conf
> files...  That would be very ugly.  Also, it wasn't implicitly
> rejected, it was discussed up-thread (see the comments between Magnus
> and I, specifically, quoted above- "that ship sailed *years* ago") and
> explicitly rejected.

Apologies, you're right of course.  I intended to say why *I* had
rejected it but got bit by the passive voice.

Thanks,
--Robbie


signature.asc
Description: PGP signature


Re: Pluggable Storage - Andres's take

2019-04-11 Thread Tom Lane
Andres Freund  writes:
> On 2019-04-11 14:52:40 +0300, Heikki Linnakangas wrote:
>> + * HEIKKI: A flags bitmask argument would be more readable than 6 
>> booleans

> I honestly don't have strong feelings about it. Not sure that I buy that
> bitmasks would be much more readable

Sure they would be --- how's FLAG_FOR_FOO | FLAG_FOR_BAR not
better than unlabeled "true" and "false"?

> - but perhaps we could just use the
> struct trickery we started to use in

I find that rather ugly really.  If we're doing something other than a
dozen-or-so booleans, maybe its the only viable option.  But for cases
where a flags argument will serve, that's our longstanding practice and
I don't see a reason to deviate.

regards, tom lane




Re: block-level incremental backup

2019-04-11 Thread Anastasia Lubennikova

09.04.2019 18:48, Robert Haas writes:

Thoughts?

Hi,
Thank you for bringing that up.
In-core support of incremental backups is a long-awaited feature.
Hopefully, this take will end up committed in PG13.

Speaking of UI:
1) I agree that it should be implemented as a new replication command.

2) There should be a command to get only a map of changes without actual 
data.


Most backup tools establish server connection, so they can use this 
protocol to get the list of changed blocks.
Then they can use this information for any purpose. For example, 
distribute files between parallel workers to copy the data,
or estimate backup size before data is sent, or store metadata 
separately from the data itself.
Most methods (except straightforward LSN comparison) consist of two 
steps: get a map of changes and read blocks.

So it won't add much of extra work.

example commands:
GET_FILELIST [lsn]
returning json (or whatever) with filenames and maps of changed blocks

Map format is also the subject of discussion.
Now in pg_probackup we reuse code from pg_rewind/datapagemap,
not sure if this format is good for sending data via the protocol, though.

3) The API should provide functions to request data with a granularity 
of file and block.

It will be useful for parallelism and for various future projects.

example commands:
GET_DATAFILE [filename [map of blocks] ]
GET_DATABLOCK [filename] [blkno]
returning data in some format

4) The algorithm of collecting changed blocks is another topic.
Though, it's API should be discussed here:

Do we want to have multiple implementations?
Personally, I think that it's good to provide several strategies,
since they have different requirements and fit for different workloads.

Maybe we can add a hook to allow custom implementations.

Do we want to allow the backup client to tell what block collection 
method to use?

example commands:
GET_FILELIST [lsn] [METHOD lsn | page | ptrack | etc]
Or should it be server-side cost-based decision?

5) The method based on LSN comparison stands out - it can be done in one 
pass.

So it probably requires special protocol commands.
for example:
GET_DATAFILES [lsn]
GET_DATAFILE [filename] [lsn]

This is pretty simple to implement and pg_basebackup can use this method,
at least until we have something more advanced in-core.

I'll be happy to help with design, code, review, and testing.
Hope that my experience with pg_probackup will be useful.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: setLastTid() and currtid()

2019-04-11 Thread Tom Lane
Andres Freund  writes:
> On 2019-03-27 10:01:08 +0900, Inoue, Hiroshi wrote:
>> The above code remains only for PG servers whose version < 8.2.
>> Please remove the code around setLastTid().

> Does anybody else have concerns about removing this interface? Does
> anybody think we should have a deprecation phase? Should we remove this
> in 12 or 13?

I think removing it after feature freeze is not something to do,
but +1 for nuking it as soon as the v13 branch opens.  Unless
there's some important reason we need it to be gone in v12?

regards, tom lane




Re: cache lookup failed for collation 0

2019-04-11 Thread Tom Lane
Jeevan Chalke  writes:
> Do you mean, the code in get_collation_isdeterministic() should look like
> something like below?

> If colloid = InvalidOid then
>   return TRUE
> ELSE IF tuple is valid then
>   return collisdeterministic from the tuple
> ELSE
>  return FALSE

I think it's appropriate to fail if we don't find a tuple, for any
collation oid other than zero.  Again, if you trace through the
behavior of the longstanding collation check functions like
lc_ctype_is_c(), you'll see that that's what happens (except for
some hardwired OIDs that they have fast paths for).

regards, tom lane




Re: Pluggable Storage - Andres's take

2019-04-11 Thread Robert Haas
On Thu, Apr 11, 2019 at 12:49 PM Andres Freund  wrote:
> > @@ -179,6 +184,12 @@ typedef struct TableAmRoutine
> >*
> >* if temp_snap is true, the snapshot will need to be deallocated at
> >* scan_end.
> > +  *
> > +  * HEIKKI: table_scan_update_snapshot() changes the snapshot. That's
> > +  * a bit surprising for the AM, no? Can it be called when a scan is
> > +  * already in progress?
>
> Yea, it can be called when the scan is in-progress. I think we probably
> should just fix calling code to not need that - it's imo weird that
> nodeBitmapHeapscan.c doesn't just delay starting the scan till it has
> the snapshot. This isn't new code, but it's now going to be exposed to
> more AMs, so I think there's a good argument to fix it now.
>
> Robert: You committed that addition, in
>
> commit f35742ccb7aa53ee3ed8416bbb378b0c3eeb6bb9
> Author: Robert Haas 
> Date:   2017-03-08 12:05:43 -0500
>
> Support parallel bitmap heap scans.
>
> do you remember why that's done?

I don't think there was any brilliant idea behind it.  Delaying the
scan start until it has the snapshot seems like a good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Reducing the runtime of the core regression tests

2019-04-11 Thread Peter Geoghegan
On Thu, Apr 11, 2019 at 9:55 AM Tom Lane  wrote:
> So I concur that indexing.sql's fastpath test
> isn't adding anything useful coverage-wise, and will just nuke it.

Good.

> (It'd be interesting perhaps to check whether the results shown
> by coverage.postgresql.org are similarly unstable.  They might be
> less so, since I believe those are taken over the whole check-world
> suite not just the core regression tests.)

I'm almost certain that they're at least slightly unstable. I mostly
find the report useful because it shows whether or not something gets
hit at all. I don't trust it to be very accurate.

I've noticed that the coverage reported on coverage.postgresql.org
sometimes looks contradictory, which can happen due to compiler
optimizations. I wonder if that could be addressed in some way,
because I find the site to be a useful resource. I would at least like
to know the settings used by its builds.

-- 
Peter Geoghegan




Re: finding changed blocks using WAL scanning

2019-04-11 Thread Ashwin Agrawal
On Thu, Apr 11, 2019 at 6:27 AM Robert Haas  wrote:

> On Thu, Apr 11, 2019 at 3:52 AM Peter Eisentraut
>  wrote:
> > I had in mind that you could have different overlapping incremental
> > backup jobs in existence at the same time.  Maybe a daily one to a
> > nearby disk and a weekly one to a faraway cloud.  Each one of these
> > would need a separate replication slot, so that the information that is
> > required for *that* incremental backup series is preserved between runs.
> >  So just one reserved replication slot that feeds the block summaries
> > wouldn't work.  Perhaps what would work is a flag on the replication
> > slot itself "keep block summaries for this slot".  Then when all the
> > slots with the block summary flag are past an LSN, you can clean up the
> > summaries before that LSN.
>
> I don't think that quite works.  There are two different LSNs.  One is
> the LSN of the oldest WAL archive that we need to keep around so that
> it can be summarized, and the other is the LSN of the oldest summary
> we need to keep around so it can be used for incremental backup
> purposes.  You can't keep both of those LSNs in the same slot.
> Furthermore, the LSN stored in the slot is defined as the amount of
> WAL we need to keep, not the amount of something else (summaries) that
> we need to keep.  Reusing that same field to mean something different
> sounds inadvisable.
>
> In other words, I think there are two problems which we need to
> clearly separate: one is retaining WAL so we can generate summaries,
> and the other is retaining summaries so we can generate incremental
> backups.  Even if we solve the second problem by using some kind of
> replication slot, we still need to solve the first problem somehow.
>

Just a thought for first problem, may not to simpler, can replication slot
be enhanced to define X amount of WAL to retain, after reaching such limit
collect summary and let the WAL be deleted.


Re: Enable data checksums by default

2019-04-11 Thread Andres Freund
Hi,

On 2019-04-09 23:11:03 -0400, Bruce Momjian wrote:
> Enabling checksums by default will require anyone using pg_upgrade to
> run initdb to disable checksums before running pg_upgrade, for one
> release.  We could add checksums for non-link pg_upgrade runs, but we
> don't have code to do that yet, and most people use link anyway.

Hm. We could just have pg_ugprade run pg_checksums --enable/disable,
based on the old cluster, and print a warning on mismatches. Not sure if
that's worth it, but ...

Greetings,

Andres Freund




Re: cache lookup failed for collation 0

2019-04-11 Thread Jeevan Chalke
On Thu, Apr 11, 2019 at 9:07 PM Tom Lane  wrote:

> Jeevan Chalke  writes:
> > Following test-sequence causing an error "cache lookup failed for
> collation 0";
> > postgres:5432 [42106]=# create table foobar(a bytea primary key, b int);
> > CREATE TABLE
> > postgres:5432 [42106]=# insert into foobar
> > values('\x4c835521685c46ee827ab83d376cf028', 1);
> > INSERT 0 1
> > postgres:5432 [42106]=# select * from foobar where a like '%1%';
> > ERROR:  cache lookup failed for collation 0
>
> Good catch!
>
> > The error is coming from get_collation_isdeterministic() when colloid
> > passed is 0. I think like we do in get_collation_name(), we should return
> > false here when such collation oid does not exist.
>
> Considering that e.g. lc_ctype_is_c() doesn't fail for InvalidOid, I agree
> that it's probably a bad idea for get_collation_isdeterministic to fail.
> There's a lot of code that thinks it can check for InvalidOid only in slow
> paths.  However, I'd kind of expect the default result to be "true" not
> "false".  Doing what you suggest would make match_pattern_prefix fail
> entirely, unless we also put a special case there.
>

Do you mean, the code in get_collation_isdeterministic() should look like
something like below?

If colloid = InvalidOid then
  return TRUE
ELSE IF tuple is valid then
  return collisdeterministic from the tuple
ELSE
 return FALSE

I think for non-zero colloid which is not valid we should return false, but
I may be missing your point here.


>
> regards, tom lane
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Re: Reducing the runtime of the core regression tests

2019-04-11 Thread Tom Lane
Peter Geoghegan  writes:
> On Wed, Apr 10, 2019 at 4:56 PM Peter Geoghegan  wrote:
>> The original fastpath tests don't seem particularly effective to me,
>> even without the oversight I mentioned. I suggest that you remove
>> them, since the minimal btree_index.sql fast path test is sufficient.

> To be clear: I propose that you remove the tests entirely, and we
> leave it at that. I don't intend to follow up with my own patch
> because I don't think that there is anything in the original test case
> that is worth salvaging.

I checked into this by dint of comparing "make coverage" output for
"make check" runs with and without indexing.sql's fastpath tests.
There were some differences that seem mostly to be down to whether
or not autovacuum hit particular code paths during the test run.
In total, I found 29 lines that were hit in the first test but not
in the second ... and 141 lines that were hit in the second test
but not the first.  So I concur that indexing.sql's fastpath test
isn't adding anything useful coverage-wise, and will just nuke it.

(It'd be interesting perhaps to check whether the results shown
by coverage.postgresql.org are similarly unstable.  They might be
less so, since I believe those are taken over the whole check-world
suite not just the core regression tests.)

regards, tom lane




Re: finding changed blocks using WAL scanning

2019-04-11 Thread Ashwin Agrawal
On Wed, Apr 10, 2019 at 2:50 PM Robert Haas  wrote:

> Over at
> https://www.postgresql.org/message-id/CA%2BTgmobFVe4J4AA7z9OMUzKnm09Tt%2BsybhxeL_Ddst3q3wqpzQ%40mail.gmail.com
> I mentioned parsing the WAL to extract block references so that
> incremental backup could efficiently determine which blocks needed to
> be copied.  Ashwin replied in
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__postgr.es_m_CALfoeitO-2DvkfjubMFQRmgyXghL-2DuUnZLNxbr-3DobrQQsm8kFO4A-40mail.gmail.com=DwIBaQ=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw=gxIaqms7ncm0pvqXLI_xjkgwSStxAET2rnZQpzba2KM=W07oy16p6VEfYKCgfRXQpRz9pfy_of-a_8DAjAg5TGk=YAtoa9HWqQ1PPjt1CGui1Fo_a20j0n95LRonCXucBz4=
> to mention that the same approach could be useful for pg_upgrade:
>

Thank you for initiating this separate thread. Just typo above not
pg_upgrade but pg_rewind.

Let me explain first the thought I have around how to leverage this for
pg_rewind, actually any type of incremental recovery to be exact. Would
love to hear thoughts on it.

Currently, incremental recovery of any form, if replica goes down and comes
up or trying to bring back primary after failover to replica, requires
*all* the WAL to be present from point of disconnect. So, its boolean in
those terms, if WAL available can incrementally recovery otherwise have to
perform full basebackup. If we come up with this mechanism to find and
store changed blocks from WAL, we can provide intermediate level of
incremental recovery which will be better than full recovery.

WAL allows tuple level granularity for recovery (if we ignore FPI for a
moment). Modified blocks from WAL, if WAL is not available will provide
block level incremental recovery.

So, pg_basebackup (or some other tool or just option to it) and pg_rewind
can leverage the changed blocks if WAL can't be retained due to space
constraints and perform the recovery.

pg_rewind can also be optimized as it currently copies blocks from src to
target which were present in target WAL to rewind. So, such blocks can be
easily skipped from copying again.

Depending on pattern of changes in WAL and size, instead of replaying all
the WAL logs for incremental recovery, just copying over the changed blocks
could prove more efficient.

It seems to me that there are basically two ways of storing this kind
> of information, plus a bunch of variants.  One way is to store files
> that cover a range of LSNs, and basically contain a synopsis of the
> WAL for those LSNs.  You omit all the actual data and just mention
> which blocks were changed by some record in that part of the WAL.  In
> this type of scheme, the storage required is roughly proportional to
> the volume of WAL for which you wish to retain data.  Pruning old data
> is easy; just remove the files that provide information about LSNs
> that you don't care about any more.  The other way is to store data
> about each block, or each range of blocks, or all the blocks that hash
> onto a certain slot; for each, store the newest LSN that has modified
> that block, or a block in that range, or a block that hashes onto that
> that slot.  In this system, storage is roughly proportional to the
> size of the database cluster, except maybe in the hashing case, but I
> *think* that case will degrade unless you basically expand the map to
> be roughly proportional to the size of the cluster anyway.  I may be
> wrong.
>
> Of these two variants, I am inclined to prefer the version where each
> file is a summary of the block references within some range of LSNs.
> It seems simpler to implement to me.  You just read a bunch of WAL
> files and then when you get tired you stop and emit an output file.
> You need to protect yourself against untimely crashes.  One way is to
> stick a checksum into the output file.  After you finish writing it,
> fsync() it before you start writing the next one.  After a restart,
> read the latest such file and see if the checksum is OK.  If not,
> regenerate it; if not, assume it's good and move on.  Files other than
> the last one can be assumed good.  Another way is to create the file
> with a temporary name, fsync() it, and then rename it into place and
> fsync() again.  The background worker that generates the files can
> have a GUC to remove them when they are older than some threshold
> amount of time, or you can keep them forever and let the user manually
> remove stuff they no longer want based on LSN.  That's pretty much it.
>

+1 for first option. Seems simpler and straight-forward.


Re: setLastTid() and currtid()

2019-04-11 Thread Andres Freund
Hi,

On 2019-03-27 10:01:08 +0900, Inoue, Hiroshi wrote:
> Hi Andres,
> Sorry for the late reply.

Not late at all. Sorry for *my* late reply :)


> On 2019/03/26 9:44, Andres Freund wrote:
> > Hi,
> > 
> > For the tableam work I'd like to remove heapam.h from
> > nodeModifyTable.c. The only remaining impediment to that is a call to
> > setLastTid(), which is defined in tid.c but declared in heapam.h.
> > 
> > That doesn't seem like a particularly accurate location, it doesn't
> > really have that much to do with heap. It seems more like a general
> > executor facility or something.  Does anybody have a good idea where to
> > put the declaration?
> > 
> > 
> > Looking at how this function is used, lead to some confusion on my part.
> > 
> > 
> > We currently call setLastTid in ExecInsert():
> > 
> > if (canSetTag)
> > {
> > (estate->es_processed)++;
> > setLastTid(>tts_tid);
> > }
> > 
> > And Current_last_tid, the variable setLastTid sets, is only used in
> > currtid_byreloid():
> > 
> > 
> > Datum
> > currtid_byreloid(PG_FUNCTION_ARGS)
> > {
> > Oid reloid = PG_GETARG_OID(0);
> > ItemPointer tid = PG_GETARG_ITEMPOINTER(1);
> > ItemPointer result;
> > Relationrel;
> > AclResult   aclresult;
> > Snapshotsnapshot;
> > 
> > result = (ItemPointer) palloc(sizeof(ItemPointerData));
> > if (!reloid)
> > {
> > *result = Current_last_tid;
> > PG_RETURN_ITEMPOINTER(result);
> > }
> > 
> > I've got to say I'm a bit baffled by this interface. If somebody passes
> > in a 0 reloid, we just ignore the passed in tid, and return the last tid
> > inserted into any table?
> > 
> > I then was even more baffled to find that there's no documentation of
> > this function, nor this special case behaviour, to be found
> > anywhere. Not in the docs (which don't mention the function, nor it's
> > special case behaviour for relation 0), nor in the code.
> > 
> > 
> > It's unfortunately used in psqlobdc:
> > 
> >  else if ((flag & USE_INSERTED_TID) != 0)
> >  printfPQExpBuffer(, "%s where ctid = 
> > (select currtid(0, '(0,0)'))", load_stmt);
> 
> The above code remains only for PG servers whose version < 8.2.
> Please remove the code around setLastTid().

Does anybody else have concerns about removing this interface? Does
anybody think we should have a deprecation phase? Should we remove this
in 12 or 13?

Greetings,

Andres Freund




Re: Pluggable Storage - Andres's take

2019-04-11 Thread Andres Freund
Hi,

On 2019-04-11 14:52:40 +0300, Heikki Linnakangas wrote:
> Here is another iteration on the comments. The patch is a mix of
> copy-editing and questions. The questions are marked with "HEIKKI:". I can
> continue the copy-editing, if you can reply to the questions, clarifying the
> intention on some parts of the API. (Or feel free to pick and push any of
> these fixes immediately, if you prefer.)

Thanks!

> diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
> index f7f726b5aec..bbcab9ce31a 100644
> --- a/src/backend/utils/misc/guc.c
> +++ b/src/backend/utils/misc/guc.c
> @@ -3638,7 +3638,7 @@ static struct config_string ConfigureNamesString[] =
>   {"default_table_access_method", PGC_USERSET, 
> CLIENT_CONN_STATEMENT,
>   gettext_noop("Sets the default table access method for 
> new tables."),
>   NULL,
> - GUC_IS_NAME
> + GUC_NOT_IN_SAMPLE | GUC_IS_NAME
>   },
>   _table_access_method,
>   DEFAULT_TABLE_ACCESS_METHOD,

Hm, I think we should rather add it to sample. That's an oversight, not
intentional.


> index 6fbfcb96c98..d4709563e7e 100644
> --- a/src/include/access/tableam.h
> +++ b/src/include/access/tableam.h
> @@ -91,8 +91,9 @@ typedef enum TM_Result
>   * xmax is the outdating transaction's XID.  If the caller wants to visit the
>   * replacement tuple, it must check that this matches before believing the
>   * replacement is really a match.
> + * HEIKKI: matches what? xmin, but that's specific to the heapam.

It's basically just the old comment moved. I wonder if we can just get
rid of that field - because the logic to follow update chains correctly
is now inside the lock tuple callback. And as you say - it's not clear
what callers can do with it for the purpose of following chains.  The
counter-argument is that having it makes it a lot less annoying to adapt
external code that wants to adapt with the minimal set of changes, and
only is really interested in supporting heap for now.


>   * GetTableAmRoutine() asserts that required callbacks are filled in, 
> remember
>   * to update when adding a callback.
> @@ -179,6 +184,12 @@ typedef struct TableAmRoutine
>*
>* if temp_snap is true, the snapshot will need to be deallocated at
>* scan_end.
> +  *
> +  * HEIKKI: table_scan_update_snapshot() changes the snapshot. That's
> +  * a bit surprising for the AM, no? Can it be called when a scan is
> +  * already in progress?

Yea, it can be called when the scan is in-progress. I think we probably
should just fix calling code to not need that - it's imo weird that
nodeBitmapHeapscan.c doesn't just delay starting the scan till it has
the snapshot. This isn't new code, but it's now going to be exposed to
more AMs, so I think there's a good argument to fix it now.

Robert: You committed that addition, in

commit f35742ccb7aa53ee3ed8416bbb378b0c3eeb6bb9
Author: Robert Haas 
Date:   2017-03-08 12:05:43 -0500

Support parallel bitmap heap scans.

do you remember why that's done?



> +  * HEIKKI: A flags bitmask argument would be more readable than 6 
> booleans
>*/
>   TableScanDesc (*scan_begin) (Relation rel,
>Snapshot 
> snapshot,

I honestly don't have strong feelings about it. Not sure that I buy that
bitmasks would be much more readable - but perhaps we could just use the
struct trickery we started to use in

commit f831d4accda00b9144bc647ede2e2f848b59f39d
Author: Alvaro Herrera 
Date:   2019-02-01 11:29:42 -0300

Add ArchiveOpts to pass options to ArchiveEntry


> @@ -194,6 +205,9 @@ typedef struct TableAmRoutine
>   /*
>* Release resources and deallocate scan. If TableScanDesc.temp_snap,
>* TableScanDesc.rs_snapshot needs to be unregistered.
> +  *
> +  * HEIKKI: I find this 'temp_snap' thing pretty weird. Can't the caller 
> handle
> +  * deregistering it?
>*/
>   void(*scan_end) (TableScanDesc scan);

It's old logic, just wrapped new.  I think there's some argument that
some of this should be moved to tableam.c rather than the individual
AMs.



> @@ -221,6 +235,11 @@ typedef struct TableAmRoutine
>   /*
>* Estimate the size of shared memory needed for a parallel scan of this
>* relation. The snapshot does not need to be accounted for.
> +  *
> +  * HEIKKI: If this returns X, then the parallelscan_initialize() call
> +  * mustn't use more than X. So this is not just for optimization 
> purposes,
> +  * for example. Not sure how to phrase that, but could use some
> +  * clarification.
>*/
>   Size(*parallelscan_estimate) (Relation rel);

Hm. I thought I'd done that by adding the note about the amount of space
parallelscan_initialize() getting memory sized by
parallelscan_estimate().


>   /*
>

Re: [HACKERS][Proposal] LZ4 Compressed Storage Manager

2019-04-11 Thread Pavel Stehule
čt 11. 4. 2019 v 18:18 odesílatel Bruce Momjian  napsal:

> On Sun, Mar 31, 2019 at 05:25:51PM +0300, Николай Петров wrote:
> > Why it should be implemented on Storage Manager level instead of usage
> > Pluggable storage API [9]?
> >   - From my perspective view Storage Manager level implementation
> > allows to focus on proper I/O operations and compression.
> > It allows to write much more simple realization. It's because of
> > Pluggable storage API force you to implement more complex
> > interfaces. To be honest, I am really hesitating about this point,
> > especially because of Pluggable storage API allows to create
> > extension without core code modification and it potentially allows
> > to use more perfective compression algorithms (Table Access Manager
> > allows you to get more information about storing data).
> >
> > I would like to implement a proof of concept
> > and have a couple of questions:
> >   - your opinion about necessity of this feature
> > (Compressed Storage Manager)
> >   - Is it good idea to implement DB compression on Storage Manager
> > level? Perhaps it is better to use Pluggable storage API.
> >   - Is there any reason to refuse this proposal?
> >   - Are there any circumstances what didn't allow to implement
> > Compressed Storage Manager?
>
> Stepping back a bit, there are several levels of compression:
>
> 1.  single field
> 2.  across all fields in a row
> 3.  across rows on a single page
> 4.  across all rows in a table
> 5.  across tables in a database
>

> We currently do #1 with TOAST, and your approach would allow the first
> three.  #4 feels like it is getting near the features of columnar
> storage.  I think it is unclear if adding #2 and #3 produce enough of a
> benefit to warrant special storage, given the complexity and overhead of
> implementing it.
>

@4 compression over columns on page are probably much more effective. But
there can some preprocessing stage, where rows can be transformed to
columns.

This doesn't need real column store, and can helps lot of. Real column
store has sense when columns are separated to different pages. But for
compressions, we can transform rows to columns without real column storage.

Probably 8kB page is too small for this case.

Regards

Pavel





> I do think the Pluggable storage API is the right approach, and, if you
> are going to go that route, adding #4 compression seems very worthwhile.
>
> --
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com
>
> + As you are, so once was I.  As I am, so you will be. +
> +  Ancient Roman grave inscription +
>
>
>


Re: [HACKERS][Proposal] LZ4 Compressed Storage Manager

2019-04-11 Thread Bruce Momjian
On Sun, Mar 31, 2019 at 05:25:51PM +0300, Николай Петров wrote:
> Why it should be implemented on Storage Manager level instead of usage
> Pluggable storage API [9]?
>   - From my perspective view Storage Manager level implementation 
> allows to focus on proper I/O operations and compression. 
> It allows to write much more simple realization. It's because of 
> Pluggable storage API force you to implement more complex 
> interfaces. To be honest, I am really hesitating about this point, 
> especially because of Pluggable storage API allows to create 
> extension without core code modification and it potentially allows 
> to use more perfective compression algorithms (Table Access Manager
> allows you to get more information about storing data). 
> 
> I would like to implement a proof of concept 
> and have a couple of questions:
>   - your opinion about necessity of this feature 
> (Compressed Storage Manager)
>   - Is it good idea to implement DB compression on Storage Manager 
> level? Perhaps it is better to use Pluggable storage API.
>   - Is there any reason to refuse this proposal?
>   - Are there any circumstances what didn't allow to implement 
> Compressed Storage Manager?

Stepping back a bit, there are several levels of compression:

1.  single field
2.  across all fields in a row
3.  across rows on a single page
4.  across all rows in a table
5.  across tables in a database

We currently do #1 with TOAST, and your approach would allow the first
three.  #4 feels like it is getting near the features of columnar
storage.  I think it is unclear if adding #2 and #3 produce enough of a
benefit to warrant special storage, given the complexity and overhead of
implementing it.

I do think the Pluggable storage API is the right approach, and, if you
are going to go that route, adding #4 compression seems very worthwhile.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: serializable transaction: exclude constraint violation (backed by GIST index) instead of ssi conflict

2019-04-11 Thread Peter Billen
On Thu, Apr 11, 2019 at 1:14 AM Thomas Munro  wrote:

> On Thu, Apr 11, 2019 at 10:54 AM Thomas Munro 
> wrote:
> > On Thu, Apr 11, 2019 at 9:43 AM Peter Billen 
> wrote:
> > > I kinda expected/hoped that transaction t2 would get aborted by a
> serialization error, and not an exclude constraint violation. This makes
> the application session bound to transaction t2 failing, as only
> serialization errors are retried.
>
> > Yeah, I agree, the behaviour you are expecting is desirable and we
> > should figure out how to do that.  The basic trick for btree unique
> > constraints was to figure out where the index *would* have written, to
> > give the SSI machinery a chance to object to that before raising the
> > UCV.  I wonder if we can use the same technique here... at first
> > glance, check_exclusion_or_unique_constraint() is raising the error,
> > but is not index AM specific code, and it is somewhat removed from the
> > GIST code that would do the equivalent
> > CheckForSerializableConflictIn() call.  I haven't looked into it
> > properly, but that certainly complicates matters somewhat...  Perhaps
> > the index AM would actually need a new entrypoint that could be called
> > before the error is raised, or perhaps there is an easier way.
>
> Adding Kevin (architect of SSI and reviewer/committer of my UCV
> interception patch) and Shubham (author of GIST SSI support) to the CC
> list in case they have thoughts on this.
>

Thanks Thomas, appreciated!

I was fiddling some more, and I am experiencing the same behavior with an
exclude constraint backed by a btree index. I tried as following:

drop table if exists t;
create table t(i int);
alter table t add constraint bla exclude using btree(i with =);

-- t1
begin transaction isolation level serializable;
select * from t where i = 1;
insert into t(i) values(1);

-- t2
begin transaction isolation level serializable;
select * from t where i = 1;
insert into t(i) values(1);

-- t1
commit;

-- t2
ERROR:  conflicting key value violates exclusion constraint "bla"
DETAIL:  Key (i)=(1) conflicts with existing key (i)=(1).

Looking back, I now believe that
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=fcff8a575198478023ada8a48e13b50f70054766
was intended only for *unique* constraints, and not for *exclude*
constraints as well. This is not explicitly mentioned in the commit
message, though only tests for unique constraints are added in that commit.

I believe we are after multiple issues/improvements:

1. Could we extend
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=fcff8a575198478023ada8a48e13b50f70054766
by adding support for exclude constraints?
2. Fully support gist & constraints in serializable transactions. I did not
yet test a unique constraint backed by a gist constraint, which is also
interesting to test I assume. This test would tell us if there currently is
a status quo between btree and gist indexes.

Thanks.


Attempt to consolidate reading of XLOG page

2019-04-11 Thread Antonin Houska
While working on the instance encryption I found it annoying to apply
decyption of XLOG page to three different functions. Attached is a patch that
tries to merge them all into one function, XLogRead(). The existing
implementations differ in the way new segment is opened. So I added a pointer
to callback function as a new argument. This callback handles the specific
ways to determine segment file name and to open the file.

I can split the patch into multiple diffs to make detailed review easier, but
first I'd like to hear if anything is seriously wrong about this
design. Thanks.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f9a4960f8a..444b5bf910 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1369,7 +1369,7 @@ ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed)
 	bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
 }
 
-
+static	XLogReadPos	*readPos = NULL;
 
 /*
  * Reads 2PC data from xlog. During checkpoint this data will be moved to
@@ -1386,8 +1386,17 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
 
-	xlogreader = XLogReaderAllocate(wal_segment_size, _local_xlog_page,
-	NULL);
+
+	/* First time through? */
+	if (readPos == NULL)
+		readPos = XLogReadInitPos();
+
+	/*
+	 * read_local_xlog_page() eventually calls XLogRead(), so pass the initial
+	 * position.
+	 */
+	xlogreader = XLogReaderAllocate(wal_segment_size, read_local_xlog_page,
+	readPos);
 	if (!xlogreader)
 		ereport(ERROR,
 (errcode(ERRCODE_OUT_OF_MEMORY),
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 9196aa3aae..7d0fdfba87 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -17,6 +17,8 @@
  */
 #include "postgres.h"
 
+#include 
+
 #include "access/transam.h"
 #include "access/xlogrecord.h"
 #include "access/xlog_internal.h"
@@ -26,6 +28,7 @@
 #include "replication/origin.h"
 
 #ifndef FRONTEND
+#include "pgstat.h"
 #include "utils/memutils.h"
 #endif
 
@@ -1005,6 +1008,191 @@ out:
 
 #endif			/* FRONTEND */
 
+/*
+ * Initialize XLOG file position for callers of XLogRead().
+ */
+XLogReadPos *
+XLogReadInitPos(void)
+{
+	XLogReadPos *pos = (XLogReadPos *) palloc(sizeof(XLogReadPos));
+
+	pos->segFile = -1;
+	pos->segNo = 0;
+	pos->segOff = 0;
+	pos->tli = 0;
+	pos->dir = NULL;
+
+	return pos;
+}
+
+#ifdef FRONTEND
+/*
+ * Currently only pg_waldump.c is supposed to set these variables.
+ */
+const char *progname;
+int	WalSegSz;
+
+/*
+ * This is a front-end counterpart of XLogFileNameP.
+ */
+static char *
+XLogFileNameFE(TimeLineID tli, XLogSegNo segno)
+{
+	char	   *result = palloc(MAXFNAMELEN);
+
+	XLogFileName(result, tli, segno, WalSegSz);
+	return result;
+}
+
+/*
+ * XXX pg_waldump.c needs this function. Is there a smart way to put it into
+ * src/common?
+ */
+static void fatal_error(const char *fmt,...) pg_attribute_printf(1, 2);
+
+static void
+fatal_error(const char *fmt,...)
+{
+	va_list		args;
+
+	fflush(stdout);
+
+	fprintf(stderr, _("%s: FATAL:  "), progname);
+	va_start(args, fmt);
+	vfprintf(stderr, _(fmt), args);
+	va_end(args);
+	fputc('\n', stderr);
+
+	exit(EXIT_FAILURE);
+}
+#endif	/* FRONTEND */
+
+/*
+ * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'. If
+ * tli is passed, get the data from timeline *tli. 'pos' is the current
+ * position in the XLOG file and openSegment is a callback that opens the next
+ * segment for reading.
+ *
+ * XXX probably this should be improved to suck data directly from the
+ * WAL buffers when possible.
+ *
+ * Will open, and keep open, one WAL segment stored in the global file
+ * descriptor sendFile. This means if XLogRead is used once, there will
+ * always be one descriptor left open until the process ends, but never
+ * more than one.
+ */
+void
+XLogRead(char *buf, XLogRecPtr startptr, Size count,
+		 TimeLineID *tli, XLogReadPos *pos, XLogOpenSegment openSegment)
+{
+	char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	while (nbytes > 0)
+	{
+		uint32		startoff;
+		int			segbytes;
+		int			readbytes;
+		int	segsize;
+
+#ifndef FRONTEND
+		segsize = wal_segment_size;
+#else
+		segsize = WalSegSz;
+#endif
+
+		startoff = XLogSegmentOffset(recptr, segsize);
+
+		if (pos->segFile < 0 ||
+			!XLByteInSeg(recptr, pos->segNo, segsize) ||
+			(tli != NULL && *tli != pos->tli))
+		{
+			XLogSegNo	nextSegNo;
+
+			/* Switch to another logfile segment */
+			if (pos->segFile >= 0)
+close(pos->segFile);
+
+			XLByteToSeg(recptr, nextSegNo, segsize);
+
+			/* Open the next segment in the caller's way. */
+			openSegment(nextSegNo, tli, pos);
+		}
+
+		/* Need to seek in the file? */
+		if (pos->segOff != 

Re: Cleanup/remove/update references to OID column

2019-04-11 Thread Justin Pryzby
On Thu, Apr 11, 2019 at 08:39:42AM -0700, Andres Freund wrote:
> Yes, I was planning to commit that soon-ish. There still seemed
> review / newer versions happening, though, so I was thinking of waiting
> for a bit longer.

I don't expect anything new.

I got all that from: git grep '>oid' doc/src/sgml, so perhaps you'd want to
check for any other stragglers.

Thanks,
Justin




Re: Cleanup/remove/update references to OID column

2019-04-11 Thread Andres Freund
Hi,

On 2019-04-11 13:26:38 +0900, Michael Paquier wrote:
> On Wed, Apr 10, 2019 at 11:59:18AM -0500, Justin Pryzby wrote:
> > I found and included fixes for a few more references:
> > 
> >  doc/src/sgml/catalogs.sgml   | 2 +-
> >  doc/src/sgml/ddl.sgml| 3 +--
> >  doc/src/sgml/information_schema.sgml | 4 ++--
> >  doc/src/sgml/ref/create_trigger.sgml | 2 +-
> >  doc/src/sgml/spi.sgml| 2 +-
> 
> Open Item++.

> Andres, as this is a consequence of 578b229, could you look at what is
> proposed here?

Yes, I was planning to commit that soon-ish. There still seemed
review / newer versions happening, though, so I was thinking of waiting
for a bit longer.

- Andres




Re: cache lookup failed for collation 0

2019-04-11 Thread Tom Lane
Jeevan Chalke  writes:
> Following test-sequence causing an error "cache lookup failed for collation 
> 0";
> postgres:5432 [42106]=# create table foobar(a bytea primary key, b int);
> CREATE TABLE
> postgres:5432 [42106]=# insert into foobar
> values('\x4c835521685c46ee827ab83d376cf028', 1);
> INSERT 0 1
> postgres:5432 [42106]=# select * from foobar where a like '%1%';
> ERROR:  cache lookup failed for collation 0

Good catch!

> The error is coming from get_collation_isdeterministic() when colloid
> passed is 0. I think like we do in get_collation_name(), we should return
> false here when such collation oid does not exist.

Considering that e.g. lc_ctype_is_c() doesn't fail for InvalidOid, I agree
that it's probably a bad idea for get_collation_isdeterministic to fail.
There's a lot of code that thinks it can check for InvalidOid only in slow
paths.  However, I'd kind of expect the default result to be "true" not
"false".  Doing what you suggest would make match_pattern_prefix fail
entirely, unless we also put a special case there.

regards, tom lane




Re: pg_dump is broken for partition tablespaces

2019-04-11 Thread Alvaro Herrera
On 2019-Apr-10, Alvaro Herrera wrote:

> This is because ruleutils.c attaches a TABLESPACE clause when asked to
> dump an index definition; and tablecmds.c uses ruleutils to deparse the
> index definition into something that can be replayed via CREATE INDEX
> commands (or ALTER TABLE ADD CONSTRAINT UNIQUE/PRIMARY KEY, if that's
> the case.)

Found a "solution" to this -- namely, to set the GUC default_tablespace
to the empty string temporarily, and teach ruleutils.c to attach
TABLESPACE clauses on index/constraint definitions only if they
are not in the database tablespace.  That makes everything works
correctly.  (I did have to patch psql to show tablespace for partitioned
indexes.)

However, because the tablespace to use for an index is determined at
phase 3 execution time (i.e. inside DefineIndex), look what happens in
certain weird cases:

create tablespace foo location '/tmp/foo';
set default_tablespace to foo;
alter table t add unique (b) ;
create index on t (a);

at this point, the indexes for "a" and "b" is in tablespace foo, which
is correct because that's the default tablespace.

However, if we do a type change *and add an index in the same command*,
then that index ends up in the wrong tablespace (namely the database
tablespace instead of default_tablespace):

alter table t alter a type bigint, add unique (c);

I'm not seeing any good way to fix this; I need the default tablespace
reset to only affect the index creations caused by the rewrite, but not
the ones used by different commands.  I suppose forbidding ADD
CONSTRAINT subcommands together with ALTER COLUMN SET DATA TYPE would
not fly very far.  (I'm not sure that's complete either: if you change
datatype so that a toast table is created, perhaps this action will
affect the location of said new toast table, also.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Zedstore - compressed in-core columnar storage

2019-04-11 Thread Heikki Linnakangas

On 11/04/2019 17:54, Tom Lane wrote:

Ashwin Agrawal  writes:

Thank you for trying it out. Yes, noticed for certain patterns pg_lzcompress() 
actually requires much larger output buffers. Like for one 86 len source it 
required 2296 len output buffer. Current zedstore code doesn’t handle this case 
and errors out. LZ4 for same patterns works fine, would highly recommend using 
LZ4 only, as anyways speed is very fast as well with it.


You realize of course that *every* compression method has some inputs that
it makes bigger.  If your code assumes that compression always produces a
smaller string, that's a bug in your code, not the compression algorithm.


Of course. The code is not making that assumption, although clearly 
there is a bug there somewhere because it throws that error. It's early 
days..


In practice it's easy to weasel out of that, by storing the data 
uncompressed, if compression would make it longer. Then you need an 
extra flag somewhere to indicate whether it's compressed or not. It 
doesn't break the theoretical limit because the actual stored length is 
then original length + 1 bit, but it's usually not hard to find a place 
for one extra bit.


- Heikki




cache lookup failed for collation 0

2019-04-11 Thread Jeevan Chalke
Hello hackers,

Following test-sequence causing an error "cache lookup failed for collation
0";

postgres:5432 [42106]=# create table foobar(a bytea primary key, b int);
CREATE TABLE
postgres:5432 [42106]=# insert into foobar
values('\x4c835521685c46ee827ab83d376cf028', 1);
INSERT 0 1
postgres:5432 [42106]=# \d+ foobar
   Table "public.foobar"
 Column |  Type   | Collation | Nullable | Default | Storage  | Stats
target | Description
+-+---+--+-+--+--+-
 a  | bytea   |   | not null | | extended
|  |
 b  | integer |   |  | | plain
|  |
Indexes:
"foobar_pkey" PRIMARY KEY, btree (a)
Access method: heap

postgres:5432 [42106]=# select * from foobar where a like '%1%';
ERROR:  cache lookup failed for collation 0

---

After debugging it, I have observed that the code in question was added by
commit 5e1963fb764e9cc092e0f7b58b28985c311431d9 which added support for the
collations with nondeterministic comparison.

The error is coming from get_collation_isdeterministic() when colloid
passed is 0. I think like we do in get_collation_name(), we should return
false here when such collation oid does not exist.

Attached patch doing that change and re-arranged the code to look similar
to get_collation_name(). Also, added small testcase.

---

However, I have not fully understood the code changes done by the said
commit and thus the current behavior i.e. cache lookup error, might be the
expected one. But if that's the case, I kindly request to please explain
why that is expected.

Thanks

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index b4f2d0f..69e5d88 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -942,16 +942,19 @@ bool
 get_collation_isdeterministic(Oid colloid)
 {
 	HeapTuple	tp;
-	Form_pg_collation colltup;
-	bool		result;
 
 	tp = SearchSysCache1(COLLOID, ObjectIdGetDatum(colloid));
-	if (!HeapTupleIsValid(tp))
-		elog(ERROR, "cache lookup failed for collation %u", colloid);
-	colltup = (Form_pg_collation) GETSTRUCT(tp);
-	result = colltup->collisdeterministic;
-	ReleaseSysCache(tp);
-	return result;
+	if (HeapTupleIsValid(tp))
+	{
+		Form_pg_collation colltup = (Form_pg_collation) GETSTRUCT(tp);
+		bool		result;
+
+		result = colltup->collisdeterministic;
+		ReleaseSysCache(tp);
+		return result;
+	}
+	else
+		return false;
 }
 
 /*-- CONSTRAINT CACHE --	 */
diff --git a/src/test/regress/expected/collate.out b/src/test/regress/expected/collate.out
index 0dee7d7..a8e1f6e 100644
--- a/src/test/regress/expected/collate.out
+++ b/src/test/regress/expected/collate.out
@@ -676,13 +676,19 @@ SELECT collation for ((SELECT b FROM collate_test1 LIMIT 1));
  "C"
 (1 row)
 
+CREATE TABLE byteatable(a bytea primary key, b int);
+SELECT * FROM byteatable WHERE a LIKE '%1%';
+ a | b 
+---+---
+(0 rows)
+
 --
 -- Clean up.  Many of these table names will be re-used if the user is
 -- trying to run any platform-specific collation tests later, so we
 -- must get rid of them.
 --
 DROP SCHEMA collate_tests CASCADE;
-NOTICE:  drop cascades to 17 other objects
+NOTICE:  drop cascades to 18 other objects
 DETAIL:  drop cascades to table collate_test1
 drop cascades to table collate_test_like
 drop cascades to table collate_test2
@@ -700,3 +706,4 @@ drop cascades to table collate_test21
 drop cascades to table collate_test22
 drop cascades to collation mycoll2
 drop cascades to table collate_test23
+drop cascades to table byteatable
diff --git a/src/test/regress/sql/collate.sql b/src/test/regress/sql/collate.sql
index 89de26a..124daf6 100644
--- a/src/test/regress/sql/collate.sql
+++ b/src/test/regress/sql/collate.sql
@@ -259,6 +259,9 @@ SELECT collation for ((SELECT a FROM collate_test1 LIMIT 1)); -- non-collatable
 SELECT collation for ((SELECT b FROM collate_test1 LIMIT 1));
 
 
+CREATE TABLE byteatable(a bytea primary key, b int);
+SELECT * FROM byteatable WHERE a LIKE '%1%';
+
 --
 -- Clean up.  Many of these table names will be re-used if the user is
 -- trying to run any platform-specific collation tests later, so we


Re: Zedstore - compressed in-core columnar storage

2019-04-11 Thread Tom Lane
Ashwin Agrawal  writes:
> Thank you for trying it out. Yes, noticed for certain patterns 
> pg_lzcompress() actually requires much larger output buffers. Like for one 86 
> len source it required 2296 len output buffer. Current zedstore code doesn’t 
> handle this case and errors out. LZ4 for same patterns works fine, would 
> highly recommend using LZ4 only, as anyways speed is very fast as well with 
> it.

You realize of course that *every* compression method has some inputs that
it makes bigger.  If your code assumes that compression always produces a
smaller string, that's a bug in your code, not the compression algorithm.

regards, tom lane




Re: creating gist index seems to look at data ignoring transaction?

2019-04-11 Thread Tom Lane
Palle Girgensohn  writes:
> I noticed some bad data where end < begin, so I modified these first, and 
> tried to vcreate the index in the same transaction. The index creation does 
> not notice the data changes. It seems creating the gist index this is not 
> transaction safe?

Index creation has to include not-yet-dead tuples in case the index gets
used by some transaction that can still see those tuples.  So in this
case index entries get made for both the original and the updated versions
of the tuples in question.

regards, tom lane




Re: Zedstore - compressed in-core columnar storage

2019-04-11 Thread Rafia Sabih
On Tue, 9 Apr 2019 at 02:27, Ashwin Agrawal  wrote:

> Heikki and I have been hacking recently for few weeks to implement
> in-core columnar storage for PostgreSQL. Here's the design and initial
> implementation of Zedstore, compressed in-core columnar storage (table
> access method). Attaching the patch and link to github branch [1] to
> follow along.
>
> The objective is to gather feedback on design and approach to the
> same. The implementation has core basic pieces working but not close
> to complete.
>
> Big thank you to Andres, Haribabu and team for the table access method
> API's. Leveraged the API's for implementing zedstore, and proves API
> to be in very good shape. Had to enhance the same minimally but
> in-general didn't had to touch executor much.
>
> Motivations / Objectives
>
> * Performance improvement for queries selecting subset of columns
>   (reduced IO).
> * Reduced on-disk footprint compared to heap table. Shorter tuple
>   headers and also leveraging compression of similar type data
> * Be first-class citizen in the Postgres architecture (tables data can
>   just independently live in columnar storage)
> * Fully MVCC compliant
> * All Indexes supported
> * Hybrid row-column store, where some columns are stored together, and
>   others separately. Provide flexibility of granularity on how to
>   divide the columns. Columns accessed together can be stored
>   together.
> * Provide better control over bloat (similar to zheap)
> * Eliminate need for separate toast tables
> * Faster add / drop column or changing data type of column by avoiding
>   full rewrite of the table.
>
> High-level Design - B-trees for the win!
> 
>
> To start simple, let's ignore column store aspect for a moment and
> consider it as compressed row store. The column store is natural
> extension of this concept, explained in next section.
>
> The basic on-disk data structure leveraged is a B-tree, indexed by
> TID. BTree being a great data structure, fast and versatile. Note this
> is not referring to existing Btree indexes, but instead net new
> separate BTree for table data storage.
>
> TID - logical row identifier:
> TID is just a 48-bit row identifier. The traditional division into
> block and offset numbers is meaningless. In order to find a tuple with
> a given TID, one must always descend the B-tree. Having logical TID
> provides flexibility to move the tuples around different pages on page
> splits or page merges can be performed.
>
> In my understanding these TIDs will follow the datatype of the current
ones. Then my question is will TIDs be reusable here and how will the
reusable range of TIDs be determined? If not, wouldn't that become a hard
limit to the number of insertions performed on a table?

The internal pages of the B-tree are super simple and boring. Each
> internal page just stores an array of TID and downlink pairs. Let's
> focus on the leaf level. Leaf blocks have short uncompressed header,
> followed by btree items. Two kinds of items exist:
>
>  - plain item, holds one tuple or one datum, uncompressed payload
>  - a "container item", holds multiple plain items, compressed payload
>
> +-
> | Fixed-size page header:
> |
> |   LSN
> |   TID low and hi key (for Lehman & Yao B-tree operations)
> |   left and right page pointers
> |
> | Items:
> |
> |   TID | size | flags | uncompressed size | lastTID | payload (container
> item)
> |   TID | size | flags | uncompressed size | lastTID | payload (container
> item)
> |   TID | size | flags | undo pointer | payload (plain item)
> |   TID | size | flags | undo pointer | payload (plain item)
> |   ...
> |
> +
>
> Row store
> -
>
> The tuples are stored one after another, sorted by TID. For each
> tuple, we store its 48-bit TID, a undo record pointer, and the actual
> tuple data uncompressed.
>
> In uncompressed form, the page can be arbitrarily large. But after
> compression, it must fit into a physical 8k block. If on insert or
> update of a tuple, the page cannot be compressed below 8k anymore, the
> page is split. Note that because TIDs are logical rather than physical
> identifiers, we can freely move tuples from one physical page to
> another during page split. A tuple's TID never changes.
>
> The buffer cache caches compressed blocks. Likewise, WAL-logging,
> full-page images etc. work on compressed blocks. Uncompression is done
> on-the-fly, as and when needed in backend-private memory, when
> reading. For some compressions like rel encoding or delta encoding
> tuples can be constructed directly from compressed data.
>
> Column store
> 
>
> A column store uses the same structure but we have *multiple* B-trees,
> one for each column, all indexed by TID. The B-trees for all columns
> are stored in the same physical file.
>
> A metapage at block 0, has links to the roots of the B-trees. Leaf
> pages look the same, but instead of 

Re: Zedstore - compressed in-core columnar storage

2019-04-11 Thread Magnus Hagander
On Thu, Apr 11, 2019 at 3:15 PM Heikki Linnakangas  wrote:

> On 11/04/2019 16:12, Rafia Sabih wrote:
> > On Tue, 9 Apr 2019 at 20:29, Robert Haas  > > wrote:
> >
> > BTW, can I express a small measure of disappointment that the name
> for
> > the thing under discussion on this thread chose to be called
> > "zedstore"?  That seems to invite confusion with "zheap", especially
> > in parts of the world where the last letter of the alphabet is
> > pronounced "zed," where people are going to say zed-heap and
> > zed-store. Brr.
> >
> > +1 on Brr. Looks like Thomas and your thought on having 'z'  makes
> > things popular/stylish, etc. is after all true, I was skeptical back
> then.
>
> BrrStore works for me, too ;-).
>

Also works as a reference to the Finnish climate?

(Sorry, couldn't help myself)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: [PATCH v20] GSSAPI encryption support

2019-04-11 Thread Magnus Hagander
On Thu, Apr 11, 2019 at 3:56 PM Robert Haas  wrote:

> On Wed, Apr 10, 2019 at 9:47 PM Stephen Frost  wrote:
> > Right, if we changed the name of the auth method then everyone who is
> > using the "gss" auth method would have to update their pg_hba.conf
> > files...  That would be very ugly.  Also, it wasn't implicitly rejected,
> > it was discussed up-thread (see the comments between Magnus and I,
> > specifically, quoted above- "that ship sailed *years* ago") and
> > explicitly rejected.
>
> Slightly off-topic, but I am not familiar with GSSAPI and don't quite
> understand what the benefits of GSSAPI encryption are as compared with
> OpenSSL encryption.  I am sure there must be some; otherwise, nobody
> would have bothered writing, reviewing, and committing this patch.
> Can somebody enlighten me?
>

You don't need to set up an SSL PKI.

Yes you need the similar keys and stuff set up for GSSAPI, but if you
already *have* those (which you do if you are using gss authentication for
example) then it's a lot less extra overhead.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: [PATCH v20] GSSAPI encryption support

2019-04-11 Thread Robert Haas
On Wed, Apr 10, 2019 at 9:47 PM Stephen Frost  wrote:
> Right, if we changed the name of the auth method then everyone who is
> using the "gss" auth method would have to update their pg_hba.conf
> files...  That would be very ugly.  Also, it wasn't implicitly rejected,
> it was discussed up-thread (see the comments between Magnus and I,
> specifically, quoted above- "that ship sailed *years* ago") and
> explicitly rejected.

Slightly off-topic, but I am not familiar with GSSAPI and don't quite
understand what the benefits of GSSAPI encryption are as compared with
OpenSSL encryption.  I am sure there must be some; otherwise, nobody
would have bothered writing, reviewing, and committing this patch.
Can somebody enlighten me?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Zedstore - compressed in-core columnar storage

2019-04-11 Thread Konstantin Knizhnik



On 11.04.2019 16:18, Andreas Karlsson wrote:


On 4/11/19 10:46 AM, Konstantin Knizhnik wrote:


This my results of compressing pbench data using different compressors:

Configuration   Size (Gb)   Time (sec)
no compression
15.31   92
zlib (default level)2.37284
zlib (best speed)   2.43191
postgres internal lz3.89214
lz4 4.12
95
snappy  5.1899
lzfse   2.801099
(apple) 2.80 1099
1.69125



You see that zstd provides almost 2 times better compression ration 
and almost at the same speed.



What is "(apple) 2.80 1099"? Was that intended to be zstd?

Andreas


Ugh...
Cut and paste problems.
The whole document can be found here: 
http://garret.ru/PageLevelCompression.pdf


lzfse (apple)   2.80    1099
zstd (facebook)  1.69    125

ztsd is compression algorithm proposed by facebook: 
https://github.com/facebook/zstd

Looks like it provides the best speed/compress ratio result.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: REINDEX CONCURRENTLY 2.0

2019-04-11 Thread Alvaro Herrera
On 2019-Apr-11, Michael Paquier wrote:

> > I was going to just remove the caveat, but then I discovered that
> > REINDEX CONCURRENTLY doesn't work on INVALID indexes (why?).
> 
> This is a wanted choice.  The index built in parallel of the existing
> one during a concurrent reindex is marked invalid during most of the
> operation.  Hence, if the reindex is interrupted or fails, you finish
> with potentially twice the number of original indexes, half being
> invalid and the other half being the ones in use.  If the user decides
> to rinse and repeat the concurrent reindex, and if we were to also
> select invalid indexes for the operation, then you would finish by
> potentially doing twice the amount of work when working on a relation,
> half of it for nothing.

Hmm, I suppose that makes sense for REINDEX TABLE or anything that
reindexes more than one index, but if you do REINDEX INDEX surely it
is reasonable to allow the case?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: block-level incremental backup

2019-04-11 Thread Robert Haas
On Thu, Apr 11, 2019 at 12:22 AM Michael Paquier  wrote:
> incremental page size is reduced in the actual backup.  My preference
> tends toward a block-level approach if we were to do something in this
> area, though I fear that performance will be bad if we begin to scan
> all the relation files to fetch a set of blocks since a past LSN.
> Hence we need some kind of LSN map so as it is possible to skip a
> one block or a group of blocks (say one LSN every 8/16 blocks for
> example) at once for a given relation if the relation is mostly
> read-only.

So, in this thread, I want to focus on the UI and how the incremental
backup is stored on disk.  Making the process of identifying modified
blocks efficient is the subject of
http://postgr.es/m/CA+TgmoahOeuuR4pmDP1W=jnryp4fwhyntosa68bfxjq-qb_...@mail.gmail.com

Over there, the merits of what you are describing here and the
competing approaches are under discussion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Experimenting with hash join prefetch

2019-04-11 Thread Robert Haas
On Wed, Apr 10, 2019 at 2:10 AM Thomas Munro  wrote:
> Here is an example of times for a trivial join on my laptop.  Note
> that this is prefetching only the probing phase, not while building
> which should also be possible.  I didn't get around to trying deeper
> prefetch pipelines as discussed earlier, but those seemed to produce
> diminishing returns with hardcoded tests like in the earlier message.

It would be interesting to see how this does with moderately-long text
keys, say 32 or 64 byte strings, and with actually-long text keys, say
several kB, and then with gigantic text keys, say several MB.  At some
point the key probably gets large enough that computing the hash value
for the next key evicts the current key from the relevant CPU cache,
and if I had to guess, at that point prefetching will become a loser.
But I don't know where that point is.  If it turns out for example
that this technique is a winner for pass-by-value datatypes and a
loser for pass-by-reference datatypes, or that it's a winner always,
or some sort of simple rule like that, awesome!  But if it turns out
that there's no simple rule that we can use to know whether it wins or
loses, then that might make things a little tricky.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg_rewind vs superuser

2019-04-11 Thread Michael Paquier
On Thu, Apr 11, 2019 at 10:33:13AM +0200, Magnus Hagander wrote:
> And I think it would make sense to wait by default, but we could then also
> have a commandline parameter that says "don't wait, instead error out in
> case the checkpoint isn't done".
> 
> Or something like that?

Yes, that would be the idea.  You still need to cover the case where
both instances are on the same timeline, in which case you could wait
for a checkpoint forever, so we'd need to change the current behavior
a bit by making sure that we always throw an error if both nodes are
still on the same timeline after the timeout (see TAP test
005_same_timeline.pl).  I am not sure that you need a separate option
to control the case where you don't want to wait though.  Perhaps we
could have a separate switch, but a user could also just set
--timeout=0 to match that behavior.
--
Michael


signature.asc
Description: PGP signature


Re: Issue in ExecCleanupTupleRouting()

2019-04-11 Thread David Rowley
On Fri, 12 Apr 2019 at 01:06, Etsuro Fujita  wrote:
> +   /*
> +* Check if this result rel is one belonging to the node's subplans,
> +* if so, let ExecEndPlan() clean it up.
> +*/
> +   if (htab)
> +   {
> +   Oid partoid;
> +   boolfound;
> +
> +   partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
> +
> +   (void) hash_search(htab, , HASH_FIND, );
> +   if (found)
> +   continue;
> +   }
>
> /* Allow any FDWs to shut down if they've been exercised */
> -   if (resultRelInfo->ri_PartitionReadyForRouting &&
> -   resultRelInfo->ri_FdwRoutine != NULL &&
> +   if (resultRelInfo->ri_FdwRoutine != NULL &&
> resultRelInfo->ri_FdwRoutine->EndForeignInsert != NULL)
>
> resultRelInfo->ri_FdwRoutine->EndForeignInsert(mtstate->ps.state,
>resultRelInfo);
>
> This skips subplan resultrels before calling EndForeignInsert() if they
> are foreign tables, which I think causes an issue: the FDWs would fail
> to release resources for their foreign insert operations, because
> ExecEndPlan() and ExecEndModifyTable() don't do anything to allow them
> to do that.  So I think we should skip subplan resultrels after
> EndForeignInsert().  Attached is a small patch for that.

Oops. I had for some reason been under the impression that it was
nodeModifyTable.c, or whatever the calling code happened to be that
handles these ones, but this is not the case as we call
ExecInitRoutingInfo() from ExecFindPartition() which makes the call to
BeginForeignInsert. If that part is handled by the tuple routing code,
then the subsequent cleanup should be too, in which case your patch
looks fine.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: REINDEX CONCURRENTLY 2.0

2019-04-11 Thread Michael Paquier
On Thu, Apr 11, 2019 at 11:21:29AM +0100, Dagfinn Ilmari Mannsåker wrote:
> I noticed that the docs for how to recover from a failed CREATE INDEX
> CONCURRENTLY say that «REINDEX does not support concurrent builds»,
> which is no longer true.

Good catch.  I'll apply that in a couple of hours.

> I was going to just remove the caveat, but then I discovered that
> REINDEX CONCURRENTLY doesn't work on INVALID indexes (why?).

This is a wanted choice.  The index built in parallel of the existing
one during a concurrent reindex is marked invalid during most of the
operation.  Hence, if the reindex is interrupted or fails, you finish
with potentially twice the number of original indexes, half being
invalid and the other half being the ones in use.  If the user decides
to rinse and repeat the concurrent reindex, and if we were to also
select invalid indexes for the operation, then you would finish by
potentially doing twice the amount of work when working on a relation,
half of it for nothing.
--
Michael


signature.asc
Description: PGP signature


Re: finding changed blocks using WAL scanning

2019-04-11 Thread Robert Haas
On Thu, Apr 11, 2019 at 3:52 AM Peter Eisentraut
 wrote:
> I had in mind that you could have different overlapping incremental
> backup jobs in existence at the same time.  Maybe a daily one to a
> nearby disk and a weekly one to a faraway cloud.  Each one of these
> would need a separate replication slot, so that the information that is
> required for *that* incremental backup series is preserved between runs.
>  So just one reserved replication slot that feeds the block summaries
> wouldn't work.  Perhaps what would work is a flag on the replication
> slot itself "keep block summaries for this slot".  Then when all the
> slots with the block summary flag are past an LSN, you can clean up the
> summaries before that LSN.

I don't think that quite works.  There are two different LSNs.  One is
the LSN of the oldest WAL archive that we need to keep around so that
it can be summarized, and the other is the LSN of the oldest summary
we need to keep around so it can be used for incremental backup
purposes.  You can't keep both of those LSNs in the same slot.
Furthermore, the LSN stored in the slot is defined as the amount of
WAL we need to keep, not the amount of something else (summaries) that
we need to keep.  Reusing that same field to mean something different
sounds inadvisable.

In other words, I think there are two problems which we need to
clearly separate: one is retaining WAL so we can generate summaries,
and the other is retaining summaries so we can generate incremental
backups.  Even if we solve the second problem by using some kind of
replication slot, we still need to solve the first problem somehow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Zedstore - compressed in-core columnar storage

2019-04-11 Thread Andreas Karlsson

On 4/11/19 10:46 AM, Konstantin Knizhnik wrote:


This my results of compressing pbench data using different compressors:

Configuration   Size (Gb)   Time (sec)
no compression
15.31   92
zlib (default level)2.37284
zlib (best speed)   2.43191
postgres internal lz3.89214
lz4 4.12
95
snappy  5.1899
lzfse   2.801099
(apple) 2.80 1099
1.69125



You see that zstd provides almost 2 times better compression ration 
and almost at the same speed.



What is "(apple) 2.80 1099"? Was that intended to be zstd?

Andreas



Re: Zedstore - compressed in-core columnar storage

2019-04-11 Thread Heikki Linnakangas

On 11/04/2019 16:12, Rafia Sabih wrote:
On Tue, 9 Apr 2019 at 20:29, Robert Haas > wrote:


BTW, can I express a small measure of disappointment that the name for
the thing under discussion on this thread chose to be called
"zedstore"?  That seems to invite confusion with "zheap", especially
in parts of the world where the last letter of the alphabet is
pronounced "zed," where people are going to say zed-heap and
zed-store. Brr.

+1 on Brr. Looks like Thomas and your thought on having 'z'  makes 
things popular/stylish, etc. is after all true, I was skeptical back then.


BrrStore works for me, too ;-).

- Heikki




Re: Zedstore - compressed in-core columnar storage

2019-04-11 Thread Rafia Sabih
On Tue, 9 Apr 2019 at 20:29, Robert Haas  wrote:

> On Tue, Apr 9, 2019 at 11:51 AM Alvaro Herrera 
> wrote:
> > This is not surprising, considering that columnar store is precisely the
> > reason for starting the work on table AMs.
> >
> > We should certainly look into integrating some sort of columnar storage
> > in mainline.  Not sure which of zedstore or VOPS is the best candidate,
> > or maybe we'll have some other proposal.  My feeling is that having more
> > than one is not useful; if there are optimizations to one that can be
> > borrowed from the other, let's do that instead of duplicating effort.
>
> I think that conclusion may be premature.  There seem to be a bunch of
> different ways of doing columnar storage, so I don't know how we can
> be sure that one size will fit all, or that the first thing we accept
> will be the best thing.
>
> Of course, we probably do not want to accept a ton of storage manager
> implementations is core.  I think if people propose implementations
> that are poor quality, or missing important features, or don't have
> significantly different use cases from the ones we've already got,
> it's reasonable to reject those.  But I wouldn't be prepared to say
> that if we have two significantly different column store that are both
> awesome code with a complete feature set and significantly disjoint
> use cases, we should reject the second one just because it is also a
> column store.  I think that won't get out of control because few
> people will be able to produce really high-quality implementations.
>
> This stuff is hard, which I think is also why we only have 6.5 index
> AMs in core after many, many years.  And our standards have gone up
> over the years - not all of those would pass muster if they were
> proposed today.
>
> BTW, can I express a small measure of disappointment that the name for
> the thing under discussion on this thread chose to be called
> "zedstore"?  That seems to invite confusion with "zheap", especially
> in parts of the world where the last letter of the alphabet is
> pronounced "zed," where people are going to say zed-heap and
> zed-store. Brr.
>

+1 on Brr. Looks like Thomas and your thought on having 'z'  makes things
popular/stylish, etc. is after all true, I was skeptical back then.

-- 
Regards,
Rafia Sabih


Issue in ExecCleanupTupleRouting()

2019-04-11 Thread Etsuro Fujita
Hi,

(added Alvaro, Amit, and David)

While working on an update-tuple-routing bug in postgres_fdw [1], I
noticed this change to ExecCleanupTupleRouting() made by commit
3f2393edefa5ef2b6970a5a2fa2c7e9c55cc10cf:

+   /*
+* Check if this result rel is one belonging to the node's subplans,
+* if so, let ExecEndPlan() clean it up.
+*/
+   if (htab)
+   {
+   Oid partoid;
+   boolfound;
+
+   partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+   (void) hash_search(htab, , HASH_FIND, );
+   if (found)
+   continue;
+   }

/* Allow any FDWs to shut down if they've been exercised */
-   if (resultRelInfo->ri_PartitionReadyForRouting &&
-   resultRelInfo->ri_FdwRoutine != NULL &&
+   if (resultRelInfo->ri_FdwRoutine != NULL &&
resultRelInfo->ri_FdwRoutine->EndForeignInsert != NULL)

resultRelInfo->ri_FdwRoutine->EndForeignInsert(mtstate->ps.state,
   resultRelInfo);

This skips subplan resultrels before calling EndForeignInsert() if they
are foreign tables, which I think causes an issue: the FDWs would fail
to release resources for their foreign insert operations, because
ExecEndPlan() and ExecEndModifyTable() don't do anything to allow them
to do that.  So I think we should skip subplan resultrels after
EndForeignInsert().  Attached is a small patch for that.

Best regards,
Etsuro Fujita

[1]
https://www.postgresql.org/message-id/21e7eaa4-0d4d-20c2-a1f7-c7e96f4ce440%40lab.ntt.co.jp
*** a/src/backend/executor/execPartition.c
--- b/src/backend/executor/execPartition.c
***
*** 1126,1131  ExecCleanupTupleRouting(ModifyTableState *mtstate,
--- 1126,1137 
  	{
  		ResultRelInfo *resultRelInfo = proute->partitions[i];
  
+ 		/* Allow any FDWs to shut down */
+ 		if (resultRelInfo->ri_FdwRoutine != NULL &&
+ 			resultRelInfo->ri_FdwRoutine->EndForeignInsert != NULL)
+ 			resultRelInfo->ri_FdwRoutine->EndForeignInsert(mtstate->ps.state,
+ 		   resultRelInfo);
+ 
  		/*
  		 * Check if this result rel is one belonging to the node's subplans,
  		 * if so, let ExecEndPlan() clean it up.
***
*** 1142,1153  ExecCleanupTupleRouting(ModifyTableState *mtstate,
  continue;
  		}
  
- 		/* Allow any FDWs to shut down if they've been exercised */
- 		if (resultRelInfo->ri_FdwRoutine != NULL &&
- 			resultRelInfo->ri_FdwRoutine->EndForeignInsert != NULL)
- 			resultRelInfo->ri_FdwRoutine->EndForeignInsert(mtstate->ps.state,
- 		   resultRelInfo);
- 
  		ExecCloseIndices(resultRelInfo);
  		table_close(resultRelInfo->ri_RelationDesc, NoLock);
  	}
--- 1148,1153 


Re: Zedstore - compressed in-core columnar storage

2019-04-11 Thread Rafia Sabih
On Tue, 9 Apr 2019 at 02:27, Ashwin Agrawal  wrote:

> Heikki and I have been hacking recently for few weeks to implement
> in-core columnar storage for PostgreSQL. Here's the design and initial
> implementation of Zedstore, compressed in-core columnar storage (table
> access method). Attaching the patch and link to github branch [1] to
> follow along.
>
> The objective is to gather feedback on design and approach to the
> same. The implementation has core basic pieces working but not close
> to complete.
>
> Big thank you to Andres, Haribabu and team for the table access method
> API's. Leveraged the API's for implementing zedstore, and proves API
> to be in very good shape. Had to enhance the same minimally but
> in-general didn't had to touch executor much.
>
> Motivations / Objectives
>
> * Performance improvement for queries selecting subset of columns
>   (reduced IO).
> * Reduced on-disk footprint compared to heap table. Shorter tuple
>   headers and also leveraging compression of similar type data
> * Be first-class citizen in the Postgres architecture (tables data can
>   just independently live in columnar storage)
> * Fully MVCC compliant
> * All Indexes supported
> * Hybrid row-column store, where some columns are stored together, and
>   others separately. Provide flexibility of granularity on how to
>   divide the columns. Columns accessed together can be stored
>   together.
> * Provide better control over bloat (similar to zheap)
> * Eliminate need for separate toast tables
> * Faster add / drop column or changing data type of column by avoiding
>   full rewrite of the table.
>
> High-level Design - B-trees for the win!
> 
>
> To start simple, let's ignore column store aspect for a moment and
> consider it as compressed row store. The column store is natural
> extension of this concept, explained in next section.
>
> The basic on-disk data structure leveraged is a B-tree, indexed by
> TID. BTree being a great data structure, fast and versatile. Note this
> is not referring to existing Btree indexes, but instead net new
> separate BTree for table data storage.
>
> TID - logical row identifier:
> TID is just a 48-bit row identifier. The traditional division into
> block and offset numbers is meaningless. In order to find a tuple with
> a given TID, one must always descend the B-tree. Having logical TID
> provides flexibility to move the tuples around different pages on page
> splits or page merges can be performed.
>
> The internal pages of the B-tree are super simple and boring. Each
> internal page just stores an array of TID and downlink pairs. Let's
> focus on the leaf level. Leaf blocks have short uncompressed header,
> followed by btree items. Two kinds of items exist:
>
>  - plain item, holds one tuple or one datum, uncompressed payload
>  - a "container item", holds multiple plain items, compressed payload
>
> +-
> | Fixed-size page header:
> |
> |   LSN
> |   TID low and hi key (for Lehman & Yao B-tree operations)
> |   left and right page pointers
> |
> | Items:
> |
> |   TID | size | flags | uncompressed size | lastTID | payload (container
> item)
> |   TID | size | flags | uncompressed size | lastTID | payload (container
> item)
> |   TID | size | flags | undo pointer | payload (plain item)
> |   TID | size | flags | undo pointer | payload (plain item)
> |   ...
> |
> +
>
> Row store
> -
>
> The tuples are stored one after another, sorted by TID. For each
> tuple, we store its 48-bit TID, a undo record pointer, and the actual
> tuple data uncompressed.
>
> In uncompressed form, the page can be arbitrarily large. But after
> compression, it must fit into a physical 8k block. If on insert or
> update of a tuple, the page cannot be compressed below 8k anymore, the
> page is split. Note that because TIDs are logical rather than physical
> identifiers, we can freely move tuples from one physical page to
> another during page split. A tuple's TID never changes.
>
> The buffer cache caches compressed blocks. Likewise, WAL-logging,
> full-page images etc. work on compressed blocks. Uncompression is done
> on-the-fly, as and when needed in backend-private memory, when
> reading. For some compressions like rel encoding or delta encoding
> tuples can be constructed directly from compressed data.
>
> Column store
> 
>
> A column store uses the same structure but we have *multiple* B-trees,
> one for each column, all indexed by TID. The B-trees for all columns
> are stored in the same physical file.
>
> A metapage at block 0, has links to the roots of the B-trees. Leaf
> pages look the same, but instead of storing the whole tuple, stores
> just a single attribute. To reconstruct a row with given TID, scan
> descends down the B-trees for all the columns using that TID, and
> fetches all attributes. Likewise, a sequential scan walks all the
> B-trees in lockstep.
>
> So, in 

creating gist index seems to look at data ignoring transaction?

2019-04-11 Thread Palle Girgensohn
Hi,

I have a table with two dates, timeframe_begin and timeframe_end.

I'd like to use daterange operators on this table, and an easy way would be to 
set up an index using gist on daterange(timeframe_begin, timeframe_end, '[]');

I noticed some bad data where end < begin, so I modified these first, and tried 
to vcreate the index in the same transaction. The index creation does not 
notice the data changes. It seems creating the gist index this is not 
transaction safe?

db=> begin; 
BEGIN
db=> update group_info set timeframe_begin = timeframe_end where 
timeframe_begin > timeframe_end;
UPDATE 76
db=> create index group_info_timeframe_idx on group_info using gist 
(daterange(timeframe_begin, timeframe_end, '[]'));
ERROR:  range lower bound must be less than or equal to range upper bound
db=> abort; 
ROLLBACK

db=> begin; 
BEGIN
db=> update group_info set timeframe_begin = timeframe_end where 
timeframe_begin > timeframe_end;
UPDATE 76
db=> commit;
COMMIT
db=> begin;
BEGIN
db=> create index group_info_timeframe_idx on group_info using gist 
(daterange(timeframe_begin, timeframe_end, '[]'));
CREATE INDEX
db=> commit;
COMMIT
db=>


I cannot find anything about gist indexes not being transaction safe? It is 
reprodcable on different machines with different datasets. Is this correct 
behaviour?

This is on PostgreSQL-9.6.

Cheers,
Palle





Re: pgbench - add minimal stats on initialization

2019-04-11 Thread Ibrar Ahmed
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

Patch works fine on my machine.

The new status of this patch is: Ready for Committer


Re: Pluggable Storage - Andres's take

2019-04-11 Thread Heikki Linnakangas

On 08/04/2019 20:37, Andres Freund wrote:

Hi,

On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:

There were a bunch of typos in the comments in tableam.h, see attached. Some
of the comments could use more copy-editing and clarification, I think, but
I stuck to fixing just typos and such for now.

I pushed these after adding three boring changes by pgindent. Thanks for
those!

I'd greatly welcome more feedback on the comments - I've been pretty
deep in this for so long that I don't see all of the issues anymore. And
a mild dyslexia doesn't help...


Here is another iteration on the comments. The patch is a mix of 
copy-editing and questions. The questions are marked with "HEIKKI:". I 
can continue the copy-editing, if you can reply to the questions, 
clarifying the intention on some parts of the API. (Or feel free to pick 
and push any of these fixes immediately, if you prefer.)


- Heikki
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f7f726b5aec..bbcab9ce31a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3638,7 +3638,7 @@ static struct config_string ConfigureNamesString[] =
 		{"default_table_access_method", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Sets the default table access method for new tables."),
 			NULL,
-			GUC_IS_NAME
+			GUC_NOT_IN_SAMPLE | GUC_IS_NAME
 		},
 		_table_access_method,
 		DEFAULT_TABLE_ACCESS_METHOD,
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 82de4cdcf2c..8aeeba38ca2 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -57,6 +57,8 @@ typedef struct TableScanDescData *TableScanDesc;
  * pointer to this structure.  The information here must be sufficient to
  * properly initialize each new TableScanDesc as workers join the scan, and it
  * must act as a information what to scan for those workers.
+ *
+ * This is stored in dynamic shared memory, so you can't use pointers here.
  */
 typedef struct ParallelTableScanDescData
 {
@@ -64,6 +66,11 @@ typedef struct ParallelTableScanDescData
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
 	Size		phs_snapshot_off;	/* data for snapshot */
+
+	/*
+	 * Table AM specific data follows. After the table AM specific data
+	 * comes the snapshot data, at 'phs_snapshot_off'.
+	 */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 6fbfcb96c98..d4709563e7e 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -91,8 +91,9 @@ typedef enum TM_Result
  * xmax is the outdating transaction's XID.  If the caller wants to visit the
  * replacement tuple, it must check that this matches before believing the
  * replacement is really a match.
+ * HEIKKI: matches what? xmin, but that's specific to the heapam.
  *
- * cmax is the outdating command's CID, but only when the failure code is
+ * cmax is the outdating command's CID. Only set when the failure code is
  * TM_SelfModified (i.e., something in the current transaction outdated the
  * tuple); otherwise cmax is zero.  (We make this restriction because
  * HeapTupleHeaderGetCmax doesn't work for tuples outdated in other
@@ -133,7 +134,11 @@ typedef void (*IndexBuildCallback) (Relation index,
  * returned by FormData_pg_am.amhandler.
  *
  * In most cases it's not appropriate to call the callbacks directly, use the
- * table_* wrapper functions instead.
+ * table_* wrapper functions instead. The descriptions of the callbacks here
+ * are written from the AM implementor's point of view. For more information
+ * on how to call them, see the wrapper functions. (If you update the comments
+ * on either a callback or a wrapper function, remember to also update the other
+ * one!)
  *
  * GetTableAmRoutine() asserts that required callbacks are filled in, remember
  * to update when adding a callback.
@@ -179,6 +184,12 @@ typedef struct TableAmRoutine
 	 *
 	 * if temp_snap is true, the snapshot will need to be deallocated at
 	 * scan_end.
+	 *
+	 * HEIKKI: table_scan_update_snapshot() changes the snapshot. That's
+	 * a bit surprising for the AM, no? Can it be called when a scan is
+	 * already in progress?
+	 *
+	 * HEIKKI: A flags bitmask argument would be more readable than 6 booleans
 	 */
 	TableScanDesc (*scan_begin) (Relation rel,
  Snapshot snapshot,
@@ -194,6 +205,9 @@ typedef struct TableAmRoutine
 	/*
 	 * Release resources and deallocate scan. If TableScanDesc.temp_snap,
 	 * TableScanDesc.rs_snapshot needs to be unregistered.
+	 *
+	 * HEIKKI: I find this 'temp_snap' thing pretty weird. Can't the caller handle
+	 * deregistering it?
 	 */
 	void		(*scan_end) (TableScanDesc scan);
 
@@ -221,6 +235,11 @@ typedef struct TableAmRoutine
 	/*
 	 * Estimate the size of shared memory needed for a parallel scan of this
 	 

Re: bug in update tuple routing with foreign partitions

2019-04-11 Thread Etsuro Fujita

(2019/04/11 14:33), Amit Langote wrote:

On 2019/04/10 17:38, Etsuro Fujita wrote:

(2019/03/06 18:33), Amit Langote wrote:

I noticed a bug with how UPDATE tuple routing initializes ResultRelInfos
to use for partition routing targets.  Specifically, the bug occurs when
UPDATE targets include a foreign partition that is locally modified (as
opposed to being modified directly on the remove server) and its
ResultRelInfo (called subplan result rel in the source code) is reused for
tuple routing:



The solution is to teach ExecSetupPartitionTupleRouting to avoid using a
subplan result rel if its ri_FdwState is already set.


I'm not sure that is a good idea, because that requires to create a new
ResultRelInfo, which is not free; I think it requires a lot of work.


After considering this a bit more and studying your proposed solution, I
tend to agree.  Beside the performance considerations you point out that I
too think are valid, I realize that going with my approach would mean
embedding some assumptions in the core code about ri_FdwState; in this
case, assuming that a subplan result rel's ri_FdwState is not to be
re-purposed for tuple routing insert, which is only based on looking at
postgres_fdw's implementation.


Yeah, that assumption might not apply to some other FDWs.


Not to mention the ensuing complexity in
the core tuple routing code -- it now needs to account for the fact that
not all subplan result rels may have been re-purposed for tuple-routing,
something that's too complicated to handle in PG 11.


I think so too.


IOW, let's call this a postgres_fdw bug and fix it there as you propose.


Agreed.


Another solution to avoid that would be to store the fmstate created by
postgresBeginForeignInsert() into the ri_FdwState, not overwrite the
ri_FdwState, like the attached.  This would not need any changes to the
core, and this would not cause any overheads either, IIUC.  What do you
think about that?


+1.  Just one comment:

+/*
+ * If the given resultRelInfo already has PgFdwModifyState set, it means
+ * the foreign table is an UPDATE subplan resultrel; in which case, store
+ * the resulting state into the PgFdwModifyState.
+ */

The last "PgFdwModifyState" should be something else?  Maybe,
sub-PgModifyState or sub_fmstate?


OK, will revise.  My favorite would be the latter.


I've also added a
test in postgres_fdw.sql to exercise this test case.


Thanks again, but the test case you added works well without any fix:


Oops, I should really adopt a habit of adding a test case before code.


+-- Check case where the foreign partition is a subplan target rel and
+-- foreign parttion is locally modified (target table being joined
+-- locally prevents a direct/remote modification plan).
+explain (verbose, costs off)
+update utrtest set a = 1 from (values (1), (2)) s(x) where a = s.x
returning *;
+  QUERY PLAN
+--

+ Update on public.utrtest
+   Output: remp.a, remp.b, "*VALUES*".column1
+   Foreign Update on public.remp
+ Remote SQL: UPDATE public.loct SET a = $2 WHERE ctid = $1 RETURNING
a, b
+   Update on public.locp
+   ->   Hash Join
+ Output: 1, remp.b, remp.ctid, "*VALUES*".*, "*VALUES*".column1
+ Hash Cond: (remp.a = "*VALUES*".column1)
+ ->   Foreign Scan on public.remp
+   Output: remp.b, remp.ctid, remp.a
+   Remote SQL: SELECT a, b, ctid FROM public.loct FOR UPDATE
+ ->   Hash
+   Output: "*VALUES*".*, "*VALUES*".column1
+   ->   Values Scan on "*VALUES*"
+ Output: "*VALUES*".*, "*VALUES*".column1
+   ->   Hash Join
+ Output: 1, locp.b, locp.ctid, "*VALUES*".*, "*VALUES*".column1
+ Hash Cond: (locp.a = "*VALUES*".column1)
+ ->   Seq Scan on public.locp
+   Output: locp.b, locp.ctid, locp.a
+ ->   Hash
+   Output: "*VALUES*".*, "*VALUES*".column1
+   ->   Values Scan on "*VALUES*"
+ Output: "*VALUES*".*, "*VALUES*".column1
+(24 rows)

In this test case, the foreign partition is updated before ri_FdwState is
overwritten by postgresBeginForeignInsert() that's invoked by the tuple
routing code, so it would work without any fix.  So I modified this test
case as such.


Thanks for fixing that.   I hadn't noticed that foreign partition remp is
updated before local partition locp; should've added a locp0 for values
(0) so that remp appears second in the ModifyTable's subplans.  Or, well,
make the foreign table remp appear later by making it a partition for
values in (3) as your patch does.

BTW, have you noticed that the RETURNING clause returns the same row twice?


I noticed this, but I didn't think it hard.  :(


+update utrtest set a = 3 from (values (2), (3)) s(x) where a = s.x
returning *;
+ a | b | x
+---+---+---
+ 3 | qux triggered !   | 2

Re: Failure in contrib test _int on loach

2019-04-11 Thread Anastasia Lubennikova

10.04.2019 18:25, Heikki Linnakangas writes:

On 09/04/2019 19:11, Anastasia Lubennikova wrote:

05.04.2019 19:41, Anastasia Lubennikova writes:

In attachment, you can find patch with a test that allows to reproduce
the bug not randomly, but on every run.
Now I'm trying to find a way to fix the issue.


The problem was caused by incorrect detection of the page to insert new
tuple after split.
If gistinserttuple() of the tuple formed by gistgetadjusted() had to
split the page, we must to go back to the parent and
descend back to the child that's a better fit for the new tuple.

Previously this was handled by the code block with the following 
comment:


* Concurrent split detected. There's no guarantee that the
* downlink for this page is consistent with the tuple we're
* inserting anymore, so go back to parent and rechoose the best
* child.

After introducing GistBuildNSN this code path became unreachable.
To fix it, I added new flag to detect such splits during indexbuild.


Isn't it possible that the grandparent page is also split, so that 
we'd need to climb further up?



From what I understand,
the only reason for grandparent's split during gistbuild is the 
insertion of the newtup returned by gistgetadjusted().


After we stepped up the stack, we will do gistchoose() to choose new 
correct child,
adjust the downlink key and insert it into grandparent. If this 
insertion caused split, we will recursively follow the same codepath

and set stack->retry_from_parent again.

So it is possible, but it doesn't require any extra algorithm changes.
I didn't manage to generate dataset to reproduce grandparent split.
Though, I do agree that it's worth checking out. Do you have any ideas?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: REINDEX CONCURRENTLY 2.0

2019-04-11 Thread Dagfinn Ilmari Mannsåker
Peter Eisentraut  writes:

> On 2019-03-28 09:07, Sergei Kornilov wrote:
>> Unfortunately patch does not apply due recent commits. Any chance
>> this can be fixed (and even committed in pg12)?
>
> Committed :)

I noticed that the docs for how to recover from a failed CREATE INDEX
CONCURRENTLY say that «REINDEX does not support concurrent builds»,
which is no longer true.  I was going to just remove the caveat, but
then I discovered that REINDEX CONCURRENTLY doesn't work on INVALID
indexes (why?).

Attached is a patch that instead adjust the claim to say that REINDEX
dos not support rebulilding invalid indexess concurrently.

- ilmari
-- 
"A disappointingly low fraction of the human race is,
 at any given time, on fire." - Stig Sandbeck Mathisen

>From e2de72b348f8a96e24128fc4188bd542eb676610 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Thu, 11 Apr 2019 10:58:47 +0100
Subject: [PATCH] Correct claim about REINDEX CONCURRENTLY in CREATE INDEX
 CONCURRENTLY docs

REINDEX CONCURRENTLY exists, but cannot reindex invalid indexes.
---
 doc/src/sgml/ref/create_index.sgml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index d9d95b20e3..c458f54ef1 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -557,8 +557,8 @@
 method in such cases is to drop the index and try again to perform
 CREATE INDEX CONCURRENTLY.  (Another possibility is to rebuild
 the index with REINDEX.  However, since REINDEX
-does not support concurrent builds, this option is unlikely to seem
-attractive.)
+does not support reindexing invalid indexes concurrently, this option is
+unlikely to seem attractive.)

 

-- 
2.20.1



Re: Commit message / hash in commitfest page.

2019-04-11 Thread Ibrar Ahmed
On Thu, Apr 11, 2019 at 2:44 PM Erikjan Rijkers  wrote:
>
> On 2019-04-11 11:36, Ibrar Ahmed wrote:
> > Hi,
> >
> > Is it possible to have commit-message or at least git hash in
> > commitfest. It will be very easy to track commit against commitfest
> > item.
> >
>
> Commitfest items always point to discussion threads. These threads often
> end with a message that says that the patch is pushed.  IMHO, that
> message would be the place to include the commithash.   It would also be
> easily findable via the commitfest application.
>

+1

> Erik Rijkers
>
>
> > --
> > Ibrar Ahmed



-- 
Ibrar Ahmed




Re: Commit message / hash in commitfest page.

2019-04-11 Thread Erikjan Rijkers

On 2019-04-11 11:36, Ibrar Ahmed wrote:

Hi,

Is it possible to have commit-message or at least git hash in
commitfest. It will be very easy to track commit against commitfest
item.



Commitfest items always point to discussion threads. These threads often 
end with a message that says that the patch is pushed.  IMHO, that 
message would be the place to include the commithash.   It would also be 
easily findable via the commitfest application.


Erik Rijkers



--
Ibrar Ahmed





RE: Qestion about .partial WAL file

2019-04-11 Thread Matsumura, Ryo
Michael-san

Thank for your advice.

> then a promoted standby which archives WAL segments in the same
> location as the primary

> if the previous primary is still running after the standby

I could not come up with the combination, but I understand now.
Sorry for bothering you.

Regards
Ryo Matsumura





Commit message / hash in commitfest page.

2019-04-11 Thread Ibrar Ahmed
Hi,

Is it possible to have commit-message or at least git hash in
commitfest. It will be very easy to track commit against commitfest
item.

-- 
Ibrar Ahmed




Re: Failure in contrib test _int on loach

2019-04-11 Thread Andrey Lepikhov




On 11/04/2019 13:14, Heikki Linnakangas wrote:

On 11/04/2019 09:10, Andrey Lepikhov wrote:

On 10/04/2019 20:25, Heikki Linnakangas wrote:

On 09/04/2019 19:11, Anastasia Lubennikova wrote:

After introducing GistBuildNSN this code path became unreachable.
To fix it, I added new flag to detect such splits during indexbuild.


Isn't it possible that the grandparent page is also split, so that we'd
need to climb further up?


Based on Anastasia's idea i prepare alternative solution to fix the bug
(see attachment).
It utilizes the idea of linear increment of LSN/NSN. WAL write process
is used for change NSN value to 1 for each block of index relation.
I hope this can be a fairly clear and safe solution.


That's basically the same idea as always using the "fake LSN" during 
index build, like the original version of this patch did. It's got the 
problem that I mentioned at 
https://www.postgresql.org/message-id/090fb3cb-1ca4-e173-ecf7-47d41ebac...@iki.fi: 



* Using "fake" unlogged LSNs for GiST index build seemed fishy. I 
could not convince myself that it was safe in all corner cases. In a 
recently initdb'd cluster, it's theoretically possible that the fake 
LSN counter overtakes the real LSN value, and that could lead to 
strange behavior. For example, how would the buffer manager behave, if 
there was a dirty page in the buffer cache with an LSN value that's 
greater than the current WAL flush pointer? I think you'd get "ERROR: 
xlog flush request %X/%X is not satisfied --- flushed only to %X/%X".


Perhaps the risk is theoretical; the real WAL begins at XLOG_SEG_SIZE, 
so with defaults WAL segment size, the index build would have to do 
about 16 million page splits. The index would have to be at least 150 GB 
for that. But it seems possible, and with non-default segment and page 
size settings more so.
As i see in bufmgr.c, XLogFlush() can't called during index build. In 
the log_newpage_range() call we can use mask to set value of NSN (and 
LSN) to 1.


Perhaps we could start at 1, but instead of using a global counter, 
whenever a page is split, we take the parent's LSN value and increment 
it by one. So different branches of the tree could use the same values, 
which would reduce the consumption of the counter values.


Yet another idea would be to start the counter at 1, but check that it 
doesn't overtake the WAL insert pointer. If it's about to overtake it, 
just generate some dummy WAL.


But it seems best to deal with this in gistdoinsert(). I think 
Anastasia's approach of adding a flag to GISTInsertStack can be made to 
work, if we set the flag somewhere in gistinserttuples() or 
gistplacetopage(), whenever a page is split. That way, if it needs to 
split multiple levels, the flag is set on all of the corresponding 
GISTInsertStack entries.


Yet another trivial fix would be just always start the tree descend from 
the root in gistdoinsert(), if a page is split. Not as efficient, but 
probably negligible in practice.

Agree

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company




RE: Problem during Windows service start

2019-04-11 Thread Higuchi, Daisuke
Hi, 

+   case POSTMASTER_STILL_STARTING:
+   write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for 
server startup\n"));
+   pgwin32_SetServiceStatus(SERVICE_START_PENDING);
+   return;

Could this patch solve first post's problem [1] ?
I think no one could change the service status to SERVICE_RUNNING even if the 
server has been started properly after timeout is occurred. 

In 9.6 or earlier, the main use case where the problem is occurred is when 
timeout is occured because of long time recovery. Even if recovery takes a lot 
of time and timeout is ocurred, recovery continues in background. In this case, 
I want to set the status to SERVICE_RUNNING after recovery is completed. 

In 10 or later, I understand wait_for_postmaster does not wait until recovery 
is completed, so I think this problem rarely occurs in PG 10 or later.

[1] 
https://www.postgresql.org/message-id/99C4246E73ED1B478BBB9671718426203E37F485@G01JPEXMBKW03

Regards, 
Daisuke, Higuchi



Re: Zedstore - compressed in-core columnar storage

2019-04-11 Thread Konstantin Knizhnik



On 11.04.2019 8:03, Ashwin Agrawal wrote:

On Apr 10, 2019, at 9:08 PM, Mark Kirkwood  
wrote:



On 11/04/19 4:01 PM, Mark Kirkwood wrote:

On 9/04/19 12:27 PM, Ashwin Agrawal wrote:

Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.



Very nice. I realize that it is very early days, but applying this patch I've 
managed to stumble over some compression bugs doing some COPY's:

benchz=# COPY dim1 FROM '/data0/dump/dim1.dat'
USING DELIMITERS ',';
psql: ERROR:  compression failed. what now?
CONTEXT:  COPY dim1, line 458

The log has:

2019-04-11 15:48:43.976 NZST [2006] ERROR:  XX000: compression failed. what now?
2019-04-11 15:48:43.976 NZST [2006] CONTEXT:  COPY dim1, line 458
2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish, 
zedstore_compression.c:287
2019-04-11 15:48:43.976 NZST [2006] STATEMENT:  COPY dim1 FROM 
'/data0/dump/dim1.dat'
 USING DELIMITERS ',';

The dataset is generated from and old DW benchmark I wrote 
(https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_projects_benchw_=DwIDaQ=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw=gxIaqms7ncm0pvqXLI_xjkgwSStxAET2rnZQpzba2KM=BgmTkDoY6SKOgODe8v6fpH4hs-wM0H91cLfrAfEL6C0=lLcXp_8h2bRb_OR4FT8kxD-FG9MaLBPU7M5aV9nQ7JY=).
 The row concerned looks like:

457,457th interesting measure,1th measure 
type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp
458,458th interesting measure,2th measure 
type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsg


I'll see if changing to LZ4 makes any different.



The COPY works with LZ4 configured.

Thank you for trying it out. Yes, noticed for certain patterns pg_lzcompress() 
actually requires much larger output buffers. Like for one 86 len source it 
required 2296 len output buffer. Current zedstore code doesn’t handle this case 
and errors out. LZ4 for same patterns works fine, would highly recommend using 
LZ4 only, as anyways speed is very fast as well with it.




Internal Postgres lz compressor is really very inefficient comparing 
with other compression algorithms.
But in any case you should never assume that size of compressed data 
will be smaller than size of plain data.
Moreover, if you are trying to compress already compressed data, then 
result almost always will be larger.
If size of compressed data is larger (or even not significantly smaller) 
than size of raw data, then you should store original data.


lz4 is actually very fast. But it doesn't provide good compression ratio.
This my results of compressing pbench data using different compressors:

Configuration   Size (Gb)   Time (sec)
no compression
15.31   92
zlib (default level)2.37284
zlib (best speed)   2.43191
postgres internal lz3.89214
lz4 4.12
95
snappy  5.1899
lzfse   2.801099
(apple) 2.80 1099
1.69125



You see that zstd provides almost 2 times better compression ration and 
almost at the same speed.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Minimal logical decoding on standbys

2019-04-11 Thread tushar

On 04/10/2019 09:39 PM, Andres Freund wrote:

  Have you reproduced this with Amit's latest version?


Yes-it is very much reproducible.

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company





Re: finding changed blocks using WAL scanning

2019-04-11 Thread Peter Eisentraut
On 2019-04-10 23:49, Robert Haas wrote:
> It seems to me that there are basically two ways of storing this kind
> of information, plus a bunch of variants.  One way is to store files
> that cover a range of LSNs, and basically contain a synopsis of the
> WAL for those LSNs.  You omit all the actual data and just mention
> which blocks were changed by some record in that part of the WAL.

That seems better than the other variant.

> Yet another question is how to make sure WAL doesn't get removed
> before we finish scanning it.  Peter mentioned on the other thread
> that we could use a variant replication slot, which immediately made
> me wonder why we'd need a variant.  Actually, the biggest problem I
> see here is that if we use a replication slot, somebody might try to
> drop it or use it for some other purpose, and that would mess things
> up.  I guess we could pull the usual trick of reserving names that
> start with 'pg_' for internal purposes.  Or we could just hard-code
> the LSN that was last scanned for this purpose as a bespoke constraint
> on WAL discard.  Not sure what is best.

The word "variant" was used as a hedge ;-), but now that I think about
it ...

I had in mind that you could have different overlapping incremental
backup jobs in existence at the same time.  Maybe a daily one to a
nearby disk and a weekly one to a faraway cloud.  Each one of these
would need a separate replication slot, so that the information that is
required for *that* incremental backup series is preserved between runs.
 So just one reserved replication slot that feeds the block summaries
wouldn't work.  Perhaps what would work is a flag on the replication
slot itself "keep block summaries for this slot".  Then when all the
slots with the block summary flag are past an LSN, you can clean up the
summaries before that LSN.

> I think all of this should be optional functionality.  It's going to
> be really useful for people with large databases, I think, but people
> with small databases may not care, and it won't be entirely free.  If
> it's not enabled, then the functionality that would otherwise exploit
> it can fall back to doing things in a less efficient way; nothing
> needs to break hard.

With the flag on the slot scheme you wouldn't need a separate knob to
turn this on, because it's just enabled when a backup software has
created an appropriate slot.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table

2019-04-11 Thread Amit Langote
On 2019/04/11 15:57, Michael Paquier wrote:
> On Wed, Mar 27, 2019 at 11:56:20AM +0900, Michael Paquier wrote:
>> Adding it to the section for older bugs sounds fine to me.  Thanks for
>> doing so.
> 
> I have begun looking at this issue.  Hopefully I'll be able to provide
> an update soon.

Great, thanks.

Regards,
Amit






RE: Protect syscache from bloating with negative cache entries

2019-04-11 Thread Ideriha, Takeshi
>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>Does this result show that hard-limit size option with memory accounting 
>doesn't harm
>to usual users who disable hard limit size option?

Hi, 

I've implemented relation cache size limitation with LRU list and built-in 
memory context size account.
And I'll share some results coupled with a quick recap of catcache so that we 
can resume discussion if needed
though relation cache bloat was also discussed in this thread but right now 
it's pending 
and catcache feature is not fixed. But a variety of information could be good I 
believe.

Regarding catcache it seems to me recent Horiguchi san posts shows a pretty 
detailed stats
including comparison LRU overhead and full scan of hash table. According to the 
result, lru overhead seems small
but for simplicity this thread go without LRU.
https://www.postgresql.org/message-id/20190404.215255.09756748.horiguchi.kyotaro%40lab.ntt.co.jp

When there was hard limit of catcach, there was built-in memory context size 
accounting machinery.
I checked the overhead of memory accounting, and when repeating palloc and 
pfree of 800 byte area many times it was 4% down
on the other hand in case of 32768 byte there seems no overhead.
https://www.postgresql.org/message-id/4E72940DA2BF16479384A86D54D0988A6F44564E%40G01JPEXMBKW04
 


Regarding relcache hard limit (relation_cache_max_size), most of the 
architecture was similar as catcache one with LRU list except memory accounting.
Relcaches are managed by LRU list. To prune LRU cache, we need to know overall 
relcache sizes including objects pointed by relcache 
like 'index info'.
So in this patch relcache objects are allocated under RelCacheMemoryContext, 
which is child of CacheMemoryContext. Objects pointed by
relcache is allocated under child context of RelCacheMemoryContext.
In built-in size accounting, if memoryContext is set to collect "group(family) 
size", you can get context size including child easily.

I ran two experiments:
A) One is pgbench using Tomas's script he posted while ago, which is randomly 
select 1 from many tables.
https://www.postgresql.org/message-id/4E72940DA2BF16479384A86D54D0988A6F426207%40G01JPEXMBKW04

B) The other is to check memory context account overhead using the same method.
https://www.postgresql.org/message-id/4E72940DA2BF16479384A86D54D0988A6F44564E%40G01JPEXMBKW04
 

A) randomly select 1 from many tables
Results are average of 5 times each.

number of tables| 100  |1000|1
---
TPS (master)|11105  |10815 |8915
TPS (patch; limit feature off)  |11254 (+1%) |11176 (+3%) |9242 (+4%)
TPS (patch: limit on with 1MB)  |11317 (+2%) |10491 (-3%) |7380 (-17%)

The results are noisy but it seems overhead of LRU and memory accounting is 
small when turning off the relcache limit feature.
When turning on the limit feature, after exceeding the limit it drops 17%, 
which is no surprise.


B) Repeat palloc/pfree
"With group accounting" means that account test context and its child context 
with built-in accounting using "palloc_bench_family()".
The other one is that using palloc_bench(). Please see palloc_bench.gz.

[Size=32768, iter=1,000,000]
Master| 59.97 ms
Master with group account | 59.57 ms
patched   |67.23 ms
patched with family   |68.81 ms

It seems that overhead seems large in this patch. So it needs more inspection 
this area.


regards,
Takeshi Ideriha




palloc_bench.gz
Description: palloc_bench.gz


v15-0001-4-Add-dlist_move_tail.patch
Description: v15-0001-4-Add-dlist_move_tail.patch


binj9L3vPCahv.bin
Description: v15-0002-4-ideriha-Memory-consumption-report-reature-of-memorycontext.patch


v15-0004-4-ideriha-Remove-RelCache-Entries.patch
Description: v15-0004-4-ideriha-Remove-RelCache-Entries.patch


v15-0003-4-ideriha-Remove-CatCache-Entries.patch
Description: v15-0003-4-ideriha-Remove-CatCache-Entries.patch


Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid

2019-04-11 Thread Noah Misch
On Thu, Apr 11, 2019 at 12:48:35PM +1200, Thomas Munro wrote:
> On Mon, Apr 8, 2019 at 6:42 PM Noah Misch  wrote:
> > - AIX animals failed two ways.  First, I missed a "use" statement such that
> >   poll_start() would fail if it needed more than one attempt.  Second, I
> >   assumed $pid would be gone as soon as kill(9, $pid) returned[1].
> 
> > [1] POSIX says "sig or at least one pending unblocked signal shall be
> > delivered to the sending thread before kill() returns."  I doubt the
> > postmaster had another signal pending often enough to explain the failures, 
> > so
> > AIX probably doesn't follow POSIX in this respect.
> 
> It looks like you fixed this, but I was curious about this obversation
> as someone interested in learning more about kernel stuff and
> portability... Maybe I misunderstood, but isn't POSIX referring to
> kill(sig, $YOUR_OWN_PID) there?  That is, if you signal *yourself*,
> and no other thread exists that could handle the signal, it will be
> handled by the sending thread, and in the case of SIGKILL it will
> therefore never return.  But here, you were talking about a perl
> script that kills the postmaster, no?  If so, that passage doesn't
> seem to apply.

You're right.  I revoke the footnote.

> In any case, regardless of whether the signal handler
> has run to completion when kill() returns, doesn't the pid have to
> continue to exist in the process table until it is reaped by its
> parent (possibly in response to SIGCHLD), with one of the wait*()
> family of system calls?

True.  I'll add that to the code comment.




Re: Failure in contrib test _int on loach

2019-04-11 Thread Andrey Lepikhov



On 10/04/2019 20:25, Heikki Linnakangas wrote:

On 09/04/2019 19:11, Anastasia Lubennikova wrote:

05.04.2019 19:41, Anastasia Lubennikova writes:

In attachment, you can find patch with a test that allows to reproduce
the bug not randomly, but on every run.
Now I'm trying to find a way to fix the issue.


The problem was caused by incorrect detection of the page to insert new
tuple after split.
If gistinserttuple() of the tuple formed by gistgetadjusted() had to
split the page, we must to go back to the parent and
descend back to the child that's a better fit for the new tuple.

Previously this was handled by the code block with the following comment:

* Concurrent split detected. There's no guarantee that the
* downlink for this page is consistent with the tuple we're
* inserting anymore, so go back to parent and rechoose the best
* child.

After introducing GistBuildNSN this code path became unreachable.
To fix it, I added new flag to detect such splits during indexbuild.


Isn't it possible that the grandparent page is also split, so that we'd 
need to climb further up?
Based on Anastasia's idea i prepare alternative solution to fix the bug 
(see attachment).
It utilizes the idea of linear increment of LSN/NSN. WAL write process 
is used for change NSN value to 1 for each block of index relation.

I hope this can be a fairly clear and safe solution.

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company
>From 59e1519a0a48b879777820ff68116c68ed31e684 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Thu, 11 Apr 2019 10:52:39 +0500
Subject: [PATCH] Alt fix for gist_optimized_wal_intarray_test bug

---
 src/backend/access/gin/gininsert.c  |  2 +-
 src/backend/access/gist/gist.c  |  4 ++--
 src/backend/access/gist/gistbuild.c | 17 +++--
 src/backend/access/spgist/spginsert.c   |  2 +-
 src/backend/access/transam/xloginsert.c |  4 +++-
 src/include/access/gist.h   |  6 --
 src/include/access/gist_private.h   |  2 ++
 src/include/access/xloginsert.h |  6 +-
 8 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 55eab14617..a63f33b429 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -413,7 +413,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	{
 		log_newpage_range(index, MAIN_FORKNUM,
 		  0, RelationGetNumberOfBlocks(index),
-		  true);
+		  true, NULL);
 	}
 
 	/*
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 2db790c840..56f6ce04db 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -492,7 +492,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * we don't need to be able to detect concurrent splits yet.)
 		 */
 		if (is_build)
-			recptr = GistBuildLSN;
+			recptr = gistBuildLSN++;
 		else
 		{
 			if (RelationNeedsWAL(rel))
@@ -559,7 +559,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			MarkBufferDirty(leftchildbuf);
 
 		if (is_build)
-			recptr = GistBuildLSN;
+			recptr = gistBuildLSN++;
 		else
 		{
 			if (RelationNeedsWAL(rel))
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 8e81eda517..31118b54cf 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -76,6 +76,13 @@ typedef struct
 	GistBufferingMode bufferingMode;
 } GISTBuildState;
 
+/*
+ * A bogus LSN / NSN value used during index build. Must be smaller than any
+ * real or fake unlogged LSN, so that after an index build finishes, all the
+ * splits are considered completed.
+ */
+XLogRecPtr gistBuildLSN = 0;
+
 /* prototypes for private functions */
 static void gistInitBuffering(GISTBuildState *buildstate);
 static int	calculatePagesPerBuffer(GISTBuildState *buildstate, int levelStep);
@@ -107,6 +114,12 @@ static void gistMemorizeParent(GISTBuildState *buildstate, BlockNumber child,
 static void gistMemorizeAllDownlinks(GISTBuildState *buildstate, Buffer parent);
 static BlockNumber gistGetParent(GISTBuildState *buildstate, BlockNumber child);
 
+static void
+gistbuild_log_mask(char *page)
+{
+	GistPageSetNSN((Page) page, (uint64) 1);
+}
+
 /*
  * Main entry point to GiST index build. Initially calls insert over and over,
  * but switches to more efficient buffering build algorithm after a certain
@@ -180,7 +193,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	GISTInitBuffer(buffer, F_LEAF);
 
 	MarkBufferDirty(buffer);
-	PageSetLSN(page, GistBuildLSN);
+	PageSetLSN(page, gistBuildLSN++);
 
 	UnlockReleaseBuffer(buffer);
 
@@ -222,7 +235,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	{
 		log_newpage_range(index, MAIN_FORKNUM,
 		  0, RelationGetNumberOfBlocks(index),
-		  true);
+		  true, gistbuild_log_mask);