Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-03 Thread Thomas Munro
On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro
 wrote:
> On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian  wrote:
>> Uh, are you sure it fixes our use-case?  From the email description it
>> sounded like it only reported fsync errors for every open file
>> descriptor at the time of the failure, but the checkpoint process might
>> open the file _after_ the failure and try to fsync a write that happened
>> _before_ the failure.
>
> I'm not sure of anything.  I can see that it's designed to report
> errors since the last fsync() of the *file* (presumably via any fd),
> which sounds like the desired behaviour:
>
> [..]

Scratch that.  Whenever you open a file descriptor you can't see any
preceding errors at all, because:

/* Ensure that we skip any errors that predate opening of the file */
f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

https://github.com/torvalds/linux/blob/master/fs/open.c#L752

Our whole design is based on being able to open, close and reopen
files at will from any process, and in particular to fsync() from a
different process that didn't inherit the fd but instead opened it
later.  But it looks like that might be able to eat errors that
occurred during asynchronous writeback (when there was nobody to
report them to), before you opened the file?

If so I'm not sure how that can possibly be considered to be an
implementation of _POSIX_SYNCHRONIZED_IO:  "the fsync() function shall
force all currently queued I/O operations associated with the file
indicated by file descriptor fildes to the synchronized I/O completion
state."  Note "the file", not "this file descriptor + copies", and
without reference to when you opened it.

> But I'm not sure what the lifetime of the passed-in "file" and more
> importantly "file->f_wb_err" is.

It's really inode->i_mapping->wb_err's lifetime that I should have
been asking about there, not file->f_wb_err, but I see now that that
question is irrelevant due to the above.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Runtime Partition Pruning

2018-04-03 Thread Beena Emerson
Hi David,

On Wed, Apr 4, 2018 at 7:57 AM, David Rowley
 wrote:
> On 4 April 2018 at 05:50, Beena Emerson  wrote:
>> With Amit's v46 patch,  the following query in partition_prune was
>> crashing during make check.
>> explain (analyze, costs off, summary off, timing off) execute ab_q1 (2, 2, 
>> 3);
>
> Hi Beena,
>
> Thanks for looking.
>
> Does it work correctly if you apply [1] to Amit's v46 patch before
> patching with v18 run-time partition pruning?
>
> [1] 
> https://www.postgresql.org/message-id/CAKJS1f_6%2BgXB%3DQ%2BDryeB62yW7N19sY8hH_dBSjPFjm2ifdgoCw%40mail.gmail.com
>

Thanks for working on it. make check passes when the patch [1] is also applied.

--

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] path toward faster partition pruning

2018-04-03 Thread David Rowley
On 4 April 2018 at 16:00, Tom Lane  wrote:
> David Rowley  writes:
>> It's true that the const simplification code will generally rewrite
>> most NOT(clause) to use the negator operator, but if the operator does
>> not have a negator it can't do this.
>> ...
>> At the moment pruning does not work for this case at all. Perhaps it should?
>
> It's hard to see why we'd expend extra effort to optimize such situations.
> The right answer would invariably be to fix the inadequate operator
> definition, because missing the negator link would hobble many other
> cases besides this.
>
> Now if you can show a case where the extra smarts would be useful
> without presuming a badly-written opclass, it's a different matter.

Okay, well that certainly sounds like less work.

In that case, the comment which claims we handle the NOT clauses needs
to be updated to mention that we only handle boolean NOT clauses and
don't optimize the remainder.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] path toward faster partition pruning

2018-04-03 Thread Tom Lane
David Rowley  writes:
> It's true that the const simplification code will generally rewrite
> most NOT(clause) to use the negator operator, but if the operator does
> not have a negator it can't do this.
> ...
> At the moment pruning does not work for this case at all. Perhaps it should?

It's hard to see why we'd expend extra effort to optimize such situations.
The right answer would invariably be to fix the inadequate operator
definition, because missing the negator link would hobble many other
cases besides this.

Now if you can show a case where the extra smarts would be useful
without presuming a badly-written opclass, it's a different matter.

regards, tom lane



Re: WIP: Covering + unique indexes.

2018-04-03 Thread Peter Geoghegan
On Tue, Apr 3, 2018 at 7:02 AM, Alexander Korotkov
 wrote:
> Great, I'm looking forward your feedback.

I took a look at V11 (0001-Covering-core-v11.patch,
0002-Covering-btree-v11.patch, 0003-Covering-amcheck-v11.patch,
0004-Covering-natts-v11.patch) today.

* What's a pivot tuple?

This is the same thing as what I call a "separator key", I think --
you're talking about the set of IndexTuples including all high keys
(including leaf level high keys), as well as internal items
(downlinks). I think that it's a good idea to have a standard word
that describes this set of keys, to formalize the two categories
(pivot tuples vs. tuples that point to the heap itself). Your word is
just as good as mine, so we can go with that.

Let's put this somewhere central. Maybe in the nbtree README, and/or
nbtree.h. Also, verify_nbtree.c should probably get some small
explanation of pivot tuples. offset_is_negative_infinity() is a nice
place to mention pivot tuples, since that already has a bit of
high-level commentary about them.

* Compiler warning:

/home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:
In function ‘index_create’:
/home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:476:45:
warning: ‘opclassTup’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
   if (keyType == ANYELEMENTOID && opclassTup->opcintype == ANYARRAYOID)
 ^
/home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:332:19:
note: ‘opclassTup’ was declared here
   Form_pg_opclass opclassTup;
   ^

* Your new amcheck tests should definitely use the new
"heapallindexed" option. There were a number of bugs I can remember
seeing in earlier versions of this patch that that would catch
(probably not during regression tests, but let's at least do that
much).

* The modified amcheck contrib regression tests don't actually pass. I
see these unexpected errors:

10037/2018-04-03 16:31:12 PDT ERROR:  wrong number of index tuple
attributes for index "bttest_multi_idx"
10037/2018-04-03 16:31:12 PDT DETAIL:  Index tid=(290,2) points to
index tid=(289,2) page lsn=0/162407A8.
10037/2018-04-03 16:31:12 PDT ERROR:  wrong number of index tuple
attributes for index "bttest_multi_idx"
10037/2018-04-03 16:31:12 PDT DETAIL:  Index tid=(290,2) points to
index tid=(289,2) page lsn=0/162407A8.

* I see that we use "- 1" with attribute number, like this:

> +/* Get number of attributes in B-tree index tuple */
> +#define BtreeTupGetNAtts(itup, index)  \
> +   ( \
> +   (itup)->t_info & INDEX_ALT_TID_MASK ? \
> +   ( \
> +   AssertMacro((ItemPointerGetOffsetNumber(&(itup)->t_tid) & 
> BT_RESERVED_OFFSET_MASK) == 0), \
> +   ItemPointerGetOffsetNumber(&(itup)->t_tid) & 
> BT_N_KEYS_OFFSET_MASK - 1 \
> +   ) \
> +   : \
> +   IndexRelationGetNumberOfAttributes(index) \
> +   )

Is this left behind from before you decided to adopt
INDEX_ALT_TID_MASK? Is it your intention here to encode
InvalidOffsetNumber() without tripping up assertions? Or is it
something else?

Maybe we should follow the example of GinItemPointerGetOffsetNumber(),
and use ItemPointerGetOffsetNumberNoCheck() instead of
ItemPointerGetOffsetNumber(). What do you think? That would allow us
to get rid of the -1 thing, which might be nice. Just because we use
ItemPointerGetOffsetNumberNoCheck() in places that use an alternative
offset representation does not mean we need to use it in existing
places. If existing places had a regression tests failure because of
this, that would probably be due to a real bug. No?

* ISTM that the "points to index tid=(289,2)" part of the message just
shown would be a bit clearer if I didn't have to know that 2 actually
means 1 when we talk about the pointed-to offset (yeah, it will
probably become unclear in the future when we start using the reserved
offset status bits, but why not make the low bits of offset
simple/logical way?). Your new amcheck error message should spell it
out (it should say the number of attributes indicated by the offset,
if any) -- regardless of what we do about the "must apply - 1 to
offset" question.

* "Minus infinity" items do not have the new status bit
INDEX_ALT_TID_MASK set in at least some cases. They should.

* _bt_sortaddtup() should not do "trunctuple.t_info =
sizeof(IndexTupleData)", since that destroys useful information. Maybe
that's the reason for the last bug?

* Ditto for _bt_pgaddtup().

* Why expose _bt_pgaddtup() so that nbtsort.c/_bt_buildadd() can call
it? The only reason we have _bt_sortaddtup() is because we cannot
trust P_RIGHTMOST() within _bt_pgaddtup() when called in the context
of CREATE INDEX (from nbtsort.c/_bt_buildadd()). There is no real
change needed, because _bt_sortaddtup() knows that it's inserting on a
non-rightmost page both without this patch, and when this patch needs
to truncate and then add the high key back.

It's clear that you can 

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-04-03 Thread Amit Kapila
On Wed, Apr 4, 2018 at 4:31 AM, Andres Freund  wrote:
> On 2018-03-06 19:57:03 +0530, Amit Kapila wrote:
>> On Tue, Mar 6, 2018 at 4:53 AM, Andres Freund  wrote:
>> > Hi,
>> >
>> >> diff --git a/src/backend/executor/nodeLockRows.c 
>> >> b/src/backend/executor/nodeLockRows.c
>> >> index 7961b4be6a..b07b7092de 100644
>> >> --- a/src/backend/executor/nodeLockRows.c
>> >> +++ b/src/backend/executor/nodeLockRows.c
>> >> @@ -218,6 +218,11 @@ lnext:
>> >>   ereport(ERROR,
>> >>   
>> >> (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
>> >>errmsg("could not 
>> >> serialize access due to concurrent update")));
>> >> + if 
>> >> (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid
>> >> + ereport(ERROR,
>> >> + 
>> >> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> >> +  errmsg("tuple to 
>> >> be locked was already moved to another partition due to concurrent 
>> >> update")));
>> >> +
>> >
>> > Why are we using ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE rather than
>> > ERRCODE_T_R_SERIALIZATION_FAILURE?  A lot of frameworks have builtin
>> > logic to retry serialization failures, and this kind of thing is going
>> > to resolved by retrying, no?
>> >
>>
>> I think it depends, in some cases retry can help in deleting the
>> required tuple, but in other cases like when the user tries to perform
>> delete on a particular partition table, it won't be successful as the
>> tuple would have been moved.
>
> So? In that case the retry will not find the tuple, which'll also
> resolve the issue. Preventing frameworks from dealing with this seems
> like a way worse issue than that.
>

The idea was just that the apps should get an error so that they can
take appropriate action (either retry or whatever they want), we don't
want to silently make it a no-delete op.  The current error patch is
throwing appears similar to what we already do in delete/update
operation with a difference that here we are trying to delete a moved
tuple.

heap_delete()
{
..
if (result == HeapTupleInvisible)
{
UnlockReleaseBuffer(buffer);
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("attempted to delete invisible tuple")));
}
..
}

I think if we want users to always retry on this operation, then
ERRCODE_T_R_SERIALIZATION_FAILURE is a better error code.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [PATCH] Logical decoding of TRUNCATE

2018-04-03 Thread Peter Eisentraut
Here is an updated patch that addresses some of your concerns.

I've split it up into the decoding support and the pgoutput replication
support.

The problem I see now is that when we WAL-log a truncate, we include all
the relids in one record.  That seems useful.  But during decoding we
split this into one change per relid.  So at the receiving end, truncate
can only process one relation at a time, which will fail if there are
foreign keys involved.  The solution that had been proposed here was to
ignore foreign keys on the subscriber, which is clearly problematic.

I wonder why this approach was chosen.  If we go through the trouble of
WAL-logging all the relations together, why not present them to the
output plugin as one so that they can be applied together?  This will
clearly make questions of filtering and replication set membership and
so on more difficult, but that's just the nature of the thing.  If you
are connecting tables by foreign keys and only replicate some of them,
then that's going to have confusing effects no matter what you do.

(That's perhaps another reason why having the option of replicating
truncate separately from delete could be useful.)

I'm going to try to hack up an alternative approach and see how it works
out.


On 4/1/18 16:01, Andres Freund wrote:
> On 2018-01-25 14:21:15 +0100, Marco Nenciarini wrote:
>> +if (SessionReplicationRole != SESSION_REPLICATION_ROLE_REPLICA)
>> +{
>> + 
>> +/*
>> + * Check foreign key references.  In CASCADE mode, this should 
>> be
>> + * unnecessary since we just pulled in all the references; but 
>> as a
>> + * cross-check, do it anyway if in an Assert-enabled build.
>> + */
>>   #ifdef USE_ASSERT_CHECKING
>>  heap_truncate_check_FKs(rels, false);
>> + #else
>> +if (stmt->behavior == DROP_RESTRICT)
>> +heap_truncate_check_FKs(rels, false);
>>   #endif
>> +}
> 
> That *can't* be right.

This is actually existing code that just looks funny in the diff after
being indented.

But I'd rather skip this patch altogether and find a different solution;
see above.

> I know this has been discussed in the thread already, but it really
> strikes me as wrong to basically do some mini DDL replication feature
> via per-command WAL records.

I think TRUNCATE is special in some ways and it's reasonable to treat it
specially.  Real DDL is being worked on in the 2PC decoding thread.
What we are discussing here isn't going to be applicable there and vice
versa, I think.  In fact, one of the reasons for this effort is that in
BDR TRUNCATE is currently handled like a DDL command, which doesn't work
well.

>> +  
>> +TRUNCATE is treated as a form of
>> +DELETE for the purpose of deciding whether
>> +to publish, or not.
>> +  
>>   
>>  
>> 
> 
> Why is this a good idea?

I have changed this by making this a separate property.

> Hm, it seems logicaldecoding.sgml hasn't been updated?

I re-checked but didn't find anything suitable to update.

>> + void
>> + ExecuteTruncateGuts(List *explicit_rels, List *relids, List *relids_logged,
>> +DropBehavior behavior, 
>> bool restart_seqs)
>> + {
>> +List   *rels = list_copy(explicit_rels);
> 
> Why is this copied?

Because it is overwritten later.  I have moved it down a bit to make
that a bit clearer.

>> + * Write a WAL record to allow this set of actions to be logically 
>> decoded.
>> + * We could optimize this away when !RelationIsLogicallyLogged(rel)
>> + * but that doesn't save much space or time.
> 
> What you're saying isn't that you're not logging anything, but that
> you're allocating the header regardless? Because this certainly sounds
> like you unconditionally log a WAL record.

> I'm confused. Why do we need the resizing here, when we know the max
> upfront?

I have rewritten this a bit and removed the logging of the sequence
relids, which isn't used anywhere after, to make the code a bit simpler.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 72577e567725526543e5493e235c1059610f3467 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 3 Apr 2018 22:12:00 -0400
Subject: [PATCH v18 1/2] Logical decoding of TRUNCATE

---
 contrib/test_decoding/expected/ddl.out| 10 ++-
 contrib/test_decoding/sql/ddl.sql |  3 +
 contrib/test_decoding/test_decoding.c | 14 +++
 src/backend/access/heap/heapam.c  |  7 ++
 src/backend/access/rmgrdesc/heapdesc.c| 14 +++
 src/backend/commands/tablecmds.c  | 87 +--
 src/backend/replication/logical/decode.c  | 46 ++
 .../replication/logical/reorderbuffer.c   | 10 +++
 src/include/access/heapam_xlog.h  | 23 -
 

Re: [HACKERS] Runtime Partition Pruning

2018-04-03 Thread David Rowley
On 4 April 2018 at 05:50, Beena Emerson  wrote:
> With Amit's v46 patch,  the following query in partition_prune was
> crashing during make check.
> explain (analyze, costs off, summary off, timing off) execute ab_q1 (2, 2, 3);

Hi Beena,

Thanks for looking.

Does it work correctly if you apply [1] to Amit's v46 patch before
patching with v18 run-time partition pruning?

[1] 
https://www.postgresql.org/message-id/CAKJS1f_6%2BgXB%3DQ%2BDryeB62yW7N19sY8hH_dBSjPFjm2ifdgoCw%40mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-03 Thread Bruce Momjian
On Wed, Apr  4, 2018 at 01:54:50PM +1200, Thomas Munro wrote:
> On Wed, Apr 4, 2018 at 12:56 PM, Bruce Momjian  wrote:
> > There has been a lot of focus in this thread on the workflow:
> >
> > write() -> blocks remain in kernel memory -> fsync() -> panic?
> >
> > But what happens in this workflow:
> >
> > write() -> kernel syncs blocks to storage -> fsync()
> >
> > Is fsync() going to see a "kernel syncs blocks to storage" failure?
> >
> > There was already discussion that if the fsync() causes the "syncs
> > blocks to storage", fsync() will only report the failure once, but will
> > it see any failure in the second workflow?  There is indication that a
> > failed write to storage reports back an error once and clears the dirty
> > flag, but do we know it keeps things around long enough to report an
> > error to a future fsync()?
> >
> > You would think it does, but I have to ask since our fsync() assumptions
> > have been wrong for so long.
> 
> I believe there were some problems of that nature (with various
> twists, based on other concurrent activity and possibly different
> fds), and those problems were fixed by the errseq_t system developed
> by Jeff Layton in Linux 4.13.  Call that "bug #1".

So all our non-cutting-edge Linux systems are vulnerable and there is no
workaround Postgres can implement?  Wow.

> The second issues is that the pages are marked clean after the error
> is reported, so further attempts to fsync() the data (in our case for
> a new attempt to checkpoint) will be futile but appear successful.
> Call that "bug #2", with the proviso that some people apparently think
> it's reasonable behaviour and not a bug.  At least there is a
> plausible workaround for that: namely the nuclear option proposed by
> Craig.

Yes, that one I understood.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-03 Thread Thomas Munro
On Wed, Apr 4, 2018 at 12:56 PM, Bruce Momjian  wrote:
> There has been a lot of focus in this thread on the workflow:
>
> write() -> blocks remain in kernel memory -> fsync() -> panic?
>
> But what happens in this workflow:
>
> write() -> kernel syncs blocks to storage -> fsync()
>
> Is fsync() going to see a "kernel syncs blocks to storage" failure?
>
> There was already discussion that if the fsync() causes the "syncs
> blocks to storage", fsync() will only report the failure once, but will
> it see any failure in the second workflow?  There is indication that a
> failed write to storage reports back an error once and clears the dirty
> flag, but do we know it keeps things around long enough to report an
> error to a future fsync()?
>
> You would think it does, but I have to ask since our fsync() assumptions
> have been wrong for so long.

I believe there were some problems of that nature (with various
twists, based on other concurrent activity and possibly different
fds), and those problems were fixed by the errseq_t system developed
by Jeff Layton in Linux 4.13.  Call that "bug #1".

The second issues is that the pages are marked clean after the error
is reported, so further attempts to fsync() the data (in our case for
a new attempt to checkpoint) will be futile but appear successful.
Call that "bug #2", with the proviso that some people apparently think
it's reasonable behaviour and not a bug.  At least there is a
plausible workaround for that: namely the nuclear option proposed by
Craig.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] path toward faster partition pruning

2018-04-03 Thread David Rowley
On 4 April 2018 at 11:22, David Rowley  wrote:
> On 4 April 2018 at 09:47, David Rowley  wrote:
>> I think it would be better to just have special handling in
>> get_matching_list_bound so that it knows it's performing <>
>> elimination. I'd thought about passing some other opstrategy but the
>> only safe one I thought to use was InvalidStrategy, which is already
>> used by NULL handling.
>
> I'm currently working up a patch to do this the way I think is best.
>
> I'll submit it soon and we can review and get your thoughts on it.

I've attached a rough cut version of what I think is a good solution
for this. It's based on v46, not your latest v47, sorry.

This makes get_matching_list_bounds() aware that it's performing the
not-equal pruning via the opstrategy which allows it to not return all
partitions when there are no values in this case. Instead, we return
the NULL partition, so that we later invert that and return everything
apart from the NULL partition. A strict clause will allow us that
much, even if we can't get the actual value being compared to, at the
time.

There's also a bunch of other changes in there:

1. Adding missing step_id in copyfuncs.c
2. Simplified including the default partition in a bunch of cases.
3. Made it so scan_default and scan_null are only ever set to true if
there's a partition for that.
4. Changed get_matching_*_bounds to return the entire result struct
instead of the Bitmapset and pass the remaining bool values back
through params. I didn't really like how you'd change this to pass all
the bool flags back as params. There's a perfectly good struct there
to provide the entire result in a single return value. I know you've
disagreed with this already, so would be nice to get a 3rd opinion.
5. Rename get_matching_hash_bound to get_matching_hash_bounds. The
LIST and RANGE version of this function both had a plural name. I
didn't see any reason for the hash case to be different.

Let me know what you think.

I've patched the run-time pruning v18 against this and it now passes regression.

I need to do a bit more testing on this to ensure it works for all
cases, but thought I'd send now as I suspect you're currently around
to look.

There might be another issue with the patch too, but I'll send a
separate email about that.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


v46_fixes_drowley.patch
Description: Binary data


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-03 Thread Bruce Momjian
On Tue, Apr  3, 2018 at 05:47:01PM -0400, Robert Haas wrote:
> Well, in PostgreSQL, we have a background process called the
> checkpointer which is the process that normally does all of the
> fsync() calls but only a subset of the write() calls.  The
> checkpointer does not, however, necessarily have every file open all
> the time, so these fixes aren't sufficient to make sure that the
> checkpointer ever sees an fsync() failure.

There has been a lot of focus in this thread on the workflow:

write() -> blocks remain in kernel memory -> fsync() -> panic?

But what happens in this workflow:

write() -> kernel syncs blocks to storage -> fsync()

Is fsync() going to see a "kernel syncs blocks to storage" failure?

There was already discussion that if the fsync() causes the "syncs
blocks to storage", fsync() will only report the failure once, but will
it see any failure in the second workflow?  There is indication that a
failed write to storage reports back an error once and clears the dirty
flag, but do we know it keeps things around long enough to report an
error to a future fsync()?

You would think it does, but I have to ask since our fsync() assumptions
have been wrong for so long.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: Running Installcheck remotely

2018-04-03 Thread Michael Paquier
On Tue, Apr 03, 2018 at 09:12:48AM -0700, Kapil Sharma wrote:
> So this means that the host running the test should have capability to SSH
> to the DB Instance host ?

That's not what it means as the whole test will run through libpq.
Please make sure that the files used for the tests match the version of
the server and that all the libraries needed by the tests are installed
on the host where the database is running. 
--
Michael


signature.asc
Description: PGP signature


Re: open/lseek overhead with many partitions (including with "faster partitioning pruning")

2018-04-03 Thread David Rowley
On 4 April 2018 at 12:31, Amit Langote  wrote:
> On 2018/04/04 9:27, David Rowley wrote:
>> Yeah, this will no doubt be due to the fact that we still build
>> RelOptInfos in the planner for all partitions, so we're still doing
>> get_relation_info() and estimate_rel_size() for each of those. It
>> might be possible to delay that call until we know we're going to need
>> the partition.  I imagine we wouldn't need to know the size of the
>> relation until after set_rel_sizes, but I've not checked.
>
> Yeah, one of the earliest patches on the "faster partition pruning" had
> tried to solve this issue, but we decided it was better to come to it
> after we're done dealing with just making the pruning faster.

Yeah, it'll need to be a PG12 project now.

It would be nice to not create the RelOptInfos at all until we've
decided we need them. Perhaps we can put NULL placeholders in
simple_rel_array... the major problem with that at the moment is that
the boundinfo stuff is stored in each relation and not the parent
partitioned table, so I think pruning could only take place after each
RelOptInfo has been built.  To fix that we'd need to store relation
Oids of partitioned tables along with their boundinfo in the parent's
RelOptInfo... Anyway, no doubt we'll get a chance to think harder on
this once PG11 is out the way.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: open/lseek overhead with many partitions (including with "faster partitioning pruning")

2018-04-03 Thread Amit Langote
On 2018/04/04 9:27, David Rowley wrote:
> On 4 April 2018 at 07:46, Justin Pryzby  wrote:
>> TLDR: even with "faster pruning" patch, pg11dev open()+lseek() every file
>> backing every table being queried, even for those partitions eventually
>> "excluded".
> 
> Yeah, this will no doubt be due to the fact that we still build
> RelOptInfos in the planner for all partitions, so we're still doing
> get_relation_info() and estimate_rel_size() for each of those. It
> might be possible to delay that call until we know we're going to need
> the partition.  I imagine we wouldn't need to know the size of the
> relation until after set_rel_sizes, but I've not checked.

Yeah, one of the earliest patches on the "faster partition pruning" had
tried to solve this issue, but we decided it was better to come to it
after we're done dealing with just making the pruning faster.

Thanks,
Amit




Re: open/lseek overhead with many partitions (including with "faster partitioning pruning")

2018-04-03 Thread Amit Langote
Hi Justin.

On 2018/04/04 4:46, Justin Pryzby wrote:
> TLDR: even with "faster pruning" patch, pg11dev open()+lseek() every file
> backing every table being queried, even for those partitions eventually
> "excluded".

That's expected.  The faster pruning patch doesn't change the behavior
with respect to when the partitions' files are open()'d, which at this
point is still *before* the pruning occurs.  It just switches the method
of pruning to a faster one, whereby instead of pruning each partition
one-by-one using constraint exclusion, we only look at the parent's
partition descriptor to exclude partitions that don't satisfy the query.

Thanks,
Amit




Re: open/lseek overhead with many partitions (including with "faster partitioning pruning")

2018-04-03 Thread David Rowley
On 4 April 2018 at 07:46, Justin Pryzby  wrote:
> TLDR: even with "faster pruning" patch, pg11dev open()+lseek() every file
> backing every table being queried, even for those partitions eventually
> "excluded".

Yeah, this will no doubt be due to the fact that we still build
RelOptInfos in the planner for all partitions, so we're still doing
get_relation_info() and estimate_rel_size() for each of those. It
might be possible to delay that call until we know we're going to need
the partition.  I imagine we wouldn't need to know the size of the
relation until after set_rel_sizes, but I've not checked.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-03 Thread Thomas Munro
On Tue, Apr 3, 2018 at 1:29 PM, Thomas Munro
 wrote:
> Interestingly, there don't seem to be many operating systems that can
> report ENOSPC from fsync(), based on a quick scan through some
> documentation:
>
> POSIX, AIX, HP-UX, FreeBSD, OpenBSD, NetBSD: no
> Illumos/Solaris, Linux, macOS: yes

Oops, reading comprehension fail.  POSIX yes (since issue 5), via the
note that read() and write()'s error conditions can also be returned.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] path toward faster partition pruning

2018-04-03 Thread David Rowley
On 4 April 2018 at 09:47, David Rowley  wrote:
> On 4 April 2018 at 00:02, Amit Langote  wrote:
>> But actually, the presence of only Params in the pruning steps should
>> result in the pruning not being invoked at all (at least for the static
>> pruning case), thus selecting all partitions containing non-null data.  It
>> is better to implement that instead of a workaround like scan_all_nonnulls
>> side-channel I was talking about.
>
> I don't think this is quite true. Since we're only using strict
> clauses, a list of quals with just Params still means that NULLs can't
> match. If you skip the step altogether then won't you have you've lost
> the chance at pruning away any NULL-only partition?
>
> I think it would be better to just have special handling in
> get_matching_list_bound so that it knows it's performing <>
> elimination. I'd thought about passing some other opstrategy but the
> only safe one I thought to use was InvalidStrategy, which is already
> used by NULL handling.

I'm currently working up a patch to do this the way I think is best.

I'll submit it soon and we can review and get your thoughts on it.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-04-03 Thread Andres Freund
On 2018-03-06 19:57:03 +0530, Amit Kapila wrote:
> On Tue, Mar 6, 2018 at 4:53 AM, Andres Freund  wrote:
> > Hi,
> >
> >> diff --git a/src/backend/executor/nodeLockRows.c 
> >> b/src/backend/executor/nodeLockRows.c
> >> index 7961b4be6a..b07b7092de 100644
> >> --- a/src/backend/executor/nodeLockRows.c
> >> +++ b/src/backend/executor/nodeLockRows.c
> >> @@ -218,6 +218,11 @@ lnext:
> >>   ereport(ERROR,
> >>   
> >> (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
> >>errmsg("could not 
> >> serialize access due to concurrent update")));
> >> + if 
> >> (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid
> >> + ereport(ERROR,
> >> + 
> >> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >> +  errmsg("tuple to be 
> >> locked was already moved to another partition due to concurrent update")));
> >> +
> >
> > Why are we using ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE rather than
> > ERRCODE_T_R_SERIALIZATION_FAILURE?  A lot of frameworks have builtin
> > logic to retry serialization failures, and this kind of thing is going
> > to resolved by retrying, no?
> >
> 
> I think it depends, in some cases retry can help in deleting the
> required tuple, but in other cases like when the user tries to perform
> delete on a particular partition table, it won't be successful as the
> tuple would have been moved.

So? In that case the retry will not find the tuple, which'll also
resolve the issue. Preventing frameworks from dealing with this seems
like a way worse issue than that.

Greetings,

Andres Freund



Re: Online enabling of checksums

2018-04-03 Thread Tomas Vondra
On 04/03/2018 02:05 PM, Magnus Hagander wrote:
> On Sun, Apr 1, 2018 at 2:04 PM, Magnus Hagander  > wrote:
> 
> On Sat, Mar 31, 2018 at 5:38 PM, Tomas Vondra
> >
> wrote:
> 
> On 03/31/2018 05:05 PM, Magnus Hagander wrote:
> > On Sat, Mar 31, 2018 at 4:21 PM, Tomas Vondra
> >  
>  >> wrote:
> >
> > ...
> >
> >     I do think just waiting for all running transactions to 
> complete is
> >     fine, and it's not the first place where we use it - CREATE 
> SUBSCRIPTION
> >     does pretty much exactly the same thing (and CREATE INDEX 
> CONCURRENTLY
> >     too, to some extent). So we have a precedent / working code we 
> can copy.
> >
> >
> > Thinking again, I don't think it should be done as part of
> > BuildRelationList(). We should just do it once in the launcher 
> before
> > starting, that'll be both easier and cleaner. Anything started after
> > that will have checksums on it, so we should be fine.
> >
> > PFA one that does this.
> >
> 
> Seems fine to me. I'd however log waitforxid, not the oldest one. If
> you're a DBA and you want to make the checksumming to proceed,
> knowing
> the oldest running XID is useless for that. If we log
> waitforxid, it can
> be used to query pg_stat_activity and interrupt the sessions
> somehow.
> 
> 
> Yeah, makes sense. Updated.
> 
>  
> 
> >     >     And if you try this with a temporary table (not
> hidden in transaction,
> >     >     so the bgworker can see it), the worker will fail
> with this:
> >     >
> >     >       ERROR:  cannot access temporary tables of other
> sessions
> >     >
> >     >     But of course, this is just another way how to crash
> without updating
> >     >     the result for the launcher, so checksums may end up
> being enabled
> >     >     anyway.
> >     >
> >     >
> >     > Yeah, there will be plenty of side-effect issues from that
> >     > crash-with-wrong-status case. Fixing that will at least
> make things
> >     > safer -- in that checksums won't be enabled when not put
> on all pages. 
> >     >
> >
> >     Sure, the outcome with checksums enabled incorrectly is a
> consequence of
> >     bogus status, and fixing that will prevent that. But that
> wasn't my main
> >     point here - not articulated very clearly, though.
> >
> >     The bigger question is how to handle temporary tables
> gracefully, so
> >     that it does not terminate the bgworker like this at all.
> This might be
> >     even bigger issue than dropped relations, considering that
> temporary
> >     tables are pretty common part of applications (and it also
> includes
> >     CREATE/DROP).
> >
> >     For some clusters it might mean the online checksum
> enabling would
> >     crash+restart infinitely (well, until reaching MAX_ATTEMPTS).
> >
> >     Unfortunately, try_relation_open() won't fix this, as the
> error comes
> >     from ReadBufferExtended. And it's not a matter of simply
> creating a
> >     ReadBuffer variant without that error check, because
> temporary tables
> >     use local buffers.
> >
> >     I wonder if we could just go and set the checksums anyway,
> ignoring the
> >     local buffers. If the other session does some changes,
> it'll overwrite
> >     our changes, this time with the correct checksums. But it
> seems pretty
> >     dangerous (I mean, what if they're writing stuff while
> we're updating
> >     the checksums? Considering the various short-cuts for
> temporary tables,
> >     I suspect that would be a boon for race conditions.)
> >
> >     Another option would be to do something similar to running
> transactions,
> >     i.e. wait until all temporary tables (that we've seen at
> the beginning)
> >     disappear. But we're starting to wait on more and more stuff.
> >
> >     If we do this, we should clearly log which backends we're
> waiting for,
> >     so that the admins can go and interrupt them manually.
> >
> >
> >
> > Yeah, 

Re: Optimize Arm64 crc32c implementation in Postgresql

2018-04-03 Thread Thomas Munro
On Wed, Apr 4, 2018 at 4:38 AM, Heikki Linnakangas  wrote:
> [armv8-crc32c-2.patch]

This tests OK on my Debian buster aarch64 system (the machine that
runs "eelpout" in the buildfarm), configure tests as expected and I
confirmed that pg_comp_crc32c_armv8 is reached at runtime.

I hope we can figure out a more portable way to detect the
instructions, or failing that a way to detect them on FreeBSD in
userspace.  I'll try to send a patch for PG12 if I get a chance.

No opinion on the unaligned memory access question.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] logical decoding of two-phase transactions

2018-04-03 Thread Andrew Dunstan
On Wed, Apr 4, 2018 at 12:25 AM, Tom Lane  wrote:
> Tomas Vondra  writes:
>> On 04/03/2018 04:37 PM, Alvaro Herrera wrote:
>>> Tomas Vondra wrote:
 Yes, that is a good point actually - we need to test that replication
 between PG10 and PG11 works correctly, i.e. that the protocol version is
 correctly negotiated, and features are disabled/enabled accordingly etc.
>
>>> Maybe it'd be good to have a buildfarm animal to specifically test for
>>> that?
>
>> Not sure a buildfarm supports running two clusters with different
>> versions easily?
>
> You'd need some specialized buildfarm infrastructure like --- maybe the
> same as --- the infrastructure for testing cross-version pg_upgrade.
> Andrew could speak to the details better than I.
>


It's quite possible. The cross-version upgrade module saves out each
built version. See


Since this occupies a significant amount of disk space we'd probably
want to leverage it rather than have another module do the same thing.
Perhaps the "save" part of it needs to be factored out.

In any case, it's quite doable. I can work on that after this gets committed.

Currently we seem to have only two machines doing the cross-version
upgrade checks, which might make it easier to rearrange anything if
necessary.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Foreign keys and partitioned tables

2018-04-03 Thread Alvaro Herrera
Peter Eisentraut wrote:

> > 0002 is a fixup for a bug in the row triggers patch: I had a restriction
> > earlier that triggers declared internal were not cloned, and I seem to
> > have lost it in rebase.  Reinstate it.
> 
> Hmm, doesn't cause any test changes?

Here's a test case:

create table t (a int) partition by range (a);
create table t1 partition of t for values from (0) to (1000);
alter table t add constraint uniq unique (a) deferrable;
create table t2 partition of t for values from (1000) to (2000);
create table t3 partition of t for values from (2000) to (3000) partition by 
range (a);
create table t33 partition of t3 for values from (2000) to (2100);

Tables t and t1 have one trigger; tables t2 and t3 have two triggers;
table t33 has three triggers:

alvherre=# select tgrelid::regclass, count(*) from pg_trigger where 
tgrelid::regclass in ('t', 't1', 't2', 't3', 't33') group by tgrelid;
 tgrelid │ count 
─┼───
 t   │ 1
 t1  │ 1
 t2  │ 2
 t3  │ 2
 t33 │ 3
(5 filas)

These triggers probably all do the same thing, so there is no
correctness issue -- only speed.  I suppose it's not impossible to
construct a case that shows some actual breakage -- I just don't know
how.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] path toward faster partition pruning

2018-04-03 Thread David Rowley
On 4 April 2018 at 00:02, Amit Langote  wrote:
> But actually, the presence of only Params in the pruning steps should
> result in the pruning not being invoked at all (at least for the static
> pruning case), thus selecting all partitions containing non-null data.  It
> is better to implement that instead of a workaround like scan_all_nonnulls
> side-channel I was talking about.

I don't think this is quite true. Since we're only using strict
clauses, a list of quals with just Params still means that NULLs can't
match. If you skip the step altogether then won't you have you've lost
the chance at pruning away any NULL-only partition?

I think it would be better to just have special handling in
get_matching_list_bound so that it knows it's performing <>
elimination. I'd thought about passing some other opstrategy but the
only safe one I thought to use was InvalidStrategy, which is already
used by NULL handling.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-03 Thread Robert Haas
On Tue, Apr 3, 2018 at 6:35 AM, Anthony Iliopoulos  wrote:
>> Like other people here, I think this is 100% unreasonable, starting
>> with "the dirty pages which cannot been written out are practically
>> thrown away".  Who decided that was OK, and on the basis of what
>> wording in what specification?  I think it's always unreasonable to
>
> If you insist on strict conformance to POSIX, indeed the linux
> glibc configuration and associated manpage are probably wrong in
> stating that _POSIX_SYNCHRONIZED_IO is supported. The implementation
> matches that of the flexibility allowed by not supporting SIO.
> There's a long history of brokenness between linux and posix,
> and I think there was never an intention of conforming to the
> standard.

Well, then the man page probably shouldn't say CONFORMING TO 4.3BSD,
POSIX.1-2001, which on the first system I tested, it did.  Also, the
summary should be changed from the current "fsync, fdatasync -
synchronize a file's in-core state with storage device" by adding ",
possibly by randomly undoing some of the changes you think you made to
the file".

> I believe (as tried to explain earlier) there is a certain assumption
> being made that the writer and original owner of data is responsible
> for dealing with potential errors in order to avoid data loss (which
> should be only of interest to the original writer anyway). It would
> be very questionable for the interface to persist the error while
> subsequent writes and fsyncs to different offsets may as well go through.

No, that's not questionable at all.  fsync() doesn't take any argument
saying which part of the file you care about, so the kernel is
entirely not entitled to assume it knows to which writes a given
fsync() call was intended to apply.

> Another process may need to write into the file and fsync, while being
> unaware of those newly introduced semantics is now faced with EIO
> because some unrelated previous process failed some earlier writes
> and did not bother to clear the error for those writes. In a similar
> scenario where the second process is aware of the new semantics, it would
> naturally go ahead and clear the global error in order to proceed
> with its own write()+fsync(), which would essentially amount to the
> same problematic semantics you have now.

I don't deny that it's possible that somebody could have an
application which is utterly indifferent to the fact that earlier
modifications to a file failed due to I/O errors, but is A-OK with
that as long as later modifications can be flushed to disk, but I
don't think that's a normal thing to want.

>> Also, this really does make it impossible to write reliable programs.
>> Imagine that, while the server is running, somebody runs a program
>> which opens a file in the data directory, calls fsync() on it, and
>> closes it.  If the fsync() fails, postgres is now borked and has no
>> way of being aware of the problem.  If we knew, we could PANIC, but
>> we'll never find out, because the unrelated process ate the error.
>> This is exactly the sort of ill-considered behavior that makes fcntl()
>> locking nearly useless.
>
> Fully agree, and the errseq_t fixes have dealt exactly with the issue
> of making sure that the error is reported to all file descriptors that
> *happen to be open at the time of error*.

Well, in PostgreSQL, we have a background process called the
checkpointer which is the process that normally does all of the
fsync() calls but only a subset of the write() calls.  The
checkpointer does not, however, necessarily have every file open all
the time, so these fixes aren't sufficient to make sure that the
checkpointer ever sees an fsync() failure.  What you have (or someone
has) basically done here is made an undocumented assumption about
which file descriptors might care about a particular error, but it
just so happens that PostgreSQL has never conformed to that
assumption.  You can keep on saying the problem is with our
assumptions, but it doesn't seem like a very good guess to me to
suppose that we're the only program that has ever made them.  The
documentation for fsync() gives zero indication that it's
edge-triggered, and so complaining that people wouldn't like it if it
became level-triggered seems like an ex post facto justification for a
poorly-chosen behavior: they probably think (as we did prior to a week
ago) that it already is.

> Not sure I understand this case. The application may indeed re-write
> a bunch of pages that have failed and proceed with fsync(). The kernel
> will deal with combining the writeback of all the re-written pages. But
> further the necessity of combining for performance really depends on
> the exact storage medium. At the point you start caring about
> write-combining, the kernel community will naturally redirect you to
> use DIRECT_IO.

Well, the way PostgreSQL works today, we typically run with say 8GB of
shared_buffers even if the system memory is, say, 200GB.  As pages 

Re: pgsql: Validate page level checksums in base backups

2018-04-03 Thread David Steele

On 4/3/18 4:48 PM, Michael Banck wrote:


Attached is a patch which does that hopefully:

1. creates two user tables, one large enough for at least 6 blocks
(around 360kb), the other just one block.

2. stops the cluster before scribbling over its data and starts it
afterwards.

3. uses the blocksize (and the pager header size) to determine offsets
for scribbling.


This patch looks reasonable to me.


I've tested it with blocksizes 8 and 32 now, the latter should make sure
that the first table is indeed large enough, but maybe something less
arbitrary than "1 integers" should be used?


It might be quicker to just stop the cluster and then write out an 
arbitrary number of zero pages.  Zero pages are always considered valid 
so you can then corrupt whichever pages you want for testing.


--
-David
da...@pgmasters.net



Re: pgsql: Validate page level checksums in base backups

2018-04-03 Thread Michael Banck
Hi,

On Tue, Apr 03, 2018 at 08:48:08PM +0200, Magnus Hagander wrote:
> On Tue, Apr 3, 2018 at 8:29 PM, Tom Lane  wrote:
> I'd bet a good lunch that nondefault BLCKSZ would break it, as well,
> > since the way in which the corruption is induced is just guessing
> > as to where page boundaries are.
> 
> Yeah, that might be a problem. Those should be calculated from the block
> size.
> 
> Also, scribbling on tables as sensitive as pg_class is just asking for
> > trouble IMO.  I don't see anything in this test, for example, that
> > prevents autovacuum from running and causing a PANIC before the test
> > can complete.  Even with AV off, there's a good chance that clobber-
> > cache-always animals will fall over because they do so many more
> > physical accesses to the system catalogs.  I'd suggest inducing the
> > corruption in some user table(s) that we can more tightly constrain
> > the source server's accesses to.
> 
> Yeah, that seems like a good idea. And probably also shut the server down
> while writing the corruption, just in case.
> 
> Will stick looking into that on my todo for when I'm back, unless beaten to
> it. Michael, you want a stab at it?

Attached is a patch which does that hopefully:

1. creates two user tables, one large enough for at least 6 blocks
(around 360kb), the other just one block.

2. stops the cluster before scribbling over its data and starts it
afterwards.

3. uses the blocksize (and the pager header size) to determine offsets
for scribbling.

I've tested it with blocksizes 8 and 32 now, the latter should make sure
that the first table is indeed large enough, but maybe something less
arbitrary than "1 integers" should be used?

Anyway, sorry for the hassle.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 3162cdcd01..3fe49f68a7 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -406,19 +406,25 @@ like(
 my $checksum = $node->safe_psql('postgres', 'SHOW data_checksums;');
 is($checksum, 'on', 'checksums are enabled');
 
-# get relfilenodes of relations to corrupt
-my $pg_class = $node->safe_psql('postgres',
-	q{SELECT pg_relation_filepath('pg_class')}
+# create tables to corrupt and get their relfilenodes
+my $file_corrupt1 = $node->safe_psql('postgres',
+q{SELECT a INTO corrupt1 FROM generate_series(1,1) AS a; SELECT pg_relation_filepath('corrupt1')}
 );
-my $pg_index = $node->safe_psql('postgres',
-	q{SELECT pg_relation_filepath('pg_index')}
+my $file_corrupt2 = $node->safe_psql('postgres',
+q{SELECT b INTO corrupt2 FROM generate_series(1,2) AS b; SELECT pg_relation_filepath('corrupt2')}
 );
 
+# set page header and block sizes
+my $pageheader_size = 24;
+my $block_size = $node->safe_psql('postgres', 'SHOW block_size;');
+
 # induce corruption
-open $file, '+<', "$pgdata/$pg_class";
-seek($file, 4000, 0);
+system_or_bail 'pg_ctl', '-D', $pgdata, 'stop';
+open $file, '+<', "$pgdata/$file_corrupt1";
+seek($file, $pageheader_size, 0);
 syswrite($file, '\0\0\0\0\0\0\0\0\0');
 close $file;
+system_or_bail 'pg_ctl', '-D', $pgdata, 'start';
 
 $node->command_checks_all([ 'pg_basebackup', '-D', "$tempdir/backup_corrupt"],
 	1,
@@ -428,13 +434,15 @@ $node->command_checks_all([ 'pg_basebackup', '-D', "$tempdir/backup_corrupt"],
 );
 
 # induce further corruption in 5 more blocks
-open $file, '+<', "$pgdata/$pg_class";
-my @offsets = (12192, 20384, 28576, 36768, 44960);
-foreach my $offset (@offsets) {
+system_or_bail 'pg_ctl', '-D', $pgdata, 'stop';
+open $file, '+<', "$pgdata/$file_corrupt1";
+for my $i ( 1..5 ) {
+  my $offset = $pageheader_size + $i * $block_size;
   seek($file, $offset, 0);
   syswrite($file, '\0\0\0\0\0\0\0\0\0');
 }
 close $file;
+system_or_bail 'pg_ctl', '-D', $pgdata, 'start';
 
 $node->command_checks_all([ 'pg_basebackup', '-D', "$tempdir/backup_corrupt2"],
 1,
@@ -444,10 +452,12 @@ $node->command_checks_all([ 'pg_basebackup', '-D', "$tempdir/backup_corrupt2"],
 );
 
 # induce corruption in a second file
-open $file, '+<', "$pgdata/$pg_index";
+system_or_bail 'pg_ctl', '-D', $pgdata, 'stop';
+open $file, '+<', "$pgdata/$file_corrupt2";
 seek($file, 4000, 0);
 syswrite($file, '\0\0\0\0\0\0\0\0\0');
 close $file;
+system_or_bail 'pg_ctl', '-D', $pgdata, 'start';
 
 $node->command_checks_all([ 'pg_basebackup', '-D', "$tempdir/backup_corrupt3"],
 1,
@@ -460,3 +470,6 @@ $node->command_checks_all([ 'pg_basebackup', '-D', "$tempdir/backup_corrupt3"],
 $node->command_ok(
 	[   'pg_basebackup', '-D', "$tempdir/backup_corrupt4", '-k' ],
 	'pg_basebackup with -k does not report 

Re: Foreign keys and partitioned tables

2018-04-03 Thread Alvaro Herrera
Alvaro Herrera wrote:
> While adding some more tests for the "action" part (i.e. updates and
> deletes on the referenced table) I came across a bug that was causing
> the server to crash ... but it's actually a preexisting bug in an
> assert.  The fix is in 0001.

Yeah, it's a live bug that only manifests on Assert-enabled builds.
Here's an example:

create table pk (a int, b int, c int, d int primary key);
create table fk (d int references pk);
insert into pk values (1, 2, 3, 4);
insert into fk values (4);
delete from pk;

Will fix

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



open/lseek overhead with many partitions (including with "faster partitioning pruning")

2018-04-03 Thread Justin Pryzby
TLDR: even with "faster pruning" patch, pg11dev open()+lseek() every file
backing every table being queried, even for those partitions eventually
"excluded".

One of our customers (with the largest number of child tables, currently of
relkind='r') takes a long time to plan query, and I wondered whether we'd
benefit from ongoing work, if I should plan to convert to relkind='p', and if I
could start testing that for our use cases.

I found by stracing that the backend is open()ing + lseek(SEEK_END) each file
in each relation being UNIONed in this view,  including relations which were
excluded by constraints.  Their tables are partitioned by day (some are still
40+GB), so there's only 10-20 tables being included in the above query, total
size perhaps ~50GB and under 50 backing files, but accessing a LOT more files
than needed:

cold backend:
[pryzbyj@database ~]$ grep open /tmp/strace-pg12-0|cut -d/ -f1,2 |sort |uniq -c 
|sort -nr |head
  26593 open("pg_tblspc/16400
   6894 open("pg_tblspc/16401
482 open("base/16402

warm backend:
[pryzbyj@database ~]$ grep open /tmp/strace-pg12-1|cut -d/ -f1,2 |sort |uniq -c 
|sort -nr |head
   6545 open("pg_tblspc/16401
   1338 open("pg_tblspc/16400
254 open("base/16402

I was curious if that was improved in pg11dev, and if it applied to relkind='r'
or only to relkind='p'.  I applied v47 patches for "path toward faster
partition pruning", and was able to quickly create synthetic "huge table" using
truncate (and verified that smgr logic actually opens all 1e5 files).

$ psql -d postgres -h /tmp --port 5678 -c "CREATE TABLE t(i int) PARTITION BY 
range(i)" 
$ for a in `seq 1 999`; do psql -d postgres -h /tmp --port 5678 -c "CREATE 
TABLE t$a PARTITION OF t FOR VALUES FROM ($a)TO($((1+a)))"; done
$ time for a in ./test11dev/base/13236/1???[0-9]; do echo "$a..."; truncate -s 
1G $a; done # make existing, empty files 1GB..
$ time for a in ./test11dev/base/13236/1???[0-9]; do echo "$a..."; for b in 
`seq 1 99`; do truncate -s 1G $a.$b; done; done # make new 1GB tables with .n 
extension

postgres=# explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT 1;
 Limit  (cost=0.00..0.01 rows=1 width=8)
   ->  Aggregate  (cost=0.00..0.01 rows=1 width=8)
 ->  Result  (cost=0.00..0.00 rows=0 width=0)
   One-Time Filter: false
Time: 63268.474 ms (01:03.268)

=> zfs/fuse is certainly exacerbating the issue here due to context switches; 
but..

[pryzbyj@database postgresql]$ time cut -d\( -f1 
/tmp/strace-pg11dev-faster-15|sort |uniq -c |sort -nr |head
 100899 open
  99900 lseek
  98918 close
118 brk

postgres is still open()+lseek() on 100k files, most of which are for relations
which were clearly excluded:

(gdb) bt
#0  0x0032b2adb580 in __open_nocancel () from /lib64/libc.so.6
#1  0x00711827 in BasicOpenFilePerm (fileName=0x1e2c9d0 
"base/13236/16759.10", fileFlags=2, fileMode=384) at fd.c:965
#2  0x007121ea in PathNameOpenFilePerm (fileName=0x1e2c9d0 
"base/13236/16759.10", fileFlags=2, fileMode=384) at fd.c:1394
#3  0x00733f6e in _mdfd_openseg (reln=0x1e14d98, forknum=MAIN_FORKNUM, 
segno=10, oflags=0) at md.c:1783
#4  0x00734026 in mdnblocks (reln=0x1e14d98, forknum=MAIN_FORKNUM) at 
md.c:918
#5  0x006b26bc in estimate_rel_size (rel=0x7f3f707f4b98, 
attr_widths=0x1e2cd24, pages=, tuples=0x1e2cb98, 
allvisfrac=) at plancat.c:946
#6  0x006b3a47 in get_relation_info (root=0x1bd9030, 
relationObjectId=16759, inhparent=false, rel=0x1e2cae0) at plancat.c:144
#7  0x006b78fa in build_simple_rel (root=0x1bd9030, relid=126, 
parent=0x1bda578) at relnode.c:185
#8  0x006b7990 in build_simple_rel (root=0x1bd9030, relid=1, 
parent=0x0) at relnode.c:251
#9  0x00691be3 in add_base_rels_to_query (root=0x1bd9030, jtnode=) at initsplan.c:121
#10 0x006924af in query_planner (root=, 
tlist=0x1d34fd8, qp_callback=0x693940 , 
qp_extra=0x7ffe0a077290) at planmain.c:138
#11 0x0069777e in grouping_planner (root=, 
inheritance_update=false, tuple_fraction=) at 
planner.c:1892
#12 0x00698ef7 in subquery_planner (glob=, 
parse=, parent_root=, 
hasRecursion=, tuple_fraction=0) at planner.c:966
#13 0x00699d97 in standard_planner (parse=0x1bd9778, cursorOptions=256, 
boundParams=) at planner.c:405
#14 0x00739d2a in pg_plan_query (querytree=, 
cursorOptions=, boundParams=) at 
postgres.c:808
#15 0x005a0bd6 in ExplainOneQuery (query=0x1bd9778, 
cursorOptions=, into=0x0, es=0x1bd8f58, 
queryString=0x1ae0fb0 "explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT 1;", 
params=0x0, queryEnv=0x0)
at explain.c:365
#16 0x005a0e5d in ExplainQuery (pstate=0x1bd8d80, stmt=0x1ae1fa0, 
queryString=0x1ae0fb0 "explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT 1;", 
params=0x0, queryEnv=0x0, dest=0x1bd8e90) at explain.c:254
#17 0x007408f2 in standard_ProcessUtility (pstmt=0x1ae2050, 
queryString=0x1ae0fb0 "explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT 1;", 

Re: Foreign keys and partitioned tables

2018-04-03 Thread Alvaro Herrera
While adding some more tests for the "action" part (i.e. updates and
deletes on the referenced table) I came across a bug that was causing
the server to crash ... but it's actually a preexisting bug in an
assert.  The fix is in 0001.

0002 I already posted: don't clone row triggers that are declared
internal.  As you noted, there is no test that changes because of this.
I haven't tried yet; the only case that comes to mind is doing something
with a deferred unique constraint.  Anyway, it was clear to me from the
beginning that cloning internal triggers was not going to work for the
FK constraints.

0003 is the main patch, which is a bit changed from v4, notably to cover
your review comments:

Peter Eisentraut wrote:

> > -  tables and permanent tables.
> > +  tables and permanent tables.  Also note that while it is permitted to
> > +  define a foreign key on a partitioned table, declaring a foreign key
> > +  that references a partitioned table is not allowed.
> >   
> 
> Maybe use "possible" or "supported" instead of "allowed" and "permitted"
> in this and similar cases.

Fixed.

> @@ -7226,12 +7235,23 @@ ATAddForeignKeyConstraint(AlteredTableInfo *tab,
> Relation rel,
>  * numbers)
>  */
> if (pkrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
> +   {
> +   /* fix recursion in ATExecValidateConstraint to enable this case */
> +   if (fkconstraint->skip_validation && !fkconstraint->initially_valid)
> +   ereport(ERROR,
> +   (errcode(ERRCODE_WRONG_OBJECT_TYPE),
> +errmsg("cannot add NOT VALID foreign key to
> relation \"%s\"",
> +   RelationGetRelationName(pkrel;
> +   }
> 
> Maybe this error message should be more along the lines of "is not
> supported yet".

I added errdetail("This feature is not yet supported on partitioned tables.")))

> Also, this restriction should perhaps be mentioned in
> the documentation additions that we have been discussing.

Added a note in ALTER TABLE.

> 
> The first few hunks in ri_triggers.c (the ones that refer to the
> pktable) are apparently for the postponed part of the feature, foreign
> keys referencing partitioned tables.  So I think those hunks should be
> dropped from this patch.

Yeah, reverted.

> The removal of the ONLY for the foreign key queries also affects
> inherited tables, in undesirable ways.  For example, consider this
> script: [...]

> With the patch, this will have deleted the (111, 1) record in test2a,
> which it did not do before.
> 
> I think the trigger functions need to look up whether the table is a
> partitioned table and decide whether the use ONLY based on that.
> (This will probably also apply to the PK side, when that is
> implemented.)

Updated this.  I added you test script to inherit.sql.

> 
> In pg_dump, maybe this can be refined:
> 
> +   /*
> +* For partitioned tables, it doesn't work to emit constraints
> as not
> +* inherited.
> +*/
> +   if (tbinfo->relkind == RELKIND_PARTITIONED_TABLE)
> +   only = "";
> +   else
> +   only = "ONLY ";
> 
> We use ONLY for foreign keys on inherited tables, but foreign keys are
> not inherited anyway, so this is at best future-proofing.  We could
> just drop the ONLY unconditionally.  Or at least explain this more.

Yeah, I admit this is a bit weird.  I expanded the comment but didn't
change the code:

/*
 * Foreign keys on partitioned tables are always declared as 
inheriting
 * to partitions; for all other cases, emit them as applying 
ONLY
 * directly to the named table, because that's how they work for
 * regular inherited tables.
 */

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From d5313eea2e0196631d3269f453eb3bad86e5eca5 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 3 Apr 2018 13:58:49 -0300
Subject: [PATCH v5 1/3] Ancient bug fix

---
 src/backend/utils/adt/ri_triggers.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/adt/ri_triggers.c 
b/src/backend/utils/adt/ri_triggers.c
index 3bb708f863..d0fe65cea9 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -514,7 +514,7 @@ ri_Check_Pk_Match(Relation pk_rel, Relation fk_rel,
boolresult;
 
/* Only called for non-null rows */
-   Assert(ri_NullCheck(RelationGetDescr(fk_rel), old_row, riinfo, true) == 
RI_KEYS_NONE_NULL);
+   Assert(ri_NullCheck(RelationGetDescr(pk_rel), old_row, riinfo, true) == 
RI_KEYS_NONE_NULL);
 
if (SPI_connect() != SPI_OK_CONNECT)
elog(ERROR, "SPI_connect failed");
-- 
2.11.0

>From ed65fe82a8395e086ee0ceeea6243da10c26ac2b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: 

Re: Optimize Arm64 crc32c implementation in Postgresql

2018-04-03 Thread Daniel Gustafsson
> On 03 Apr 2018, at 18:38, Heikki Linnakangas  wrote:
> 
> On 03/04/18 19:09, Andres Freund wrote:
>> Hi,
>> On 2018-04-03 19:05:19 +0300, Heikki Linnakangas wrote:
>>> On 01/04/18 20:32, Andres Freund wrote:
 On 2018-03-06 02:44:35 +0800, Heikki Linnakangas wrote:
> * I tested this on Linux, with gcc and clang, on an ARM64 virtual machine
> that I had available (not an emulator, but a VM on a shared ARM64 server).
 
 Have you seen actual postgres performance benefits with the patch?
>>> 
>>> I just ran a small test with pg_waldump, similar to what Abhijit Menon-Sen
>>> ran with the Slicing-by-8 and Intel SSE patches, when we added those
>>> (https://www.postgresql.org/message-id/20141119155811.GA32492%40toroid.org).
>>> I ran pgbench, with scale factor 5, until it had generated about 1 GB of
>>> WAL, and then I ran pg_waldump -z on that WAL. With slicing-by-8, it took
>>> about 7 s, and with the special CPU instructions, about 5 s. 'perf' showed
>>> that the CRC computation took about 30% of the CPU time before, and about
>>> 12% after, which sounds about right. That's not as big a speedup as we saw
>>> with the corresponding Intel SSE instructions back in 2014, but still quite
>>> worthwhile.
>> Cool.  Based on a skim the patch looks reasonable.
> 
> Thanks.
> 
> I bikeshedded with myself on the naming of things, and decided to use "ARMv8" 
> in the variable and file names, instead of ARM64 or ARMCE or ARM64CE. The CRC 
> instructions were introduced in ARM v8 (as an optional feature), it's not 
> really related to the 64-bitness, even though the 64-bit instruction set was 
> also introduced in ARM v8. Other than that, and some comment fixes, this is 
> the same as the previous patch version.

Noticed two minor documentation issues in a quick skim of the attached patch:

The following line should say SSE and not SSSE (and the same typo is in
src/include/pg_config.h.in and src/include/pg_config.h.win32).  While not
introduced in this patch, it’s adjacent to the patched codepath and on topic so
may well be fixed while in there.

 AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use 
Intel SSSE 4.2 CRC instructions with a runtime check.])

The documentation in configure.in use “runtime” rather than "run time” in all
lines except this one:

+# uses the CRC instructions, compile both, and select at run time.

cheers ./daniel


Re: pgsql: Validate page level checksums in base backups

2018-04-03 Thread Tom Lane
Magnus Hagander  writes:
> On Tue, Apr 3, 2018 at 8:29 PM, Tom Lane  wrote:
>> It's scribbling on the source cluster's disk files and assuming that that
>> translates one-for-one to what gets sent to the slave server --- but what
>> if some of the blocks that it modifies on-disk are resident in the
>> source's shared buffers?  I think you'd have to shut down the source and
>> then apply the corruption if you want stable results.

> It doesn't actually use a slave server as part of the tests.
> And basebackups don't read from the sources shared buffers, but it *does*
> read from the kernel buffers.

Right, so the failure case occurs when the source server writes back a
dirty page from its shared buffers to disk, in between the test script's
corrupting that page on-disk and then trying to read it.

Again, you'd have several orders of magnitude better chance of getting
reproducible behavior if you were targeting something that wasn't a
system catalog.

regards, tom lane



Re: pgsql: Validate page level checksums in base backups

2018-04-03 Thread Magnus Hagander
On Tue, Apr 3, 2018 at 8:29 PM, Tom Lane  wrote:

> Magnus Hagander  writes:
> > Yeah, there's clearly a second problem here.
>
> I think this test script is broken in many ways.
>
> It's scribbling on the source cluster's disk files and assuming that that
> translates one-for-one to what gets sent to the slave server --- but what
> if some of the blocks that it modifies on-disk are resident in the
> source's shared buffers?  I think you'd have to shut down the source and
> then apply the corruption if you want stable results.
>

It doesn't actually use a slave server as part of the tests.

And basebackups don't read from the sources shared buffers, but it *does*
read from the kernel buffers.


I'd bet a good lunch that nondefault BLCKSZ would break it, as well,
> since the way in which the corruption is induced is just guessing
> as to where page boundaries are.
>

Yeah, that might be a problem. Those should be calculated from the block
size.


Also, scribbling on tables as sensitive as pg_class is just asking for
> trouble IMO.  I don't see anything in this test, for example, that
> prevents autovacuum from running and causing a PANIC before the test
> can complete.  Even with AV off, there's a good chance that clobber-
> cache-always animals will fall over because they do so many more
> physical accesses to the system catalogs.  I'd suggest inducing the
> corruption in some user table(s) that we can more tightly constrain
> the source server's accesses to.
>

Yeah, that seems like a good idea. And probably also shut the server down
while writing the corruption, just in case.

Will stick looking into that on my todo for when I'm back, unless beaten to
it. Michael, you want a stab at it?

//Magnus


Re: pgsql: Validate page level checksums in base backups

2018-04-03 Thread Peter Geoghegan
On Tue, Apr 3, 2018 at 11:29 AM, Tom Lane  wrote:
> Also, scribbling on tables as sensitive as pg_class is just asking for
> trouble IMO.  I don't see anything in this test, for example, that
> prevents autovacuum from running and causing a PANIC before the test
> can complete.

+1

> Even with AV off, there's a good chance that clobber-
> cache-always animals will fall over because they do so many more
> physical accesses to the system catalogs.  I'd suggest inducing the
> corruption in some user table(s) that we can more tightly constrain
> the source server's accesses to.

I've simulated quite a few novel corruption scenarios using pg_hexedit
in the past year. It would be nice if pg_prewarm offered an easy way
to evict from shared_buffers for repeated testing. Obviously you can
accomplish the same thing in other ways, but that isn't ideal, and
particularly hinders automated testing.

-- 
Peter Geoghegan



Re: pgsql: Validate page level checksums in base backups

2018-04-03 Thread Tom Lane
Magnus Hagander  writes:
> Yeah, there's clearly a second problem here.

I think this test script is broken in many ways.

It's scribbling on the source cluster's disk files and assuming that that
translates one-for-one to what gets sent to the slave server --- but what
if some of the blocks that it modifies on-disk are resident in the
source's shared buffers?  I think you'd have to shut down the source and
then apply the corruption if you want stable results.

I'd bet a good lunch that nondefault BLCKSZ would break it, as well,
since the way in which the corruption is induced is just guessing
as to where page boundaries are.

Also, scribbling on tables as sensitive as pg_class is just asking for
trouble IMO.  I don't see anything in this test, for example, that
prevents autovacuum from running and causing a PANIC before the test
can complete.  Even with AV off, there's a good chance that clobber-
cache-always animals will fall over because they do so many more
physical accesses to the system catalogs.  I'd suggest inducing the
corruption in some user table(s) that we can more tightly constrain
the source server's accesses to.

regards, tom lane



Re: Comment update in BuildTupleFromCStrings()

2018-04-03 Thread Bruce Momjian
On Fri, Mar 23, 2018 at 03:00:37PM +0530, Ashutosh Bapat wrote:
> Hi,
> BuildTupleFromCStrings() has comment "/* Call the "in" function for
> each non-dropped attribute */". It then calls the in function even
> when it's going to set that attribute to NULL.
> 1189 if (!TupleDescAttr(tupdesc, i)->attisdropped)
> 1190 {
> 1191 /* Non-dropped attributes */
> 1192 dvalues[i] = InputFunctionCall(>attinfuncs[i],
> 1193values[i],
> 1194attinmeta->attioparams[i],
> 1195attinmeta->atttypmods[i]);
> 1196 if (values[i] != NULL)
> 1197 nulls[i] = false;
> 1198 else
> 1199 nulls[i] = true;
> 1200 }
> 
>  If we are setting isnull to true i.e. it's a NULL value, dvalues
> value doesn't matter but we still invoke corresponding in function,
> which looks strange and the comment doesn't help. But there's code in
> make_tuple_from_result_row() which does the same thing and explain why
> we need to invoke in() function even on the NULL values. I thought,
> the same comment applies here. Here's patch to update the comment in
> BuildTupleFromCStrings().

Patch applied.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: [HACKERS] Runtime Partition Pruning

2018-04-03 Thread Beena Emerson
Hello,

On Tue, Apr 3, 2018 at 11:14 PM, Jesper Pedersen
 wrote:
> Hi David,
>
> On 03/31/2018 09:52 AM, David Rowley wrote:
>>
>> I've attached a new version of the patch.  I'm now at v18 after having
>> some versions of the patch that I didn't release which were based on
>> various versions of Amit's faster partition pruning patch.
>>
>
> Thank you for the updated patch set !
>
> I have tested this together with Amit's v46 patch.
>
>
> Also, I'm seeing a regression for check-world in
> src/test/regress/results/inherit.out
>

With Amit's v46 patch,  the following query in partition_prune was
crashing during make check.
explain (analyze, costs off, summary off, timing off) execute ab_q1 (2, 2, 3);



-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Optimize Arm64 crc32c implementation in Postgresql

2018-04-03 Thread Tom Lane
Heikki Linnakangas  writes:
> I was just about to commit this, when I started to wonder: Do we need to 
> worry about alignment? As the patch stands, it will merrily do unaligned 
> 8-byte loads. Is that OK on ARM? It seems to work on the system I've 
> been testing on, but I don't know. And even if it's OK, would it perform 
> better if we did 1-byte loads in the beginning, until we reach the 
> 8-byte boundary?

I'm pretty sure that some ARM platforms emulate unaligned access through
kernel trap handlers, which would certainly make this a lot slower than
handling the unaligned bytes manually.  Maybe that doesn't apply to any
ARM CPU that has this instruction ... but as you said, it'd be better
to consider the presence of the instruction as orthogonal to other
CPU features.

regards, tom lane



Re: [HACKERS] Runtime Partition Pruning

2018-04-03 Thread Jesper Pedersen

Hi David,

On 03/31/2018 09:52 AM, David Rowley wrote:

I've attached a new version of the patch.  I'm now at v18 after having
some versions of the patch that I didn't release which were based on
various versions of Amit's faster partition pruning patch.



Thank you for the updated patch set !

I have tested this together with Amit's v46 patch.

The attached case doesn't trigger a generic plan, so basically all time 
is spent in GetCachedPlan.


The standard table case (std.sql) gives:

 generic_cost = 8.4525
 avg_custom_cost = 13.4525
 total_custom_cost = 67.2625

whereas the 64 hash partition case (hash.sql) gives:

 generic_cost = 540.32
 avg_custom_cost = 175.9425
 total_custom_cost = 879.7125

I tested with pgbench -M prepared -f select.sql.

Also, I'm seeing a regression for check-world in 
src/test/regress/results/inherit.out


***
*** 642,648 
  -+---+---+-
   mlparted_tab_part1  | 1 | a |
   mlparted_tab_part2a | 2 | a |
!  mlparted_tab_part2b | 2 | b | xxx
   mlparted_tab_part3  | 3 | a | xxx
  (4 rows)

--- 642,648 
  -+---+---+-
   mlparted_tab_part1  | 1 | a |
   mlparted_tab_part2a | 2 | a |
!  mlparted_tab_part2b | 2 | b |
   mlparted_tab_part3  | 3 | a | xxx
  (4 rows)

I'll spend some more time tomorrow.

Thanks for working on this !

Best regards,
 Jesper


hash.sql
Description: application/sql


select.sql
Description: application/sql


std.sql
Description: application/sql


Re: BRIN FSM vacuuming questions

2018-04-03 Thread Tom Lane
I wrote:
> [ assorted complaining about BRIN FSM management ]

Here's a proposed patch for this.  Changes:

* Do RecordPageWithFreeSpace unconditionally in brin_page_cleanup,
and do FreeSpaceMapVacuum unconditionally in brin_vacuum_scan.
Because of the FSM's roundoff behavior, the old complications here
weren't really buying any savings.

* Elsewhere, we are trying to update the FSM for just a single-page
status update, so use FreeSpaceMapVacuumRange() to limit how much
of the upper FSM gets traversed.

* Fix a couple of places that neglected to do the upper-page
vacuuming at all after recording new free space.  If the policy
is to be that BRIN should do that, it should do it everywhere.

* Remove premature RecordPageWithFreeSpace call in brin_getinsertbuffer
where it's about to return an extended page to the caller.  The caller
should do that, instead, after it's inserted its new tuple.  Fix the
one caller that forgot to do so.

* Simplify logic in brin_doupdate's same-page-update case by
postponing brin_initialize_empty_new_buffer to after the critical
section; I see little point in doing it before.

* Avoid repeat calls of RelationGetNumberOfBlocks in brin_vacuum_scan.

* Avoid duplicate BufferGetBlockNumber and BufferGetPage calls in
a couple of places where we already had the right values.

* Move a BRIN_elog call out of a critical section; that's pretty
unsafe and I don't think it buys us anything to not wait till
after the critical section.

* I moved the "*extended = false" step in brin_getinsertbuffer into
the routine's main loop.  There's no actual bug there, since the loop
can't iterate with *extended still true, but it doesn't seem very
future-proof as coded; and it's certainly not documented as a loop
invariant.

* Assorted comment improvements.

The use of FreeSpaceMapVacuumRange makes this a HEAD-only patch.
I'm not sure if any of the other changes are worth back-patching.

regards, tom lane

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 0e5849e..6ed115f 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1121,16 +1121,22 @@ initialize_brin_buildstate(Relation idxRel, BrinRevmap *revmap,
 static void
 terminate_brin_buildstate(BrinBuildState *state)
 {
-	/* release the last index buffer used */
+	/*
+	 * Release the last index buffer used.  We might as well ensure that
+	 * whatever free space remains in that page is available in FSM, too.
+	 */
 	if (!BufferIsInvalid(state->bs_currentInsertBuf))
 	{
 		Page		page;
+		Size		freespace;
+		BlockNumber blk;
 
 		page = BufferGetPage(state->bs_currentInsertBuf);
-		RecordPageWithFreeSpace(state->bs_irel,
-BufferGetBlockNumber(state->bs_currentInsertBuf),
-PageGetFreeSpace(page));
+		freespace = PageGetFreeSpace(page);
+		blk = BufferGetBlockNumber(state->bs_currentInsertBuf);
 		ReleaseBuffer(state->bs_currentInsertBuf);
+		RecordPageWithFreeSpace(state->bs_irel, blk, freespace);
+		FreeSpaceMapVacuumRange(state->bs_irel, blk, blk + 1);
 	}
 
 	brin_free_desc(state->bs_bdesc);
@@ -1445,14 +1451,15 @@ union_tuples(BrinDesc *bdesc, BrinMemTuple *a, BrinTuple *b)
 static void
 brin_vacuum_scan(Relation idxrel, BufferAccessStrategy strategy)
 {
-	bool		vacuum_fsm = false;
+	BlockNumber nblocks;
 	BlockNumber blkno;
 
 	/*
 	 * Scan the index in physical order, and clean up any possible mess in
 	 * each page.
 	 */
-	for (blkno = 0; blkno < RelationGetNumberOfBlocks(idxrel); blkno++)
+	nblocks = RelationGetNumberOfBlocks(idxrel);
+	for (blkno = 0; blkno < nblocks; blkno++)
 	{
 		Buffer		buf;
 
@@ -1461,15 +1468,15 @@ brin_vacuum_scan(Relation idxrel, BufferAccessStrategy strategy)
 		buf = ReadBufferExtended(idxrel, MAIN_FORKNUM, blkno,
  RBM_NORMAL, strategy);
 
-		vacuum_fsm |= brin_page_cleanup(idxrel, buf);
+		brin_page_cleanup(idxrel, buf);
 
 		ReleaseBuffer(buf);
 	}
 
 	/*
-	 * If we made any change to the FSM, make sure the new info is visible all
-	 * the way to the top.
+	 * Update all upper pages in the index's FSM, as well.  This ensures not
+	 * only that we propagate leaf-page FSM updates made by brin_page_cleanup,
+	 * but also that any pre-existing damage or out-of-dateness is repaired.
 	 */
-	if (vacuum_fsm)
-		FreeSpaceMapVacuum(idxrel);
+	FreeSpaceMapVacuum(idxrel);
 }
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 60a7025..040cb62 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -64,6 +64,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	BrinTuple  *oldtup;
 	Size		oldsz;
 	Buffer		newbuf;
+	BlockNumber newblk = InvalidBlockNumber;
 	bool		extended;
 
 	Assert(newsz == MAXALIGN(newsz));
@@ -101,6 +102,8 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 			Assert(!extended);
 			newbuf = InvalidBuffer;
 		}
+		else
+			newblk = BufferGetBlockNumber(newbuf);
 	}
 	else
 	{
@@ -136,7 

Re: BUG #15122: can't import data if table has a constraint with a function calling another function

2018-04-03 Thread Bruce Momjian
On Tue, Mar 20, 2018 at 12:11:45PM +, Andrew Gierth wrote:
> > "Asier" == Asier Lostalé  writes:
> 
>  Asier> Thanks Andrew for your quick response and clear explanation.
> 
>  Asier> Can I understand from your explanation this is not considered as
>  Asier> a bug?
> 
> I would call it a misfeature rather than a bug.
> 
>  Asier> Although the adding a qualified reference workarounds the
>  Asier> problem, it forces to write pl code that is aware of the schema
>  Asier> it is going to be imported in. How could I write this code to be
>  Asier> schema agnostic, so I can import it in any schema without
>  Asier> modifying it?
> 
> For plpgsql (and other pl/* languages, but not LANGUAGE SQL) the best
> way is probably to do this:
> 
> SET search_path = public;   -- or whatever schema
> 
> CREATE OR REPLACE FUNCTION is_even_positive(integer)
>   RETURNS boolean
>   LANGUAGE plpgsql
>   IMMUTABLE
>   SET SEARCH_PATH FROM CURRENT-- ** this is the important bit
>   AS $$
> begin
> return is_even($1) and $1 >= 0;
> end;
> $$;
> 
> Some caveats:
> 
> 1) The default search_path is "$user",public. Using SET SEARCH_PATH FROM
>CURRENT doesn't interact well with this (arguably this part _is_ a
>bug), so either ensure that the search_path is set to something that
>doesn't exclude $user, or (if you need something that works in a
>script) you can canonicalize it first using this query:
> 
>SELECT set_config('search_path',
>  string_agg(quote_ident(s),','),
>  false)  -- change to true for equivalent of SET LOCAL
>  FROM unnest(current_schemas(false)) s;
> 
> 2) This doesn't work well for LANGUAGE SQL functions since it would
>block inlining, which is usually the primary reason for using
>LANGUAGE SQL in the first place. I don't know of any good workaround
>for those except to explicitly use the schema in the function body
>(possibly via text substitution).

[Hackers email list added.]

Uh, you might want to look at this thread too:


https://www.postgresql.org/message-id/flat/152106914669.1223.5104148605998271987%40wrigleys.postgresql.org#152106914669.1223.5104148605998271...@wrigleys.postgresql.org

In that thread there is discussion of how function body extensions can
reference other extension objects when the function doesn't know what
schema it is being installed in.  It uses SQL functions in an index, and
it is inlining that is causing the restore failure.

I had forgotten that SET SEARCH_PATH FROM CURRENT can be used during
function creation, but that forces the function to trust the schema it
is being installed in, which we don't want because it opens us up to the
same problems the removal of public was trying to avoid.

So, the crux of the problem is how do you install a function in an
non-predefined schema, e.g. public, and reference other extension
objects in a safe and transparent way.  This is true of non-extension
objects as well if they are assuming they can reference other objects
that were created earlier, and the schema is not predefined.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: pgsql: Validate page level checksums in base backups

2018-04-03 Thread Magnus Hagander
On Tue, Apr 3, 2018 at 7:13 PM, Andres Freund  wrote:

> On 2018-04-03 11:52:26 +, Magnus Hagander wrote:
> > Validate page level checksums in base backups
> >
> > When base backups are run over the replication protocol (for example
> > using pg_basebackup), verify the checksums of all data blocks if
> > checksums are enabled. If checksum failures are encountered, log them
> > as warnings but don't abort the backup.
> >
> > This becomes the default behaviour in pg_basebackup (provided checksums
> > are enabled on the server), so add a switch (-k) to disable the checks
> > if necessary.
> >
> > Author: Michael Banck
> > Reviewed-By: Magnus Hagander, David Steele
> > Discussion: https://postgr.es/m/20180228180856.GE13784@
> nighthawk.caipicrew.dd-dns.de
>
> Hm, something isn't quite right with the test:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=
> pogona=2018-04-03%2016%3A10%3A01
>
> # WARNING:  6 total checksum verification failures
> # pg_basebackup: checksum error occured
> # '
> # doesn't match '(?^s:^WARNING.*7 total checksum verification
> failures)'
>

Yeah, there's clearly a second problem here.

I'm stuck in meetings most of tonight, so I might not get to it right away.
If someonen else figures it out meanwhile please do, otherwise I'll get to
it as soon as I can.


-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: pgsql: Validate page level checksums in base backups

2018-04-03 Thread Andres Freund
On 2018-04-03 11:52:26 +, Magnus Hagander wrote:
> Validate page level checksums in base backups
> 
> When base backups are run over the replication protocol (for example
> using pg_basebackup), verify the checksums of all data blocks if
> checksums are enabled. If checksum failures are encountered, log them
> as warnings but don't abort the backup.
> 
> This becomes the default behaviour in pg_basebackup (provided checksums
> are enabled on the server), so add a switch (-k) to disable the checks
> if necessary.
> 
> Author: Michael Banck
> Reviewed-By: Magnus Hagander, David Steele
> Discussion: 
> https://postgr.es/m/20180228180856.ge13...@nighthawk.caipicrew.dd-dns.de

Hm, something isn't quite right with the test:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pogona=2018-04-03%2016%3A10%3A01

# WARNING:  6 total checksum verification failures
# pg_basebackup: checksum error occured
# '
# doesn't match '(?^s:^WARNING.*7 total checksum verification failures)'

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-03 Thread Anthony Iliopoulos
On Tue, Apr 03, 2018 at 03:37:30PM +0100, Greg Stark wrote:
> On 3 April 2018 at 14:36, Anthony Iliopoulos  wrote:
> 
> > If EIO persists between invocations until explicitly cleared, a process
> > cannot possibly make any decision as to if it should clear the error
> 
> I still don't understand what "clear the error" means here. The writes
> still haven't been written out. We don't care about tracking errors,
> we just care whether all the writes to the file have been flushed to
> disk. By "clear the error" you mean throw away the dirty pages and
> revert part of the file to some old data? Why would anyone ever want
> that?

It means that the responsibility of recovering the data is passed
back to the application. The writes may never be able to be written
out. How would a kernel deal with that? Either discard the data
(and have the writer acknowledge) or buffer the data until reboot
and simply risk going OOM. It's not what someone would want, but
rather *need* to deal with, one way or the other. At least on the
application-level there's a fighting chance for restoring to a
consistent state. The kernel does not have that opportunity.

> > But instead of deconstructing and debating the semantics of the
> > current mechanism, why not come up with the ideal desired form of
> > error reporting/tracking granularity etc., and see how this may be
> > fitted into kernels as a new interface.
> 
> Because Postgres is portable software that won't be able to use some
> Linux-specific interface. And doesn't really need any granular error

I don't really follow this argument, Pg is admittedly using non-portable
interfaces (e.g the sync_file_range()). While it's nice to avoid platform
specific hacks, expecting that the POSIX semantics will be consistent
across systems is simply a 90's pipe dream. While it would be lovely
to have really consistent interfaces for application writers, this is
simply not going to happen any time soon.

And since those problematic semantics of fsync() appear to be prevalent
in other systems as well that are not likely to be changed, you cannot
rely on preconception that once buffers are handed over to kernel you
have a guarantee that they will be eventually persisted no matter what.
(Why even bother having fsync() in that case? The kernel would eventually
evict and writeback dirty pages anyway. The point of reporting the error
back to the application is to give it a chance to recover - the kernel
could repeat "fsync()" itself internally if this would solve anything).

> reporting system anyways. It just needs to know when all writes have
> been synced to disk.

Well, it does know when *some* writes have *not* been synced to disk,
exactly because the responsibility is passed back to the application.
I do realize this puts more burden back to the application, but what
would a viable alternative be? Would you rather have a kernel that
risks periodically going OOM due to this design decision?

Best regards,
Anthony



Re: Prefix operator for text and spgist support

2018-04-03 Thread Teodor Sigaev
Thank you, pushed with some editorization and renaming text_startswith to 
starts_with


Ildus Kurbangaliev wrote:

On Fri, 23 Mar 2018 11:45:33 +0300
Alexander Korotkov  wrote:


On Thu, Mar 22, 2018 at 7:09 PM, Teodor Sigaev 
wrote:


Patch looks resonable, but I see some place to improvement:
spg_text_leaf_consistent() only needs to check with
text_startswith() if reconstucted value came to leaf consistent is
shorter than given prefix. For example, if level >= length of
prefix then we guarantee that fully reconstracted is matched too.
But do not miss that you may need to return value for index only
scan, consult returnData field

In attachment rebased and minorly edited version of your patch.



I took a look at this patch.  In addition to Teodor's comments I can
note following.

* This line looks strange for me.

-   if (strategy > 10)
+   if (strategy > 10 && strategy !=
RTPrefixStrategyNumber)

It's not because we added strategy != RTPrefixStrategyNumber condition
there.
It's because we already used magic number here and now have a mix of
magic number and macro constant in one line.  Once we anyway touch
this place, could we get rid of magic numbers here?

* I'm a little concern about operator name.  We're going to choose @^
operator for
prefix search without any preliminary discussion.  However,
personally I don't
have better ideas :)


Teodor, Alexander, thanks for review. In new version I have added the
optimization in spgist using level variable and also got rid of magic
numbers.

About the operator it's actually ^@ (not @^ :)), I thought about it and
don't really have any idea what operator can be used instead.

Attached version 5 of the patch.



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: Optimize Arm64 crc32c implementation in Postgresql

2018-04-03 Thread Andres Freund
Hi,

On 2018-04-03 19:38:42 +0300, Heikki Linnakangas wrote:
> I was just about to commit this, when I started to wonder: Do we need to
> worry about alignment? As the patch stands, it will merrily do unaligned
> 8-byte loads. Is that OK on ARM? It seems to work on the system I've been
> testing on, but I don't know. And even if it's OK, would it perform better
> if we did 1-byte loads in the beginning, until we reach the 8-byte boundary?

Architecture manual time?  They're available freely IIRC and should
answer this.

- Andres



Re: Optimize Arm64 crc32c implementation in Postgresql

2018-04-03 Thread Heikki Linnakangas

On 03/04/18 19:09, Andres Freund wrote:

Hi,

On 2018-04-03 19:05:19 +0300, Heikki Linnakangas wrote:

On 01/04/18 20:32, Andres Freund wrote:

On 2018-03-06 02:44:35 +0800, Heikki Linnakangas wrote:

* I tested this on Linux, with gcc and clang, on an ARM64 virtual machine
that I had available (not an emulator, but a VM on a shared ARM64 server).


Have you seen actual postgres performance benefits with the patch?


I just ran a small test with pg_waldump, similar to what Abhijit Menon-Sen
ran with the Slicing-by-8 and Intel SSE patches, when we added those
(https://www.postgresql.org/message-id/20141119155811.GA32492%40toroid.org).
I ran pgbench, with scale factor 5, until it had generated about 1 GB of
WAL, and then I ran pg_waldump -z on that WAL. With slicing-by-8, it took
about 7 s, and with the special CPU instructions, about 5 s. 'perf' showed
that the CRC computation took about 30% of the CPU time before, and about
12% after, which sounds about right. That's not as big a speedup as we saw
with the corresponding Intel SSE instructions back in 2014, but still quite
worthwhile.


Cool.  Based on a skim the patch looks reasonable.


Thanks.

I bikeshedded with myself on the naming of things, and decided to use 
"ARMv8" in the variable and file names, instead of ARM64 or ARMCE or 
ARM64CE. The CRC instructions were introduced in ARM v8 (as an optional 
feature), it's not really related to the 64-bitness, even though the 
64-bit instruction set was also introduced in ARM v8. Other than that, 
and some comment fixes, this is the same as the previous patch version.


I was just about to commit this, when I started to wonder: Do we need to 
worry about alignment? As the patch stands, it will merrily do unaligned 
8-byte loads. Is that OK on ARM? It seems to work on the system I've 
been testing on, but I don't know. And even if it's OK, would it perform 
better if we did 1-byte loads in the beginning, until we reach the 
8-byte boundary?


- Heikki
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index b518851441..ba5c40db01 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -667,3 +667,37 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_SSE42_CRC32_INTRINSICS
+
+
+# PGAC_ARMV8_CRC32C_INTRINSICS
+# ---
+# Check if the compiler supports the CRC32C instructions using the __crc32cb,
+# __crc32ch, __crc32cw, and __crc32cd intrinsic functions. These instructions
+# were first introduced in ARMv8 in the optional CRC Extension, and became
+# mandatory in ARMv8.1.
+#
+# An optional compiler flag can be passed as argument (e.g.
+# -march=armv8-a+crc). If the intrinsics are supported, sets
+# pgac_armv8_crc32c_intrinsics, and CFLAGS_ARMV8_CRC32C.
+AC_DEFUN([PGAC_ARMV8_CRC32C_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_armv8_crc32c_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [unsigned int crc = 0;
+   crc = __crc32cb(crc, 0);
+   crc = __crc32ch(crc, 0);
+   crc = __crc32cw(crc, 0);
+   crc = __crc32cd(crc, 0);
+   /* return computed value, to prevent the above being optimized away */
+   return crc == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_ARMV8_CRC32C="$1"
+  pgac_armv8_crc32c_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_ARMV8_CRC32C_INTRINSICS
diff --git a/configure b/configure
index 5c56f21282..561b5c419e 100755
--- a/configure
+++ b/configure
@@ -646,6 +646,7 @@ MSGMERGE
 MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
+CFLAGS_ARMV8_CRC32C
 CFLAGS_SSE42
 have_win32_dbghelp
 HAVE_IPV6
@@ -17254,28 +17255,175 @@ if ac_fn_c_try_compile "$LINENO"; then :
 fi
 rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
 
+# Check for ARMv8 CRC Extension intrinsics to do CRC calculations.
+#
+# First check if __crc32c* intrinsics can be used with the default compiler
+# flags. If not, check if adding -march=armv8-a+crc flag helps.
+# CFLAGS_ARMV8_CRC32C is set if the extra flag is required.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=" >&5
+$as_echo_n "checking for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=... " >&6; }
+if ${pgac_cv_armv8_crc32c_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+unsigned int crc = 0;
+   crc = __crc32cb(crc, 0);
+   crc = __crc32ch(crc, 0);
+   crc = __crc32cw(crc, 0);
+   crc = __crc32cd(crc, 0);
+   /* return computed value, to prevent the above being optimized away */
+   return crc == 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-04-03 Thread Andres Freund
On 2018-04-03 09:56:24 -0400, Tom Lane wrote:
> Heikki Linnakangas  writes:
> > But let's go back to why we're considering this. The idea was to 
> > optimize this block:
> > ...
> > One trick that we could do is to replace that with a 128-bit atomic 
> > compare-and-swap instruction. Modern 64-bit Intel systems have that, 
> > it's called CMPXCHG16B. Don't know about other architectures. An atomic 
> > fetch-and-add, as envisioned in the comment above, would presumably be 
> > better, but I suspect that a compare-and-swap would be good enough to 
> > move the bottleneck elsewhere again.
> 
> +1 for taking a look at that.  A bit of experimentation shows that
> recent gcc and clang can generate that instruction using
> __sync_bool_compare_and_swap or __sync_val_compare_and_swap
> on an __int128 value.

The problem will presumably be that early opteron AMD64s lacked that
instruction. I'm not sure which distributions still target them (windows
dropped support a few years ago), but we should make sure that neither
the necessary dispatch code isn't going to add so much overhead it's
eating into our margin, nor that the generated code SIGILLs on such
platforms.

Greetings,

Andres Freund



Re: Running Installcheck remotely

2018-04-03 Thread Kapil Sharma
So this means that the host running the test should have capability to SSH
to the DB Instance host ?

Thanks,
Kapil.


On Sat, Mar 24, 2018 at 7:21 AM, Tom Lane  wrote:

> Kapil Sharma  writes:
> > Is it possible to run installcheck (pg_regress) tests from a remote host
> ?
>
> I think if you set PGHOST and other relevant libpq environment variables,
> an installcheck run will connect where they specify.  The hard part would
> be making sure that (a) your test files match the remote server's version
> and (b) the various .so files loaded by the tests exist on the remote host
> at the same filesystem locations expected by the tests.
>
> regards, tom lane
>


Re: Optimize Arm64 crc32c implementation in Postgresql

2018-04-03 Thread Andres Freund
Hi,

On 2018-04-03 19:05:19 +0300, Heikki Linnakangas wrote:
> On 01/04/18 20:32, Andres Freund wrote:
> > On 2018-03-06 02:44:35 +0800, Heikki Linnakangas wrote:
> > > * I tested this on Linux, with gcc and clang, on an ARM64 virtual machine
> > > that I had available (not an emulator, but a VM on a shared ARM64 server).
> > 
> > Have you seen actual postgres performance benefits with the patch?
> 
> I just ran a small test with pg_waldump, similar to what Abhijit Menon-Sen
> ran with the Slicing-by-8 and Intel SSE patches, when we added those
> (https://www.postgresql.org/message-id/20141119155811.GA32492%40toroid.org).
> I ran pgbench, with scale factor 5, until it had generated about 1 GB of
> WAL, and then I ran pg_waldump -z on that WAL. With slicing-by-8, it took
> about 7 s, and with the special CPU instructions, about 5 s. 'perf' showed
> that the CRC computation took about 30% of the CPU time before, and about
> 12% after, which sounds about right. That's not as big a speedup as we saw
> with the corresponding Intel SSE instructions back in 2014, but still quite
> worthwhile.

Cool.  Based on a skim the patch looks reasonable. It's a bit sad that
it's effecively linux specific. But I'm not sure we can do anything
about that atm, given the state of the "discovery" APIs on various
platforms.

Greetings,

Andres Freund



Re: Optimize Arm64 crc32c implementation in Postgresql

2018-04-03 Thread Heikki Linnakangas

On 01/04/18 20:32, Andres Freund wrote:

On 2018-03-06 02:44:35 +0800, Heikki Linnakangas wrote:

* I tested this on Linux, with gcc and clang, on an ARM64 virtual machine
that I had available (not an emulator, but a VM on a shared ARM64 server).


Have you seen actual postgres performance benefits with the patch?


I just ran a small test with pg_waldump, similar to what Abhijit 
Menon-Sen ran with the Slicing-by-8 and Intel SSE patches, when we added 
those 
(https://www.postgresql.org/message-id/20141119155811.GA32492%40toroid.org). 
I ran pgbench, with scale factor 5, until it had generated about 1 GB of 
WAL, and then I ran pg_waldump -z on that WAL. With slicing-by-8, it 
took about 7 s, and with the special CPU instructions, about 5 s. 'perf' 
showed that the CRC computation took about 30% of the CPU time before, 
and about 12% after, which sounds about right. That's not as big a 
speedup as we saw with the corresponding Intel SSE instructions back in 
2014, but still quite worthwhile.


- Heikki



Re: [HACKERS] GUC for cleanup indexes threshold.

2018-04-03 Thread Masahiko Sawada
Sorry for the late response.

On Tue, Mar 27, 2018 at 8:01 PM, Alexander Korotkov
 wrote:
> On Tue, Mar 27, 2018 at 11:55 AM, Masahiko Sawada 
> wrote:
>>
>> On Thu, Mar 22, 2018 at 9:24 PM, Alexander Korotkov
>>  wrote:
>> > However, I see that you are comparing relative change of num_heap_tuples
>> > before and after vacuum to vacuum_cleanup_index_scale_factor.
>> > The problem is that if vacuum is very aggressive (and that is likely for
>> > append only case if this patch is committed), then num_heap_tuples
>> > changes very slightly every vacuum cycle.  So, cleanup would never
>> > happen and statistics could stall indefinitely.
>>
>> Good catch. I think we need to store the number of tuples at when
>> scanning whole index is performed (bulkdelete or cleanup) as your
>> patch does so. So it also would need the page-upgrading code. Since
>> that value would be helpful for other type of indexes too it might be
>> better to  store it into system catalog.
>>
>> >
>> > Another issue is that append only workloads are frequently combined
>> > with rare periodical bulk deletes of data.  Assuming that in this patch
>> > you don't consider deleted pages to trigger index cleanup, on such
>> > workload large amounts of deleted pages may reside non-recycled until
>> > next bulk delete cycle.
>>
>> Perhaps using that new value we can trigger the cleanup if the current
>> number of tuples has been increased or decreased the
>> vacuum_index_scale_factor% from n_tup_last_cleanup.
>
>
> Yes, that might work.  However, decreased number of tuples could be
> inaccurate measure of number of deleted pages.  Depending on distribution
> of tuples per pages, same amount of deleted tuples can lead to very
> different
> number of deleted pages (in corner cases in can start from zero to the very
> large amounts).  If we want to skip scanning of deleted pages then it
> would be better store their exact count known by previous bulk delete or
> cleanup.
>
>> >
>> > So, despite I generally like idea of storing epoch of deleted xid in the
>> > page, I think my version of patch is closer to committable shape.
>> >
>>
>> Agreed, but as I mentioned before, I'm concerned that your version
>> patch approach will become a restriction of future improvement. If
>> community decides this feature works only for mostly append-only
>> workloads I think your version of patch is the best approach for now.
>
>
> At first I would like to note that all valid optimizations presented in the
> thread optimizes append case.  Thus they do give serious benefit
> on mostly append-only workloads.  Since patches were about
> skipping/reducing cleanup stage which does serious work only in
> append-only case.
>
> It could be possible in future to optimize also cases when only small
> fraction of table tuples were deleted.  Those tuples could be deleted
> not by full index scan, but by individual index lookups.  But I think
> such optimization is rather harder than everything mentioned in
> this thread, and it should be considered separately.

Agreed.

>
> The thing which could be improved in future is making btree able
> to skip cleanup when some deleted pages are pending to be recycled.
> But I'm not sure that this problem deserves a lot of time investment
> right now.  Because I think that page deletion in our btree is unfortunately
> quite rate situation.  In real life our btree is rather going to be bloat
> with bad
> fillfactor of pages rather than got much pages deleted.

Agreed. In your version patch we cleanup recyclable pages if there
might be them whereas my version patch could leave recyclable pages.
The thing I'm concerned is that the patch unnecessarily would narrow
its effect. That is, we might be able to update the btree meta page
even when bulkdelete.

In your version patch we have to scan whole index at least twice
(bulkdelete and cleanup) if even one tuple is deleted. But if deletion
of index tuples didn't lead index page deletions (which is a common
case)  the extra one scan is really unnecessary. So I guess that we
can update the meta page only when we deleted pages in the index
vacuum scan. If no page deletion happened since we don't need to
update the oldest btpo.xact we don't update meta page, and of course
don't WAL-logging. This can be separate patch though. At least the
current approach is more safer.

>
> So, I would like to clarify why could my patch block future improvements
> in this area?  For instance, if we would decide to make btree able to skip
> cleanup when some delete pages are pending for recycle, we can add
> it in future.
>

Anyway, for approaches of this feature I agree your version patch and
your patch looks good to me as the first step of this feature.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: pgsql: New files for MERGE

2018-04-03 Thread Andres Freund
Hi,

On 2018-04-03 09:24:12 +, Simon Riggs wrote:
> New files for MERGE
> src/backend/executor/nodeMerge.c   |  575 +++
> src/backend/parser/parse_merge.c   |  660 
> src/include/executor/nodeMerge.h   |   22 +
> src/include/parser/parse_merge.h   |   19 +

Getting a bit grumpy here.  So you pushed this, without responding in
any way to the objections I made in
http://archives.postgresql.org/message-id/20180403021800.b5nsgiclzanobiup%40alap3.anarazel.de
and did it in a manner that doesn't even compile?

Greetings,

Andres Freund



Re: [HACKERS] logical decoding of two-phase transactions

2018-04-03 Thread Nikhil Sontakke
Hi Tomas,

>> Unfortunately, this does segfault for me in `make check` almost
immediately. Try

This is due to the new ERROR handling code that I added today for the
lock/unlock APIs. Will fix.

>> Also, current value for LOGICALREP_IS_COMMIT is 1, but previous code expected
flags to be zero, so this way logical replication between postgres-10 and
postgres-with-2pc-decoding will be broken.

Good point. Will also test pg-10 to pg-11 logical replication to
ensure that there are no issues.

>> So I think we need a subscription parameter to enable/disable this,
defaulting to 'disabled'.

Ok, will add it to the "CREATE SUBSCRIPTION", btw, we should have
allowed storing options in an array form for a subscription. We might
add more options in the future and adding fields one by one doesn't
seem that extensible.


> 1) twophase.c
> -
>
> I think this comment is slightly inaccurate:
>
>  /*
>  * Coordinate with logical decoding backends that may be already
>  * decoding this prepared transaction. When aborting a transaction,
>  * we need to wait for all of them to leave the decoding group. If
>  * committing, we simply remove all members from the group.
>  */
>
> Strictly speaking, we're not waiting for the workers to leave the
> decoding group, but to set decodeLocked=false. That is, we may proceed
> when there still are members, but they must be in unlocked state.
>

Agreed. Will modify it to mention that it will wait only if some of
the backends are in locked state.

>
> 2) reorderbuffer.c
> --
>
> I've already said it before, I find the "flags" bitmask and rbtxn_*
> macros way less readable than individual boolean flags. It was claimed
> this was done on Andres' request, but I don't see that in the thread. I
> admit it's rather subjective, though.
>

Yeah, this is a little subjective.

> I see ReorederBuffer only does the lock/unlock around apply_change and
> RelationIdGetRelation. That seems insufficient - RelidByRelfilenode can
> do heap_open on pg_class, for example. And I guess we need to protect
> rb->message too, because who knows what the plugin does in the callback?
>
> Also, we should not allocate gid[GIDSIZE] for every transaction. For
> example subxacts never need it, and it seems rather wasteful to allocate
> 200B when the rest of the struct has only ~100B. This is particularly
> problematic considering ReorderBufferTXN is not spilled to disk when
> reaching the memory limit. It needs to be allocated ad-hoc only when
> actually needed.
>

OK, will look at allocating GID only when needed.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services



Re: [HACKERS] logical decoding of two-phase transactions

2018-04-03 Thread Tomas Vondra
FWIW, a couple of additional comments based on eyeballing the diffs:

1) twophase.c
-

I think this comment is slightly inaccurate:

 /*
 * Coordinate with logical decoding backends that may be already
 * decoding this prepared transaction. When aborting a transaction,
 * we need to wait for all of them to leave the decoding group. If
 * committing, we simply remove all members from the group.
 */

Strictly speaking, we're not waiting for the workers to leave the
decoding group, but to set decodeLocked=false. That is, we may proceed
when there still are members, but they must be in unlocked state.


2) reorderbuffer.c
--

I've already said it before, I find the "flags" bitmask and rbtxn_*
macros way less readable than individual boolean flags. It was claimed
this was done on Andres' request, but I don't see that in the thread. I
admit it's rather subjective, though.

I see ReorederBuffer only does the lock/unlock around apply_change and
RelationIdGetRelation. That seems insufficient - RelidByRelfilenode can
do heap_open on pg_class, for example. And I guess we need to protect
rb->message too, because who knows what the plugin does in the callback?

Also, we should not allocate gid[GIDSIZE] for every transaction. For
example subxacts never need it, and it seems rather wasteful to allocate
200B when the rest of the struct has only ~100B. This is particularly
problematic considering ReorderBufferTXN is not spilled to disk when
reaching the memory limit. It needs to be allocated ad-hoc only when
actually needed.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] logical decoding of two-phase transactions

2018-04-03 Thread Tom Lane
Tomas Vondra  writes:
> On 04/03/2018 04:37 PM, Alvaro Herrera wrote:
>> Tomas Vondra wrote:
>>> Yes, that is a good point actually - we need to test that replication
>>> between PG10 and PG11 works correctly, i.e. that the protocol version is
>>> correctly negotiated, and features are disabled/enabled accordingly etc.

>> Maybe it'd be good to have a buildfarm animal to specifically test for
>> that?

> Not sure a buildfarm supports running two clusters with different
> versions easily?

You'd need some specialized buildfarm infrastructure like --- maybe the
same as --- the infrastructure for testing cross-version pg_upgrade.
Andrew could speak to the details better than I.

regards, tom lane



Re: [PATCH] Verify Checksums during Basebackups

2018-04-03 Thread Magnus Hagander
On Tue, Apr 3, 2018 at 4:32 PM, Tom Lane  wrote:

> I wrote:
> > Magnus Hagander  writes:
> >> Unless.. %ld is the wrong thing to print:
> >> static int64 total_checksum_failures;
> >> We should perhaps be using something other than %ld to print that?
>
> > INT64_FORMAT.
>
> BTW, don't just stick INT64_FORMAT into the message-to-be-translated,
> or you'll break things for translation.  Good practice is to sprintf
> into a local char array with INT64_FORMAT, then include the number
> into the displayed message with %s.  You can find examples easily
> by grepping for INT64_FORMAT.
>

Thanks for the hint. I've pushed a fix along this line, let's see if clears
things.


-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: jsonb nesting level edge case

2018-04-03 Thread Bruce Momjian
On Sun, Mar 18, 2018 at 09:42:14PM +0100, Dmitry Dolgov wrote:
> Hi,
> 
> I've just realized, that looks like there is one edge-case in the current 
> jsonb
> implementation, that can be quite confusing, and I couldn't find any related
> discussion about it. From what I see there is no limit for how many nested
> levels can be in a jsonb field, and e.g. when a jsonb is created from a string
> it basically means that we're limited only by `check_stack_depth` (in the
> following case it's about recursive `parse_object`/`parse_object_field`). So
> you can create a jsonb with quite many nesting levels:
> 
> =# insert into test_jsonb values((
> (select '{' || string_agg('"field' || s.id || '"', ': {')
> from generate_series(1, 1) as s(id))
> || ': "some_value"' ||
> (select string_agg('}', '') from generate_series(1, 1)))::jsonb);
> 
> INSERT 0 1
> Time: 29.129 ms
> 
> But at the same time `jsonb_set` apparently has a different recursion schema,
> and reaches max_stack_depth faster (in this case it's about recursive
> `setPath`/`setPathObject`). It means that you can create a document, but you
> can't update its value using function, that was specified for that (so you
> probably need to override the entire jsonb to actually update something):
> 
> =# update test_jsonb set data = jsonb_set(data,
> (select array_agg('field' || s.id) from generate_series(1,
> 1) as s(id)),
> '"other_value"');
> 
> ERROR:  54001: stack depth limit exceeded
> HINT:  Increase the configuration parameter "max_stack_depth"
> (currently 2048kB), after ensuring the platform's stack depth limit is
> adequate.
> LOCATION:  check_stack_depth, postgres.c:3163
> Time: 17.143 ms
> 
> Is it something significant enough to worry about? Just to mention, in some
> other databases there is just a limit for number of nested levels for a
> document (e.g. in MongoDB Bson, MySQL binary json it's exactly 100).

I think our current behavior is the best we can do.  We are limited
only by configured stack memory, which people can increase.  I don't see
how we could improve it unless we reduced jsonb_set()'s stack memory
use.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: [HACKERS] logical decoding of two-phase transactions

2018-04-03 Thread Tomas Vondra
On 04/03/2018 04:37 PM, Alvaro Herrera wrote:
> Tomas Vondra wrote:
> 
>> Yes, that is a good point actually - we need to test that replication
>> between PG10 and PG11 works correctly, i.e. that the protocol version is
>> correctly negotiated, and features are disabled/enabled accordingly etc.
> 
> Maybe it'd be good to have a buildfarm animal to specifically test for
> that?
> 

Not sure a buildfarm supports running two clusters with different
versions easily?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-03 Thread Greg Stark
On 3 April 2018 at 14:36, Anthony Iliopoulos  wrote:

> If EIO persists between invocations until explicitly cleared, a process
> cannot possibly make any decision as to if it should clear the error

I still don't understand what "clear the error" means here. The writes
still haven't been written out. We don't care about tracking errors,
we just care whether all the writes to the file have been flushed to
disk. By "clear the error" you mean throw away the dirty pages and
revert part of the file to some old data? Why would anyone ever want
that?

> But instead of deconstructing and debating the semantics of the
> current mechanism, why not come up with the ideal desired form of
> error reporting/tracking granularity etc., and see how this may be
> fitted into kernels as a new interface.

Because Postgres is portable software that won't be able to use some
Linux-specific interface. And doesn't really need any granular error
reporting system anyways. It just needs to know when all writes have
been synced to disk.

-- 
greg



Re: [HACKERS] logical decoding of two-phase transactions

2018-04-03 Thread Alvaro Herrera
Tomas Vondra wrote:

> Yes, that is a good point actually - we need to test that replication
> between PG10 and PG11 works correctly, i.e. that the protocol version is
> correctly negotiated, and features are disabled/enabled accordingly etc.

Maybe it'd be good to have a buildfarm animal to specifically test for
that?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] logical decoding of two-phase transactions

2018-04-03 Thread Tomas Vondra


On 04/03/2018 04:07 PM, Stas Kelvich wrote:
> 
> 
>> On 3 Apr 2018, at 16:56, Tomas Vondra  wrote:
>>
>>
>> So I think we need a subscription parameter to enable/disable this,
>> defaulting to 'disabled’.
> 
> +1
> 
> Also, current value for LOGICALREP_IS_COMMIT is 1, but previous code expected
> flags to be zero, so this way logical replication between postgres-10 and
> postgres-with-2pc-decoding will be broken. So ISTM it’s better to set
> LOGICALREP_IS_COMMIT to zero and change flags checking rules to accommodate 
> that.
> 

Yes, that is a good point actually - we need to test that replication
between PG10 and PG11 works correctly, i.e. that the protocol version is
correctly negotiated, and features are disabled/enabled accordingly etc.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Verify Checksums during Basebackups

2018-04-03 Thread Tom Lane
I wrote:
> Magnus Hagander  writes:
>> Unless.. %ld is the wrong thing to print:
>> static int64 total_checksum_failures;
>> We should perhaps be using something other than %ld to print that?

> INT64_FORMAT.

BTW, don't just stick INT64_FORMAT into the message-to-be-translated,
or you'll break things for translation.  Good practice is to sprintf
into a local char array with INT64_FORMAT, then include the number
into the displayed message with %s.  You can find examples easily
by grepping for INT64_FORMAT.

regards, tom lane



Re: [PATCH] Verify Checksums during Basebackups

2018-04-03 Thread Tom Lane
Magnus Hagander  writes:
> Unless.. %ld is the wrong thing to print:
> static int64 total_checksum_failures;
> We should perhaps be using something other than %ld to print that?

INT64_FORMAT.

regards, tom lane



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-03 Thread Craig Ringer
On 3 April 2018 at 10:54, Robert Haas  wrote:


> I think it's always unreasonable to
> throw away the user's data.


Well, we do that. If a txn aborts, all writes in the txn are discarded.

I think that's perfectly reasonable. Though we also promise an all or
nothing effect, we make exceptions even there.

The FS doesn't offer transactional semantics, but the fsync behaviour can
be interpreted kind of similarly.

I don't *agree* with it, but I don't think it's as wholly unreasonable as
all that. I think leaving it undocumented is absolutely gobsmacking, and
it's dubious at best, but it's not totally insane.


> If the writes are going to fail, then let
> them keep on failing every time.


Like we do, where we require an explicit rollback.

But POSIX may pose issues there, it doesn't really define any interface for
that AFAIK. Unless you expect the app to close() and re-open() the file.
Replacing one nonstandard issue with another may not be a win.


> *That* wouldn't cause any data loss,
> because we'd never be able to checkpoint, and eventually the user
> would have to kill the server uncleanly, and that would trigger
> recovery.
>

Yep. That's what I expected to happen on unrecoverable I/O errors. Because,
y'know, unrecoverable.

I was stunned to learn it's not so. And I'm even more amazed to learn that
ext4's errors=remount-ro apparently doesn't concern its self with mere user
data, and may exhibit the same behaviour - I need to rerun my test case on
it tomorrow.


> Also, this really does make it impossible to write reliable programs.
>

In the presence of multiple apps interacting on the same file, yes. I think
that's a little bit of a stretch though.

For a single app, you can recover by remembering and redoing all the writes
you did.

Sucks if your app wants to have multiple processes working together on a
file without some kind of journal or WAL, relying on fsync() alone, mind
you. But at least we have WAL.

Hrm. I wonder how this interacts with wal_level=minimal.


> Even leaving that aside, a PANIC means a prolonged outage on a
> prolonged system - it could easily take tens of minutes or longer to
> run recovery.  So saying "oh, just do that" is not really an answer.
> Sure, we can do it, but it's like trying to lose weight by
> intentionally eating a tapeworm.  Now, it's possible to shorten the
> checkpoint_timeout so that recovery runs faster, but then performance
> drops because data has to be fsync()'d more often instead of getting
> buffered in the OS cache for the maximum possible time.


It's also spikier. Users have more issues with latency with short, frequent
checkpoints.


>   We could also
> dodge this issue in another way: suppose that when we write a page
> out, we don't consider it really written until fsync() succeeds.  Then
> we wouldn't need to PANIC if an fsync() fails; we could just re-write
> the page.  Unfortunately, this would also be terrible for performance,
> for pretty much the same reasons: letting the OS cache absorb lots of
> dirty blocks and do write-combining is *necessary* for good
> performance.
>

Our double-caching is already plenty bad enough anyway, as well.

(Ideally I want to be able to swap buffers between shared_buffers and the
OS buffer-cache. Almost like a 2nd level of buffer pinning. When we write
out a block, we *transfer* ownership to the OS.  Yeah, I'm dreaming. But
we'd sure need to be able to trust the OS not to just forget the block
then!)


> > The error reporting is thus consistent with the intended semantics (which
> > are sadly not properly documented). Repeated calls to fsync() simply do
> not
> > imply that the kernel will retry to writeback the previously-failed
> pages,
> > so the application needs to be aware of that. Persisting the error at the
> > fsync() level would essentially mean moving application policy into the
> > kernel.
>
> I might accept this argument if I accepted that it was OK to decide
> that an fsync() failure means you can forget that the write() ever
> happened in the first place, but it's hard to imagine an application
> that wants that behavior.  If the application didn't care about
> whether the bytes really got to disk or not, it would not have called
> fsync() in the first place.  If it does care, reporting the error only
> once is never an improvement.
>

Many RDBMSes do just that. It's hardly behaviour unique to the kernel. They
report an ERROR on a statement in a txn then go on with life, merrily
forgetting that anything was ever wrong.

I agree with PostgreSQL's stance that this is wrong. We require an explicit
rollback (or ROLLBACK TO SAVEPOINT) to  restore the session to a usable
state. This is good.

But we're the odd one out there. Almost everyone else does much like what
fsync() does on Linux, report the error and forget it.

In any case, we're not going to get anyone to backpatch a fix for this into
all kernels, so we're stuck working around it.

I'll do some testing with ENOSPC 

Re: [PATCH] Verify Checksums during Basebackups

2018-04-03 Thread Magnus Hagander
On Tue, Apr 3, 2018 at 4:25 PM, Tom Lane  wrote:

> Magnus Hagander  writes:
> > Seems the tests are failing on prairiedog:
> > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?
> nm=prairiedog=2018-04-03%2012%3A15%3A27=pg_basebackup-check
> > I don't have time to dive in right now, but that seems interesting --
> it's
> > reporting the failures, but it's then reporting the total number of
> > failures as 0...
> > Note that prairedog is a PowerPC machine -- I bet that has something to
> do
> > with it.
>
> endianness issue?  I can look closer if you can point me to where to look.
>

I think the problem comes from:
if (total_checksum_failures > 1)
ereport(WARNING,
(errmsg("%ld total checksum verification failures",
total_checksum_failures)));

That one actually logs a zero in the text. Which should not possibly ever
pr5int "0 total checksum verification failures".

Unless.. %ld is the wrong thing to print:
static int64 total_checksum_failures;

We should perhaps be using something other than %ld to print that?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: [PATCH] Verify Checksums during Basebackups

2018-04-03 Thread Tom Lane
Magnus Hagander  writes:
> Seems the tests are failing on prairiedog:
> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=prairiedog=2018-04-03%2012%3A15%3A27=pg_basebackup-check
> I don't have time to dive in right now, but that seems interesting -- it's
> reporting the failures, but it's then reporting the total number of
> failures as 0...
> Note that prairedog is a PowerPC machine -- I bet that has something to do
> with it.

endianness issue?  I can look closer if you can point me to where to look.

regards, tom lane



Re: Vacuum: allow usage of more than 1GB of work mem

2018-04-03 Thread Claudio Freire
On Tue, Apr 3, 2018 at 11:09 AM, Claudio Freire  wrote:
> On Tue, Apr 3, 2018 at 11:06 AM, Claudio Freire  
> wrote:
>> I didn't receive your comment, I just saw it. Nevertheless, I rebased the 
>> patches a while ago just because I noticed they didn't apply anymore in 
>> cputube, and they still seem to apply.
>
> Sorry, that is false.
>
> They appear green in cputube, so I was confident they did apply, but I
> just double-checked on a recent pull and they don't. I'll rebase them
> shortly.


Ok, rebased patches attached
From 712aff9a856c2bae1d87555057e6546d029ecc47 Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 1/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.

Improve test ability to spot vacuum errors
---
 src/backend/commands/vacuumlazy.c| 402 ---
 src/test/regress/expected/vacuum.out |  72 ++-
 src/test/regress/sql/vacuum.sql  |  40 +++-
 3 files changed, 438 insertions(+), 76 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index d2a0066..2f82dc6 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -11,11 +11,24 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
- * initially allocate an array of TIDs of that size, with an upper limit that
+ * initially allocate an array of TIDs of 128MB, or an upper limit that
  * depends on table size (this limit ensures we don't allocate a huge area
- * uselessly for vacuuming small tables).  If the array threatens to overflow,
+ * uselessly for vacuuming small tables). Additional arrays of increasingly
+ * large sizes are allocated as they become necessary.
+ *
+ * The TID array is thus represented as a list of multiple segments of
+ * varying size, beginning with the initial size of up to 128MB, and growing
+ * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
+ * is used up.
+ *
+ * Lookup in that structure happens with a binary search of segments, and then
+ * with a binary search within each segment. Since segment's size grows
+ * exponentially, this retains O(log N) lookup complexity.
+ *
+ * If the multiarray's total size threatens to grow beyond maintenance_work_mem,
  * we suspend the heap scan phase and perform a pass of index cleanup and page
- * compaction, then resume the heap scan with an empty TID array.
+ * compaction, then resume the heap scan with an array of logically empty but
+ * already preallocated TID segments to be refilled with more dead tuple TIDs.
  *
  * If we're processing a table with no indexes, we can just vacuum each page
  * as we go; there's no need to save up multiple tuples to minimize the number
@@ -101,6 +114,14 @@
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
 
 /*
+ * Minimum (starting) size of the dead_tuples array segments. Will allocate
+ * space for 128MB worth of tid pointers in the first segment, further segments
+ * will grow in size exponentially. Don't make it too small or the segment list
+ * will grow bigger than the sweetspot for search efficiency on big vacuums.
+ */
+#define LAZY_INIT_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
+
+/*
  * Before we consider skipping a page that's marked as clean in
  * visibility map, we must've seen at least this many clean pages.
  */
@@ -112,6 +133,27 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+typedef struct DeadTuplesSegment
+{
+	ItemPointerData last_dead_tuple;	/* Copy of the last dead tuple (unset
+		 * until the segment is fully
+		 * populated). Keep it first to simplify
+		 * binary searches */
+	int			num_dead_tuples;	/* # of entries in the segment */
+	int			max_dead_tuples;	/* # of entries allocated in the segment */
+	ItemPointer dt_tids;		/* Array of dead tuples */
+}	DeadTuplesSegment;
+
+typedef struct DeadTuplesMultiArray
+{
+	int			num_entries;	/* current # of entries */
+	int			max_entries;	/* total # of slots that can be allocated in
+ * array */
+	int			num_segs;		/* number of dead tuple segments allocated */
+	int			last_seg;		/* last dead tuple segment with data (or 0) */
+	DeadTuplesSegment *dt_segments;		/* array of num_segs segments */
+}	DeadTuplesMultiArray;
+
 typedef struct LVRelStats
 {
 	/* hasindex = true means two-pass strategy; false means one-pass */
@@ -132,14 +174,20 @@ typedef struct LVRelStats
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 	/* List of TIDs of tuples we intend to delete */
 	/* NB: this list is 

Re: new function for tsquery creartion

2018-04-03 Thread Aleksandr Parfenov
On Tue, 03 Apr 2018 14:28:37 +0300
Dmitry Ivanov  wrote:
> I'm sorry, I totally forgot to fix a few more things, the patch is 
> attached below.

The patch looks good to me except two things.

I'm not sure about the different result for these queries:
SELECT websearch_to_tsquery('simple', 'cat or ');
 websearch_to_tsquery 
--
 'cat'
(1 row)
SELECT websearch_to_tsquery('simple', 'cat or');
 websearch_to_tsquery 
--
 'cat' & 'or'
(1 row)

But I don't have strong opinion about these queries, since input in
both of them looks broken in terms of operator usage.


I found an odd behavior of the query creation function in case:
SELECT websearch_to_tsquery('english', '"pg_class pg"');
websearch_to_tsquery 
-
 ( 'pg' & 'class' ) <-> 'pg'
(1 row)

This query means that lexemes 'pg' and 'class' should be at the same
distance from the last 'pg'. In other words, they should have the same
position. But default parser will interpret pg_class as two separate
words during text parsing/vector creation.

The bug wasn't introduced in the patch and can be found in current
master. During the discussion of the patch with Dmitry, he noticed that
to_tsquery() function shares same odd behavior:
select to_tsquery('english', ' pg_class <-> pg');
 to_tsquery  
-
 ( 'pg' & 'class' ) <-> 'pg'
(1 row)

This oddity caused by they implementation of makepol. In makepol, each
token (parsed by query parser) is sent to FTS parser and in case of
further separation of the token, it uses operator selected in functions
to_tsquery and friends. So it doesn't change over the runtime.

I see two different ways to solve it:
1) Use the same operator inside the parenthesizes. This will mean to
parse it as few parts of one word.
2) Remove parenthesizes. This will mean to parse it as few separate
words.

I prefer the second way since in some languages words can be separated
by some special symbol or not separated by any symbols at all and
should be extracted by special FTS parser. It also allows us to parse
such words as one by using the special parser (as it done for hyphenated
word).

But in the example with websearch_to_tsquery, I think it should use
the same operator for quoted part of the query. For example, we can
update the operator in makepol before sending it to pushval
(pushval_morph) to do so.

It looks like there should be two separated patches, one for
websearch_to_tsquery and another one for fixing odd behavior of the
query construction. But since the first one may depend on the
bugfix, to solve case with quotes, I will mark it as Waiting on
Author.

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: Vacuum: allow usage of more than 1GB of work mem

2018-04-03 Thread Claudio Freire
On Tue, Apr 3, 2018 at 11:06 AM, Claudio Freire  wrote:
> I didn't receive your comment, I just saw it. Nevertheless, I rebased the 
> patches a while ago just because I noticed they didn't apply anymore in 
> cputube, and they still seem to apply.

Sorry, that is false.

They appear green in cputube, so I was confident they did apply, but I
just double-checked on a recent pull and they don't. I'll rebase them
shortly.



Re: [HACKERS] logical decoding of two-phase transactions

2018-04-03 Thread Stas Kelvich


> On 3 Apr 2018, at 16:56, Tomas Vondra  wrote:
> 
> 
> So I think we need a subscription parameter to enable/disable this,
> defaulting to 'disabled’.

+1

Also, current value for LOGICALREP_IS_COMMIT is 1, but previous code expected
flags to be zero, so this way logical replication between postgres-10 and
postgres-with-2pc-decoding will be broken. So ISTM it’s better to set
LOGICALREP_IS_COMMIT to zero and change flags checking rules to accommodate 
that.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Vacuum: allow usage of more than 1GB of work mem

2018-04-03 Thread Claudio Freire
I didn't receive your comment, I just saw it. Nevertheless, I rebased the 
patches a while ago just because I noticed they didn't apply anymore in 
cputube, and they still seem to apply.

Patch number 2 was committed a long while ago, that's why it's missing. It was 
a simple patch, it landed rewritten as commit 
7e26e02eec90370dd222f35f00042f8188488ac4

Re: [PATCH] Verify Checksums during Basebackups

2018-04-03 Thread Magnus Hagander
On Tue, Apr 3, 2018 at 4:00 PM, Michael Banck 
wrote:

> Hi Magnus,
>
> Am 03.04.2018 13:59 schrieb Magnus Hagander :
>
> And of course I forgot that particular part in the first push, so I've
> pushed it as a separate commit.
>
>
> Many thanks for cleaning up the patch and committing it!
>
>
Seems the tests are failing on prairiedog:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=prairiedog=2018-04-03%2012%3A15%3A27=pg_basebackup-check

I don't have time to dive in right now, but that seems interesting -- it's
reporting the failures, but it's then reporting the total number of
failures as 0...

Note that prairedog is a PowerPC machine -- I bet that has something to do
with it.


-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: WIP: Covering + unique indexes.

2018-04-03 Thread Alexander Korotkov
On Tue, Apr 3, 2018 at 7:02 AM, Peter Geoghegan  wrote:

> On Mon, Apr 2, 2018 at 4:27 PM, Alexander Korotkov
>  wrote:
> > I thought abut that another time and I decided that it would be safer
> > to use 13th bit in index tuple flags.  There are already attempt to
> > use whole 6 bytes of tid for not heap pointer information [1].  Thus, it
> > would be safe to use 13th bit for indicating alternative offset meaning
> > in pivot tuples, because it wouldn't block further work.  Revised
> patchset
> > in the attachment implements it.
>
> This is definitely not the only time someone has talked about this
> 13th bit -- it's quite coveted. It also came up with UPSERT, and with
> WARM. That's just the cases that I can personally remember.
>
> I'm glad that you found a way to make this work, that will keep things
> flexible for future patches, and make testing easier. I think that we
> can find a flexible representation that makes almost everyone happy.
>

OK, good.

I didn't have time to look at this properly today, but I will try to
> do so tomorrow.
>

Great, I'm looking forward your feedback.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [PATCH] Verify Checksums during Basebackups

2018-04-03 Thread Michael Banck
Hi Magnus,Am 03.04.2018 13:59 schrieb Magnus Hagander :And of course I forgot that particular part in the first push, so I've pushed it as a separate commit. Many thanks for cleaning up the patch and committing it!Michael-- Michael BanckProjektleiter / Senior BeraterTel.: +49 2166 9901-171Fax:  +49 2166 9901-100Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080USt-ID-Nummer: DE204566209Trompeterallee 108, 41189 MönchengladbachGeschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Re: Transform for pl/perl

2018-04-03 Thread Peter Eisentraut
On 3/15/18 03:46, Pavel Stehule wrote:
> It looks well
> 
> the patch has tests and doc,
> there are not any warnings or compilation issues
> all tests passed
> 
> I'll mark this patch as ready for commiter

committed

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] logical decoding of two-phase transactions

2018-04-03 Thread Tomas Vondra
On 04/03/2018 12:40 PM, Nikhil Sontakke wrote:
> Hi,
> 
>>> It's certainly a nice future goal to have it all happen automatically,
>>> but we don't know what the plugin will do.
>>
>> No, fighting too complicated APIs is not unreasonable. And we've found
>> an alternative.
>>
> 
> PFA, latest patch set.
> 
> The LogicalLockTransaction/LogicalUnlockTransaction API implementation
> using decode groups now has proper cleanup handling in case there's an
> ERROR while holding the logical lock.
> 
> Rest of the patches are the same as yesterday.
> 

Unfortunately, this does segfault for me in `make check` almost
immediately. Try

./configure --enable-debug --enable-cassert CFLAGS="-O0 -ggdb3
-DRANDOMIZE_ALLOCATED_MEMORY" && make -s clean && make -s -j4 check

and you should get an assert failure right away. Examples of backtraces
attached, not sure what exactly is the issue.

Also, I get this compiler warning:

proc.c: In function ‘AssignDecodeGroupLeader’:
proc.c:1975:8: warning: variable ‘pid’ set but not used
[-Wunused-but-set-variable]
  int   pid;
^~~
All of PostgreSQL successfully made. Ready to install.

which suggests we don't really need the pid variable.

> Other than this, we would want to have pgoutput support for 2PC
> decoding to be made optional? In that case we could add an option to
> "CREATE SUBSCRIPTION". This will mean adding a new
> Anum_pg_subscription_subenable_twophase attribute to Subscription
> struct and related processing. Should we go down this route?
> 

I'd say yes, we need to make it opt-in (assuming we want pgoutput to
support the 2PC decoding at all).

The trouble is that while it may improve replication of two-phase
transactions, it may also require config changes on the subscriber (to
support enough prepared transactions) and furthermore the GID is going
to be copied to the subscriber.

Which means that if the publisher/subscriber (at the instance level) are
already part of the are on the same 2PC transaction, it can't possibly
proceed because the subscriber won't be able to do PREPARE TRANSACTION.

So I think we need a subscription parameter to enable/disable this,
defaulting to 'disabled'.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
#0  0x74a570c468df in raise () from /lib64/libc.so.6
#1  0x74a570c484da in abort () from /lib64/libc.so.6
#2  0x00a14d33 in ExceptionalCondition (conditionName=0xc29608 
"!(MyProc->backendId != (-1))", errorType=0xc2878a "FailedAssertion", 
fileName=0xc28783 "lock.c", lineNumber=4249) at assert.c:54
#3  0x00893180 in VirtualXactLockTableCleanup () at lock.c:4249
#4  0x0088f113 in LockReleaseAll (lockmethodid=1, allLocks=false) at 
lock.c:2042
#5  0x00894c7f in ProcReleaseLocks (isCommit=true) at proc.c:786
#6  0x00a52950 in ResourceOwnerReleaseInternal (owner=0x121bf60, 
phase=RESOURCE_RELEASE_LOCKS, isCommit=true, isTopLevel=true) at resowner.c:564
#7  0x00a5276d in ResourceOwnerRelease (owner=0x121bf60, 
phase=RESOURCE_RELEASE_LOCKS, isCommit=true, isTopLevel=true) at resowner.c:480
#8  0x005401f3 in CommitTransaction () at xact.c:2103
#9  0x00540d76 in CommitTransactionCommand () at xact.c:2768
#10 0x00a29e12 in InitPostgres (in_dbname=0x1216748 "regression", 
dboid=0, username=0x1216728 "user", useroid=0, out_dbname=0x0) at 
postinit.c:1046
#11 0x008b027f in PostgresMain (argc=1, argv=0x12169c0, 
dbname=0x1216748 "regression", username=0x1216728 "user") at postgres.c:3778
#12 0x0080cb9f in BackendRun (port=0x120d580) at postmaster.c:4409
#13 0x0080c2fd in BackendStartup (port=0x120d580) at postmaster.c:4081
#14 0x008088af in ServerLoop () at postmaster.c:1754
#15 0x00807ed8 in PostmasterMain (argc=8, argv=0x11e6420) at 
postmaster.c:1362
#16 0x0073ad8c in main (argc=8, argv=0x11e6420) at main.c:228


#0  0x74a570c468df in raise () from /lib64/libc.so.6
#1  0x74a570c484da in abort () from /lib64/libc.so.6
#2  0x00a14d33 in ExceptionalCondition (conditionName=0xc25af8 
"!(((allPgXact[proc->pgprocno].xid) != ((TransactionId) 0)))", 
errorType=0xc25ae8 "FailedAssertion", fileName=0xc25ab9 "procarray.c", 
lineNumber=408) at assert.c:54
#3  0x0087f166 in ProcArrayEndTransaction (proc=0x74a569e07700, 
latestXid=1376) at procarray.c:408
#4  0x00540186 in CommitTransaction () at xact.c:2061
#5  0x00540d76 in CommitTransactionCommand () at xact.c:2768
#6  0x008ae7fc in finish_xact_command () at postgres.c:2499
#7  0x008ac4cf in exec_simple_query (query_string=0x11eb9e8 "CREATE 
TABLE target (tid integer, balance integer);") at postgres.c:1146
#8  0x008b0798 in PostgresMain (argc=1, argv=0x12169a0, 
dbname=0x1216748 "regression", username=0x1216728 "user") at postgres.c:4149
#9  0x0080cb9f in BackendRun (port=0x120d580) at postmaster.c:4409
#10 

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-04-03 Thread Tom Lane
Heikki Linnakangas  writes:
> But let's go back to why we're considering this. The idea was to 
> optimize this block:
> ...
> One trick that we could do is to replace that with a 128-bit atomic 
> compare-and-swap instruction. Modern 64-bit Intel systems have that, 
> it's called CMPXCHG16B. Don't know about other architectures. An atomic 
> fetch-and-add, as envisioned in the comment above, would presumably be 
> better, but I suspect that a compare-and-swap would be good enough to 
> move the bottleneck elsewhere again.

+1 for taking a look at that.  A bit of experimentation shows that
recent gcc and clang can generate that instruction using
__sync_bool_compare_and_swap or __sync_val_compare_and_swap
on an __int128 value.

regards, tom lane



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-04-03 Thread Claudio Freire
On Tue, Apr 3, 2018 at 10:19 AM, Claudio Freire  wrote:
> On Thu, Mar 29, 2018 at 2:09 PM, Tom Lane  wrote:
>> Claudio Freire  writes:
>>> On Wed, Mar 28, 2018 at 6:54 PM, Tom Lane  wrote:
 After 0001,
 there's no reason to assume that vacuum is particularly likely to get
 cancelled between having made cleanups and having updated the upper FSM
 levels.  (Maybe the odds are a bit more for the no-indexes case, but
 that doesn't seem like it's common enough to justify a special mechanism
 either.)
>>
>>> Why not?
>>
>>> Any kind of DDL (even those that don't rewrite the heap) would cancel
>>> autovacuum.
>>
>>> You might think DDL isn't common enough to worry about, but I've seen
>>> cases where regular reindex were required to keep index bloat in check
>>> (and were cron'd), and those cancel autovacuum all the time.
>>
>> If you've got a situation where every vacuum gets canceled partway
>> through, you've got bloat problems regardless, because the heap tuples are
>> never getting removed in the first place; worrying about whether the FSM
>> is up to date is pretty pointless.  The 0001 patch basically updates the
>> FSM as soon as it can after the tuples are actually deleted, so I think
>> we've made the window as small as practical, and I don't really see a need
>> to do extra work (and add substantial extra complexity) to deal with
>> missed cleanup a bit sooner.
>>
>> People who are dealing with this sort of scenario a lot might be well
>> advised to reduce autovacuum's maintenance_work_mem, so that the cleanup
>> cycles happen more often.  That's kind of costly in terms of the number
>> of index scans, but it reduces the amount of cleanup work that can be
>> lost to a cancel.
>>
>> (I'd also argue that a setup such as you describe is very possibly making
>> things worse not better.  Perhaps the 0001 patch will go some way towards
>> making it less necessary to do that.)
>
> Alright, so we just drop 2.

So, that's it then.

Thanks



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-03 Thread Anthony Iliopoulos
On Tue, Apr 03, 2018 at 12:26:05PM +0100, Greg Stark wrote:
> On 3 April 2018 at 11:35, Anthony Iliopoulos  wrote:
> > Hi Robert,
> >
> > Fully agree, and the errseq_t fixes have dealt exactly with the issue
> > of making sure that the error is reported to all file descriptors that
> > *happen to be open at the time of error*. But I think one would have a
> > hard time defending a modification to the kernel where this is further
> > extended to cover cases where:
> >
> > process A does write() on some file offset which fails writeback,
> > fsync() gets EIO and exit()s.
> >
> > process B does write() on some other offset which succeeds writeback,
> > but fsync() gets EIO due to (uncleared) failures of earlier process.
> 
> 
> Surely that's exactly what process B would want? If it calls fsync and
> gets a success and later finds out that the file is corrupt and didn't
> match what was in memory it's not going to be happy.

You can't possibly make this assumption. Process B may be reading
and writing to completely disjoint regions from those of process A,
and as such not really caring about earlier failures, only wanting
to ensure its own writes go all the way through. But even if it did
care, the file interfaces make no transactional guarantees. Even
without fsync() there is nothing preventing process B from reading
dirty pages from process A, and based on their content proceed to
to its own business and write/persist new data, while process A
further modifies the not-yet-flushed pages in-memory before flushing.
In this case you'd need explicit synchronization/locking between
the processes anyway, so why would fsync() be an exception?

> This seems like an attempt to co-opt fsync for a new and different
> purpose for which it's poorly designed. It's not an async error
> reporting mechanism for writes. It would be useless as that as any
> process could come along and open your file and eat the errors for
> writes you performed. An async error reporting mechanism would have to
> document which writes it was giving errors for and give you ways to
> control that.

The errseq_t fixes deal with that; errors will be reported to any
process that has an open fd, irrespective to who is the actual caller
of the fsync() that may have induced errors. This is anyway required
as the kernel may evict dirty pages on its own by doing writeback and
as such there needs to be a way to report errors on all open fds.

> The semantics described here are useless for everyone. For a program
> needing to know the error status of the writes it executed, it doesn't
> know which writes are included in which fsync call. For a program

If EIO persists between invocations until explicitly cleared, a process
cannot possibly make any decision as to if it should clear the error
and proceed or some other process will need to leverage that without
coordination, or which writes actually failed for that matter.
We would be back to the case of requiring explicit synchronization
between processes that care about this, in which case the processes
may as well synchronize over calling fsync() in the first place.

Having an opt-in persisting EIO per-fd would practically be a form
of "contract" between "cooperating" processes anyway.

But instead of deconstructing and debating the semantics of the
current mechanism, why not come up with the ideal desired form of
error reporting/tracking granularity etc., and see how this may be
fitted into kernels as a new interface.

Best regards,
Anthony



Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-04-03 Thread Robert Haas
On Thu, Mar 29, 2018 at 5:18 PM, Tom Lane  wrote:
> Tomas Vondra  writes:
>> If each WAL record has xl_curr, then we know to which position the
>> record belongs (after verifying the checksum). And we do know the size
>> of each WAL record, so we should be able to deduce if two records are
>> immediately after each other.
>
> Per my point earlier, XLOG_SWITCH is sufficient to defeat that argument.
> Also consider a timeline fork.  It's really hard to be sure which record
> in the old timeline is the direct ancestor of the first one in the new
> if you lack xl_prev:
>
> A1 -> B1 -> C1 -> D1
> \
>   B2 -> C2 -> D2
>
> If you happened to get confused and think that C2 is the first in its
> timeline, diverging off the old line after B1 not A1, there would be
> nothing about C2 to disabuse you of your error.

But, as Simon correctly points out, if xl_prev is the only thing
that's saving us from disaster, that's rather fragile.  All it's
cross-checking is the length of the previous WAL record, and that
could easily match by accident: there aren't *that* many bits of
entropy in the length of a WAL record.  As he also points out, I think
also correctly, if we really want a strong check that the chain of WAL
records is continuous and unbroken, we ought to be including the CRC
from the previous WAL record, not just the length.

Another thing that's bothering me is this: surely there could be (if
there isn't already) something *else* that tells us whether we've
switched timelines at the wrong place.  I mean, if nothing else, we
could dictate that the first WAL record after a timeline switch must
be XLOG_TIMELINE_SWITCH or something like that, and then if you try to
change timelines and don't find that record (with the correct previous
TLI), you know something's messed up.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-04-03 Thread Claudio Freire
On Thu, Mar 29, 2018 at 7:55 PM, Tom Lane  wrote:
> I wrote:
>> I have to go do something
>> else right now, but I'll take a closer look at 0004 later.
>
> OK, so after studying 0004, it seems to me that we could do it more
> simply as attached; that is, move the IndexFreeSpaceMapVacuum calls
> into btvacuumscan/spgvacuumscan, do them only if we found any recyclable
> pages, and drop the calls in btvacuumcleanup/spgvacuumcleanup altogether.
>
> The reason I think this is sufficient is that the scans find and record
> every reusable page in the index, no matter whether they were recorded
> before or not.  Therefore, if we don't find any such pages, there's
> nothing useful in the FSM and no particular urgency about making its
> upper pages up-to-date.  It's true that if the FSM is actually corrupt,
> leaving that to be fixed retail by searches is probably less efficient
> than doing an IndexFreeSpaceMapVacuum call would be --- but *only* if
> you assume that the problem is just in the upper pages and the leaf
> pages are all fine.  That doesn't seem to be a case we should optimize
> for.

+1, LGTM.



Re: [HACKERS] Add support for tuple routing to foreign partitions

2018-04-03 Thread Etsuro Fujita

(2018/04/03 13:59), Amit Langote wrote:

On 2018/04/02 21:29, Etsuro Fujita wrote:

(2018/04/02 18:49), Amit Langote wrote:

I looked at the new patch.  It looks good overall, although I have one
question -- IIUC, BeginForeignInsert() performs actions that are
equivalent of performing PlanForeignModify() + BeginForeignModify() for an
INSERT as if it was directly executed on a given foreign table/partition.
Did you face any problems in doing the latter itself in the core code?
Doing it that way would mean no changes to a FDW itself will be required
(and hence no need for additional APIs), but I may be missing something.


The biggest issue in performing PlanForeignModify() plus
BeginForeignModify() instead of the new FDW API would be: can the core
code create a valid-looking planner state passed to PlanForeignModify()
such as the ModifyTable plan node or the query tree stored in PlannerInfo?


Hmm, I can see the point.  Passing mostly-dummy (garbage) PlannerInfo and
query tree from the core code to FDW seems bad.  By defining the new API
with a clean interface (receiving fully valid ModifyTableState), we're not
required to do that across the interface, but only inside the FDW's
implementation.


I really think so too.


I was just slightly concerned that the new FDW function
would have to implement what would normally be carried out across multiple
special purpose callbacks, but maybe that's OK as long as it's clearly
documented what its job is.


OK


Noticed a couple of things in the patch:

+
+  When this is called by aCOPY FROM  command, the
+  plan-related global data inmtstate  is not provided
+  and theplanSlot  parameter of
+ExecForeignInsert  called for each inserted
tuple is

How about s/called/subsequently called/g?


Done.


+NULL, wether the foreign table is the partition
chosen

Typo: s/wether/whether/g


Done.

Thanks again, Amit!

Best regards,
Etsuro Fujita



Re: [HACKERS] Add support for tuple routing to foreign partitions

2018-04-03 Thread Etsuro Fujita

(2018/04/03 13:32), Amit Langote wrote:

On 2018/04/02 21:26, Etsuro Fujita wrote:

We wouldn't need that for foreign partitions (When DO NOTHING with an
inference specification or DO UPDATE on a partitioned table containing
foreign partitions, the planner would throw an error before we get to
ExecInitPartitionInfo).


Actually, as you might know, since it is not possible to create an index
on a partitioned table that has a foreign partition, there is no
possibility of supporting any form of ON CONFLICT that requires an
inference specification.


Right.


The reason why I updated the patch that way is
just to make the patch simple, but on reflection I don't think that's a
good idea; I think we should delay the map-creation step as well as the
FDW-initialization step for UPDATE subplan partitions as before for
improved efficiency for UPDATE-of-partition-key.  However, I don't think
it'd still be a good idea to delay those steps for partitions created by
ExecInitPartitionInfo the same way as for the subplan partitions, because
in that case it'd be better to perform CheckValidResultRel against a
partition right after we do InitResultRelInfo for the partition in
ExecInitPartitionInfo, as that would save cycles in cases when the
partition is considered nonvalid by that check.


It seems like a good idea to perform CheckValidResultRel right after the
InitResultRelInfo in ExecInitPartitionInfo.  However, if the partition is
considered non-valid, that results in an error afaik, so I don't
understand the point about saving cycles.


I think that we could abort the query without doing the remaining work 
after the check in ExecInitPartitionInfo in that case.



So, What I'm thinking is:
a) for the subplan partitions, we delay those steps as before, and b) for
the partitions created by ExecInitPartitionInfo, we do that check for a
partition right after InitResultRelInfo in that function, (and if valid,
we create a map and initialize the FDW for the partition in that function).


Sounds good to me.  I'm assuming that you're talking about delaying the
is-valid-for-insert-routing check (using CheckValidResultRel) and
parent-to-child map creation for a sub-plan result relation until
ExecPrepareTupleRouting().


That's exactly what I have in mind.  I modified the patch that way.


On a related note, I wonder if it'd be a good idea to rename the flag
ri_PartitionIsValid to something that signifies that we only need it to be
true if we want to do tuple routing (aka insert) using it.  Maybe,
ri_PartitionReadyForRouting or ri_PartitionReadyForInsert.  I'm afraid
that ri_PartitionIsValid is a bit ambiguous, although I'm not saying the
name options I'm suggesting are the best.


That's a good idea!  I like the first one, so I changed the name to that.

Thanks for the review!

Attached is an updated version of the patch.  Patch 
foreign-routing-fdwapi-3.patch is created on top of patch 
postgres-fdw-refactoring-3.patch and the bug-fix patch [1].


Other changes:
* Fixed/revised docs as you pointed out in another post.
* Added docs to update.sgml
* Revised some code/comments a little bit
* Revised regression tests
* Rebased against the latest HEAD

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/5aba4074.1090...@lab.ntt.co.jp
*** a/contrib/postgres_fdw/postgres_fdw.c
--- b/contrib/postgres_fdw/postgres_fdw.c
***
*** 376,387  static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
--- 376,396 
  static void create_cursor(ForeignScanState *node);
  static void fetch_more_data(ForeignScanState *node);
  static void close_cursor(PGconn *conn, unsigned int cursor_number);
+ static PgFdwModifyState *create_foreign_modify(EState *estate,
+ 	  ResultRelInfo *resultRelInfo,
+ 	  CmdType operation,
+ 	  Plan *subplan,
+ 	  char *query,
+ 	  List *target_attrs,
+ 	  bool has_returning,
+ 	  List *retrieved_attrs);
  static void prepare_foreign_modify(PgFdwModifyState *fmstate);
  static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
  		 ItemPointer tupleid,
  		 TupleTableSlot *slot);
  static void store_returning_result(PgFdwModifyState *fmstate,
  	   TupleTableSlot *slot, PGresult *res);
+ static void finish_foreign_modify(PgFdwModifyState *fmstate);
  static List *build_remote_returning(Index rtindex, Relation rel,
  	   List *returningList);
  static void rebuild_fdw_scan_tlist(ForeignScan *fscan, List *tlist);
***
*** 1681,1698  postgresBeginForeignModify(ModifyTableState *mtstate,
  		   int eflags)
  {
  	PgFdwModifyState *fmstate;
! 	EState	   *estate = mtstate->ps.state;
! 	CmdType		operation = mtstate->operation;
! 	Relation	rel = resultRelInfo->ri_RelationDesc;
! 	RangeTblEntry *rte;
! 	Oid			userid;
! 	ForeignTable *table;
! 	UserMapping *user;
! 	AttrNumber	n_params;
! 	Oid			typefnoid;
! 	bool		isvarlena;
! 	ListCell   *lc;
! 	TupleDesc	tupdesc = RelationGetDescr(rel);
  
  	/*
  	 

Re: Online enabling of checksums

2018-04-03 Thread Magnus Hagander
On Sun, Apr 1, 2018 at 2:04 PM, Magnus Hagander  wrote:

> On Sat, Mar 31, 2018 at 5:38 PM, Tomas Vondra <
> tomas.von...@2ndquadrant.com> wrote:
>
>> On 03/31/2018 05:05 PM, Magnus Hagander wrote:
>> > On Sat, Mar 31, 2018 at 4:21 PM, Tomas Vondra
>> > >
>> wrote:
>> >
>> > ...
>> >
>> > I do think just waiting for all running transactions to complete is
>> > fine, and it's not the first place where we use it - CREATE
>> SUBSCRIPTION
>> > does pretty much exactly the same thing (and CREATE INDEX
>> CONCURRENTLY
>> > too, to some extent). So we have a precedent / working code we can
>> copy.
>> >
>> >
>> > Thinking again, I don't think it should be done as part of
>> > BuildRelationList(). We should just do it once in the launcher before
>> > starting, that'll be both easier and cleaner. Anything started after
>> > that will have checksums on it, so we should be fine.
>> >
>> > PFA one that does this.
>> >
>>
>> Seems fine to me. I'd however log waitforxid, not the oldest one. If
>> you're a DBA and you want to make the checksumming to proceed, knowing
>> the oldest running XID is useless for that. If we log waitforxid, it can
>> be used to query pg_stat_activity and interrupt the sessions somehow.
>>
>
> Yeah, makes sense. Updated.
>
>
>
>> > > And if you try this with a temporary table (not hidden in
>> transaction,
>> > > so the bgworker can see it), the worker will fail with this:
>> > >
>> > >   ERROR:  cannot access temporary tables of other sessions
>> > >
>> > > But of course, this is just another way how to crash without
>> updating
>> > > the result for the launcher, so checksums may end up being
>> enabled
>> > > anyway.
>> > >
>> > >
>> > > Yeah, there will be plenty of side-effect issues from that
>> > > crash-with-wrong-status case. Fixing that will at least make
>> things
>> > > safer -- in that checksums won't be enabled when not put on all
>> pages.
>> > >
>> >
>> > Sure, the outcome with checksums enabled incorrectly is a
>> consequence of
>> > bogus status, and fixing that will prevent that. But that wasn't my
>> main
>> > point here - not articulated very clearly, though.
>> >
>> > The bigger question is how to handle temporary tables gracefully, so
>> > that it does not terminate the bgworker like this at all. This
>> might be
>> > even bigger issue than dropped relations, considering that temporary
>> > tables are pretty common part of applications (and it also includes
>> > CREATE/DROP).
>> >
>> > For some clusters it might mean the online checksum enabling would
>> > crash+restart infinitely (well, until reaching MAX_ATTEMPTS).
>> >
>> > Unfortunately, try_relation_open() won't fix this, as the error
>> comes
>> > from ReadBufferExtended. And it's not a matter of simply creating a
>> > ReadBuffer variant without that error check, because temporary
>> tables
>> > use local buffers.
>> >
>> > I wonder if we could just go and set the checksums anyway, ignoring
>> the
>> > local buffers. If the other session does some changes, it'll
>> overwrite
>> > our changes, this time with the correct checksums. But it seems
>> pretty
>> > dangerous (I mean, what if they're writing stuff while we're
>> updating
>> > the checksums? Considering the various short-cuts for temporary
>> tables,
>> > I suspect that would be a boon for race conditions.)
>> >
>> > Another option would be to do something similar to running
>> transactions,
>> > i.e. wait until all temporary tables (that we've seen at the
>> beginning)
>> > disappear. But we're starting to wait on more and more stuff.
>> >
>> > If we do this, we should clearly log which backends we're waiting
>> for,
>> > so that the admins can go and interrupt them manually.
>> >
>> >
>> >
>> > Yeah, waiting for all transactions at the beginning is pretty simple.
>> >
>> > Making the worker simply ignore temporary tables would also be easy.
>> >
>> > One of the bigger issues here is temporary tables are *session* scope
>> > and not transaction, so we'd actually need the other session to finish,
>> > not just the transaction.
>> >
>> > I guess what we could do is something like this:
>> >
>> > 1. Don't process temporary tables in the checksumworker, period.
>> > Instead, build a list of any temporary tables that existed when the
>> > worker started in this particular database (basically anything that we
>> > got in our scan). Once we have processed the complete database, keep
>> > re-scanning pg_class until those particular tables are gone (search by
>> oid).
>> >
>> > That means that any temporary tables that are created *while* we are
>> > processing a database are ignored, but they should already be receiving
>> > checksums.
>> >
>> > It definitely leads to a potential issue 

Re: [PATCH] Verify Checksums during Basebackups

2018-04-03 Thread Magnus Hagander
On Tue, Apr 3, 2018 at 1:52 PM, Magnus Hagander  wrote:

>
>
> On Mon, Apr 2, 2018 at 2:48 PM, Michael Banck 
> wrote:
>
>> Hi!
>>
>> On Sun, Apr 01, 2018 at 05:27:21PM +0200, Magnus Hagander wrote:
>> > It might be a micro-optimisation, but ISTM that we shouldn't do the
>> > basename(palloc(fn)) in is_heap_file() unless we actually need it -- so
>> not
>> > at the top of the function. (And surely "." and ".." should not occur
>> here?
>> > I think that's a result of copy/paste from the online checksum patch?
>> >
>> > We also do exactly the same basename(palloc(fn)) in sendFile().  Can we
>> > find a way to reuse that duplication? Perhaps we want to split it into
>> the
>> > two pieces already out in sendFile() and pass it in as separate
>> parameters?
>>
>> I've done the latter now, although it looks a bit weird that the second
>> argument data is a subset of the first.  I couldn't find another way to
>> not have it done twice, though.
>>
>
> I agree, but I think it's still cleaner.
>
> On further look, there is actually no need to pstrdup() at all -- we never
> used the modified part of the string anyway, because we don't care about
> the oid (unlike pg_verify_checksums).
>
> So I adjusted the patch by that.
>
>
> > If not then this second one should at least only be done inside the if
>> > (verify_checksums).
>>
>> We can't have both, as we need to call the is_heap_file() function in
>> order to determine whether we should verify the checksums.
>>
>
> Right. I realize that -- thus the "if not". But I guess I was not clear in
> what I meant -- see attached file for it.
>
>
> > There is a bigger problem next to that though -- I believe  basename()
>> does
>> > not exist on Win32. I haven't tested it, but there is zero
>> documentation of
>> > it existing, which usually indicates it doesn't. That's the part that
>> > definitely needs to get fixed.
>> >
>> > I think you need to look into the functionality in port/path.c, in
>> > particular last_dir_separator()?
>>
>> Thanks for the pointer, I've used that now; I mentioned before that
>> basename() might be a portability hazard, but couldn't find a good
>> substitute myself.
>>
>
> Yeah, I have a recollection somewhere of running into this before, but I
> couldn't find any references. But the complete lack of docs about it on
> msdn.microsoft.com is a clear red flag :)
>
>
>
>>
>> V6 of the patch is attached.
>>
>>
> Excellent. I've done some mangling on it:
>
> * Changed the is_heap_file to is_checksummed_file (and the associtaed
> struct name), because this is really what it's about (we for example verify
> checksums on indexes, which are clearly not heaps)
> * Moved the list of files to the top of the file next to the other lists
> of files/directories
> * Added missing function prototype at the top, and changed the parameter
> names to be a bit more clear
> * Added some comments
> * Changed the logic around the checksum-check to avoid the pstrdup() and
> to not call the path functions unless necessary (per comment above)
> * "filen" -> "file" in message
> * xlog.h does not need to be included
> * pgindent
>
> Remaining question:
>
> The check for (cnt % BLCKSZ != 0) currently does "continue", which means
> that this block of data isn't actually sent to the client at all, which
> seems completely wrong. We only want to prevent checksum validations.
>
> I have moved the check up a bit, and refactored it so it continues to do
> the actual transmission of the file if this path is hit.
>
>
And of course I forgot that particular part in the first push, so I've
pushed it as a separate commit.


-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: [PROPOSAL] Shared Ispell dictionaries

2018-04-03 Thread Arthur Zakirov
On Thu, Mar 29, 2018 at 02:03:07AM +0300, Arthur Zakirov wrote:
> Here is the new version of the patch.

Please find the attached new version of the patch.

I removed refcnt because it is useless, it doesn't guarantee that a hash
table entry will be removed.

I fixed a bug, dsm_unpin_segment() can be called twice if a transaction
which called it was aborted and another transaction calls
ts_dict_shmem_release(). I added segment_ispinned to fix it.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/src/backend/tsearch/spell.c b/src/backend/tsearch/spell.c
index b9fdd77e19..e071994523 100644
--- a/src/backend/tsearch/spell.c
+++ b/src/backend/tsearch/spell.c
@@ -78,6 +78,8 @@
 #define tmpalloc(sz)  MemoryContextAlloc(Conf->buildCxt, (sz))
 #define tmpalloc0(sz)  MemoryContextAllocZero(Conf->buildCxt, (sz))
 
+#define tmpstrdup(str) MemoryContextStrdup(Conf->buildCxt, (str))
+
 /*
  * Prepare for constructing an ISpell dictionary.
  *
@@ -498,7 +500,7 @@ NIAddSpell(IspellDict *Conf, const char *word, const char 
*flag)
Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + 
strlen(word) + 1);
strcpy(Conf->Spell[Conf->nspell]->word, word);
Conf->Spell[Conf->nspell]->p.flag = (*flag != '\0')
-   ? cpstrdup(Conf, flag) : VoidString;
+   ? tmpstrdup(flag) : VoidString;
Conf->nspell++;
 }
 
@@ -1040,7 +1042,7 @@ setCompoundAffixFlagValue(IspellDict *Conf, 
CompoundAffixFlag *entry,
entry->flag.i = i;
}
else
-   entry->flag.s = cpstrdup(Conf, s);
+   entry->flag.s = tmpstrdup(s);
 
entry->flagMode = Conf->flagMode;
entry->value = val;
@@ -1536,6 +1538,9 @@ nextline:
return;
 
 isnewformat:
+   pfree(recoded);
+   pfree(pstr);
+
if (oldformat)
ereport(ERROR,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
diff --git a/contrib/dict_int/dict_int.c b/contrib/dict_int/dict_int.c
index 56ede37089..8dd4959028 100644
--- a/contrib/dict_int/dict_int.c
+++ b/contrib/dict_int/dict_int.c
@@ -30,7 +30,7 @@ PG_FUNCTION_INFO_V1(dintdict_lexize);
 Datum
 dintdict_init(PG_FUNCTION_ARGS)
 {
-   List   *dictoptions = (List *) PG_GETARG_POINTER(0);
+   DictInitData *init_data = (DictInitData *) PG_GETARG_POINTER(0);
DictInt*d;
ListCell   *l;
 
@@ -38,7 +38,7 @@ dintdict_init(PG_FUNCTION_ARGS)
d->maxlen = 6;
d->rejectlong = false;
 
-   foreach(l, dictoptions)
+   foreach(l, init_data->dict_options)
{
DefElem*defel = (DefElem *) lfirst(l);
 
diff --git a/contrib/dict_xsyn/dict_xsyn.c b/contrib/dict_xsyn/dict_xsyn.c
index a79ece240c..0b8a32d459 100644
--- a/contrib/dict_xsyn/dict_xsyn.c
+++ b/contrib/dict_xsyn/dict_xsyn.c
@@ -140,7 +140,7 @@ read_dictionary(DictSyn *d, const char *filename)
 Datum
 dxsyn_init(PG_FUNCTION_ARGS)
 {
-   List   *dictoptions = (List *) PG_GETARG_POINTER(0);
+   DictInitData *init_data = (DictInitData *) PG_GETARG_POINTER(0);
DictSyn*d;
ListCell   *l;
char   *filename = NULL;
@@ -153,7 +153,7 @@ dxsyn_init(PG_FUNCTION_ARGS)
d->matchsynonyms = false;
d->keepsynonyms = true;
 
-   foreach(l, dictoptions)
+   foreach(l, init_data->dict_options)
{
DefElem*defel = (DefElem *) lfirst(l);
 
diff --git a/contrib/unaccent/unaccent.c b/contrib/unaccent/unaccent.c
index 247c202755..2a2fbee5fa 100644
--- a/contrib/unaccent/unaccent.c
+++ b/contrib/unaccent/unaccent.c
@@ -267,12 +267,12 @@ PG_FUNCTION_INFO_V1(unaccent_init);
 Datum
 unaccent_init(PG_FUNCTION_ARGS)
 {
-   List   *dictoptions = (List *) PG_GETARG_POINTER(0);
+   DictInitData *init_data = (DictInitData *) PG_GETARG_POINTER(0);
TrieChar   *rootTrie = NULL;
boolfileloaded = false;
ListCell   *l;
 
-   foreach(l, dictoptions)
+   foreach(l, init_data->dict_options)
{
DefElem*defel = (DefElem *) lfirst(l);
 
diff --git a/src/backend/commands/tsearchcmds.c 
b/src/backend/commands/tsearchcmds.c
index 3a843512d1..3753e32b2c 100644
--- a/src/backend/commands/tsearchcmds.c
+++ b/src/backend/commands/tsearchcmds.c
@@ -386,17 +386,25 @@ verify_dictoptions(Oid tmplId, List *dictoptions)
}
else
{
+   DictInitData init_data;
+
/*
 * Copy the options just in case init method thinks it can 
scribble on
 * them ...
 */
dictoptions = copyObject(dictoptions);
 
+   init_data.dict_options = dictoptions;
+   init_data.dict.id = InvalidOid;
+   init_data.dict.xmin = InvalidTransactionId;
+   init_data.dict.xmax = InvalidTransactionId;
+   ItemPointerSetInvalid(_data.dict.tid);
+
/*
 

Re: [PATCH] Verify Checksums during Basebackups

2018-04-03 Thread Magnus Hagander
On Mon, Apr 2, 2018 at 2:48 PM, Michael Banck 
wrote:

> Hi!
>
> On Sun, Apr 01, 2018 at 05:27:21PM +0200, Magnus Hagander wrote:
> > It might be a micro-optimisation, but ISTM that we shouldn't do the
> > basename(palloc(fn)) in is_heap_file() unless we actually need it -- so
> not
> > at the top of the function. (And surely "." and ".." should not occur
> here?
> > I think that's a result of copy/paste from the online checksum patch?
> >
> > We also do exactly the same basename(palloc(fn)) in sendFile().  Can we
> > find a way to reuse that duplication? Perhaps we want to split it into
> the
> > two pieces already out in sendFile() and pass it in as separate
> parameters?
>
> I've done the latter now, although it looks a bit weird that the second
> argument data is a subset of the first.  I couldn't find another way to
> not have it done twice, though.
>

I agree, but I think it's still cleaner.

On further look, there is actually no need to pstrdup() at all -- we never
used the modified part of the string anyway, because we don't care about
the oid (unlike pg_verify_checksums).

So I adjusted the patch by that.


> If not then this second one should at least only be done inside the if
> > (verify_checksums).
>
> We can't have both, as we need to call the is_heap_file() function in
> order to determine whether we should verify the checksums.
>

Right. I realize that -- thus the "if not". But I guess I was not clear in
what I meant -- see attached file for it.


> There is a bigger problem next to that though -- I believe  basename()
> does
> > not exist on Win32. I haven't tested it, but there is zero documentation
> of
> > it existing, which usually indicates it doesn't. That's the part that
> > definitely needs to get fixed.
> >
> > I think you need to look into the functionality in port/path.c, in
> > particular last_dir_separator()?
>
> Thanks for the pointer, I've used that now; I mentioned before that
> basename() might be a portability hazard, but couldn't find a good
> substitute myself.
>

Yeah, I have a recollection somewhere of running into this before, but I
couldn't find any references. But the complete lack of docs about it on
msdn.microsoft.com is a clear red flag :)



>
> V6 of the patch is attached.
>
>
Excellent. I've done some mangling on it:

* Changed the is_heap_file to is_checksummed_file (and the associtaed
struct name), because this is really what it's about (we for example verify
checksums on indexes, which are clearly not heaps)
* Moved the list of files to the top of the file next to the other lists of
files/directories
* Added missing function prototype at the top, and changed the parameter
names to be a bit more clear
* Added some comments
* Changed the logic around the checksum-check to avoid the pstrdup() and to
not call the path functions unless necessary (per comment above)
* "filen" -> "file" in message
* xlog.h does not need to be included
* pgindent

Remaining question:

The check for (cnt % BLCKSZ != 0) currently does "continue", which means
that this block of data isn't actually sent to the client at all, which
seems completely wrong. We only want to prevent checksum validations.

I have moved the check up a bit, and refactored it so it continues to do
the actual transmission of the file if this path is hit.

I have pushed an updated patch with those changes. Please review the result
and let me know I broke something :)


Thanks!!

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: new function for tsquery creartion

2018-04-03 Thread Dmitry Ivanov

The code in its current state looks messy and way too complicated;
there're lots of interleaving code branches. Thus, I decided to split
gettoken_query() into three independent tokenizers for phrase, web and
original (to_tsquery()) syntaxes. Documentation is included. Any
feedback is very welcome.


I'm sorry, I totally forgot to fix a few more things, the patch is 
attached below.


--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companydiff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5abb1c46fb..c3b7be6e4e 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -9609,6 +9609,18 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
 phraseto_tsquery('english', 'The Fat Rats')
 'fat' - 'rat'

+   
+
+ 
+  websearch_to_tsquery
+ 
+  websearch_to_tsquery( config regconfig ,  query text)
+ 
+tsquery
+produce tsquery from a web search style query
+websearch_to_tsquery('english', '"fat rat" or rat')
+'fat' - 'rat' | 'rat'
+   

 
  
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index 610b7bf033..19f58511c8 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -797,13 +797,16 @@ UPDATE tt SET ti =

 PostgreSQL provides the
 functions to_tsquery,
-plainto_tsquery, and
-phraseto_tsquery
+plainto_tsquery,
+phraseto_tsquery and
+websearch_to_tsquery
 for converting a query to the tsquery data type.
 to_tsquery offers access to more features
 than either plainto_tsquery or
-phraseto_tsquery, but it is less forgiving
-about its input.
+phraseto_tsquery, but it is less forgiving about its
+input. websearch_to_tsquery is a simplified version
+of to_tsquery with an alternative syntax, similar
+to the one used by web search engines.

 

@@ -962,6 +965,87 @@ SELECT phraseto_tsquery('english', 'The Fat  Rats:C');
 

 
+
+websearch_to_tsquery( config regconfig,  querytext text) returns tsquery
+
+
+   
+websearch_to_tsquery creates a tsquery
+value from querytext using an alternative
+syntax in which simple unformatted text is a valid query.
+Unlike plainto_tsquery
+and phraseto_tsquery, it also recognizes certain
+operators. Moreover, this function should never raise syntax errors,
+which makes it possible to use raw user-supplied input for search.
+The following syntax is supported:
+
+ 
+   
+unquoted text: text not inside quote marks will be
+converted to terms separated by  operators, as
+if processed by
+plainto_tsquery.
+  
+ 
+ 
+   
+"quoted text": text inside quote marks will be
+converted to terms separated by -
+operators, as if processed by phraseto_tsquery.
+  
+ 
+ 
+  
+   OR: logical or will be converted to
+   the | operator.
+  
+ 
+ 
+  
+   -: the logical not operator, converted to the
+   the ! operator.
+  
+ 
+
+   
+   
+Examples:
+
+  select websearch_to_tsquery('english', 'The fat rats');
+   websearch_to_tsquery
+  -
+   'fat'  'rat'
+  (1 row)
+
+
+  select websearch_to_tsquery('english', '"supernovae stars" -crab');
+ websearch_to_tsquery
+  --
+   'supernova' - 'star'  !'crab'
+  (1 row)
+
+
+  select websearch_to_tsquery('english', '"sad cat" or "fat rat"');
+ websearch_to_tsquery
+  ---
+   'sad' - 'cat' | 'fat' - 'rat'
+  (1 row)
+
+
+  select websearch_to_tsquery('english', 'signal -"segmentation fault"');
+   websearch_to_tsquery
+  ---
+   'signal'  !( 'segment' - 'fault' )
+  (1 row)
+
+
+  select websearch_to_tsquery('english', '""" )( dummy \\ query -');
+   websearch_to_tsquery
+  --
+   'dummi'  'queri'
+  (1 row)
+
+
   
 
   
diff --git a/src/backend/tsearch/to_tsany.c b/src/backend/tsearch/to_tsany.c
index ea5947a3a8..6055fb6b4e 100644
--- a/src/backend/tsearch/to_tsany.c
+++ b/src/backend/tsearch/to_tsany.c
@@ -490,7 +490,7 @@ to_tsquery_byid(PG_FUNCTION_ARGS)
 	query = parse_tsquery(text_to_cstring(in),
 		  pushval_morph,
 		  PointerGetDatum(),
-		  false);
+		  0);
 
 	PG_RETURN_TSQUERY(query);
 }
@@ -520,7 +520,7 @@ plainto_tsquery_byid(PG_FUNCTION_ARGS)
 	query = parse_tsquery(text_to_cstring(in),
 		  pushval_morph,
 		  PointerGetDatum(),
-		  true);
+		  P_TSQ_PLAIN);
 
 	PG_RETURN_POINTER(query);
 }
@@ -551,7 +551,7 @@ phraseto_tsquery_byid(PG_FUNCTION_ARGS)
 	query = 

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-03 Thread Greg Stark
On 3 April 2018 at 11:35, Anthony Iliopoulos  wrote:
> Hi Robert,
>
> Fully agree, and the errseq_t fixes have dealt exactly with the issue
> of making sure that the error is reported to all file descriptors that
> *happen to be open at the time of error*. But I think one would have a
> hard time defending a modification to the kernel where this is further
> extended to cover cases where:
>
> process A does write() on some file offset which fails writeback,
> fsync() gets EIO and exit()s.
>
> process B does write() on some other offset which succeeds writeback,
> but fsync() gets EIO due to (uncleared) failures of earlier process.


Surely that's exactly what process B would want? If it calls fsync and
gets a success and later finds out that the file is corrupt and didn't
match what was in memory it's not going to be happy.

This seems like an attempt to co-opt fsync for a new and different
purpose for which it's poorly designed. It's not an async error
reporting mechanism for writes. It would be useless as that as any
process could come along and open your file and eat the errors for
writes you performed. An async error reporting mechanism would have to
document which writes it was giving errors for and give you ways to
control that.

The semantics described here are useless for everyone. For a program
needing to know the error status of the writes it executed, it doesn't
know which writes are included in which fsync call. For a program
using fsync for its original intended purpose of guaranteeing that the
all writes are synced to disk it no longer has any guarantee at all.


> This would be a highly user-visible change of semantics from edge-
> triggered to level-triggered behavior.

It was always documented as level-triggered. This edge-triggered
concept is a completely surprise to application writers.

-- 
greg



Re: new function for tsquery creartion

2018-04-03 Thread Dmitry Ivanov

Hi everyone,

The code in its current state looks messy and way too complicated; 
there're lots of interleaving code branches. Thus, I decided to split 
gettoken_query() into three independent tokenizers for phrase, web and 
original (to_tsquery()) syntaxes. Documentation is included. Any 
feedback is very welcome.


--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companydiff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5abb1c46fb..c3b7be6e4e 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -9609,6 +9609,18 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
 phraseto_tsquery('english', 'The Fat Rats')
 'fat' - 'rat'

+   
+
+ 
+  websearch_to_tsquery
+ 
+  websearch_to_tsquery( config regconfig ,  query text)
+ 
+tsquery
+produce tsquery from a web search style query
+websearch_to_tsquery('english', '"fat rat" or rat')
+'fat' - 'rat' | 'rat'
+   

 
  
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index 610b7bf033..19f58511c8 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -797,13 +797,16 @@ UPDATE tt SET ti =

 PostgreSQL provides the
 functions to_tsquery,
-plainto_tsquery, and
-phraseto_tsquery
+plainto_tsquery,
+phraseto_tsquery and
+websearch_to_tsquery
 for converting a query to the tsquery data type.
 to_tsquery offers access to more features
 than either plainto_tsquery or
-phraseto_tsquery, but it is less forgiving
-about its input.
+phraseto_tsquery, but it is less forgiving about its
+input. websearch_to_tsquery is a simplified version
+of to_tsquery with an alternative syntax, similar
+to the one used by web search engines.

 

@@ -962,6 +965,87 @@ SELECT phraseto_tsquery('english', 'The Fat  Rats:C');
 

 
+
+websearch_to_tsquery( config regconfig,  querytext text) returns tsquery
+
+
+   
+websearch_to_tsquery creates a tsquery
+value from querytext using an alternative
+syntax in which simple unformatted text is a valid query.
+Unlike plainto_tsquery
+and phraseto_tsquery, it also recognizes certain
+operators. Moreover, this function should never raise syntax errors,
+which makes it possible to use raw user-supplied input for search.
+The following syntax is supported:
+
+ 
+   
+unquoted text: text not inside quote marks will be
+converted to terms separated by  operators, as
+if processed by
+plainto_tsquery.
+  
+ 
+ 
+   
+"quoted text": text inside quote marks will be
+converted to terms separated by -
+operators, as if processed by phraseto_tsquery.
+  
+ 
+ 
+  
+   OR: logical or will be converted to
+   the | operator.
+  
+ 
+ 
+  
+   -: the logical not operator, converted to the
+   the ! operator.
+  
+ 
+
+   
+   
+Examples:
+
+  select websearch_to_tsquery('english', 'The fat rats');
+   websearch_to_tsquery
+  -
+   'fat'  'rat'
+  (1 row)
+
+
+  select websearch_to_tsquery('english', '"supernovae stars" -crab');
+ websearch_to_tsquery
+  --
+   'supernova' - 'star'  !'crab'
+  (1 row)
+
+
+  select websearch_to_tsquery('english', '"sad cat" or "fat rat"');
+ websearch_to_tsquery
+  ---
+   'sad' - 'cat' | 'fat' - 'rat'
+  (1 row)
+
+
+  select websearch_to_tsquery('english', 'signal -"segmentation fault"');
+   websearch_to_tsquery
+  ---
+   'signal'  !( 'segment' - 'fault' )
+  (1 row)
+
+
+  select websearch_to_tsquery('english', '""" )( dummy \\ query -');
+   websearch_to_tsquery
+  --
+   'dummi'  'queri'
+  (1 row)
+
+
   
 
   
diff --git a/src/backend/tsearch/to_tsany.c b/src/backend/tsearch/to_tsany.c
index ea5947a3a8..6055fb6b4e 100644
--- a/src/backend/tsearch/to_tsany.c
+++ b/src/backend/tsearch/to_tsany.c
@@ -490,7 +490,7 @@ to_tsquery_byid(PG_FUNCTION_ARGS)
 	query = parse_tsquery(text_to_cstring(in),
 		  pushval_morph,
 		  PointerGetDatum(),
-		  false);
+		  0);
 
 	PG_RETURN_TSQUERY(query);
 }
@@ -520,7 +520,7 @@ plainto_tsquery_byid(PG_FUNCTION_ARGS)
 	query = parse_tsquery(text_to_cstring(in),
 		  pushval_morph,
 		  PointerGetDatum(),
-		  true);
+		  P_TSQ_PLAIN);
 
 	PG_RETURN_POINTER(query);
 }
@@ -551,7 +551,7 @@ phraseto_tsquery_byid(PG_FUNCTION_ARGS)
 	query = parse_tsquery(text_to_cstring(in),
 		  pushval_morph,
 		  PointerGetDatum(),
-		  

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-03 Thread Anthony Iliopoulos
Hi Robert,

On Mon, Apr 02, 2018 at 10:54:26PM -0400, Robert Haas wrote:
> On Mon, Apr 2, 2018 at 2:53 PM, Anthony Iliopoulos  wrote:
> > Given precisely that the dirty pages which cannot been written-out are
> > practically thrown away, the semantics of fsync() (after the 4.13 fixes)
> > are essentially correct: the first call indicates that a writeback error
> > indeed occurred, while subsequent calls have no reason to indicate an error
> > (assuming no other errors occurred in the meantime).
> 
> Like other people here, I think this is 100% unreasonable, starting
> with "the dirty pages which cannot been written out are practically
> thrown away".  Who decided that was OK, and on the basis of what
> wording in what specification?  I think it's always unreasonable to

If you insist on strict conformance to POSIX, indeed the linux
glibc configuration and associated manpage are probably wrong in
stating that _POSIX_SYNCHRONIZED_IO is supported. The implementation
matches that of the flexibility allowed by not supporting SIO.
There's a long history of brokenness between linux and posix,
and I think there was never an intention of conforming to the
standard.

> throw away the user's data.  If the writes are going to fail, then let
> them keep on failing every time.  *That* wouldn't cause any data loss,
> because we'd never be able to checkpoint, and eventually the user
> would have to kill the server uncleanly, and that would trigger
> recovery.

I believe (as tried to explain earlier) there is a certain assumption
being made that the writer and original owner of data is responsible
for dealing with potential errors in order to avoid data loss (which
should be only of interest to the original writer anyway). It would
be very questionable for the interface to persist the error while
subsequent writes and fsyncs to different offsets may as well go through.
Another process may need to write into the file and fsync, while being
unaware of those newly introduced semantics is now faced with EIO
because some unrelated previous process failed some earlier writes
and did not bother to clear the error for those writes. In a similar
scenario where the second process is aware of the new semantics, it would
naturally go ahead and clear the global error in order to proceed
with its own write()+fsync(), which would essentially amount to the
same problematic semantics you have now.

> Also, this really does make it impossible to write reliable programs.
> Imagine that, while the server is running, somebody runs a program
> which opens a file in the data directory, calls fsync() on it, and
> closes it.  If the fsync() fails, postgres is now borked and has no
> way of being aware of the problem.  If we knew, we could PANIC, but
> we'll never find out, because the unrelated process ate the error.
> This is exactly the sort of ill-considered behavior that makes fcntl()
> locking nearly useless.

Fully agree, and the errseq_t fixes have dealt exactly with the issue
of making sure that the error is reported to all file descriptors that
*happen to be open at the time of error*. But I think one would have a
hard time defending a modification to the kernel where this is further
extended to cover cases where:

process A does write() on some file offset which fails writeback,
fsync() gets EIO and exit()s.

process B does write() on some other offset which succeeds writeback,
but fsync() gets EIO due to (uncleared) failures of earlier process.

This would be a highly user-visible change of semantics from edge-
triggered to level-triggered behavior.

> dodge this issue in another way: suppose that when we write a page
> out, we don't consider it really written until fsync() succeeds.  Then

That's the only way to think about fsync() guarantees unless you
are on a kernel that keeps retrying to persist dirty pages. Assuming
such a model, after repeated and unrecoverable hard failures the
process would have to explicitly inform the kernel to drop the dirty
pages. All the process could do at that point is read back to userspace
the dirty/failed pages and attempt to rewrite them at a different place
(which is current possible too). Most applications would not bother
though to inform the kernel and drop the permanently failed pages;
and thus someone eventually would hit the case that a large amount
of failed writeback pages are running his server out of memory,
at which point people will complain that those semantics are completely
unreasonable.

> we wouldn't need to PANIC if an fsync() fails; we could just re-write
> the page.  Unfortunately, this would also be terrible for performance,
> for pretty much the same reasons: letting the OS cache absorb lots of
> dirty blocks and do write-combining is *necessary* for good
> performance.

Not sure I understand this case. The application may indeed re-write
a bunch of pages that have failed and proceed with fsync(). The kernel
will deal with combining the writeback of all the 

Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

2018-04-03 Thread Amit Langote
On 2018/04/03 14:45, Kyotaro HORIGUCHI wrote:
> Hello.
> 
> At Mon, 2 Apr 2018 16:11:12 -0300, Alvaro Herrera  
> wrote:
>> Why do we need AccessExclusiveLock on all children of a relation that we
>> want to scan to search for rows not satisfying the constraint?  I think
>> it should be enough to get ShareLock, which prevents INSERT, no?  I have
>> a feeling I'm missing something here, but I don't know what, and all
>> tests pass with that change.

Thinking on this a bit, I see no problem with locking the children with
just ShareLock.  It was just a paranoia that if we're going to lock the
table itself being attached with AEL, we must its children (if any) with
AEL, too.

> Mmm. I'm not sure if there's a lock-upgrade case but the
> following sentense is left at the last of one of the modified
> comments. ATREwriteTables is called with AEL after that (that has
> finally no effect in this case).
> 
> |   But we cannot risk a deadlock by taking
> | * a weaker lock now and the stronger one only when needed.
> 
> I don't actually find places where the children's lock can be
> raised but this seems just following the lock parent first
> principle.

No lock upgrade happen as of now.  The comment was added by the commit
972b6ec20bf [1], which removed the code that could cause such a deadlock.
The comment fragment is simply trying to explain why we don't postpone the
locking of children to a later time, say, to the point where we actually
know that they need to be scanned.  Previously the code next to the
comment used to lock the children using AccessShareLock, because at that
point we just needed to check if the table being attached is one of those
children and then later locked with AEL if it turned out that they need to
be scanned to check the partition constraint.

> By the way check_default_allows_bound() (CREATE TABLE case)
> contains a similar usage of find_all_inheritors(default_rel,
> AEL).

Good catch.  Its job is more or less same as
ValidatePartitionConstraints(), except all the children (if any) are
scanned right away instead of queuing it like in the AT case.

>> Also: the change proposed to remove the get_default_oid_from_partdesc()
>> call actually fixes the bug, but to me it was not at all obvious why.
> 
> CommandCounterIncrement updates the content of a relcache entry
> via invalidation. It can be surprising for callers of a function
> like StorePartitionBound.
> 
> CommandCounterIncrement
>  
>RelationCacheInvalidateEntry
>  RelationFlushRelation
>RelationClearRelation

Because of the CCI() after storing the default partition OID into the
system catalog, RelationClearRelation() would changes what
rel->rd_partdesc points to where 'rel' is the ATExecAttachPartition()'s
reference to the relcache entry of the parent that it passed to
StorePartitionBound.

So, whereas the rel->rd_partdesc wouldn't contain the default partition
before StorePartitionBound() was called, it would after.

>> To figure out why, I first had to realize that
>> ValidatePartitionConstraints was lying to me, both in name and in
>> comments: it doesn't actually validate anything, it merely queues
>> entries so that alter table's phase 3 would do the validation.  I found
>> this extremely confusing, so I fixed the comments to match reality, and
>> later decided to rename the function also.
> 
> It is reasonable. Removing exccessive extension of lower-level
> partitions is also reasonable. The previous code extended
> inheritances in different manner for levels at odd and even
> depth.

I like the new code including the naming, but I notice this changes the
order in which we do the locking now.  There are still sites in the code
where the locking order is breadth-first, that is, as determined by
find_all_inheritors(), as this function would too previously.

Also note that beside the breadth-first -> depth-first change, this also
changes the locking order of partitions for a given partitioned table.
The OIDs in partdesc->oids[] are canonically ordered (that is order of
their partition bounds), whereas find_inheritance_children() that's called
by find_all_inheritors() would lock them in the order in which the
individual OIDs were found in the system catalog.

Not sure if there is anything to be alarmed of here, but in all previous
discussions, this has been a thing to avoid.

>> At that point I was able to understand what the problem was: when
>> attaching the default partition, we were setting the constraint to be
>> validated for that partition to the correctly computed partition
>> constraint; and later in the same function we would set the constraint
>> to "the immature constraint" because we were now seeing that the default
>> partition OID was not invalid.  So it was rather a bug in
>> ValidatePartitionConstraints, in that it was accepting to set the
>> expression to validate when another expression had already been set!  I
>> added an assert to protect 

Re: Missing parse_merge.h?

2018-04-03 Thread Satoshi Nagayasu
2018-04-03 18:11 GMT+09:00 Pavan Deolasee :
>
>
> On Tue, Apr 3, 2018 at 2:37 PM, Satoshi Nagayasu  wrote:
>>
>> Hi,
>>
>> I just got a compile error as below on the latest HEAD.
>> ---
>> gcc -Wall -Wmissing-prototypes -Wpointer-arith
>> -Wdeclaration-after-statement -Wendif-labels
>> -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
>> -fwrapv -fexcess-precision=standard -g -O2 -I. -I.
>> -I../../../src/include  -D_GNU_SOURCE   -c -o analyze.o analyze.c
>>
>> analyze.c:41:32: fatal error: parser/parse_merge.h: No such file or
>> directory
>>
>> #include "parser/parse_merge.h"
>>
>>^
>>
>> compilation terminated.
>>
>> make[3]: *** [analyze.o] Error 1
>> ---
>>
>> The latest commit [1] looks missing parse_merge.h.
>> Or did I make some mistake on building?
>>
>
> Looks like Simon forgot to add new files in that commit. I am trying to get
> in touch with him so that he can add the missing files and correct the
> situation. Sorry for the trouble.

Thanks, I will be waiting for the fix.

Regards,
-- 
Satoshi Nagayasu 



Re: Missing parse_merge.h?

2018-04-03 Thread Pavan Deolasee
On Tue, Apr 3, 2018 at 2:37 PM, Satoshi Nagayasu  wrote:

> Hi,
>
> I just got a compile error as below on the latest HEAD.
> ---
> gcc -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Wendif-labels
> -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
> -fwrapv -fexcess-precision=standard -g -O2 -I. -I.
> -I../../../src/include  -D_GNU_SOURCE   -c -o analyze.o analyze.c
>
> analyze.c:41:32: fatal error: parser/parse_merge.h: No such file or
> directory
>
> #include "parser/parse_merge.h"
>
>^
>
> compilation terminated.
>
> make[3]: *** [analyze.o] Error 1
> ---
>
> The latest commit [1] looks missing parse_merge.h.
> Or did I make some mistake on building?
>
>
Looks like Simon forgot to add new files in that commit. I am trying to get
in touch with him so that he can add the missing files and correct the
situation. Sorry for the trouble.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] [PATCH] Incremental sort

2018-04-03 Thread Alexander Korotkov
On Sun, Apr 1, 2018 at 12:06 AM, Tomas Vondra 
wrote:

> On 03/31/2018 10:43 PM, Tomas Vondra wrote:
> > ...
> > But I'm pretty sure it may lead to surprising behavior - for example if
> > you disable incremental sorts (enable_incrementalsort=off), the plan
> > will switch to plain sort without the additional costs. So you'll get a
> > cheaper plan by disabling some operation. That's surprising.
> >
>
> To illustrate this is a valid issue, consider this trivial example:
>
> create table t (a int, b int, c int);
>
> insert into t select 10*random(), 10*random(), 10*random()
>   from generate_series(1,100) s(i);
>
> analyze t;
>
> explain select * from (select * from t order by a,b) foo order by a,b,c;
>
>QUERY PLAN
> 
>  Incremental Sort  (cost=133100.48..264139.27 rows=100 width=12)
>Sort Key: t.a, t.b, t.c
>Presorted Key: t.a, t.b
>->  Sort  (cost=132154.34..134654.34 rows=100 width=12)
>  Sort Key: t.a, t.b
>  ->  Seq Scan on t  (cost=0.00..15406.00 rows=100 width=12)
> (6 rows)
>
> set enable_incrementalsort = off;
>
> explain select * from (select * from t order by a,b) foo order by a,b,c;
>QUERY PLAN
> 
>  Sort  (cost=261402.69..263902.69 rows=100 width=12)
>Sort Key: t.a, t.b, t.c
>->  Sort  (cost=132154.34..134654.34 rows=100 width=12)
>  Sort Key: t.a, t.b
>  ->  Seq Scan on t  (cost=0.00..15406.00 rows=100 width=12)
> (5 rows)
>
> So the cost with incremental sort was 264139, and after disabling the
> incremental cost it dropped to 263902. Granted, the difference is
> negligible in this case, but it's still surprising.
>
> Also, it can be made much more significant by reducing the number of
> prefix groups in the data:
>
> truncate t;
>
> insert into t select 1,1,1 from generate_series(1,100) s(i);
>
> analyze t;
>
> set enable_incrementalsort = on;
>
> explain select * from (select * from t order by a,b) foo order by a,b,c;
>
>QUERY PLAN
> 
>  Incremental Sort  (cost=324165.83..341665.85 rows=100 width=12)
>Sort Key: t.a, t.b, t.c
>Presorted Key: t.a, t.b
>->  Sort  (cost=132154.34..134654.34 rows=100 width=12)
>  Sort Key: t.a, t.b
>  ->  Seq Scan on t  (cost=0.00..15406.00 rows=100 width=12)
> (6 rows)
>
> So that's 263902 vs. 341665, yet we still prefer the incremental mode.


Problem is well-defined, thank you.
I'll check what can be done in this field today.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Missing parse_merge.h?

2018-04-03 Thread Satoshi Nagayasu
Hi,

I just got a compile error as below on the latest HEAD.
---
gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -fexcess-precision=standard -g -O2 -I. -I.
-I../../../src/include  -D_GNU_SOURCE   -c -o analyze.o analyze.c

analyze.c:41:32: fatal error: parser/parse_merge.h: No such file or directory

#include "parser/parse_merge.h"

   ^

compilation terminated.

make[3]: *** [analyze.o] Error 1
---

The latest commit [1] looks missing parse_merge.h.
Or did I make some mistake on building?

Regards,

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d204ef63776b8a00ca220adec23979091564e465
-- 
Satoshi Nagayasu 



Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-04-03 Thread Heikki Linnakangas

On 27/03/18 21:02, Tom Lane wrote:

Robert Haas  writes:

On Tue, Mar 27, 2018 at 11:41 AM, Tom Lane  wrote:

This is ignoring the possibility of damaged data in between, ie
A ... B ... CHKPT ... C ...  a few zeroed pages ... D ... CHKPT ... E ... F



It's hard for me to believe that this case matters very much.  If
you're trying to run pg_rewind on a system where the WAL segments
contain a few zeroed pages, you're probably going to be hosed anyway,
if not by this particular thing then by something else.


Well, the point of checkpoints is that WAL data before the last one
should no longer matter anymore, isn't it?

But really this is just one illustration of the point, which is that
changing the WAL header definition as proposed *does* cost us reliability.
We can argue about whether better concurrency in WAL insertion is
worth that price, but claiming that the price is zero is flat wrong.

For me, the more important issue is that checking xl_prev not only
validates that the record is where it is supposed to be, but that you
arrived at it correctly.  Replacing it with xl_curr would keep the
guarantee of the former (in fact likely make it stronger); but we'd
completely lose the latter.


Yeah, removing xl_prev would certainly remove a useful cross-check from 
the format. Maybe it would be worth it, I don't think it's that 
critical. And adding some extra information in the page or segment 
headers might alleviate that.


But let's go back to why we're considering this. The idea was to 
optimize this block:



/*
 * The duration the spinlock needs to be held is minimized by minimizing
 * the calculations that have to be done while holding the lock. The
 * current tip of reserved WAL is kept in CurrBytePos, as a byte 
position
 * that only counts "usable" bytes in WAL, that is, it excludes all WAL
 * page headers. The mapping between "usable" byte positions and 
physical
 * positions (XLogRecPtrs) can be done outside the locked region, and
 * because the usable byte position doesn't include any headers, 
reserving
 * X bytes from WAL is almost as simple as "CurrBytePos += X".
 */
SpinLockAcquire(>insertpos_lck);

startbytepos = Insert->CurrBytePos;
endbytepos = startbytepos + size;
prevbytepos = Insert->PrevBytePos;
Insert->CurrBytePos = endbytepos;
Insert->PrevBytePos = startbytepos;

SpinLockRelease(>insertpos_lck);


One trick that we could do is to replace that with a 128-bit atomic 
compare-and-swap instruction. Modern 64-bit Intel systems have that, 
it's called CMPXCHG16B. Don't know about other architectures. An atomic 
fetch-and-add, as envisioned in the comment above, would presumably be 
better, but I suspect that a compare-and-swap would be good enough to 
move the bottleneck elsewhere again.


- Heikki



Re: ALTER TABLE ADD COLUMN fast default

2018-04-03 Thread Andrew Dunstan
On Fri, Mar 30, 2018 at 10:08 AM, Andrew Dunstan
 wrote:
> On Fri, Mar 30, 2018 at 5:00 AM, Andres Freund  wrote:

[ missing values are loaded in the TupleDesc by RelationBuildTupleDesc ]

>>> > I'm still strongly object to do doing this unconditionally for queries
>>> > that never need any of this.  We're can end up with a significant number
>>> > of large tuples in memory here, and then copy that through dozens of
>>> > tupledesc copies in queries.
>>
>>> We're just doing the same thing we do for default values.
>>
>> That's really not a defense. It's hurting us there too.
>>
>
>
> I will take a look and see if it can be avoided easily.
>
>


For missing values there are three places where we would need to load
them on demand: getmissingattr(), slot_getmissingattrs() and
expand_tuple(). However, none of those functions have obvious access
to the information required to load the missing values. We could
probably do something very ugly like getting the relid from the
TupleDesc's first attribute. Maybe someone else has a better option.
But I can't see a simple and clean way to do this.


cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



  1   2   >