Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Craig Ringer
On 10 April 2018 at 13:04, Michael Paquier  wrote:
> On Mon, Apr 09, 2018 at 03:02:11PM -0400, Robert Haas wrote:
>> Another consequence of this behavior that initdb -S is never reliable,
>> so pg_rewind's use of it doesn't actually fix the problem it was
>> intended to solve.  It also means that initdb itself isn't crash-safe,
>> since the data file changes are made by the backend but initdb itself
>> is doing the fsyncs, and initdb has no way of knowing what files the
>> backend is going to create and therefore can't -- even theoretically
>> -- open them first.
>
> And pg_basebackup.  And pg_dump.  And pg_dumpall.  Anything using initdb
> -S or fsync_pgdata would enter in those waters.

... but *only if they hit an I/O error* or they're on a FS that
doesn't reserve space and hit ENOSPC.

It still does 99% of the job. It still flushes all buffers to
persistent storage and maintains write ordering. It may not detect and
report failures to the user how we'd expect it to, yes, and that's not
great. But it's hardly throw up our hands and give up territory
either. Also, at least for initdb, we can make initdb fsync() its own
files before close(). Annoying but hardly the end of the world.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Andrey Borodin


> 9 апр. 2018 г., в 23:04, Heikki Linnakangas  написал(а):
> 
> On 09/04/18 18:21, Andrey Borodin wrote:
>>> 9 апр. 2018 г., в 19:50, Teodor Sigaev 
>>> написал(а):
 3. Why do we *not* lock the entry leaf page, if there is no
 match? We still need a lock to remember that we probed for that
 value and there was no match, so that we conflict with a tuple
 that might be inserted later. At least #3 is a bug. The attached
 patch adds an isolation test that demonstrates it. #1 and #2 are
 weird, and cause unnecessary locking, so I think we should fix
 those too, even if they don't lead to incorrect results.
>>> I can't find a hole here. Agree.
>> Please correct me if I'm wrong. Let's say we have posting trees for
>> word A and word B. We are looking for a document that contains both. We will 
>> read through all posting tree of A, but only through some
>> segments of B. If we will not find anything in B, we have to lock
>> only segments where we actually were looking, not all the posting
>> tree of B.
> 
> True, that works. It was not clear from the code or comments that that was 
> intended. I'm not sure if that's worthwhile, compared to locking just the 
> posting tree root block.
From the text search POV this is kind of bulky granularity: if you have 
frequent words like "the", "a", "in", conflicts are inevitable.
I'm not sure we have means for picking optimal granularity: should it be ranges 
of postings, ranges of pages of posting trees, entries, pages of entries or 
whole index.
Technically, [time for locking] should be less than [time of transaction 
retry]*[probability of conflict]. Holding this constraint we should minimize 
[time for locking] + [time of transaction retry]*[probability of conflict].
I suspect that [time for locking] is some orders of magnitude less than time of 
transaction. So, efforts should be skewed towards smaller granularity to reduce 
[probability of conflict].
But all this is not real math and have no strength of science.

> I'll let Teodor decide..
+1. I belive this is very close to optimal solution :)

Best regards, Andrey Borodin.

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Michael Paquier
On Mon, Apr 09, 2018 at 03:02:11PM -0400, Robert Haas wrote:
> Another consequence of this behavior that initdb -S is never reliable,
> so pg_rewind's use of it doesn't actually fix the problem it was
> intended to solve.  It also means that initdb itself isn't crash-safe,
> since the data file changes are made by the backend but initdb itself
> is doing the fsyncs, and initdb has no way of knowing what files the
> backend is going to create and therefore can't -- even theoretically
> -- open them first.

And pg_basebackup.  And pg_dump.  And pg_dumpall.  Anything using initdb
-S or fsync_pgdata would enter in those waters.
--
Michael


signature.asc
Description: PGP signature


Re: [sqlsmith] Failed assertion on pfree() via perform_pruning_combine_step

2018-04-09 Thread Michael Paquier
On Mon, Apr 09, 2018 at 10:59:48AM -0300, Alvaro Herrera wrote:
> Amit Langote wrote:
>> I have reproduced this and found that the problem is that
>> perform_pruning_combine_step forgets to *copy* the bitmapset of the first
>> step in the handling of an COMBINE_INTERSECT step.
> 
> Pushed, thanks Amit and Andreas!

Álvaro, didn't you forget to actually push the patch?
--
Michael


signature.asc
Description: PGP signature


Re: lazy detoasting

2018-04-09 Thread Andrew Gierth
> "Chapman" == Chapman Flack  writes:

 Chapman> AFAICS, that is *all* that comment block has to say about why
 Chapman> there's an active snapshot stack. I believe you are saying it
 Chapman> has another important function, namely that its top element is
 Chapman> what tells the executor what can be seen.

That's not precisely true - ultimately, the routines that do actual
scans take the snapshot to use as a parameter, and the executor mostly
references the snapshot from the EState; but a bunch of places do
require that ActiveSnapshot be set to the currently applicable snapshot
(e.g. for queries inside stable functions nested inside the main query).

-- 
Andrew (irc:RhodiumToad)



Re: [HACKERS] path toward faster partition pruning

2018-04-09 Thread Ashutosh Bapat
On Mon, Apr 9, 2018 at 8:56 PM, Robert Haas  wrote:
> On Fri, Apr 6, 2018 at 11:41 PM, Tom Lane  wrote:
>> David Rowley  writes:
>>> Sounds like you're saying that if we have too many alternative files
>>> then there's a chance that one could pass by luck.
>>
>> Yeah, exactly: it passed, but did it pass for the right reason?
>>
>> If there's just two expected-files, it's likely not a big problem,
>> but if you have a bunch it's something to worry about.
>>
>> I'm also wondering how come we had hash partitioning before and
>> did not have this sort of problem.  Is it just that we added a
>> new test that's more sensitive to the details of the hashing
>> (if so, could it be made less so)?  Or is there actually more
>> platform dependence now than before (and if so, why is that)?
>
> The existing hash partitioning tests did have some dependencies on the
> hash function, but they took care not to use the built-in hash
> functions.  Instead they did stuff like this:
>
> CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS
> $$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE;
> CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS
> OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8);
> CREATE TABLE mchash (a int, b text, c jsonb)
>   PARTITION BY HASH (a test_int4_ops, b test_text_ops);
>
> I think that this approach should also be used for the new tests.
> Variant expected output files are a pain to maintain, and you
> basically just have to take whatever output you get as the right
> answer, because nobody knows what output a certain built-in hash
> function should produce for a given input except by running the code.
> If you do the kind of thing shown above, though, then you can easily
> see by inspection that you're getting the right answer.

+1.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: pruning disabled for array, enum, record, range type partition keys

2018-04-09 Thread Amit Langote
Thanks for the comment.

On 2018/04/09 23:22, Tom Lane wrote:
> Amit Langote  writes:
>> I noticed that the newly added pruning does not work if the partition key
>> is of one of the types that have a corresponding pseudo-type.
> 
> While I don't recall the details due to acute caffeine shortage,
> there are specific coding patterns that are used in the planner
> (e.g. in indxpath.c) to ensure that the code works with pseudotype
> opclasses as well as simple ones.  The existence of this bug
> indicates that those conventions were not followed in the pruning
> code.  I wonder whether this patch makes the pruning code look
> more like the pre-existing code, or even less like it.

Ah, I think I got it after staring at the (btree) index code for a bit.

What pruning code got wrong is that it's comparing the expression type
(type of the constant arg that will be compared with partition bound
datums when pruning) with the partopcintype to determine if we should look
up the cross-type comparison/hashing procedure, whereas what the latter
should be compare with is the clause operator's oprighttype.  ISTM, if
op_in_opfamily() passed for the operator, that's the correct thing to do.

Once I changed the code to do it that way, no special considerations are
needed to handle pseudo-type type partition key.

Attached please find the updated patch.

Thanks,
Amit
From 1c136a321153cca04c0842807c2cd166e79f2556 Mon Sep 17 00:00:00 2001
From: amit 
Date: Fri, 8 Dec 2017 19:09:31 +0900
Subject: [PATCH v2] Fix pruning code to determine comparison function
 correctly

It's unreliable to determine one using the constant expression's
type directly (for example, it doesn't work correctly when the
expression contains an array, enum, etc.).  Instead, use righttype
of the operator, the one that supposedly passed the op_in_opfamily
test using the partitioning opfamily.
---
 src/backend/partitioning/partprune.c  | 29 +---
 src/test/regress/expected/partition_prune.out | 98 +++
 src/test/regress/sql/partition_prune.sql  | 44 +++-
 3 files changed, 161 insertions(+), 10 deletions(-)

diff --git a/src/backend/partitioning/partprune.c 
b/src/backend/partitioning/partprune.c
index 7666c6c412..2655d2caa2 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -1426,10 +1426,12 @@ match_clause_to_partition_key(RelOptInfo *rel,
OpExpr *opclause = (OpExpr *) clause;
Expr   *leftop,
   *rightop;
-   Oid commutator = InvalidOid,
+   Oid op_lefttype,
+   op_righttype,
+   commutator = InvalidOid,
negator = InvalidOid;
Oid cmpfn;
-   Oid exprtype;
+   int op_strategy;
boolis_opne_listp = false;
PartClauseInfo *partclause;
 
@@ -1517,10 +1519,20 @@ match_clause_to_partition_key(RelOptInfo *rel,
return PARTCLAUSE_UNSUPPORTED;
}
 
-   /* Check if we're going to need a cross-type comparison 
function. */
-   exprtype = exprType((Node *) expr);
-   if (exprtype != part_scheme->partopcintype[partkeyidx])
+   /*
+* Check if we're going to need a cross-type comparison 
function to
+* use during pruning.
+*/
+   get_op_opfamily_properties(OidIsValid(negator)
+   ? 
negator : opclause->opno,
+  
partopfamily, false,
+  
_strategy, _lefttype, _righttype);
+   /* Use the cached one if no cross-type comparison. */
+   if (op_righttype == part_scheme->partopcintype[partkeyidx])
+   cmpfn = part_scheme->partsupfunc[partkeyidx].fn_oid;
+   else
{
+   /* Otherwise, look the correct one up in the catalog. */
switch (part_scheme->strategy)
{
case PARTITION_STRATEGY_LIST:
@@ -1528,13 +1540,14 @@ match_clause_to_partition_key(RelOptInfo *rel,
cmpfn =

get_opfamily_proc(part_scheme->partopfamily[partkeyidx],

  part_scheme->partopcintype[partkeyidx],
-   
  exprtype, BTORDER_PROC);
+ 

Gotchas about pg_verify_checksums

2018-04-09 Thread Michael Paquier
Hi all,

I have not been giving much attention to the thread about enabling
checksums online, which has resulted in the revert of the feature, but
there is still pg_verify_checksums around.  So I looked at it a bit.

I have a couple of questions/gotchas about it:
1) The documentation states that the cluster needs to be offline.
Doesn't this imply that the cluster can also be forcibly killed?  It
seems to me that the documentation ought to say that the cluster needs
to be shut down cleanly instead.  Mentioning that only in the notes of
the documentation would be enough in my opinion.
2) On a cluster where checksums are disabled, aka the control file says
so, then using --force does not result in incorrect blocks to be
reported.  This is caused by the presence of the check on
PG_DATA_CHECKSUM_VERSION which does not match for a cluster where
checksums have been disabled.  Wouldn't one want to know this
information as well to know what are the portions of the data folder not
treated yet by checksum updates?
3) Is the force option actually useful?  I assume that --force is useful
if one wants to check how checksums are computed if a switch off -> on
is used to see the progress of the operation or to see how much progress
has been done after doing an online switch, which is also what a228cc13
outlines.  Still this requires the cluster to be offline...

Attached is a patch which addresses point 1) (with one markup fix in
bonus).  I'd love to get more input about the other bullet points.

Thanks,
--
Michael
diff --git a/doc/src/sgml/ref/pg_verify_checksums.sgml b/doc/src/sgml/ref/pg_verify_checksums.sgml
index 463ecd5e1b..e1f12ad4c4 100644
--- a/doc/src/sgml/ref/pg_verify_checksums.sgml
+++ b/doc/src/sgml/ref/pg_verify_checksums.sgml
@@ -30,8 +30,8 @@ PostgreSQL documentation
  
   Description
   
-   pg_verify_checksums verifies data checksums in a PostgreSQL
-   cluster. It must be run against a cluster that's offline.
+   pg_verify_checksums verifies data checksums in a
+   PostgreSQL cluster.
   
  
 
@@ -97,7 +97,8 @@ PostgreSQL documentation
  
   Notes
   
-Can only be run when the server is offline.
+   The cluster must be shut down cleanly before running
+   pg_verify_checksums.
   
  
 


signature.asc
Description: PGP signature


Re: [sqlsmith] Failed assertion on pfree() via perform_pruning_combine_step

2018-04-09 Thread Amit Langote
On 2018/04/09 22:59, Alvaro Herrera wrote:
> Hello,
> 
> Amit Langote wrote:
> 
>> I have reproduced this and found that the problem is that
>> perform_pruning_combine_step forgets to *copy* the bitmapset of the first
>> step in the handling of an COMBINE_INTERSECT step.
> 
> Pushed, thanks Amit and Andreas!

Thanks!

Regards,
Amit




Re: Warnings and uninitialized variables in TAP tests

2018-04-09 Thread Michael Paquier
On Mon, Apr 09, 2018 at 09:46:29PM +0200, Magnus Hagander wrote:
> Applied, thanks.

Thanks for the commit.
--
Michael


signature.asc
Description: PGP signature


Re: pgsql: Merge catalog/pg_foo_fn.h headers back into pg_foo.h headers.

2018-04-09 Thread Michael Paquier
On Mon, Apr 09, 2018 at 04:46:34PM -0400, Tom Lane wrote:
> After further contemplation I decided that that was, in fact, the only
> reasonable way to improve matters.  If we have multiple subdirectories
> independently firing the "make generated-headers" action, then we have
> parallel make hazards of just the same sort I was trying to prevent.
> So it's really an all-or-nothing proposition.  The MAKELEVEL hack
> plus wiring the prerequisite into the recursion rules is the best way
> to make that happen.

Oh.  Good idea.  Thanks for the fix.  Hopefully that won't cause much
noise when bisecting bugs, this got in pretty quickly..

No more complains from here.
--
Michael


signature.asc
Description: PGP signature


Re: ON CONFLICT DO UPDATE for partitioned tables

2018-04-09 Thread Amit Langote
On 2018/03/27 13:27, Amit Langote wrote:
> On 2018/03/26 23:20, Alvaro Herrera wrote:
>> The one thing I wasn't terribly in love with is the four calls to
>> map_partition_varattnos(), creating the attribute map four times ... but
>> we already have it in the TupleConversionMap, no?  Looks like we could
>> save a bunch of work there.
> 
> Hmm, actually we can't use that map, assuming you're talking about the
> following map:
> 
>   TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
> 
> We can only use that to tell if we need converting expressions (as we
> currently do), but it cannot be used to actually convert the expressions.
> The map in question is for use by do_convert_tuple(), not to map varattnos
> in Vars using map_variable_attnos().
> 
> But it's definitely a bit undesirable to have various
> map_partition_varattnos() calls within ExecInitPartitionInfo() to
> initialize the same information (the map) multiple times.

I will try to think of doing something about this later this week.

>> And a final item is: can we have a whole-row expression in the clauses?
>> We currently don't handle those either, not even to throw an error.
>> [figures a test case] ... and now that I test it, it does crash!
>>
>> create table part (a int primary key, b text) partition by range (a);
>> create table part1 (b text, a int not null);
>> alter table part attach partition part1 for values from (1) to (1000);
>> insert into part values (1, 'two') on conflict (a)
>>   do update set b = format('%s (was %s)', excluded.b, part.b)
>>   where part.* *<> (1, text 'two');
>>
>> I think this means we should definitely handle found_whole_row.  (If you
>> create part1 in the normal way, it works as you'd expect.)
> 
> I agree.  That means we simply remove the Assert after the
> map_partition_varattnos call.
> 
>> I'm going to close a few other things first, then come back to this.
> 
> Attached find a patch to fix the whole-row expression issue.  I added your
> test to insert_conflict.sql.

Adding this to the open items list.

Thanks,
Amit




Re: WIP: Covering + unique indexes.

2018-04-09 Thread Peter Geoghegan
On Sun, Apr 8, 2018 at 11:19 PM, Teodor Sigaev  wrote:
> Thank you, pushed.

I noticed a few more issues following another pass-through of the patch:

* There is no pfree() within _bt_buildadd() for truncated tuples, even
though that's a context where it's clearly not okay.

* It might be a good idea to also pfree() the truncated tuple for most
other _bt_buildadd() callers. Even though it's arguably okay in other
cases, it seems worth being consistent about it (consistent with old
nbtree code).

* There should probably be some documentation around why it's okay
that we call index_truncate_tuple() with an exclusive buffer lock held
(during a page split). For example, there should probably be a comment
on the VARATT_IS_EXTERNAL() situation.

* Not sure that all calls to BTreeInnerTupleGetDownLink() are limited
to inner tuples, which might be worth doing something about (perhaps
just renaming the macro).

I do not have the time to write a patch right away, but I should be
able to post one in a few days. I want to avoid sending several small
patches.

-- 
Peter Geoghegan



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Craig Ringer
On 10 April 2018 at 08:41, Andreas Karlsson  wrote:
> On 04/09/2018 02:16 PM, Craig Ringer wrote:
>>
>> I'd like a middle ground where the kernel lets us register our interest
>> and tells us if it lost something, without us having to keep eight million
>> FDs open for some long period. "Tell us about anything that happens under
>> pgdata/" or an inotify-style per-directory-registration option. I'd even say
>> that's ideal.
>
>
> Could there be a risk of a race condition here where fsync incorrectly
> returns success before we get the notification of that something went wrong?

We'd examine the notification queue only once all our checkpoint
fsync()s had succeeded, and before we updated the control file to
advance the redo position.

I'm intrigued by the suggestion upthread of using a kprobe or similar
to achieve this. It's a horrifying unportable hack that'd make kernel
people cry, and I don't know if we have any way to flush buffered
probe data to be sure we really get the news in time, but it's a cool
idea too.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andres Freund


On April 9, 2018 6:59:03 PM PDT, Craig Ringer  wrote:
>On 10 April 2018 at 04:37, Andres Freund  wrote:
>> Hi,
>>
>> On 2018-04-09 22:30:00 +0200, Tomas Vondra wrote:
>>> Maybe. I'd certainly prefer automated recovery from an temporary I/O
>>> issues (like full disk on thin-provisioning) without the database
>>> crashing and restarting. But I'm not sure it's worth the effort.
>>
>> Oh, I agree on that one. But that's more a question of how we force
>the
>> kernel's hand on allocating disk space. In most cases the kernel
>> allocates the disk space immediately, even if delayed allocation is
>in
>> effect. For the cases where that's not the case (if there are current
>> ones, rather than just past bugs), we should be able to make sure
>that's
>> not an issue by pre-zeroing the data and/or using fallocate.
>
>Nitpick: In most cases the kernel reserves disk space immediately,
>before returning from write(). NFS seems to be the main exception
>here.
>
>EXT4 and XFS don't allocate until later, it by performing actual
>writes to FS metadata, initializing disk blocks, etc. So we won't
>notice errors that are only detectable at actual time of allocation,
>like thin provisioning problems, until after write() returns and we
>face the same writeback issues.
>
>So I reckon you're safe from space-related issues if you're not on NFS
>(and whyyy would you do that?) and not thinly provisioned. I'm sure
>there are other corner cases, but I don't see any reason to expect
>space-exhaustion-related corruption problems on a sensible FS backed
>by a sensible block device. I haven't tested things like quotas,
>verified how reliable space reservation is under concurrency, etc as
>yet.

How's that not solved by pre zeroing and/or fallocate as I suggested above?

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Excessive PostmasterIsAlive calls slow down WAL redo

2018-04-09 Thread Andres Freund


On April 9, 2018 6:57:23 PM PDT, Alvaro Herrera  wrote:
>Andres Freund wrote:
>> 
>> On April 9, 2018 6:31:07 PM PDT, Alvaro Herrera
> wrote:
>
>> >Would it work to use this second pipe, to which each child writes a
>> >byte that postmaster never reads, and then rely on SIGPIPE when
>> >postmaster dies?  Then we never need to do a syscall.
>> 
>> I'm not following, could you expand on what you're suggesting?  Note
>> that you do not get SIGPIPE for already buffered writes.  Which
>> syscall can we avoid?
>
>Ah.  I was thinking we'd get SIGPIPE from the byte sent at the start,
>as
>soon as the kernel saw that postmaster abandoned the fd by dying.
>Scratch that then.

Had the same idea, but unfortunately reality, in the form of a test program, 
cured me of my hope ;)
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Craig Ringer
On 10 April 2018 at 04:37, Andres Freund  wrote:
> Hi,
>
> On 2018-04-09 22:30:00 +0200, Tomas Vondra wrote:
>> Maybe. I'd certainly prefer automated recovery from an temporary I/O
>> issues (like full disk on thin-provisioning) without the database
>> crashing and restarting. But I'm not sure it's worth the effort.
>
> Oh, I agree on that one. But that's more a question of how we force the
> kernel's hand on allocating disk space. In most cases the kernel
> allocates the disk space immediately, even if delayed allocation is in
> effect. For the cases where that's not the case (if there are current
> ones, rather than just past bugs), we should be able to make sure that's
> not an issue by pre-zeroing the data and/or using fallocate.

Nitpick: In most cases the kernel reserves disk space immediately,
before returning from write(). NFS seems to be the main exception
here.

EXT4 and XFS don't allocate until later, it by performing actual
writes to FS metadata, initializing disk blocks, etc. So we won't
notice errors that are only detectable at actual time of allocation,
like thin provisioning problems, until after write() returns and we
face the same writeback issues.

So I reckon you're safe from space-related issues if you're not on NFS
(and whyyy would you do that?) and not thinly provisioned. I'm sure
there are other corner cases, but I don't see any reason to expect
space-exhaustion-related corruption problems on a sensible FS backed
by a sensible block device. I haven't tested things like quotas,
verified how reliable space reservation is under concurrency, etc as
yet.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Excessive PostmasterIsAlive calls slow down WAL redo

2018-04-09 Thread Alvaro Herrera
Andres Freund wrote:
> 
> On April 9, 2018 6:31:07 PM PDT, Alvaro Herrera  
> wrote:

> >Would it work to use this second pipe, to which each child writes a
> >byte that postmaster never reads, and then rely on SIGPIPE when
> >postmaster dies?  Then we never need to do a syscall.
> 
> I'm not following, could you expand on what you're suggesting?  Note
> that you do not get SIGPIPE for already buffered writes.  Which
> syscall can we avoid?

Ah.  I was thinking we'd get SIGPIPE from the byte sent at the start, as
soon as the kernel saw that postmaster abandoned the fd by dying.
Scratch that then.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Craig Ringer
On 10 April 2018 at 04:25, Mark Dilger  wrote:

> I was reading this thread up until now as meaning that the standby could
> receive corrupt WAL data and become corrupted.

Yes, it can, but not directly through the first error.

What can happen is that we think a block got written when it didn't.

If our in memory state diverges from our on disk state, we can make
subsequent WAL writes based on that wrong information. But that's
actually OK, since the standby will have replayed the original WAL
correctly.

I think the only time we'd run into trouble is if we evict the good
(but not written out) data from s_b and the fs buffer cache, then
later read in the old version of a block we failed to overwrite. Data
checksums (if enabled) might catch it unless the write left the whole
block stale. In that case we might generate a full page write with the
stale block and propagate that over WAL to the standby.

So I'd say standbys are relatively safe - very safe if the issue is
caught promptly, and less so over time. But AFAICS WAL-based
replication (physical or logical) is not a perfect defense for this.

However, remember, if your storage system is free of any sort of
overprovisioning, is on a non-network file system, and doesn't use
multipath (or sets it up right) this issue *is exceptionally unlikely
to affect you*.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Thomas Munro
On Tue, Apr 10, 2018 at 1:44 PM, Craig Ringer  wrote:
> On 10 April 2018 at 03:59, Andres Freund  wrote:
>> I don't think that's as hard as some people argued in this thread.  We
>> could very well open a pipe in postmaster with the write end open in
>> each subprocess, and the read end open only in checkpointer (and
>> postmaster, but unused there).  Whenever closing a file descriptor that
>> was dirtied in the current process, send it over the pipe to the
>> checkpointer. The checkpointer then can receive all those file
>> descriptors (making sure it's not above the limit, fsync(), close() ing
>> to make room if necessary).  The biggest complication would presumably
>> be to deduplicate the received filedescriptors for the same file,
>> without loosing track of any errors.
>
> Yep. That'd be a cheaper way to do it, though it wouldn't work on
> Windows. Though we don't know how Windows behaves here at all yet.
>
> Prior discussion upthread had the checkpointer open()ing a file at the
> same time as a backend, before the backend writes to it. But passing
> the fd when the backend is done with it would be better.

How would that interlock with concurrent checkpoints?

I can see how to make that work if the share-fd-or-fsync-now logic
happens in smgrwrite() when called by FlushBuffer() while you hold
io_in_progress, but not if you defer it to some random time later.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Craig Ringer
On 10 April 2018 at 03:59, Andres Freund  wrote:
> On 2018-04-09 14:41:19 -0500, Justin Pryzby wrote:
>> On Mon, Apr 09, 2018 at 09:31:56AM +0800, Craig Ringer wrote:
>> > You could make the argument that it's OK to forget if the entire file
>> > system goes away. But actually, why is that ok?
>>
>> I was going to say that it'd be okay to clear error flag on umount, since any
>> opened files would prevent unmounting; but, then I realized we need to 
>> consider
>> the case of close()ing all FDs then opening them later..in another process.
>
>> On Mon, Apr 09, 2018 at 02:54:16PM +0200, Anthony Iliopoulos wrote:
>> > notification descriptor open, where the kernel would inject events
>> > related to writeback failures of files under watch (potentially
>> > enriched to contain info regarding the exact failed pages and
>> > the file offset they map to).
>>
>> For postgres that'd require backend processes to open() an file such that,
>> following its close(), any writeback errors are "signalled" to the 
>> checkpointer
>> process...
>
> I don't think that's as hard as some people argued in this thread.  We
> could very well open a pipe in postmaster with the write end open in
> each subprocess, and the read end open only in checkpointer (and
> postmaster, but unused there).  Whenever closing a file descriptor that
> was dirtied in the current process, send it over the pipe to the
> checkpointer. The checkpointer then can receive all those file
> descriptors (making sure it's not above the limit, fsync(), close() ing
> to make room if necessary).  The biggest complication would presumably
> be to deduplicate the received filedescriptors for the same file,
> without loosing track of any errors.

Yep. That'd be a cheaper way to do it, though it wouldn't work on
Windows. Though we don't know how Windows behaves here at all yet.

Prior discussion upthread had the checkpointer open()ing a file at the
same time as a backend, before the backend writes to it. But passing
the fd when the backend is done with it would be better.

We'd need a way to dup() the fd and pass it back to a backend when it
needed to reopen it sometimes, or just make sure to keep the oldest
copy of the fd when a backend reopens multiple times, but that's no
biggie.

We'd still have to fsync() out early in the checkpointer if we ran out
of space in our FD list, and initscripts would need to change our
ulimit or we'd have to do it ourselves in the checkpointer. But
neither seems insurmountable.

FWIW, I agree that this is a corner case, but it's getting to be a
pretty big corner with the spread of overcommitted, dedupliating SANs,
cloud storage, etc. Not all I/O errors indicate permanent hardware
faults, disk failures, etc, as I outlined earlier. I'm very curious to
know what AWS EBS's error semantics are, and other cloud network block
stores. (I posted on Amazon forums
https://forums.aws.amazon.com/thread.jspa?threadID=279274=0 but
nothing so far).

I'm also not particularly inclined to trust that all file systems will
always reliably reserve space without having some cases where they'll
fail writeback on space exhaustion.

So we don't need to panic and freak out, but it's worth looking at the
direction the storage world is moving in, and whether this will become
a bigger issue over time.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Excessive PostmasterIsAlive calls slow down WAL redo

2018-04-09 Thread Andres Freund


On April 9, 2018 6:36:19 PM PDT, Thomas Munro  
wrote:
>On Tue, Apr 10, 2018 at 12:53 PM, Andres Freund 
>wrote:
>> I coincidentally got pinged about our current approach causing
>> performance problems on FreeBSD and started writing a patch.  The
>> problem there appears to be that constantly attaching events to the
>read
>> pipe end, from multiple processes, causes significant contention
>inside
>> the kernel. Which isn't that surprising.   That's distinct from the
>> problem netbsd/openbsd reported a while back (superflous wakeups).
>>
>> That person said he'd work on adding an equivalent of linux'
>> prctl(PR_SET_PDEATHSIG) to FreeBSD.
>
>Just an idea, not tested: what about a reusable WaitEventSet with zero
>timeout?  Using the kqueue patch, that'd call kevent() which'd return
>immediately and tell you if any postmaster death notifications had
>arrive on your queue since last time you asked.  It doesn't even touch
>the pipe, or any other kernel objects apart from your own queue IIUC.

We still create a lot of WES adhoc in a number of places. I don't think this 
would solve that?  The problem isn't that IsAlive is expensive, it's that 
polling is expensive. It's possible that using kqueue would fix that, because 
the highest frequency cases use a persistent WES.

Andres

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Excessive PostmasterIsAlive calls slow down WAL redo

2018-04-09 Thread Thomas Munro
On Tue, Apr 10, 2018 at 12:53 PM, Andres Freund  wrote:
> I coincidentally got pinged about our current approach causing
> performance problems on FreeBSD and started writing a patch.  The
> problem there appears to be that constantly attaching events to the read
> pipe end, from multiple processes, causes significant contention inside
> the kernel. Which isn't that surprising.   That's distinct from the
> problem netbsd/openbsd reported a while back (superflous wakeups).
>
> That person said he'd work on adding an equivalent of linux'
> prctl(PR_SET_PDEATHSIG) to FreeBSD.

Just an idea, not tested: what about a reusable WaitEventSet with zero
timeout?  Using the kqueue patch, that'd call kevent() which'd return
immediately and tell you if any postmaster death notifications had
arrive on your queue since last time you asked.  It doesn't even touch
the pipe, or any other kernel objects apart from your own queue IIUC.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Boolean partitions syntax

2018-04-09 Thread Kyotaro HORIGUCHI
Hello, I returned to this.

I'd like to insisnt on prposing to use existing parser element.

At Mon, 9 Apr 2018 10:11:08 -0400, "Jonathan S. Katz" 
 wrote in 
<27021281-2ed7-4cde-9d82-366af10b3...@excoventures.com>
> > On Apr 9, 2018, at 10:06 AM, Tom Lane  wrote:
> > It's premature to discuss whether this could be back-patched when
> > we haven't got an acceptable patch yet.
> 
> Understood.

At Fri, 2 Mar 2018 16:49:29 +0900, Amit Langote  
wrote in <2d49cd86-acb9-fc5b-8eb7-e467b01ec...@lab.ntt.co.jp>
> I had tried the approach your patch takes and had noticed that the syntax
> had stopped accepting negative values (a regression test fails due to that).
...
> I had tried fixing that as well, but it didn't readily work.

Just adding negation would work as a_expr is doing.

> | '-' a_expr%prec UMINUS
> { $$ = doNegate($2, @1); }

Boolean and negative values are accepted as partition bound
values with the attached patch.

> =# create table p (a bool, b int) partition by list(a);
> CREATE TABLE
> =# create table ct partition of p for values in (true) partition by list (b);
> CREATE TABLE
> =# create table ct1 partition of ct for values in (-1, -2);
> CREATE TABLE

And illegal negative is rejected.

> =# create table ct3 partition of ct for values in (-true);
> ERROR:  operator does not exist: - boolean
> LINE 1: create table ct3 partition of ct for values in (-true);

Interval has a conversion to text so this behaves someshat oddly
but it seems to me that we don't need reject that explicitly.

> create table c2 partition of p for values in (interval '1 year');
> CREATE TABLE
> =# \d+ c2
...
> Partition of: p FOR VALUES IN ('1 year')

or

> =# create table c1 partition of p for values in (interval '1 day');
> ERROR:  specified value cannot be cast to type integer for column "a"
> LINE 1: create table c1 partition of p for values in (interval '1 da...

The attached patch is adding lines for error checking in some
functions like transformWindowFuncCall. They are basically
useless as they are to be rejected by parser but it seems to be
needed for consistency.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index dd0c26c11b..710f7a9dc1 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -2805,9 +2805,9 @@ hash_partbound:
 		;
 
 partbound_datum:
-			Sconst			{ $$ = makeStringConst($1, @1); }
-			| NumericOnly	{ $$ = makeAConst($1, @1); }
-			| NULL_P		{ $$ = makeNullAConst(@1); }
+		AexprConst			{ $$ = $1; }
+		| '-' AexprConst			%prec UMINUS
+			{ $$ = doNegate($2, @1); }
 		;
 
 partbound_datum_list:
diff --git a/src/backend/parser/parse_agg.c b/src/backend/parser/parse_agg.c
index 0307738946..61f59f7527 100644
--- a/src/backend/parser/parse_agg.c
+++ b/src/backend/parser/parse_agg.c
@@ -506,6 +506,12 @@ check_agglevels_and_constraints(ParseState *pstate, Node *expr)
 			else
 err = _("grouping operations are not allowed in EXECUTE parameters");
 
+		case EXPR_KIND_PARTITION_BOUNDS:
+			if (isAgg)
+err = _("aggregate functions are not allowed in partition bounds");
+			else
+err = _("grouping operations are not allowed in partition bounds");
+
 			break;
 		case EXPR_KIND_TRIGGER_WHEN:
 			if (isAgg)
@@ -908,6 +914,8 @@ transformWindowFuncCall(ParseState *pstate, WindowFunc *wfunc,
 			break;
 		case EXPR_KIND_PARTITION_EXPRESSION:
 			err = _("window functions are not allowed in partition key expressions");
+		case EXPR_KIND_PARTITION_BOUNDS:
+			err = _("window functions are not allowed in partition bounds");
 			break;
 		case EXPR_KIND_CALL_ARGUMENT:
 			err = _("window functions are not allowed in CALL arguments");
diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 38fbe3366f..a7f3d86f75 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -1850,6 +1850,9 @@ transformSubLink(ParseState *pstate, SubLink *sublink)
 		case EXPR_KIND_CALL_ARGUMENT:
 			err = _("cannot use subquery in CALL argument");
 			break;
+		case EXPR_KIND_PARTITION_BOUNDS:
+			err = _("cannot use subquery in partition bounds");
+			break;
 
 			/*
 			 * There is intentionally no default: case here, so that the
@@ -3474,6 +3477,8 @@ ParseExprKindName(ParseExprKind exprKind)
 			return "WHEN";
 		case EXPR_KIND_PARTITION_EXPRESSION:
 			return "PARTITION BY";
+		case EXPR_KIND_PARTITION_BOUNDS:
+			return "partition bounds";
 		case EXPR_KIND_CALL_ARGUMENT:
 			return "CALL";
 		case EXPR_KIND_MERGE_WHEN_AND:
diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index 615aee6d15..a39e26dd76 100644
--- a/src/backend/parser/parse_func.c
+++ b/src/backend/parser/parse_func.c
@@ -2306,6 +2306,9 @@ check_srf_call_placement(ParseState *pstate, Node *last_srf, int location)
 		case 

Re: Excessive PostmasterIsAlive calls slow down WAL redo

2018-04-09 Thread Andres Freund


On April 9, 2018 6:31:07 PM PDT, Alvaro Herrera  wrote:
>Andres Freund wrote:
>
>> Another approach, that's simpler to implement, is to simply have a
>> second selfpipe, just for WL_POSTMASTER_DEATH.
>
>Would it work to use this second pipe, to which each child writes a
>byte
>that postmaster never reads, and then rely on SIGPIPE when postmaster
>dies?  Then we never need to do a syscall.

I'm not following, could you expand on what you're suggesting?  Note that you 
do not get SIGPIPE for already buffered writes.  Which syscall can we avoid?

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Excessive PostmasterIsAlive calls slow down WAL redo

2018-04-09 Thread Alvaro Herrera
Andres Freund wrote:

> Another approach, that's simpler to implement, is to simply have a
> second selfpipe, just for WL_POSTMASTER_DEATH.

Would it work to use this second pipe, to which each child writes a byte
that postmaster never reads, and then rely on SIGPIPE when postmaster
dies?  Then we never need to do a syscall.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Excessive PostmasterIsAlive calls slow down WAL redo

2018-04-09 Thread Andres Freund
Hi,

On 2018-04-05 12:20:38 -0700, Andres Freund wrote:
> > While it's not POSIX, at least some platforms are capable of delivering
> > a separate signal on parent process death.  Perhaps using that where
> > available would be enough of an answer.
> 
> Yea, that'd work on linux. Which is probably the platform 80-95% of
> performance critical PG workloads run on.  There's
> JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE on windows, which might also work,
> but I'm not sure it provides enough opportunity for cleanup.

I coincidentally got pinged about our current approach causing
performance problems on FreeBSD and started writing a patch.  The
problem there appears to be that constantly attaching events to the read
pipe end, from multiple processes, causes significant contention inside
the kernel. Which isn't that surprising.   That's distinct from the
problem netbsd/openbsd reported a while back (superflous wakeups).

That person said he'd work on adding an equivalent of linux'
prctl(PR_SET_PDEATHSIG) to FreeBSD.

It's not particularly hard to whip something up that kind of
works. Getting some of the details right isn't entirely as clear
however:

It's trivial to make PostmasterIsAlive() cheap. In my prototype I've
setup PR_SET_PDEATHSIG to deliver SIGINFO to postmaster children. That
sets parent_potentially_died to true. PostmasterIsAlive() only runs the
main body when it's true, making it very cheap in the common case.

What I'm however not quite as clear on is how to best change
WL_POSTMASTER_DEATH.

One appraoch approach I can come up with is to use the self-pipe for
both WL_POSTMASTER_DEATH and WL_LATCH_SET. As we can cheaply check if
either set->latch->is_set or parent_potentially_died, re-using the
selfpipe works well enough.  What I'm however not quite sure about is
how to best do so with epoll() - there we explicitly do not iterate over
all registered events, but use epoll_event->data to get just the wait
event associated with a readyness event.  Therefor we've to keep track
of whether a WaitEventSet has both WL_POSTMASTER_DEATH and WL_LATCH_SET
registered, and check in both WL_POSTMASTER_DEATH/WL_LATCH_SET whether
the event is being waited for.

Another approach, that's simpler to implement, is to simply have a
second selfpipe, just for WL_POSTMASTER_DEATH.


A third question is whether we want to keep postmaster_alive_fds if we
have PR_SET_PDEATHSIG. I'd want to continue re-checking that the parent
is actually dead, but we could also do so by kill()ing postmaster like
we used to do before 89fd72cbf26f5d2e3d86ab19c1ead73ab8fac0fe.  It's not
bad to get rid of unnecessary filedescriptors. Would imply a semantic
change however.

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andreas Karlsson

On 04/09/2018 02:16 PM, Craig Ringer wrote:
I'd like a middle ground where the kernel lets us register our interest 
and tells us if it lost something, without us having to keep eight 
million FDs open for some long period. "Tell us about anything that 
happens under pgdata/" or an inotify-style per-directory-registration 
option. I'd even say that's ideal.


Could there be a risk of a race condition here where fsync incorrectly 
returns success before we get the notification of that something went wrong?


Andreas



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Thomas Munro
On Tue, Apr 10, 2018 at 10:33 AM, Thomas Munro
 wrote:
> I wonder if anyone can tell us what Windows, AIX and HPUX do here.

I created a wiki page to track what we know (or think we know) about
fsync() on various operating systems:

https://wiki.postgresql.org/wiki/Fsync_Errors

If anyone has more information or sees mistakes, please go ahead and edit it.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Thomas Munro
On Tue, Apr 10, 2018 at 2:22 AM, Anthony Iliopoulos  wrote:
> On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:
>> Well, there seem to be kernels that seem to do exactly that already. At
>> least that's how I understand what this thread says about FreeBSD and
>> Illumos, for example. So it's not an entirely insane design, apparently.
>
> It is reasonable, but even FreeBSD has a big fat comment right
> there (since 2017), mentioning that there can be no recovery from
> EIO at the block layer and this needs to be done differently. No
> idea how an application running on top of either FreeBSD or Illumos
> would actually recover from this error (and clear it out), other
> than remounting the fs in order to force dropping of relevant pages.
> It does provide though indeed a persistent error indication that
> would allow Pg to simply reliably panic. But again this does not
> necessarily play well with other applications that may be using
> the filesystem reliably at the same time, and are now faced with
> EIO while their own writes succeed to be persisted.

Right.  For anyone interested, here is the change you mentioned, and
an interesting one that came a bit earlier last year:

https://reviews.freebsd.org/rS316941 -- drop buffers after device goes away
https://reviews.freebsd.org/rS326029 -- update comment about EIO contract

Retrying may well be futile, but at least future fsync() calls won't
report success bogusly.  There may of course be more space-efficient
ways to represent that state as the comment implies, while never lying
to the user -- perhaps involving filesystem level or (pinned) inode
level errors that stop all writes until unmounted.  Something tells me
they won't resort to flakey fsync() error reporting.

I wonder if anyone can tell us what Windows, AIX and HPUX do here.

> [1] 
> https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf

Very interesting, thanks.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Optional message to user when terminating/cancelling backend

2018-04-09 Thread Andres Freund
Hi,

On 2017-06-20 13:01:35 -0700, Andres Freund wrote:
> For extensions it'd also be useful if it'd be possible to overwrite the
> error code.  E.g. for citus there's a distributed deadlock detector,
> running out of process because there's no way to interrupt lock waits
> locally, and we've to do some ugly hacking to generate proper error
> messages and code from another session.

What happened to this request?  Seems we're out of the crunch mode and
could round the feature out a littlebit more...

Greetings,

Andres Freund



Re: [HACKERS] Optional message to user when terminating/cancelling backend

2018-04-09 Thread Daniel Gustafsson
> On 09 Apr 2018, at 02:47, Michael Paquier  wrote:
> 
> On Fri, Apr 06, 2018 at 11:18:34AM +0200, Daniel Gustafsson wrote:
>> Yep, I completely agree.  Attached are patches with the quotes removed and
>> rebased since Oids were taken etc.
> 
> I still find this idea interesting for plugin authors.  However, as
> feature freeze for v11 is in effect, I am marking the patch as returned
> with feedback.

Not sure what the feedback is though? The patch has been through thorough
review with all review comments addressed.

Will resubmit this for a v12 CF.

cheers ./daniel



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Mark Dilger

> On Apr 9, 2018, at 2:25 PM, Tomas Vondra  wrote:
> 
> 
> 
> On 04/09/2018 11:08 PM, Andres Freund wrote:
>> Hi,
>> 
>> On 2018-04-09 13:55:29 -0700, Mark Dilger wrote:
>>> I can also imagine a master and standby that are similarly provisioned,
>>> and thus hit an out of disk error at around the same time, resulting in
>>> corruption on both, even if not the same corruption.
>> 
>> I think it's a grave mistake conflating ENOSPC issues (which we should
>> solve by making sure there's always enough space pre-allocated), with
>> EIO type errors.  The problem is different, the solution is different.

I'm happy to take your word for that.

> In any case, that certainly does not count as data corruption spreading
> from the master to standby.

Maybe not from the point of view of somebody looking at the code.  But a
user might see it differently.  If the data being loaded into the master
and getting replicated to the standby "causes" both to get corrupt, then
it seems like corruption spreading.  I put "causes" in quotes because there
is some argument to be made about "correlation does not prove cause" and so
forth, but it still feels like causation from an arms length perspective.
If there is a pattern of standby servers tending to fail more often right
around the time that the master fails, you'll have a hard time comforting
users, "hey, it's not technically causation."  If loading data into the
master causes the master to hit ENOSPC, and replicating that data to the
standby causes the standby to hit ENOSPC, and if the bug abound ENOSPC has
not been fixed, then this looks like corruption spreading.

I'm certainly planning on taking a hard look at the disk allocation on my
standby servers right soon now.

mark




Re: Fix pg_rewind which can be run as root user

2018-04-09 Thread Michael Paquier
On Mon, Apr 09, 2018 at 09:36:40PM +0200, Magnus Hagander wrote:
> Applied, and pushed this way.

OK, thanks for the commit.
--
Michael


signature.asc
Description: PGP signature


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Tomas Vondra


On 04/09/2018 11:08 PM, Andres Freund wrote:
> Hi,
> 
> On 2018-04-09 13:55:29 -0700, Mark Dilger wrote:
>> I can also imagine a master and standby that are similarly provisioned,
>> and thus hit an out of disk error at around the same time, resulting in
>> corruption on both, even if not the same corruption.
> 
> I think it's a grave mistake conflating ENOSPC issues (which we should
> solve by making sure there's always enough space pre-allocated), with
> EIO type errors.  The problem is different, the solution is different.
> 

In any case, that certainly does not count as data corruption spreading
from the master to standby.


-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andres Freund
Hi,

On 2018-04-09 13:55:29 -0700, Mark Dilger wrote:
> I can also imagine a master and standby that are similarly provisioned,
> and thus hit an out of disk error at around the same time, resulting in
> corruption on both, even if not the same corruption.

I think it's a grave mistake conflating ENOSPC issues (which we should
solve by making sure there's always enough space pre-allocated), with
EIO type errors.  The problem is different, the solution is different.

Greetings,

Andres Freund



Re: Shared PostgreSQL libraries and symbol versioning

2018-04-09 Thread Tom Lane
Stephen Frost  writes:
> * Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote:
>> On 4/5/18 02:04, Pavel Raiskup wrote:
>>> As a followup thought; there are probably two major obstacles ATM
>>> - the DSOs' symbols are not yet versioned, and
>>> - the build-system doesn't seem to know how to -lpq link against
>>> external libpq.so

> I've not looked but neither of those strike me as terribly difficult to
> overcome, assuming they need to be overcome.

I'm not excited about introducing YA cross-platform difference in our
library/linking behavior without fairly strong evidence that it's
necessary.  The available evidence points in the other direction.

As for linking against an external .so, commit dddfc4cb2 just went to
some lengths to make sure that that *wouldn't* happen.  But as long
as all the builds expect libpq.so to end up in the same place after
installation, I doubt it matters much what happens at build time.
You just need to control which build actually installs it.

regards, tom lane



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Mark Dilger

> On Apr 9, 2018, at 1:43 PM, Tomas Vondra  wrote:
> 
> 
> 
> On 04/09/2018 10:25 PM, Mark Dilger wrote:
>> 
>>> On Apr 9, 2018, at 12:13 PM, Andres Freund  wrote:
>>> 
>>> Hi,
>>> 
>>> On 2018-04-09 15:02:11 -0400, Robert Haas wrote:
 I think the simplest technological solution to this problem is to
 rewrite the entire backend and all supporting processes to use
 O_DIRECT everywhere.  To maintain adequate performance, we'll have to
 write a complete I/O scheduling system inside PostgreSQL.  Also, since
 we'll now have to make shared_buffers much larger -- since we'll no
 longer be benefiting from the OS cache -- we'll need to replace the
 use of malloc() with an allocator that pulls from shared_buffers.
 Plus, as noted, we'll need to totally rearchitect several of our
 critical frontend tools.  Let's freeze all other development for the
 next year while we work on that, and put out a notice that Linux is no
 longer a supported platform for any existing release.  Before we do
 that, we might want to check whether fsync() actually writes the data
 to disk in a usable way even with O_DIRECT.  If not, we should just
 de-support Linux entirely as a hopelessly broken and unsupportable
 platform.
>>> 
>>> Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
>>> absurd, as is some of the proposed ways this is all supposed to
>>> work. But I think the case we're discussing is much closer to a near
>>> irresolvable corner case than anything else.
>>> 
>>> We're talking about the storage layer returning an irresolvable
>>> error. You're hosed even if we report it properly.  Yes, it'd be nice if
>>> we could report it reliably.  But that doesn't change the fact that what
>>> we're doing is ensuring that data is safely fsynced unless storage
>>> fails, in which case it's not safely fsynced anyway.
>> 
>> I was reading this thread up until now as meaning that the standby could
>> receive corrupt WAL data and become corrupted.  That seems a much bigger
>> problem than merely having the master become corrupted in some unrecoverable
>> way.  It is a long standing expectation that serious hardware problems on
>> the master can result in the master needing to be replaced.  But there has
>> not been an expectation that the one or more standby servers would be taken
>> down along with the master, leaving all copies of the database unusable.
>> If this bug corrupts the standby servers, too, then it is a whole different
>> class of problem than the one folks have come to expect.
>> 
>> Your comment reads as if this is a problem isolated to whichever server has
>> the problem, and will not get propagated to other servers.  Am I reading
>> that right?
>> 
>> Can anybody clarify this for non-core-hacker folks following along at home?
>> 
> 
> That's a good question. I don't see any guarantee it'd be isolated to
> the master node. Consider this example:
> 
> (0) checkpoint happens on the primary
> 
> (1) a page gets modified, a full-page gets written to WAL
> 
> (2) the page is written out to page cache
> 
> (3) writeback of that page fails (and gets discarded)
> 
> (4) we attempt to modify the page again, but we read the stale version
> 
> (5) we modify the stale version, writing the change to WAL
> 
> 
> The standby will get the full-page, and then a WAL from the stale page
> version. That doesn't seem like a story with a happy end, I guess. But I
> might be easily missing some protection built into the WAL ...

I can also imagine a master and standby that are similarly provisioned,
and thus hit an out of disk error at around the same time, resulting in
corruption on both, even if not the same corruption.  When choosing to
have one standby, or two standbys, or ten standbys, one needs to be able
to assume a certain amount of statistical independence between failures
on one server and failures on another.  If they are tightly correlated
dependent variables, then the conclusion that the probability of all
nodes failing simultaneously is vanishingly small becomes invalid.

mark


Re: pgsql: Merge catalog/pg_foo_fn.h headers back into pg_foo.h headers.

2018-04-09 Thread Tom Lane
I wrote:
> Michael Paquier  writes:
>> That takes care of the problem from the root of the directory, but when
>> doing the same from src/bin/ then the same issue shows up even if
>> src/Makefile is patched to handle install targets.

> Hm.  Not sure how far we want to go in that direction.  It's never really
> been the case that you could configure a maintainer-clean tree and then
> go and build in any random subdirectory without taking care of the
> generated headers.  In principle, we could do something like the hack
> Peter did with temp-install, where it's basically forced to be a
> prerequisite of "make check" in EVERY subdirectory and then we buy back
> some of the inefficiency by making it a no-op in child make runs.  Not
> sure that's a good thing though.

After further contemplation I decided that that was, in fact, the only
reasonable way to improve matters.  If we have multiple subdirectories
independently firing the "make generated-headers" action, then we have
parallel make hazards of just the same sort I was trying to prevent.
So it's really an all-or-nothing proposition.  The MAKELEVEL hack
plus wiring the prerequisite into the recursion rules is the best way
to make that happen.

Hence, done that way.

regards, tom lane



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Tomas Vondra


On 04/09/2018 10:25 PM, Mark Dilger wrote:
> 
>> On Apr 9, 2018, at 12:13 PM, Andres Freund  wrote:
>>
>> Hi,
>>
>> On 2018-04-09 15:02:11 -0400, Robert Haas wrote:
>>> I think the simplest technological solution to this problem is to
>>> rewrite the entire backend and all supporting processes to use
>>> O_DIRECT everywhere.  To maintain adequate performance, we'll have to
>>> write a complete I/O scheduling system inside PostgreSQL.  Also, since
>>> we'll now have to make shared_buffers much larger -- since we'll no
>>> longer be benefiting from the OS cache -- we'll need to replace the
>>> use of malloc() with an allocator that pulls from shared_buffers.
>>> Plus, as noted, we'll need to totally rearchitect several of our
>>> critical frontend tools.  Let's freeze all other development for the
>>> next year while we work on that, and put out a notice that Linux is no
>>> longer a supported platform for any existing release.  Before we do
>>> that, we might want to check whether fsync() actually writes the data
>>> to disk in a usable way even with O_DIRECT.  If not, we should just
>>> de-support Linux entirely as a hopelessly broken and unsupportable
>>> platform.
>>
>> Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
>> absurd, as is some of the proposed ways this is all supposed to
>> work. But I think the case we're discussing is much closer to a near
>> irresolvable corner case than anything else.
>>
>> We're talking about the storage layer returning an irresolvable
>> error. You're hosed even if we report it properly.  Yes, it'd be nice if
>> we could report it reliably.  But that doesn't change the fact that what
>> we're doing is ensuring that data is safely fsynced unless storage
>> fails, in which case it's not safely fsynced anyway.
> 
> I was reading this thread up until now as meaning that the standby could
> receive corrupt WAL data and become corrupted.  That seems a much bigger
> problem than merely having the master become corrupted in some unrecoverable
> way.  It is a long standing expectation that serious hardware problems on
> the master can result in the master needing to be replaced.  But there has
> not been an expectation that the one or more standby servers would be taken
> down along with the master, leaving all copies of the database unusable.
> If this bug corrupts the standby servers, too, then it is a whole different
> class of problem than the one folks have come to expect.
> 
> Your comment reads as if this is a problem isolated to whichever server has
> the problem, and will not get propagated to other servers.  Am I reading
> that right?
> 
> Can anybody clarify this for non-core-hacker folks following along at home?
> 

That's a good question. I don't see any guarantee it'd be isolated to
the master node. Consider this example:

(0) checkpoint happens on the primary

(1) a page gets modified, a full-page gets written to WAL

(2) the page is written out to page cache

(3) writeback of that page fails (and gets discarded)

(4) we attempt to modify the page again, but we read the stale version

(5) we modify the stale version, writing the change to WAL


The standby will get the full-page, and then a WAL from the stale page
version. That doesn't seem like a story with a happy end, I guess. But I
might be easily missing some protection built into the WAL ...

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andres Freund
Hi,

On 2018-04-09 22:30:00 +0200, Tomas Vondra wrote:
> Maybe. I'd certainly prefer automated recovery from an temporary I/O
> issues (like full disk on thin-provisioning) without the database
> crashing and restarting. But I'm not sure it's worth the effort.

Oh, I agree on that one. But that's more a question of how we force the
kernel's hand on allocating disk space. In most cases the kernel
allocates the disk space immediately, even if delayed allocation is in
effect. For the cases where that's not the case (if there are current
ones, rather than just past bugs), we should be able to make sure that's
not an issue by pre-zeroing the data and/or using fallocate.

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andres Freund
Hi,

On 2018-04-09 13:25:54 -0700, Mark Dilger wrote:
> I was reading this thread up until now as meaning that the standby could
> receive corrupt WAL data and become corrupted.

I don't see that as a real problem here. For one the problematic
scenarios shouldn't readily apply, for another WAL is checksummed.

There's the problem that a new basebackup would potentially become
corrupted however. And similarly pg_rewind.

Note that I'm not saying that we and/or linux shouldn't change
anything. Just that the apocalypse isn't here.


> Your comment reads as if this is a problem isolated to whichever server has
> the problem, and will not get propagated to other servers.  Am I reading
> that right?

I think that's basically right. There's cases where corruption could get
propagated, but they're not straightforward.

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Tomas Vondra


On 04/09/2018 10:04 PM, Andres Freund wrote:
> Hi,
> 
> On 2018-04-09 21:54:05 +0200, Tomas Vondra wrote:
>> Isn't the expectation that when a fsync call fails, the next one will
>> retry writing the pages in the hope that it succeeds?
> 
> Some people expect that, I personally don't think it's a useful
> expectation.
> 

Maybe. I'd certainly prefer automated recovery from an temporary I/O
issues (like full disk on thin-provisioning) without the database
crashing and restarting. But I'm not sure it's worth the effort.

And most importantly, it's rather delusional to think the kernel
developers are going to be enthusiastic about that approach ...

>
> We should just deal with this by crash-recovery. The big problem I
> see is that you always need to keep an file descriptor open for
> pretty much any file written to inside and outside of postgres, to be
> guaranteed to see errors. And that'd solve that. Even if retrying
> would work, I'd advocate for that (I've done so in the past, and I've
> written code in pg that panics on fsync failure...).
> 

Sure. And it's likely way less invasive from kernel perspective.

>
> What we'd need to do however is to clear that bit during crash 
> recovery... Which is interesting from a policy perspective. Could be 
> that other apps wouldn't want that.
> 

IMHO it'd be enough if a remount clears it.

>
> I also wonder if we couldn't just somewhere read each relevant
> mounted filesystem's errseq value. Whenever checkpointer notices
> before finishing a checkpoint that it has changed, do a crash
> restart.
> 

H, that's an interesting idea, and it's about the only thing that
would help us on older kernels. There's a wb_err in adress_space, but
that's at inode level. Not sure if there's something at fs level.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Mark Dilger

> On Apr 9, 2018, at 12:13 PM, Andres Freund  wrote:
> 
> Hi,
> 
> On 2018-04-09 15:02:11 -0400, Robert Haas wrote:
>> I think the simplest technological solution to this problem is to
>> rewrite the entire backend and all supporting processes to use
>> O_DIRECT everywhere.  To maintain adequate performance, we'll have to
>> write a complete I/O scheduling system inside PostgreSQL.  Also, since
>> we'll now have to make shared_buffers much larger -- since we'll no
>> longer be benefiting from the OS cache -- we'll need to replace the
>> use of malloc() with an allocator that pulls from shared_buffers.
>> Plus, as noted, we'll need to totally rearchitect several of our
>> critical frontend tools.  Let's freeze all other development for the
>> next year while we work on that, and put out a notice that Linux is no
>> longer a supported platform for any existing release.  Before we do
>> that, we might want to check whether fsync() actually writes the data
>> to disk in a usable way even with O_DIRECT.  If not, we should just
>> de-support Linux entirely as a hopelessly broken and unsupportable
>> platform.
> 
> Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
> absurd, as is some of the proposed ways this is all supposed to
> work. But I think the case we're discussing is much closer to a near
> irresolvable corner case than anything else.
> 
> We're talking about the storage layer returning an irresolvable
> error. You're hosed even if we report it properly.  Yes, it'd be nice if
> we could report it reliably.  But that doesn't change the fact that what
> we're doing is ensuring that data is safely fsynced unless storage
> fails, in which case it's not safely fsynced anyway.

I was reading this thread up until now as meaning that the standby could
receive corrupt WAL data and become corrupted.  That seems a much bigger
problem than merely having the master become corrupted in some unrecoverable
way.  It is a long standing expectation that serious hardware problems on
the master can result in the master needing to be replaced.  But there has
not been an expectation that the one or more standby servers would be taken
down along with the master, leaving all copies of the database unusable.
If this bug corrupts the standby servers, too, then it is a whole different
class of problem than the one folks have come to expect.

Your comment reads as if this is a problem isolated to whichever server has
the problem, and will not get propagated to other servers.  Am I reading
that right?

Can anybody clarify this for non-core-hacker folks following along at home?


mark




Re: Shared PostgreSQL libraries and symbol versioning

2018-04-09 Thread Stephen Frost
Greetings,

* Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote:
> On 4/5/18 02:04, Pavel Raiskup wrote:
> > Hello, for the support of multiple versions of PostgreSQL RPM packages on
> > one system, we are thinking about having only one libpq.so.5
> > (libecpg.so.6, libpgtype.so.3 respectively) supported and about building
> > (linking) all the PostgreSQL package versions against that.  The pattern
> > would mean that we'd have to update the system-wide libraries before
> > adding support for new PostgreSQL major version and that the older
> > packages (say version =< 9.6) would be run against libs from newer PG
> > package (say libpq.so.5 from v10).
> > 
> > Do you think it is a good idea in general?
> 
> yes

Adding symbol versioning would probably be a good idea, yes, though we
don't have as much need as others since we've really had a rather stable
ABI for a long time.

> > As a followup thought; there are probably two major obstacles ATM
> > 
> >   - the DSOs' symbols are not yet versioned, and
> >   - the build-system doesn't seem to know how to -lpq link against
> > external libpq.so

I've not looked but neither of those strike me as terribly difficult to
overcome, assuming they need to be overcome.

> It's not clear to me why you would need these, given that Debian has
> been doing this for many years without this.

Huh?

objdump -T /usr/lib/x86_64-linux-gnu/libpq.so.5.10

...
00021a50 gDF .text  014c  Base
pg_char_to_encoding
000270d0 gDF .text  002c  BasePQsslStruct
00013880 gDF .text  01d2  Base
PQmakeEmptyPGresult
00017900 gDF .text  0012  BasePQmblen
ac20 gDF .text  02c0  Base
PQencryptPasswordConn
fcc0 gDF .text  00f3  BasePQresetPoll
00014790 gDF .text  00ef  BasePQsendQuery
0001fc50 gDF .text  0061  BaseinitPQExpBuffer
000152a0 gDF .text  0012  Base
PQsendDescribePortal
...

No, Debian doesn't do symbol versioning for libpq and I don't believe it
ever has.

There are other libraries in Debian where the are using symbol
versioning, of course, but libpq isn't one of those.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Shared PostgreSQL libraries and symbol versioning

2018-04-09 Thread Peter Eisentraut
On 4/5/18 02:04, Pavel Raiskup wrote:
> Hello, for the support of multiple versions of PostgreSQL RPM packages on
> one system, we are thinking about having only one libpq.so.5
> (libecpg.so.6, libpgtype.so.3 respectively) supported and about building
> (linking) all the PostgreSQL package versions against that.  The pattern
> would mean that we'd have to update the system-wide libraries before
> adding support for new PostgreSQL major version and that the older
> packages (say version =< 9.6) would be run against libs from newer PG
> package (say libpq.so.5 from v10).
> 
> Do you think it is a good idea in general?

yes

> As a followup thought; there are probably two major obstacles ATM
> 
>   - the DSOs' symbols are not yet versioned, and
>   - the build-system doesn't seem to know how to -lpq link against
> external libpq.so

It's not clear to me why you would need these, given that Debian has
been doing this for many years without this.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andres Freund
Hi,

On 2018-04-09 21:54:05 +0200, Tomas Vondra wrote:
> Isn't the expectation that when a fsync call fails, the next one will
> retry writing the pages in the hope that it succeeds?

Some people expect that, I personally don't think it's a useful
expectation.

We should just deal with this by crash-recovery.  The big problem I see
is that you always need to keep an file descriptor open for pretty much
any file written to inside and outside of postgres, to be guaranteed to
see errors. And that'd solve that.  Even if retrying would work, I'd
advocate for that (I've done so in the past, and I've written code in pg
that panics on fsync failure...).

What we'd need to do however is to clear that bit during crash
recovery... Which is interesting from a policy perspective. Could be
that other apps wouldn't want that.

I also wonder if we couldn't just somewhere read each relevant mounted
filesystem's errseq value. Whenever checkpointer notices before
finishing a checkpoint that it has changed, do a crash restart.


Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andres Freund
On 2018-04-09 14:41:19 -0500, Justin Pryzby wrote:
> On Mon, Apr 09, 2018 at 09:31:56AM +0800, Craig Ringer wrote:
> > You could make the argument that it's OK to forget if the entire file
> > system goes away. But actually, why is that ok?
> 
> I was going to say that it'd be okay to clear error flag on umount, since any
> opened files would prevent unmounting; but, then I realized we need to 
> consider
> the case of close()ing all FDs then opening them later..in another process.

> On Mon, Apr 09, 2018 at 02:54:16PM +0200, Anthony Iliopoulos wrote:
> > notification descriptor open, where the kernel would inject events
> > related to writeback failures of files under watch (potentially
> > enriched to contain info regarding the exact failed pages and
> > the file offset they map to).
> 
> For postgres that'd require backend processes to open() an file such that,
> following its close(), any writeback errors are "signalled" to the 
> checkpointer
> process...

I don't think that's as hard as some people argued in this thread.  We
could very well open a pipe in postmaster with the write end open in
each subprocess, and the read end open only in checkpointer (and
postmaster, but unused there).  Whenever closing a file descriptor that
was dirtied in the current process, send it over the pipe to the
checkpointer. The checkpointer then can receive all those file
descriptors (making sure it's not above the limit, fsync(), close() ing
to make room if necessary).  The biggest complication would presumably
be to deduplicate the received filedescriptors for the same file,
without loosing track of any errors.

Even better, we could do so via a dedicated worker. That'd quite
possibly end up as a performance benefit.


> I was going to say that's fine for postgres, since it chdir()s into its
> basedir, but actually not fine for nondefault tablespaces..

I think it'd be fair to open PG_VERSION of all created
tablespaces. Would require some hangups to signal checkpointer (or
whichever process) to do so when creating one, but it shouldn't be too
hard.  Some people would complain because they can't do some nasty hacks
anymore, but it'd also save peoples butts by preventing them from
accidentally unmounting.

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Tomas Vondra


On 04/09/2018 09:37 PM, Andres Freund wrote:
> 
> 
> On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos  
> wrote:
> 
>> I honestly do not expect that keeping around the failed pages will
>> be an acceptable change for most kernels, and as such the
>> recommendation
>> will probably be to coordinate in userspace for the fsync().
> 
> Why is that required? You could very well just keep per inode 
> information about fatal failures that occurred around. Report errors 
> until that bit is explicitly cleared. Yes, that keeps some memory
> around until unmount if nobody clears it. But it's orders of
> magnitude less, and results in usable semantics.
>

Isn't the expectation that when a fsync call fails, the next one will
retry writing the pages in the hope that it succeeds?

Of course, it's also possible to do what you suggested, and simply mark
the inode as failed. In which case the next fsync can't possibly retry
the writes (e.g. after freeing some space on thin-provisioned system),
but we'd get reliable failure mode.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 12:37:03PM -0700, Andres Freund wrote:
> 
> 
> On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos  
> wrote:
> 
> >I honestly do not expect that keeping around the failed pages will
> >be an acceptable change for most kernels, and as such the
> >recommendation
> >will probably be to coordinate in userspace for the fsync().
> 
> Why is that required? You could very well just keep per inode information 
> about fatal failures that occurred around. Report errors until that bit is 
> explicitly cleared.  Yes, that keeps some memory around until unmount if 
> nobody clears it. But it's orders of magnitude less, and results in usable 
> semantics.

As discussed before, I think this could be acceptable, especially
if you pair it with an opt-in mechanism (only applications that
care to deal with this will have to), and would give it a shot.

Still need a way to deal with all other systems and prior kernel
releases that are eating fsync() writeback errors even over sync().

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Tomas Vondra
On 04/09/2018 04:22 PM, Anthony Iliopoulos wrote:
> On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:
>>
>> We already have dirty_bytes and dirty_background_bytes, for example. I
>> don't see why there couldn't be another limit defining how much dirty
>> data to allow before blocking writes altogether. I'm sure it's not that
>> simple, but you get the general idea - do not allow using all available
>> memory because of writeback issues, but don't throw the data away in
>> case it's just a temporary issue.
> 
> Sure, there could be knobs for limiting how much memory such "zombie"
> pages may occupy. Not sure how helpful it would be in the long run
> since this tends to be highly application-specific, and for something
> with a large data footprint one would end up tuning this accordingly
> in a system-wide manner. This has the potential to leave other
> applications running in the same system with very little memory, in
> cases where for example original application crashes and never clears
> the error. Apart from that, further interfaces would need to be provided
> for actually dealing with the error (again assuming non-transient
> issues that may not be fixed transparently and that temporary issues
> are taken care of by lower layers of the stack).
> 

I don't quite see how this is any different from other possible issues
when running multiple applications on the same system. One application
can generate a lot of dirty data, reaching dirty_bytes and forcing the
other applications on the same host to do synchronous writes.

Of course, you might argue that is a temporary condition - it will
resolve itself once the dirty pages get written to storage. In case of
an I/O issue, it is a permanent impact - it will not resolve itself
unless the I/O problem gets fixed.

Not sure what interfaces would need to be written? Possibly something
that says "drop dirty pages for these files" after the application gets
killed or something. That makes sense, of course.

>> Well, there seem to be kernels that seem to do exactly that already. At
>> least that's how I understand what this thread says about FreeBSD and
>> Illumos, for example. So it's not an entirely insane design, apparently.
> 
> It is reasonable, but even FreeBSD has a big fat comment right
> there (since 2017), mentioning that there can be no recovery from
> EIO at the block layer and this needs to be done differently. No
> idea how an application running on top of either FreeBSD or Illumos
> would actually recover from this error (and clear it out), other
> than remounting the fs in order to force dropping of relevant pages.
> It does provide though indeed a persistent error indication that
> would allow Pg to simply reliably panic. But again this does not
> necessarily play well with other applications that may be using
> the filesystem reliably at the same time, and are now faced with
> EIO while their own writes succeed to be persisted.
> 

In my experience when you have a persistent I/O error on a device, it
likely affects all applications using that device. So unmounting the fs
to clear the dirty pages seems like an acceptable solution to me.

I don't see what else the application should do? In a way I'm suggesting
applications don't really want to be responsible for recovering (cleanup
or dirty pages etc.). We're more than happy to hand that over to kernel,
e.g. because each kernel will do that differently. What we however do
want is reliable information about fsync outcome, which we need to
properly manage WAL, checkpoints etc.

> Ideally, you'd want a (potentially persistent) indication of error
> localized to a file region (mapping the corresponding failed writeback
> pages). NetBSD is already implementing fsync_ranges(), which could
> be a step in the right direction.
> 
>> One has to wonder how many applications actually use this correctly,
>> considering PostgreSQL cares about data durability/consistency so much
>> and yet we've been misunderstanding how it works for 20+ years.
> 
> I would expect it would be very few, potentially those that have
> a very simple process model (e.g. embedded DBs that can abort a
> txn on fsync() EIO). I think that durability is a rather complex
> cross-layer issue which has been grossly misunderstood similarly
> in the past (e.g. see [1]). It seems that both the OS and DB
> communities greatly benefit from a periodic reality check, and
> I see this as an opportunity for strengthening the IO stack in
> an end-to-end manner.
> 

Right. What I was getting to is that perhaps the current fsync()
behavior is not very practical for building actual applications.

> Best regards,
> Anthony
> 
> [1] 
> https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf
> 

Thanks. The paper looks interesting.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Warnings and uninitialized variables in TAP tests

2018-04-09 Thread Magnus Hagander
On Mon, Apr 9, 2018 at 3:15 AM, Michael Paquier  wrote:

> Hi all,
>
> While looking at the output of the TAP tests, I have seen warnings like
> the following:
> Use of uninitialized value $target_lsn in concatenation (.) or string at
> /home/foo/git/postgres/src/bin/pg_rewind/../../../src/
> test/perl/PostgresNode.pm
> line 1540.
> [...]
> ./src/bin/pg_basebackup/tmp_check/log/regress_log_010_pg_basebackup:Use
> of uninitialized value $str in print at
> /home/foo/git/postgres/src/bin/pg_basebackup/../../../
> src/test/perl/TestLib.pm
> line 244.
>
> The first one shows somethng like 30 times, and the second only once.
>
> Attached is a patch to clean up all that.  In order to find all that, I
> simply ran the following and they pop up:
> find . -name "regress_log*" | xargs grep -i "uninitialized"
>
> What I have found is harmless, still they pollute the logs.
>

Applied, thanks.


-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 12:29:16PM -0700, Andres Freund wrote:
> On 2018-04-09 21:26:21 +0200, Anthony Iliopoulos wrote:
> > What about having buffered IO with implied fsync() atomicity via
> > O_SYNC?
> 
> You're kidding, right?  We could also just add sleep(30)'s all over the
> tree, and hope that that'll solve the problem.  There's a reason we
> don't permanently fsync everything. Namely that it'll be way too slow.

I am assuming you can apply the same principle of selectively using O_SYNC
at times and places that you'd currently actually call fsync().

Also assuming that you'd want to have a backwards-compatible solution for
all those kernels that don't keep the pages around, irrespective of future
fixes. Short of loading a kernel module and dealing with the problem directly,
the only other available options seem to be either O_SYNC, O_DIRECT or ignoring
the issue.

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Justin Pryzby
On Mon, Apr 09, 2018 at 09:31:56AM +0800, Craig Ringer wrote:
> You could make the argument that it's OK to forget if the entire file
> system goes away. But actually, why is that ok?

I was going to say that it'd be okay to clear error flag on umount, since any
opened files would prevent unmounting; but, then I realized we need to consider
the case of close()ing all FDs then opening them later..in another process.

I was going to say that's fine for postgres, since it chdir()s into its
basedir, but actually not fine for nondefault tablespaces..

On Mon, Apr 09, 2018 at 02:54:16PM +0200, Anthony Iliopoulos wrote:
> notification descriptor open, where the kernel would inject events
> related to writeback failures of files under watch (potentially
> enriched to contain info regarding the exact failed pages and
> the file offset they map to).

For postgres that'd require backend processes to open() an file such that,
following its close(), any writeback errors are "signalled" to the checkpointer
process...

Justin



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andres Freund


On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos  wrote:

>I honestly do not expect that keeping around the failed pages will
>be an acceptable change for most kernels, and as such the
>recommendation
>will probably be to coordinate in userspace for the fsync().

Why is that required? You could very well just keep per inode information about 
fatal failures that occurred around. Report errors until that bit is explicitly 
cleared.  Yes, that keeps some memory around until unmount if nobody clears it. 
But it's orders of magnitude less, and results in usable semantics.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Fix pg_rewind which can be run as root user

2018-04-09 Thread Magnus Hagander
On Mon, Apr 9, 2018 at 9:31 PM, Tom Lane  wrote:

> Magnus Hagander  writes:
> > Seems simple enough and the right hting to do, but I wonder if we should
> > really backpatch it. Yes, the behaviour is not great now, but there is
> also
> > a non-zero risk of breaking peoples automated failover scripts of we
> > backpatch it, isn't it?
>
> Yeah, I'd vote against backpatching.  This doesn't seem like an essential
> fix.
>

Applied, and pushed this way.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Fix pg_rewind which can be run as root user

2018-04-09 Thread Tom Lane
Magnus Hagander  writes:
> Seems simple enough and the right hting to do, but I wonder if we should
> really backpatch it. Yes, the behaviour is not great now, but there is also
> a non-zero risk of breaking peoples automated failover scripts of we
> backpatch it, isn't it?

Yeah, I'd vote against backpatching.  This doesn't seem like an essential
fix.

regards, tom lane



Re: Fix pg_rewind which can be run as root user

2018-04-09 Thread Peter Geoghegan
On Mon, Apr 9, 2018 at 12:23 PM, Magnus Hagander  wrote:
> Seems simple enough and the right hting to do, but I wonder if we should
> really backpatch it. Yes, the behaviour is not great now, but there is also
> a non-zero risk of breaking peoples automated failover scripts of we
> backpatch it, isn't it?

+1 to committing to master as a documented behavioral change.

-- 
Peter Geoghegan



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andres Freund
On 2018-04-09 21:26:21 +0200, Anthony Iliopoulos wrote:
> What about having buffered IO with implied fsync() atomicity via
> O_SYNC?

You're kidding, right?  We could also just add sleep(30)'s all over the
tree, and hope that that'll solve the problem.  There's a reason we
don't permanently fsync everything. Namely that it'll be way too slow.

Greetings,

Andres Freund



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 04:29:36PM +0100, Greg Stark wrote:
> Honestly I don't think there's *any* way to use the current interface
> to implement reliable operation. Even that embedded database using a
> single process and keeping every file open all the time (which means
> file descriptor limits limit its scalability) can be having silent
> corruption whenever some other process like a backup program comes
> along and calls fsync (or even sync?).

That is indeed true (sync would induce fsync on open inodes and clear
the error), and that's a nasty bug that apparently went unnoticed for
a very long time. Hopefully the errseq_t linux 4.13 fixes deal with at
least this issue, but similar fixes need to be adopted by many other
kernels (all those that mark failed pages as clean).

I honestly do not expect that keeping around the failed pages will
be an acceptable change for most kernels, and as such the recommendation
will probably be to coordinate in userspace for the fsync().

What about having buffered IO with implied fsync() atomicity via O_SYNC?
This would probably necessitate some helper threads that mask the
latency and present an async interface to the rest of PG, but sounds
less intrusive than going for DIO.

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Peter Geoghegan
On Mon, Apr 9, 2018 at 12:13 PM, Andres Freund  wrote:
> Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
> absurd, as is some of the proposed ways this is all supposed to
> work. But I think the case we're discussing is much closer to a near
> irresolvable corner case than anything else.

+1

> We're talking about the storage layer returning an irresolvable
> error. You're hosed even if we report it properly.  Yes, it'd be nice if
> we could report it reliably.  But that doesn't change the fact that what
> we're doing is ensuring that data is safely fsynced unless storage
> fails, in which case it's not safely fsynced anyway.

Right. We seem to be implicitly assuming that there is a big
difference between a problem in the storage layer that we could in
principle detect, but don't, and any other problem in the storage
layer. I've read articles claiming that technologies like SMART are
not really reliable in a practical sense [1], so it seems to me that
there is reason to doubt that this gap is all that big.

That said, I suspect that the problems with running out of disk space
are serious practical problems. I have personally scoffed at stories
involving Postgres databases corruption that gets attributed to
running out of disk space. Looks like I was dead wrong.

[1] https://danluu.com/file-consistency/ -- "Filesystem correctness"
-- 
Peter Geoghegan



Re: Fix pg_rewind which can be run as root user

2018-04-09 Thread Magnus Hagander
On Mon, Apr 9, 2018 at 7:11 AM, Michael Paquier  wrote:

> Hi all,
>
> I was just going through pg_rewind's code, and noticed the following
> pearl:
> /*
>  * Don't allow pg_rewind to be run as root, to avoid overwriting the
>  * ownership of files in the data directory. We need only check for
> root
>  * -- any other user won't have sufficient permissions to modify files
> in
>  * the data directory.
>  */
> #ifndef WIN32
> if (geteuid() == 0)
> {
> fprintf(stderr, _("cannot be executed by \"root\"\n"));
> fprintf(stderr, _("You must run %s as the PostgreSQL
> superuser.\n"),
> progname);
> }
> #endif
>
> While that's nice to inform the user about the problem, that actually
> does not prevent pg_rewind to run as root.  Attached is a patch, which
> needs a back-patch down to 9.5.
>

Seems simple enough and the right hting to do, but I wonder if we should
really backpatch it. Yes, the behaviour is not great now, but there is also
a non-zero risk of breaking peoples automated failover scripts of we
backpatch it, isn't it?


-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Tomas Vondra


On 04/09/2018 08:29 PM, Mark Dilger wrote:
> 
>> On Apr 9, 2018, at 10:26 AM, Joshua D. Drake  wrote:
> 
>> We have plenty of YEARS of people not noticing this issue
> 
> I disagree.  I have noticed this problem, but blamed it on other things.
> For over five years now, I have had to tell customers not to use thin
> provisioning, and I have had to add code to postgres to refuse to perform
> inserts or updates if the disk volume is more than 80% full.  I have lost
> count of the number of customers who are running an older version of the
> product (because they refuse to upgrade) and come back with complaints that
> they ran out of disk and now their database is corrupt.  All this time, I
> have been blaming this on virtualization and thin provisioning.
> 

Yeah. There's a big difference between not noticing an issue because it
does not happen very often vs. attributing it to something else. If we
had the ability to revisit past data corruption cases, we would probably
discover a fair number of cases caused by this.

The other thing we probably need to acknowledge is that the environment
changes significantly - things like thin provisioning are likely to get
even more common, increasing the incidence of these issues.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: using index or check in ALTER TABLE SET NOT NULL

2018-04-09 Thread Sergei Kornilov
Hello
I notice my patch does not apply again. Here is update to current HEAD

regards, Sergeidiff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index bd22627..db98a98 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -215,8 +215,15 @@ WITH ( MODULUS numeric_literal, REM
 
  
   These forms change whether a column is marked to allow null
-  values or to reject null values.  You can only use SET
-  NOT NULL when the column contains no null values.
+  values or to reject null values.
+ 
+
+ 
+  You can only use SET NOT NULL when the column 
+  contains no null values. Full table scan is performed to check 
+  this condition unless there are valid CHECK 
+  constraints that prohibit NULL values for specified column.
+  In the latter case table scan is skipped.
  
 
  
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 0f5932f..6fd3663 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1153,7 +1153,7 @@ check_default_allows_bound(Relation parent, Relation default_rel,
 	 * not contain any row that would belong to the new partition, we can
 	 * avoid scanning the default partition.
 	 */
-	if (PartConstraintImpliedByRelConstraint(default_rel, def_part_constraints))
+	if (ConstraintImpliedByRelConstraint(default_rel, def_part_constraints))
 	{
 		ereport(INFO,
 (errmsg("updated partition constraint for default partition \"%s\" is implied by existing constraints",
@@ -1197,8 +1197,8 @@ check_default_allows_bound(Relation parent, Relation default_rel,
 			 * that it will not contain any row that would belong to the new
 			 * partition, we can avoid scanning the child table.
 			 */
-			if (PartConstraintImpliedByRelConstraint(part_rel,
-	 def_part_constraints))
+			if (ConstraintImpliedByRelConstraint(part_rel,
+ def_part_constraints))
 			{
 ereport(INFO,
 		(errmsg("updated partition constraint for default partition \"%s\" is implied by existing constraints",
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 43b2fce..11dc92c 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -160,7 +160,7 @@ typedef struct AlteredTableInfo
 	/* Information saved by Phases 1/2 for Phase 3: */
 	List	   *constraints;	/* List of NewConstraint */
 	List	   *newvals;		/* List of NewColumnValue */
-	bool		new_notnull;	/* T if we added new NOT NULL constraints */
+	bool		verify_new_notnull; /* T if we should recheck NOT NULL */
 	int			rewrite;		/* Reason for forced rewrite, if any */
 	Oid			newTableSpace;	/* new tablespace; 0 means no change */
 	bool		chgPersistence; /* T if SET LOGGED/UNLOGGED is used */
@@ -372,6 +372,7 @@ static ObjectAddress ATExecDropNotNull(Relation rel, const char *colName, LOCKMO
 static void ATPrepSetNotNull(Relation rel, bool recurse, bool recursing);
 static ObjectAddress ATExecSetNotNull(AlteredTableInfo *tab, Relation rel,
  const char *colName, LOCKMODE lockmode);
+static bool NotNullImpliedByRelConstraints(Relation rel, Form_pg_attribute attr);
 static ObjectAddress ATExecColumnDefault(Relation rel, const char *colName,
 	Node *newDefault, LOCKMODE lockmode);
 static ObjectAddress ATExecAddIdentity(Relation rel, const char *colName,
@@ -4462,7 +4463,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode)
 			 * Test the current data within the table against new constraints
 			 * generated by ALTER TABLE commands, but don't rebuild data.
 			 */
-			if (tab->constraints != NIL || tab->new_notnull ||
+			if (tab->constraints != NIL || tab->verify_new_notnull ||
 tab->partition_constraint != NULL)
 ATRewriteTable(tab, InvalidOid, lockmode);
 
@@ -4623,7 +4624,7 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 	}
 
 	notnull_attrs = NIL;
-	if (newrel || tab->new_notnull)
+	if (newrel || tab->verify_new_notnull)
 	{
 		/*
 		 * If we are rebuilding the tuples OR if we added any new NOT NULL
@@ -5658,7 +5659,7 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
 			 * an OID column.  OID will be marked not null, but since it's
 			 * filled specially, there's no need to test anything.)
 			 */
-			tab->new_notnull |= colDef->is_not_null;
+			tab->verify_new_notnull |= colDef->is_not_null;
 		}
 	}
 
@@ -6062,8 +6063,17 @@ ATExecSetNotNull(AlteredTableInfo *tab, Relation rel,
 
 		CatalogTupleUpdate(attr_rel, >t_self, tuple);
 
-		/* Tell Phase 3 it needs to test the constraint */
-		tab->new_notnull = true;
+		/*
+		 * Verifying table data can be done in Phase 3 with single table scan
+		 * for all constraint (both existing and new)
+		 * We can skip this check only if every new NOT NULL contraint
+		 * implied by existed table constraints
+		 */
+		if (!NotNullImpliedByRelConstraints(rel, (Form_pg_attribute) 

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andres Freund
Hi,

On 2018-04-09 15:02:11 -0400, Robert Haas wrote:
> I think the simplest technological solution to this problem is to
> rewrite the entire backend and all supporting processes to use
> O_DIRECT everywhere.  To maintain adequate performance, we'll have to
> write a complete I/O scheduling system inside PostgreSQL.  Also, since
> we'll now have to make shared_buffers much larger -- since we'll no
> longer be benefiting from the OS cache -- we'll need to replace the
> use of malloc() with an allocator that pulls from shared_buffers.
> Plus, as noted, we'll need to totally rearchitect several of our
> critical frontend tools.  Let's freeze all other development for the
> next year while we work on that, and put out a notice that Linux is no
> longer a supported platform for any existing release.  Before we do
> that, we might want to check whether fsync() actually writes the data
> to disk in a usable way even with O_DIRECT.  If not, we should just
> de-support Linux entirely as a hopelessly broken and unsupportable
> platform.

Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
absurd, as is some of the proposed ways this is all supposed to
work. But I think the case we're discussing is much closer to a near
irresolvable corner case than anything else.

We're talking about the storage layer returning an irresolvable
error. You're hosed even if we report it properly.  Yes, it'd be nice if
we could report it reliably.  But that doesn't change the fact that what
we're doing is ensuring that data is safely fsynced unless storage
fails, in which case it's not safely fsynced anyway.

Greetings,

Andres Freund



Re: Verbosity of genbki.pl

2018-04-09 Thread Tom Lane
Andres Freund  writes:
> On 2018-04-08 13:33:42 -0400, Tom Lane wrote:
>> Traditionally genbki.pl has printed "Writing foo" for every file
>> it writes out.

>> 2. Print just one message like "Generating postgres.bki and related
>> files", and I guess a second one for fmgroids.h and related files.

> +0.5.

Hearing no votes against, done that way.

regards, tom lane



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Robert Haas
On Mon, Apr 9, 2018 at 12:45 PM, Robert Haas  wrote:
> Ouch.  If a process exits -- say, because the user typed \q into psql
> -- then you're talking about potentially calling fsync() on a really
> large number of file descriptor flushing many gigabytes of data to
> disk.  And it may well be that you never actually wrote any data to
> any of those file descriptors -- those writes could have come from
> other backends.  Or you may have written a little bit of data through
> those FDs, but there could be lots of other data that you end up
> flushing incidentally.  Perfectly innocuous things like starting up a
> backend, running a few short queries, and then having that backend
> exit suddenly turn into something that could have a massive
> system-wide performance impact.
>
> Also, if a backend ever manages to exit without running through this
> code, or writes any dirty blocks afterward, then this still fails to
> fix the problem completely.  I guess that's probably avoidable -- we
> can put this late in the shutdown sequence and PANIC if it fails.
>
> I have a really tough time believing this is the right way to solve
> the problem.  We suffered for years because of ext3's desire to flush
> the entire page cache whenever any single file was fsync()'d, which
> was terrible.  Eventually ext4 became the norm, and the problem went
> away.  Now we're going to deliberately insert logic to do a very
> similar kind of terrible thing because the kernel developers have
> decided that fsync() doesn't have to do what it says on the tin?  I
> grant that there doesn't seem to be a better option, but I bet we're
> going to have a lot of really unhappy users if we do this.

What about the bug we fixed in
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ce439f3379aed857517c8ce207485655000fc8e
?  Say somebody does something along the lines of:

ps uxww | grep postgres | grep -v grep | awk '{print $2}' | xargs kill -9

...and then restarts postgres.  Craig's proposal wouldn't cover this
case, because there was no opportunity to run fsync() after the first
crash, and there's now no way to go back and fsync() any stuff we
didn't fsync() before, because the kernel may have already thrown away
the error state, or may lie to us and tell us everything is fine
(because our new fd wasn't opened early enough).  I can't find the
original discussion that led to that commit right now, so I'm not
exactly sure what scenarios we were thinking about.  But I think it
would at least be a problem if full_page_writes=off or if you had
previously started the server with fsync=off and now wish to switch to
fsync=on after completing a bulk load or similar.  Recovery can read a
page, see that it looks OK, and continue, and then a later fsync()
failure can revert that page to an earlier state and now your database
is corrupted -- and there's absolute no way to detect this because
write() gives you the new page contents later, fsync() doesn't feel
obliged to tell you about the error because your fd wasn't opened
early enough, and eventually the write can be discarded and you'll
revert back to the old page version with no errors ever being reported
anywhere.

Another consequence of this behavior that initdb -S is never reliable,
so pg_rewind's use of it doesn't actually fix the problem it was
intended to solve.  It also means that initdb itself isn't crash-safe,
since the data file changes are made by the backend but initdb itself
is doing the fsyncs, and initdb has no way of knowing what files the
backend is going to create and therefore can't -- even theoretically
-- open them first.

What's being presented to us as the API contract that we should expect
from buffered I/O is that if you open a file and read() from it, call
fsync(), and get no error, the kernel may nevertheless decide that
some previous write that it never managed to flush can't be flushed,
and then revert the page to the contents it had at some point in the
past.  That's mostly or less equivalent to letting a malicious
adversary randomly overwrite database pages plausible-looking but
incorrect contents without notice and hoping you can still build a
reliable system.  You can avoid the problem if you can always open an
fd for every file you want to modify before it's written and hold on
to it until after it's fsync'd, but that's pretty hard to guarantee in
the face of kill -9.

I think the simplest technological solution to this problem is to
rewrite the entire backend and all supporting processes to use
O_DIRECT everywhere.  To maintain adequate performance, we'll have to
write a complete I/O scheduling system inside PostgreSQL.  Also, since
we'll now have to make shared_buffers much larger -- since we'll no
longer be benefiting from the OS cache -- we'll need to replace the
use of malloc() with an allocator that pulls from shared_buffers.
Plus, as noted, we'll need to totally rearchitect several of our
critical frontend tools.  Let's freeze 

Re: Documentation for bootstrap data conversion

2018-04-09 Thread Tom Lane
John Naylor  writes:
> On 4/9/18, Tom Lane  wrote:
>> I did note that in some internal comments, but forgot it when writing
>> this.  I agree that now that the conversion is done, it'd be better
>> to remove that special case.  Would you send a patch for that?

> Sure, attached.

Pushed, thanks.  I fixed the sgml markup too.

regards, tom lane



Re: [HACKERS] Runtime Partition Pruning

2018-04-09 Thread Alvaro Herrera
Robert Haas wrote:
> On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera  
> wrote:
> > Robert Haas wrote:

> >> I don't get this.  The executor surely had to (and did) open all of
> >> the relations somewhere even before this patch.
> >
> > I was worried that this coding could be seen as breaking modularity, or
> > trying to do excessive work.  However, after looking closer at it, it
> > doesn't really look like it's the case.  So, nevermind.
> 
> Well what I'm saying is that it shouldn't be necessary.  If the
> relations are being opened already and the pointers to the relcache
> entries are being saved someplace, you shouldn't need to re-open them
> elsewhere to get pointers to the relcache entries.

Oh, okay.  I can look into that.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Runtime Partition Pruning

2018-04-09 Thread Robert Haas
On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera  wrote:
> Robert Haas wrote:
>> On Sat, Apr 7, 2018 at 5:13 PM, Alvaro Herrera  
>> wrote:
>> > I had reservations about a relation_open() in the new executor code.  It
>> > seemed a bit odd; we don't have any other relation_open in the executor
>> > anywhere.  However, setting up the pruneinfo needs some stuff from
>> > relcache that I don't see a reasonable mechanism to pass through
>> > planner.  I asked Andres about it on IM and while he didn't endorse the
>> > patch in any way, his quick opinion was that "it wasn't entirely
>> > insane".  I verified that we already hold lock on the relation.
>>
>> I don't get this.  The executor surely had to (and did) open all of
>> the relations somewhere even before this patch.
>
> Yeah.
>
> I was worried that this coding could be seen as breaking modularity, or
> trying to do excessive work.  However, after looking closer at it, it
> doesn't really look like it's the case.  So, nevermind.

Well what I'm saying is that it shouldn't be necessary.  If the
relations are being opened already and the pointers to the relcache
entries are being saved someplace, you shouldn't need to re-open them
elsewhere to get pointers to the relcache entries.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Mark Dilger

> On Apr 9, 2018, at 10:26 AM, Joshua D. Drake  wrote:

> We have plenty of YEARS of people not noticing this issue

I disagree.  I have noticed this problem, but blamed it on other things.
For over five years now, I have had to tell customers not to use thin
provisioning, and I have had to add code to postgres to refuse to perform
inserts or updates if the disk volume is more than 80% full.  I have lost
count of the number of customers who are running an older version of the
product (because they refuse to upgrade) and come back with complaints that
they ran out of disk and now their database is corrupt.  All this time, I
have been blaming this on virtualization and thin provisioning.

mark



Re: [HACKERS] Runtime Partition Pruning

2018-04-09 Thread Alvaro Herrera
Robert Haas wrote:
> On Sat, Apr 7, 2018 at 5:13 PM, Alvaro Herrera  
> wrote:
> > I had reservations about a relation_open() in the new executor code.  It
> > seemed a bit odd; we don't have any other relation_open in the executor
> > anywhere.  However, setting up the pruneinfo needs some stuff from
> > relcache that I don't see a reasonable mechanism to pass through
> > planner.  I asked Andres about it on IM and while he didn't endorse the
> > patch in any way, his quick opinion was that "it wasn't entirely
> > insane".  I verified that we already hold lock on the relation.
> 
> I don't get this.  The executor surely had to (and did) open all of
> the relations somewhere even before this patch.

Yeah.

I was worried that this coding could be seen as breaking modularity, or
trying to do excessive work.  However, after looking closer at it, it
doesn't really look like it's the case.  So, nevermind.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Verbosity of genbki.pl

2018-04-09 Thread Andres Freund
Hi,

On 2018-04-08 13:33:42 -0400, Tom Lane wrote:
> Traditionally genbki.pl has printed "Writing foo" for every file
> it writes out.

> 2. Print just one message like "Generating postgres.bki and related
> files", and I guess a second one for fmgroids.h and related files.

+0.5.

- Andres



Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Heikki Linnakangas

On 09/04/18 18:21, Andrey Borodin wrote:



9 апр. 2018 г., в 19:50, Teodor Sigaev 
написал(а):


3. Why do we *not* lock the entry leaf page, if there is no
match? We still need a lock to remember that we probed for that
value and there was no match, so that we conflict with a tuple
that might be inserted later. At least #3 is a bug. The attached
patch adds an isolation test that demonstrates it. #1 and #2 are
weird, and cause unnecessary locking, so I think we should fix
those too, even if they don't lead to incorrect results.


I can't find a hole here. Agree.

Please correct me if I'm wrong. Let's say we have posting trees for
word A and word B. We are looking for a document that contains both. 
We will read through all posting tree of A, but only through some

segments of B. If we will not find anything in B, we have to lock
only segments where we actually were looking, not all the posting
tree of B.


True, that works. It was not clear from the code or comments that that 
was intended. I'm not sure if that's worthwhile, compared to locking 
just the posting tree root block. I'll let Teodor decide..



BTW I do not think that we lock ranges. We lock possibility of
appearance of tuples that we might find. Ranges are shortcuts for
places where we were looking.. That's how I understand, chances are
I'm missing something.


Yeah, that's one way of thinking about it.

- Heikki



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Gasper Zejn
On 09. 04. 2018 15:42, Tomas Vondra wrote:
> On 04/09/2018 12:29 AM, Bruce Momjian wrote:
>> An crazy idea would be to have a daemon that checks the logs and
>> stops Postgres when it seems something wrong.
>>
> That doesn't seem like a very practical way. It's better than nothing,
> of course, but I wonder how would that work with containers (where I
> think you may not have access to the kernel log at all). Also, I'm
> pretty sure the messages do change based on kernel version (and possibly
> filesystem) so parsing it reliably seems rather difficult. And we
> probably don't want to PANIC after I/O error on an unrelated device, so
> we'd need to understand which devices are related to PostgreSQL.
>
> regards
>

For a bit less (or more) crazy idea, I'd imagine creating a Linux kernel
module with kprobe/kretprobe capturing the file passed to fsync or even
byte range within file and corresponding return value shouldn't be that
hard. Kprobe has been a part of Linux kernel for a really long time, and
from first glance it seems like it could be backported to 2.6 too.

Then you could have stable log messages or implement some kind of "fsync
error log notification" via whatever is the most sane way to get this
out of kernel.

If the kernel is new enough and has eBPF support (seems like >=4.4),
using bcc-tools[1] should enable you to write a quick script to get
exactly that info via perf events[2].

Obviously, that's a stopgap solution ...


Kind regards,
Gasper


[1] https://github.com/iovisor/bcc
[2]
https://blog.yadutaf.fr/2016/03/30/turn-any-syscall-into-event-introducing-ebpf-kernel-probes/



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Joshua D. Drake

On 04/09/2018 09:45 AM, Robert Haas wrote:

On Mon, Apr 9, 2018 at 8:16 AM, Craig Ringer  wrote:

In the mean time, I propose that we fsync() on close() before we age FDs out
of the LRU on backends. Yes, that will hurt throughput and cause stalls, but
we don't seem to have many better options. At least it'll only flush what we
actually wrote to the OS buffers not what we may have in shared_buffers. If
the bgwriter does the same thing, we should be 100% safe from this problem
on 4.13+, and it'd be trivial to make it a GUC much like the fsync or
full_page_writes options that people can turn off if they know the risks /
know their storage is safe / don't care.

I have a really tough time believing this is the right way to solve
the problem.  We suffered for years because of ext3's desire to flush
the entire page cache whenever any single file was fsync()'d, which
was terrible.  Eventually ext4 became the norm, and the problem went
away.  Now we're going to deliberately insert logic to do a very
similar kind of terrible thing because the kernel developers have
decided that fsync() doesn't have to do what it says on the tin?  I
grant that there doesn't seem to be a better option, but I bet we're
going to have a lot of really unhappy users if we do this.


I don't have a better option but whatever we do, it should be an optional
(GUC) change. We have plenty of YEARS of people not noticing this issue and
Robert's correct, if we go back to an era of things like stalls it is going
to look bad on us no matter how we describe the problem.

Thanks,

JD


--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
* Unless otherwise stated, opinions are my own.   *




Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Heikki Linnakangas

On 09/04/18 18:04, Alvaro Herrera wrote:

Heikki Linnakangas wrote:


Remember, the purpose of predicate locks is to lock key ranges, not physical
pages or tuples in the index. We use leaf pages as handy shortcut for "any
key value that would belong on this page", but it is just an implementation
detail.


Hmm ... so, thinking about pending list locking, would it work to
acquire locks on the posting tree's root of each item in the pending
list, when the item is put in the pending list? (even if we insert the
item in the pending list instead of its posting tree).


Hmm, you mean, when inserting a new tuple? Yes, that would be correct. I 
don't think it would perform very well, though. If you have to traverse 
down to the posting trees, anyway, then you might as well insert the new 
tuples there directly, and forget about the pending list.


- Heikki



Re: Online enabling of checksums

2018-04-09 Thread Magnus Hagander
On Sat, Apr 7, 2018 at 6:22 PM, Andres Freund  wrote:

> Hi,
>
> On 2018-04-07 08:57:03 +0200, Magnus Hagander wrote:
> > Note however that I'm sans-laptop until Sunday, so I will revert it then
> or
> > possibly Monday.
>
> I'll deactive the isolationtester tests until then. They've been
> intermittently broken for days now and prevent other tests from being
> exercised.
>

Thanks.

I've pushed the revert now, and left the pg_verify_checksums in place for
the time being.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: pgsql: Merge catalog/pg_foo_fn.h headers back into pg_foo.h headers.

2018-04-09 Thread Tom Lane
Michael Paquier  writes:
> On Sun, Apr 08, 2018 at 11:05:09PM -0400, Tom Lane wrote:
>> Hm.  I'd tested "make -j all", but not going directly to "install".
>> Does it help if you add
>> $(SUBDIRS:%=install-%-recurse): | submake-generated-headers
>> to src/Makefile?

> That takes care of the problem from the root of the directory, but when
> doing the same from src/bin/ then the same issue shows up even if
> src/Makefile is patched to handle install targets.

Hm.  Not sure how far we want to go in that direction.  It's never really
been the case that you could configure a maintainer-clean tree and then
go and build in any random subdirectory without taking care of the
generated headers.  In principle, we could do something like the hack
Peter did with temp-install, where it's basically forced to be a
prerequisite of "make check" in EVERY subdirectory and then we buy back
some of the inefficiency by making it a no-op in child make runs.  Not
sure that's a good thing though.

Independently of that, though, I noticed that part of the issue in your
"make install" example is that relpath.c, along with some other frontend
code, includes catalog/catalog.h which includes pg_class.h (and thereby
now pg_class_d.h).  Since src/common/Makefile thinks the only generated
header it needs to worry about is errcodes.h, building src/common first
now falls over.  But allowing frontend code to include catalog.h is
pretty insane: that pulls in relcache.h as well, and surely we don't
want to tie our hands to the point that relcache.h has to be frontend
clean.

What we need to do is take OIDCHARS and TABLESPACE_VERSION_DIRECTORY
out of catalog.h and put them in some more frontend-friendly header.
My first thought was pg_tablespace_d.h, but my second one is relpath.h.
Comments?

regards, tom lane



Re: Optimization of range queries

2018-04-09 Thread Teodor Sigaev

Hi!

12 years ago I proposed patch to which could "union" OR clauses into one 
range clause if it's possible. In that time pgsql could not use IS NULL 
as index clause, so patch doesn't support that


https://www.postgresql.org/message-id/flat/45742C51.9020602%40sigaev.ru

option number 4), all other are already committed.


Konstantin Knizhnik wrote:

Hi hackers,

Postgres optimizer is not able to build efficient execution plan for the 
following query:


explain select * from  people_raw where not ("ID"<2068113880 AND "INN" 
is not null) and "ID"<=2068629726 AND "INN" is not null;

  QUERY PLAN
 

  Bitmap Heap Scan on people_raw  (cost=74937803.72..210449640.49 
rows=121521030 width=336)

    Recheck Cond: ("ID" <= 2068629726)
    Filter: (("INN" IS NOT NULL) AND (("ID" >= 2068113880) OR ("INN" IS 
NULL)))
    ->  Bitmap Index Scan on "People_pkey" (cost=0.00..74907423.47 
rows=2077021718 width=0)

  Index Cond: ("ID" <= 2068629726)
(5 rows)


Here the table is very large, but query effects only relatively small 
number of rows located in the range: [2068113880,2068629726]

But unfortunately optimizer doesn't take it into the account.
Moreover, using "is not null" and "not null" is both operands of AND is 
not smart:

  (("INN" IS NOT NULL) AND (("ID" >= 2068113880) OR ("INN" IS NULL)))

If I remove "is not null" condition, then plan is perfect:

explain select * from  people_raw where not ("ID"<2068113880) and 
"ID"<=2068629726;

  QUERY PLAN
 

  Index Scan using "People_pkey" on people_raw  (cost=0.58..196745.57 
rows=586160 width=336)

    Index Cond: (("ID" >= 2068113880) AND ("ID" <= 2068629726))
(2 rows)

Before starting  investigation of the problem, I will like to know 
opinion and may be some advise of people familiar with optimizer:

how difficult will be to handle this case and where to look.

Thanks in advance,



--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Robert Haas
On Mon, Apr 9, 2018 at 8:16 AM, Craig Ringer  wrote:
> In the mean time, I propose that we fsync() on close() before we age FDs out
> of the LRU on backends. Yes, that will hurt throughput and cause stalls, but
> we don't seem to have many better options. At least it'll only flush what we
> actually wrote to the OS buffers not what we may have in shared_buffers. If
> the bgwriter does the same thing, we should be 100% safe from this problem
> on 4.13+, and it'd be trivial to make it a GUC much like the fsync or
> full_page_writes options that people can turn off if they know the risks /
> know their storage is safe / don't care.

Ouch.  If a process exits -- say, because the user typed \q into psql
-- then you're talking about potentially calling fsync() on a really
large number of file descriptor flushing many gigabytes of data to
disk.  And it may well be that you never actually wrote any data to
any of those file descriptors -- those writes could have come from
other backends.  Or you may have written a little bit of data through
those FDs, but there could be lots of other data that you end up
flushing incidentally.  Perfectly innocuous things like starting up a
backend, running a few short queries, and then having that backend
exit suddenly turn into something that could have a massive
system-wide performance impact.

Also, if a backend ever manages to exit without running through this
code, or writes any dirty blocks afterward, then this still fails to
fix the problem completely.  I guess that's probably avoidable -- we
can put this late in the shutdown sequence and PANIC if it fails.

I have a really tough time believing this is the right way to solve
the problem.  We suffered for years because of ext3's desire to flush
the entire page cache whenever any single file was fsync()'d, which
was terrible.  Eventually ext4 became the norm, and the problem went
away.  Now we're going to deliberately insert logic to do a very
similar kind of terrible thing because the kernel developers have
decided that fsync() doesn't have to do what it says on the tin?  I
grant that there doesn't seem to be a better option, but I bet we're
going to have a lot of really unhappy users if we do this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Optimization of range queries

2018-04-09 Thread Konstantin Knizhnik

Hi hackers,

Postgres optimizer is not able to build efficient execution plan for the 
following query:


explain select * from  people_raw where not ("ID"<2068113880 AND "INN" 
is not null) and "ID"<=2068629726 AND "INN" is not null;

 QUERY PLAN

 Bitmap Heap Scan on people_raw  (cost=74937803.72..210449640.49 
rows=121521030 width=336)

   Recheck Cond: ("ID" <= 2068629726)
   Filter: (("INN" IS NOT NULL) AND (("ID" >= 2068113880) OR ("INN" IS 
NULL)))
   ->  Bitmap Index Scan on "People_pkey" (cost=0.00..74907423.47 
rows=2077021718 width=0)

 Index Cond: ("ID" <= 2068629726)
(5 rows)


Here the table is very large, but query effects only relatively small 
number of rows located in the range: [2068113880,2068629726]

But unfortunately optimizer doesn't take it into the account.
Moreover, using "is not null" and "not null" is both operands of AND is 
not smart:

 (("INN" IS NOT NULL) AND (("ID" >= 2068113880) OR ("INN" IS NULL)))

If I remove "is not null" condition, then plan is perfect:

explain select * from  people_raw where not ("ID"<2068113880) and 
"ID"<=2068629726;

 QUERY PLAN

 Index Scan using "People_pkey" on people_raw  (cost=0.58..196745.57 
rows=586160 width=336)

   Index Cond: (("ID" >= 2068113880) AND ("ID" <= 2068629726))
(2 rows)

Before starting  investigation of the problem, I will like to know 
opinion and may be some advise of people familiar with optimizer:

how difficult will be to handle this case and where to look.

Thanks in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [WIP PATCH] Index scan offset optimisation using visibility map

2018-04-09 Thread Andrey Borodin
Hi!

The work on the patch goes on, where was some discussion of this patch off-list 
with author.
Advise-request is still actual.

I think that we should move this patch to next CF. So I'm marking patch as 
needs review.

Best regards, Andrey Borodin.


Re: [HACKERS] Runtime Partition Pruning

2018-04-09 Thread Robert Haas
On Sat, Apr 7, 2018 at 5:13 PM, Alvaro Herrera  wrote:
> I had reservations about a relation_open() in the new executor code.  It
> seemed a bit odd; we don't have any other relation_open in the executor
> anywhere.  However, setting up the pruneinfo needs some stuff from
> relcache that I don't see a reasonable mechanism to pass through
> planner.  I asked Andres about it on IM and while he didn't endorse the
> patch in any way, his quick opinion was that "it wasn't entirely
> insane".  I verified that we already hold lock on the relation.

I don't get this.  The executor surely had to (and did) open all of
the relations somewhere even before this patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [PATCH] Incremental sort

2018-04-09 Thread Alexander Korotkov
On Sat, Apr 7, 2018 at 11:57 PM, Tomas Vondra 
wrote:

> On 04/07/2018 06:23 PM, Tom Lane wrote:
> > Teodor Sigaev  writes:
> >>> I dunno, how would you estimate whether this is actually a win or not?
> >>> I don't think our model of sort costs is anywhere near refined enough
> >>> or accurate enough to reliably predict whether this is better than
> >>> just doing it in one step.  Even if the cost model is good, it's not
> >>> going to be better than our statistics about the number/size of the
> >>> groups in the first column(s), and that's a notoriously unreliable
> stat.
> >
> >> I think that improvement in cost calculation of sort should be a
> >> separate patch, not directly connected to this one. Postpone patches
> >> till other part will be ready to get max improvement for postponed ones
> >> doesn't seem to me very good, especially if it suggests some improvement
> >> right now.
> >
> > No, you misunderstand the point of my argument.  Without a reasonably
> > reliable cost model, this patch could easily make performance *worse*
> > not better for many people, due to choosing incremental-sort plans
> > where they were really a loss.
> >
>
> Yeah. Essentially the patch could push the planner to pick a path that
> has low startup cost (and very high total cost), assuming it'll only
> need to read small part of the input. But if the number of groups in the
> input is low (perhaps just one huge group), that would be a regression.
>

Yes, I think the biggest risk here is too small number of groups.  More
precisely the risk is too large groups while total number of groups might
be large enough.

> If we were at the start of a development cycle and work were being
> > promised to be done later in the cycle to improve the planning aspect,
> > I'd be more charitable about it.  But this isn't merely the end of a
> > cycle, it's the *last day*.  Now is not the time to commit stuff that
> > needs, or even just might need, follow-on work.
> >
>
> +1 to that
>
> FWIW I'm willing to spend some time on the patch for PG12, particularly
> on the planner / costing part. The potential gains are too interesting
>

Thank you very much for your efforts on reviewing this patch.
I'm looking forward work with you on this feature for PG12.

FWIW, I think that we're moving this patch in the right direction by
providing separate paths for incremental sort.  It's much better than
deciding between full or incremental sort in-place.  For sure, considereing
extra paths might cause planning time regression.  But I think the
same statement is true about many other planning optimizations.
One thing be can do is to make enable_incrementalsort = off by
default.  Then only users who understand improtance of incremental
sort will get extra planning time.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Greg Stark
On 9 April 2018 at 15:22, Anthony Iliopoulos  wrote:
> On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:
>>
> Sure, there could be knobs for limiting how much memory such "zombie"
> pages may occupy. Not sure how helpful it would be in the long run
> since this tends to be highly application-specific, and for something
> with a large data footprint one would end up tuning this accordingly
> in a system-wide manner.

Surely this is exactly what the kernel is there to manage. It has to
control how much memory is allowed to be full of dirty buffers in the
first place to ensure that the system won't get memory starved if it
can't clean them fast enough. That isn't even about persistent
hardware errors. Even when the hardware is working perfectly it can
only flush buffers so fast.  The whole point of the kernel is to
abstract away shared resources. It's not like user space has any
better view of the situation here. If Postgres implemented all this in
DIRECT_IO it would have exactly the same problem only with less
visibility into what the rest of the system is doing. If every
application implemented its own buffer cache we would be back in the
same boat only with a fragmented memory allocation.

> This has the potential to leave other
> applications running in the same system with very little memory, in
> cases where for example original application crashes and never clears
> the error.

I still think we're speaking two different languages. There's no
application anywhere that's going to "clear the error". The
application has done the writes and if it's calling fsync it wants to
wait until the filesystem can arrange for the write to be persisted.
If the application could manage without the persistence then it
wouldn't have called fsync.

The only way to "clear out" the error would be by having the writes
succeed. There's no reason to think that wouldn't be possible
sometime. The filesystem could remap blocks or an administrator could
replace degraded raid device components. The only thing Postgres could
do to recover would be create a new file and move the data (reading
from the dirty buffer in memory!) to a new file anyways so we would
"clear the error" by just no longer calling fsync on the old file.

We always read fsync as a simple write barrier. That's what the
documentation promised and it's what Postgres always expected. It
sounds like the kernel implementors looked at it as some kind of
communication channel to communicate status report for specific writes
back to user-space. That's a much more complex problem and would have
entirely different interface. I think this is why we're having so much
difficulty communicating.



> It is reasonable, but even FreeBSD has a big fat comment right
> there (since 2017), mentioning that there can be no recovery from
> EIO at the block layer and this needs to be done differently. No
> idea how an application running on top of either FreeBSD or Illumos
> would actually recover from this error (and clear it out), other
> than remounting the fs in order to force dropping of relevant pages.
> It does provide though indeed a persistent error indication that
> would allow Pg to simply reliably panic. But again this does not
> necessarily play well with other applications that may be using
> the filesystem reliably at the same time, and are now faced with
> EIO while their own writes succeed to be persisted.

Well if they're writing to the same file that had a previous error I
doubt there are many applications that would be happy to consider
their writes "persisted" when the file was corrupt. Ironically the
earlier discussion quoted talked about how applications that wanted
more granular communication would be using O_DIRECT -- but what we
have is fsync trying to be *too* granular such that it's impossible to
get any strong guarantees about anything with it.

>> One has to wonder how many applications actually use this correctly,
>> considering PostgreSQL cares about data durability/consistency so much
>> and yet we've been misunderstanding how it works for 20+ years.
>
> I would expect it would be very few, potentially those that have
> a very simple process model (e.g. embedded DBs that can abort a
> txn on fsync() EIO).

Honestly I don't think there's *any* way to use the current interface
to implement reliable operation. Even that embedded database using a
single process and keeping every file open all the time (which means
file descriptor limits limit its scalability) can be having silent
corruption whenever some other process like a backup program comes
along and calls fsync (or even sync?).



-- 
greg



Re: [HACKERS] path toward faster partition pruning

2018-04-09 Thread Robert Haas
On Fri, Apr 6, 2018 at 11:41 PM, Tom Lane  wrote:
> David Rowley  writes:
>> Sounds like you're saying that if we have too many alternative files
>> then there's a chance that one could pass by luck.
>
> Yeah, exactly: it passed, but did it pass for the right reason?
>
> If there's just two expected-files, it's likely not a big problem,
> but if you have a bunch it's something to worry about.
>
> I'm also wondering how come we had hash partitioning before and
> did not have this sort of problem.  Is it just that we added a
> new test that's more sensitive to the details of the hashing
> (if so, could it be made less so)?  Or is there actually more
> platform dependence now than before (and if so, why is that)?

The existing hash partitioning tests did have some dependencies on the
hash function, but they took care not to use the built-in hash
functions.  Instead they did stuff like this:

CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS
$$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE;
CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS
OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8);
CREATE TABLE mchash (a int, b text, c jsonb)
  PARTITION BY HASH (a test_int4_ops, b test_text_ops);

I think that this approach should also be used for the new tests.
Variant expected output files are a pain to maintain, and you
basically just have to take whatever output you get as the right
answer, because nobody knows what output a certain built-in hash
function should produce for a given input except by running the code.
If you do the kind of thing shown above, though, then you can easily
see by inspection that you're getting the right answer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Transform for pl/perl

2018-04-09 Thread Tom Lane
ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:
> Tom Lane  writes:
>> I think you'd have to convert to text and back.  That's kind of icky,
>> but it beats failing.

> I had a look, and that's what the PL/Python transform does.  Attached is
> a patch that does that for PL/Perl too, but only if the value is
> actually > PG_INT64_MAX.

> The secondary output files are for Perls with 32bit IV/UV types, but I
> haven't been able to test them, since Debian's Perl uses 64bit integers
> even on 32bit platforms.

Ugh.  I really don't want to maintain a separate expected-file for this,
especially not if it's going to be hard to test.  Can we choose another
way of exercising the code path?

Another issue with this code as written is that on 32-bit-UV platforms,
at least some vompilers will give warnings about the constant-false
predicate.  Not sure about a good solution for that.  Maybe it's a
sufficient reason to invent uint8_numeric so we don't need a range check.

regards, tom lane



Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Teodor Sigaev



Alvaro Herrera wrote:

Heikki Linnakangas wrote:


Remember, the purpose of predicate locks is to lock key ranges, not physical
pages or tuples in the index. We use leaf pages as handy shortcut for "any
key value that would belong on this page", but it is just an implementation
detail.


Hmm ... so, thinking about pending list locking, would it work to
acquire locks on the posting tree's root of each item in the pending
list, when the item is put in the pending list? (even if we insert the
item in the pending list instead of its posting tree).


Items in pending list doesn't use posting tree or list, pending list is just 
list of pair (ItemPointer to heap, entry) represented as IndexTuple. There is no 
order in pending list, so Heikki suggests to lock metapage always instead of 
locking whole relation if fastupdate=on. If fastupdate is off then insertion 
process will not change metapage.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Andrey Borodin

> 9 апр. 2018 г., в 19:50, Teodor Sigaev  написал(а):
>> 
>> 3. Why do we *not* lock the entry leaf page, if there is no match? We still 
>> need a lock to remember that we probed for that value and there was no 
>> match, so that we conflict with a tuple that might be inserted later.
>> At least #3 is a bug. The attached patch adds an isolation test that 
>> demonstrates it. #1 and #2 are weird, and cause unnecessary locking, so I 
>> think we should fix those too, even if they don't lead to incorrect results.
> 
> I can't find a hole here. Agree.
Please correct me if I'm wrong.
Let's say we have posting trees for word A and word B.
We are looking for a document that contains both.
We will read through all posting tree of A, but only through some segments of B.
If we will not find anything in B, we have to lock only segments where we 
actually were looking, not all the posting tree of B.

BTW I do not think that we lock ranges. We lock possibility of appearance of 
tuples that we might find. Ranges are shortcuts for places where we were 
looking.. That's how I understand, chances are I'm missing something.

Best regards, Andrey Borodin.


Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Alvaro Herrera
Heikki Linnakangas wrote:

> Remember, the purpose of predicate locks is to lock key ranges, not physical
> pages or tuples in the index. We use leaf pages as handy shortcut for "any
> key value that would belong on this page", but it is just an implementation
> detail.

Hmm ... so, thinking about pending list locking, would it work to
acquire locks on the posting tree's root of each item in the pending
list, when the item is put in the pending list? (even if we insert the
item in the pending list instead of its posting tree).

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transform for pl/perl

2018-04-09 Thread Dagfinn Ilmari Mannsåker
Tom Lane  writes:

> ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:
>> I tried fixing this by adding an 'if (SvUV(in))' clause to
>> SV_to_JsonbValue, but I couldn't find a function to create a numeric
>> value from an uint64.  If it's not possible, should we error on UVs
>> greater than PG_INT64_MAX?
>
> I think you'd have to convert to text and back.  That's kind of icky,
> but it beats failing.

I had a look, and that's what the PL/Python transform does.  Attached is
a patch that does that for PL/Perl too, but only if the value is
actually > PG_INT64_MAX.

The secondary output files are for Perls with 32bit IV/UV types, but I
haven't been able to test them, since Debian's Perl uses 64bit integers
even on 32bit platforms.

> Or we could add a not-visible-to-SQL uint8-to-numeric function in
> numeric.c.  Not sure if this is enough use-case to justify that
> though.

I don't think this one use-case is enough, but it's worth keeping in
mind if it keeps cropping up.

- ilmari
-- 
"I use RMS as a guide in the same way that a boat captain would use
 a lighthouse.  It's good to know where it is, but you generally
 don't want to find yourself in the same spot." - Tollef Fog Heen

>From acf968b4df81797fc06868dac87123413f3f4167 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Thu, 5 Apr 2018 16:23:59 +0100
Subject: [PATCH] Handle integers > PG_INT64_MAX in PL/Perl JSONB transform

---
 .../jsonb_plperl/expected/jsonb_plperl.out|  15 +-
 .../jsonb_plperl/expected/jsonb_plperl_1.out  | 223 ++
 .../jsonb_plperl/expected/jsonb_plperlu.out   |  15 +-
 .../jsonb_plperl/expected/jsonb_plperlu_1.out | 223 ++
 contrib/jsonb_plperl/jsonb_plperl.c   |  20 +-
 contrib/jsonb_plperl/sql/jsonb_plperl.sql |   9 +
 contrib/jsonb_plperl/sql/jsonb_plperlu.sql|  10 +
 7 files changed, 512 insertions(+), 3 deletions(-)
 create mode 100644 contrib/jsonb_plperl/expected/jsonb_plperl_1.out
 create mode 100644 contrib/jsonb_plperl/expected/jsonb_plperlu_1.out

diff --git a/contrib/jsonb_plperl/expected/jsonb_plperl.out b/contrib/jsonb_plperl/expected/jsonb_plperl.out
index 99a2e8e135..c311a603f0 100644
--- a/contrib/jsonb_plperl/expected/jsonb_plperl.out
+++ b/contrib/jsonb_plperl/expected/jsonb_plperl.out
@@ -52,6 +52,19 @@ SELECT testRegexpResultToJsonb();
  0
 (1 row)
 
+CREATE FUNCTION testUVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+as $$
+$val = ~0;
+return $val;
+$$;
+SELECT testUVToJsonb();
+testuvtojsonb 
+--
+ 18446744073709551615
+(1 row)
+
 CREATE FUNCTION roundtrip(val jsonb) RETURNS jsonb
 LANGUAGE plperl
 TRANSFORM FOR TYPE jsonb
@@ -207,4 +220,4 @@ SELECT roundtrip('{"1": {"2": [3, 4, 5]}, "2": 3}');
 
 \set VERBOSITY terse \\ -- suppress cascade details
 DROP EXTENSION plperl CASCADE;
-NOTICE:  drop cascades to 6 other objects
+NOTICE:  drop cascades to 7 other objects
diff --git a/contrib/jsonb_plperl/expected/jsonb_plperl_1.out b/contrib/jsonb_plperl/expected/jsonb_plperl_1.out
new file mode 100644
index 00..c425c73b9c
--- /dev/null
+++ b/contrib/jsonb_plperl/expected/jsonb_plperl_1.out
@@ -0,0 +1,223 @@
+CREATE EXTENSION jsonb_plperl CASCADE;
+NOTICE:  installing required extension "plperl"
+CREATE FUNCTION testHVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+$val = {a => 1, b => 'boo', c => undef};
+return $val;
+$$;
+SELECT testHVToJsonb();
+  testhvtojsonb  
+-
+ {"a": 1, "b": "boo", "c": null}
+(1 row)
+
+CREATE FUNCTION testAVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+$val = [{a => 1, b => 'boo', c => undef}, {d => 2}];
+return $val;
+$$;
+SELECT testAVToJsonb();
+testavtojsonb
+-
+ [{"a": 1, "b": "boo", "c": null}, {"d": 2}]
+(1 row)
+
+CREATE FUNCTION testSVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+$val = 1;
+return $val;
+$$;
+SELECT testSVToJsonb();
+ testsvtojsonb 
+---
+ 1
+(1 row)
+
+-- this revealed a bug in the original implementation
+CREATE FUNCTION testRegexpResultToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+return ('1' =~ m(0\t2));
+$$;
+SELECT testRegexpResultToJsonb();
+ testregexpresulttojsonb 
+-
+ 0
+(1 row)
+
+CREATE FUNCTION testUVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+as $$
+$val = ~0;
+return $val;
+$$;
+SELECT testUVToJsonb();
+ testuvtojsonb 
+---
+ 4294967295
+(1 row)
+
+CREATE FUNCTION roundtrip(val jsonb) RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+return $_[0];
+$$;
+SELECT roundtrip('null');
+ roundtrip 
+---
+ null
+(1 row)
+
+SELECT roundtrip('1');
+ roundtrip 
+---
+ 1
+(1 row)
+
+SELECT roundtrip('1E+131071');
+ERROR:  

Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Alexander Korotkov
Hi!

Thank you for taking a look at this patch.  I really appreciate your
attention over complex subjects like this.

On Mon, Apr 9, 2018 at 1:33 PM, Heikki Linnakangas  wrote:

> On 28/03/18 19:53, Teodor Sigaev wrote:
>
>> As I understand, scan should lock any visited page, but now it's true
>> only for
>>
> posting tree. Seems, it also should lock pages in entry tree because
>> concurrent
>> procesess could add new entries which could be matched by partial search,
>> for
>> example. BTW, partial search (see collectMatchBitmap()) locks correctly
>> entry
>> tree, but regular startScanEntry() doesn't lock entry page in case of
>> posting
>> tree, only in case of posting list.
>>
> I think this needs some high-level comments or README to explain how the
> locking works. It seems pretty ad hoc at the moment. And incorrect.
>

I agree that explanation in README in insufficient.

1. Why do we lock all posting tree pages, even though they all represent
> the same value? Isn't it enough to lock the root of the posting tree?
>
> 2. Why do we lock any posting tree pages at all, if we lock the entry tree
> page anyway? Isn't the lock on the entry tree page sufficient to cover the
> key value?
>

I already have similar concerns in [1].  The idea of locking posting tree
leafs was to
get more granular locks.  I think you've correctly describe it in the
commit message
here:

With a very large posting tree, it would
possibly be better to lock the posting tree leaf pages instead, so that a
"skip scan" with a query like "A & B", you could avoid unnecessary conflict
if a new tuple is inserted with A but !B. But let's keep this simple.

However, it's very complex and error prone.  So, +1 for simplify it for v11.


> 3. Why do we *not* lock the entry leaf page, if there is no match? We
> still need a lock to remember that we probed for that value and there was
> no match, so that we conflict with a tuple that might be inserted later.
>

+1

At least #3 is a bug. The attached patch adds an isolation test that
> demonstrates it. #1 and #2 are weird, and cause unnecessary locking, so I
> think we should fix those too, even if they don't lead to incorrect results.
>
> Remember, the purpose of predicate locks is to lock key ranges, not
> physical pages or tuples in the index. We use leaf pages as handy shortcut
> for "any key value that would belong on this page", but it is just an
> implementation detail.
>
> I took a stab at fixing those issues, as well as the bug when fastupdate
> is turned on concurrently. Does the attached patch look sane to you?


Teodor has already answered you, and I'd like to mention that your patch
looks good for me too.

1. https://www.postgresql.org/message-id/CAPpHfdvED_-7KbJp-e
i4zRZu1brLgkJt4CA-uxF0iRO9WX2Sqw%40mail.gmail.com

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: lazy detoasting

2018-04-09 Thread Chapman Flack
On 04/08/2018 02:01 AM, Andrew Gierth wrote:

>  Chapman> (d) some other reason I haven't thought of ?
> 
> It has to be pushed as the active snapshot so that it's the snapshot
> that the executor uses to run the query to populate the tuplestore which
> becomes the "held" portal content.

That seems closest to my (b) guess.

> I think you're confusing the stack of active snapshots with the registry
> of snapshots. The snapshot of a portal that's not currently running in
> the executor won't be on the stack, but it will be registered (which is
> enough to maintain the session's reported xmin, which is what prevents
> visible data from being vacuumed away).

That's why I was asking. The first paragraph in snapmgr.c seems to say
that the registry and the active stack both exist as ways to keep the
snapshot alive, with reachability from either one being sufficient:

* We keep track of snapshots in two ways: those "registered" by
* resowner.c, and the "active snapshot" stack.  All snapshots in
* either of them live in persistent memory.  When a snapshot is
* no longer in any of these lists (tracked by separate refcounts
* on each snapshot), its memory can be freed.

AFAICS, that is *all* that comment block has to say about why there's
an active snapshot stack. I believe you are saying it has another
important function, namely that its top element is what tells the
executor what can be seen. That makes sense, and to clarify that
was why I was asking.

I suppose the reason that's not mentioned in the comments is that it
was so obviously the ultimate purpose of the whole scheme that nobody
writing or reviewing the comments could imagine not knowing it. :)

-Chap



Re: WIP: Covering + unique indexes.

2018-04-09 Thread Teodor Sigaev

Thanks to both of you, pushed

Shinoda, Noriyoshi wrote:

Hi!

Thank you for your response.

I think that it is good with your proposal.

Regards,

Noriyoshi Shinoda

*From:*Alexander Korotkov [mailto:a.korot...@postgrespro.ru]
*Sent:* Monday, April 9, 2018 11:22 PM
*To:* Shinoda, Noriyoshi 
*Cc:* PostgreSQL Hackers ; Teodor Sigaev 
; Peter Geoghegan ; Jeff Janes 
; Anastasia Lubennikova 

*Subject:* Re: WIP: Covering + unique indexes.

Hi!

On Mon, Apr 9, 2018 at 5:07 PM, Shinoda, Noriyoshi > wrote:


I tested this feature and found a document shortage in the columns added to
the pg_constraint catalog.
The attached patch will add the description of the 'conincluding' column to
the manual of the pg_constraint catalog.

Thank you for pointing this!

I think we need more wordy explanation here.  My proposal is attached.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com 
The Russian Postgres Company



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Teodor Sigaev

Hi!


1. Why do we lock all posting tree pages, even though they all represent the 
same value? Isn't it enough to lock the root of the posting tree?
2. Why do we lock any posting tree pages at all, if we lock the entry tree page 
anyway? Isn't the lock on the entry tree page sufficient to cover the key value?
3. Why do we *not* lock the entry leaf page, if there is no match? We still need 
a lock to remember that we probed for that value and there was no match, so that 
we conflict with a tuple that might be inserted later.


At least #3 is a bug. The attached patch adds an isolation test that 
demonstrates it. #1 and #2 are weird, and cause unnecessary locking, so I think 
we should fix those too, even if they don't lead to incorrect results.


I can't find a hole here. Agree.

I took a stab at fixing those issues, as well as the bug when fastupdate is 
turned on concurrently. Does the attached patch look sane to you?


I like an idea use metapage locking, thank you. Patch seems good, will you push 
it?

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



RE: WIP: Covering + unique indexes.

2018-04-09 Thread Shinoda, Noriyoshi
Hi!

Thank you for your response.
I think that it is good with your proposal.

Regards,
Noriyoshi Shinoda

From: Alexander Korotkov [mailto:a.korot...@postgrespro.ru]
Sent: Monday, April 9, 2018 11:22 PM
To: Shinoda, Noriyoshi 
Cc: PostgreSQL Hackers ; Teodor Sigaev 
; Peter Geoghegan ; Jeff Janes 
; Anastasia Lubennikova 
Subject: Re: WIP: Covering + unique indexes.

Hi!

On Mon, Apr 9, 2018 at 5:07 PM, Shinoda, Noriyoshi 
> wrote:
I tested this feature and found a document shortage in the columns added to the 
pg_constraint catalog.
The attached patch will add the description of the 'conincluding' column to the 
manual of the pg_constraint catalog.

Thank you for pointing this!
I think we need more wordy explanation here.  My proposal is attached.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: pruning disabled for array, enum, record, range type partition keys

2018-04-09 Thread Tom Lane
Amit Langote  writes:
> I noticed that the newly added pruning does not work if the partition key
> is of one of the types that have a corresponding pseudo-type.

While I don't recall the details due to acute caffeine shortage,
there are specific coding patterns that are used in the planner
(e.g. in indxpath.c) to ensure that the code works with pseudotype
opclasses as well as simple ones.  The existence of this bug
indicates that those conventions were not followed in the pruning
code.  I wonder whether this patch makes the pruning code look
more like the pre-existing code, or even less like it.

regards, tom lane



Re: WIP: Covering + unique indexes.

2018-04-09 Thread Alexander Korotkov
Hi!

On Mon, Apr 9, 2018 at 5:07 PM, Shinoda, Noriyoshi <
noriyoshi.shin...@hpe.com> wrote:

> I tested this feature and found a document shortage in the columns added
> to the pg_constraint catalog.
> The attached patch will add the description of the 'conincluding' column
> to the manual of the pg_constraint catalog.
>

Thank you for pointing this!
I think we need more wordy explanation here.  My proposal is attached.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pg_constraint-2.patch
Description: Binary data


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:
>
> We already have dirty_bytes and dirty_background_bytes, for example. I
> don't see why there couldn't be another limit defining how much dirty
> data to allow before blocking writes altogether. I'm sure it's not that
> simple, but you get the general idea - do not allow using all available
> memory because of writeback issues, but don't throw the data away in
> case it's just a temporary issue.

Sure, there could be knobs for limiting how much memory such "zombie"
pages may occupy. Not sure how helpful it would be in the long run
since this tends to be highly application-specific, and for something
with a large data footprint one would end up tuning this accordingly
in a system-wide manner. This has the potential to leave other
applications running in the same system with very little memory, in
cases where for example original application crashes and never clears
the error. Apart from that, further interfaces would need to be provided
for actually dealing with the error (again assuming non-transient
issues that may not be fixed transparently and that temporary issues
are taken care of by lower layers of the stack).

> Well, there seem to be kernels that seem to do exactly that already. At
> least that's how I understand what this thread says about FreeBSD and
> Illumos, for example. So it's not an entirely insane design, apparently.

It is reasonable, but even FreeBSD has a big fat comment right
there (since 2017), mentioning that there can be no recovery from
EIO at the block layer and this needs to be done differently. No
idea how an application running on top of either FreeBSD or Illumos
would actually recover from this error (and clear it out), other
than remounting the fs in order to force dropping of relevant pages.
It does provide though indeed a persistent error indication that
would allow Pg to simply reliably panic. But again this does not
necessarily play well with other applications that may be using
the filesystem reliably at the same time, and are now faced with
EIO while their own writes succeed to be persisted.

Ideally, you'd want a (potentially persistent) indication of error
localized to a file region (mapping the corresponding failed writeback
pages). NetBSD is already implementing fsync_ranges(), which could
be a step in the right direction.

> One has to wonder how many applications actually use this correctly,
> considering PostgreSQL cares about data durability/consistency so much
> and yet we've been misunderstanding how it works for 20+ years.

I would expect it would be very few, potentially those that have
a very simple process model (e.g. embedded DBs that can abort a
txn on fsync() EIO). I think that durability is a rather complex
cross-layer issue which has been grossly misunderstood similarly
in the past (e.g. see [1]). It seems that both the OS and DB
communities greatly benefit from a periodic reality check, and
I see this as an opportunity for strengthening the IO stack in
an end-to-end manner.

Best regards,
Anthony

[1] 
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf



  1   2   >