Re: [HACKERS] GSoC 2017: weekly progress reports (week 4) and patch for hash index

2018-04-09 Thread Amit Kapila
On Sat, Apr 7, 2018 at 7:47 PM, Teodor Sigaev  wrote:
> Thanks to everyone, pushed
>

Thanks!

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Problem while setting the fpw with SIGHUP

2018-04-09 Thread Kyotaro HORIGUCHI

At Fri, 6 Apr 2018 17:59:58 +0530, Amit Kapila  wrote 
in 
> On Fri, Apr 6, 2018 at 1:50 PM, Kyotaro HORIGUCHI
>  wrote:
> > Hello.
> >
> > At Wed, 04 Apr 2018 17:26:46 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
> >  wrote in 
> > <[email protected]>
> >> > In general, I was wondering why in the first place this variable
> >> > (full_page_writes) is a SIGHUP variable?  I think if the user tries to
> >> > switch it to 'on' from 'off', it won't guarantee the recovery from
> >> > torn pages.  Yeah, one can turn it to 'off' from 'on' without any
> >> > problem, but as the reverse doesn't guarantee anything, it can confuse
> >> > users. What do you think?
> >>
> >> I tend to agree with you. It works as expected after the next
> >> checkpoint. So define the variable as "it can be changed any time
> >> but has an effect at the next checkpoint time", then remove
> >> XLOG_FPW_CHANGE record. And that eliminates the problem of
> >> concurrent calls since the checkpointer becomes the only modifier
> >> of the variable. And the problematic function
> >> UpdateFullPageWrites also no longer needs to write a WAL
> >> record. The information is conveyed only by checkpoint records.
> >
> > I noticed that XLOG_FPW_CHANGE(fpw=false) is still required by
> > pg_start/stop_backup to know FPW's turning-off without waiting
> > for the next checkpoint record. But XLOG_FPW_CHANGE(true) is not
> > required since no one uses the information. It seems even harmful
> > when it is written at the incorrect place.
> >
> > In the attached patch, shared fullPageWrites is updated only at
> > REDO point
> >
> 
> I am not completely sure if that is the right option because this
> would mean that we are defining the new scope for a GUC variable.  I
> guess we should take the input of others as well.  I am not sure what
> is the right way to do that, but maybe we can start a new thread with
> a proper subject and description rather than discussing this under
> some related bug fix patch discussion.  I guess we can try that after
> CF unless some other people pitch in and share their feedback.

I'd like to refrain from making a new thread since this topic is
registered as an open item (in Old Bugs section). Or making a new
thread then relinking it from the page is preferable?

I'm surprised a bit that this got confilcted so soon. On the way
rebasing, for anyone's information, I considered comment and
documentation fix but all comments and documentation can be read
as both old and new behavior. That being said, the patch contains
a small addtion about the new behavior.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From fe0401335e0ac1a9ae36dbdb5c2db82f9081a1a3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 6 Apr 2018 13:57:48 +0900
Subject: [PATCH] Change FPW handling

The GUC full_pages_writes currently has an effect immediately. That
makes a race condition between config reload on checkpointer and
StartupXLOG. But since full page images are meaningful only when they
are attached to all WAL records covers a checkpoint, there is no
problem if we update the shared FPW only at REDO point.  On the other
hand, online backup mechanism on standby requires to know if FPW is
turned off before the next checkpoint record comes.

As the result, with this patch, changing of full_page_writes takes
effect at REDO point and additional XLOG_FPW_CHANGE is written only
for turning-off. These are sufficient for standby-backup to work
properly, reduces complexity and prevent the race condition.
---
 doc/src/sgml/config.sgml  |   4 +-
 src/backend/access/transam/xlog.c | 114 +-
 src/backend/postmaster/checkpointer.c |   6 --
 src/include/access/xlog.h |   1 -
 src/include/catalog/pg_control.h  |   2 +-
 5 files changed, 34 insertions(+), 93 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d5f2d23c4..7ea42c25e2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2598,7 +2598,9 @@ include_dir 'conf.d'

 This parameter can only be set in the postgresql.conf
 file or on the server command line.
-The default is on.
+
+The default is on. The change of the parmeter takes
+effect at the next checkpoint time.

   
  
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4a47395174..ba9cf07d38 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -202,14 +202,6 @@ static XLogRecPtr LastRec;
 static XLogRecPtr receivedUpto = 0;
 static TimeLineID receiveTLI = 0;
 
-/*
- * During recovery, lastFullPageWrites keeps track of full_page_writes that
- * the replayed WAL records indicate. It's initialized with full_page_writes
- * that the recovery starting checkpoint record indicates, and then updated
- * each time XLOG_FPW_CHANGE record is replayed.
- *

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Greg Stark
On 8 April 2018 at 22:47, Anthony Iliopoulos  wrote:
> On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:
>> On 8 April 2018 at 04:27, Craig Ringer  wrote:
>> > On 8 April 2018 at 10:16, Thomas Munro 
>
> The question is, what should the kernel and application do in cases
> where this is simply not possible (according to freebsd that keeps
> dirty pages around after failure, for example, -EIO from the block
> layer is a contract for unrecoverable errors so it is pointless to
> keep them dirty). You'd need a specialized interface to clear-out
> the errors (and drop the dirty pages), or potentially just remount
> the filesystem.

Well firstly that's not necessarily the question. ENOSPC is not an
unrecoverable error. And even unrecoverable errors for a single write
doesn't mean the write will never be able to succeed in the future.
But secondly doesn't such an interface already exist? When the device
is dropped any dirty pages already get dropped with it. What's the
point in dropping them but keeping the failing device?

But just to underline the point. "pointless to keep them dirty" is
exactly backwards from the application's point of view. If the error
writing to persistent media really is unrecoverable then it's all the
more critical that the pages be kept so the data can be copied to some
other device. The last thing user space expects to happen is if the
data can't be written to persistent storage then also immediately
delete it from RAM. (And the *really* last thing user space expects is
for this to happen and return no error.)

-- 
greg



Re: Verbosity of genbki.pl

2018-04-09 Thread John Naylor
On 4/9/18, Teodor Sigaev  wrote:
>> 1. Print nothing at all.  That's more in keeping with our modern
>> build practices, but maybe it's too big a change?
>>
>> 2. Print just one message like "Generating postgres.bki and related
>> files", and I guess a second one for fmgroids.h and related files.
>>
>> I don't have a strong preference.  Opinions?
>
> Second point, pls. I'd like to see some stage done

The attached patch does #2.

-John Naylor
diff --git a/src/backend/catalog/Catalog.pm b/src/backend/catalog/Catalog.pm
index 3b3bb6b..02bd6d4 100644
--- a/src/backend/catalog/Catalog.pm
+++ b/src/backend/catalog/Catalog.pm
@@ -340,7 +340,6 @@ sub RenameTempFile
 	my $final_name = shift;
 	my $extension  = shift;
 	my $temp_name  = $final_name . $extension;
-	print "Writing $final_name\n";
 	rename($temp_name, $final_name) || die "rename: $temp_name: $!";
 }
 
diff --git a/src/backend/catalog/genbki.pl b/src/backend/catalog/genbki.pl
index d0afcc7..2689f73 100644
--- a/src/backend/catalog/genbki.pl
+++ b/src/backend/catalog/genbki.pl
@@ -217,6 +217,7 @@ my %lookup_kind = (
 
 # Generate postgres.bki, postgres.description, postgres.shdescription,
 # and pg_*_d.h headers.
+print "Generating BKI files and symbol definition headers...\n";
 
 # version marker for .bki file
 print $bki "# PostgreSQL $major_version\n";
diff --git a/src/backend/utils/Gen_fmgrtab.pl b/src/backend/utils/Gen_fmgrtab.pl
index 3b112c6..4f5af79 100644
--- a/src/backend/utils/Gen_fmgrtab.pl
+++ b/src/backend/utils/Gen_fmgrtab.pl
@@ -84,6 +84,8 @@ my $FirstBootstrapObjectId = Catalog::FindDefinedSymbol(
 my $INTERNALlanguageId = Catalog::FindDefinedSymbolFromData(
 	$catalog_data{pg_language}, 'INTERNALlanguageId');
 
+print "Generating fmgrtab.c, fmgroids.h, and fmgrprotos.h...\n";
+
 # Collect certain fields from pg_proc.dat.
 my @fmgr = ();
 
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index b267c19..f953460 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -275,7 +275,6 @@ s{PG_VERSION_STR "[^"]+"}{PG_VERSION_STR "PostgreSQL $self->{strver}$extraver, c
 			'fmgrtab.c', '../../../src/include/access/transam.h')
 		)
 	{
-		print "Generating fmgrtab.c, fmgroids.h, fmgrprotos.h...\n";
 		system(
 "perl -I ../catalog Gen_fmgrtab.pl -I../../../src/include/ $pg_language_dat $pg_proc_dat");
 	}
@@ -479,7 +478,6 @@ EOF
 'src/backend/catalog/postgres.bki',
 "src/include/catalog/$bki"))
 		{
-			print "Generating BKI files and symbol definition headers...\n";
 			chdir('src/backend/catalog');
 			my $bki_srcs = join(' ../../../src/include/catalog/', @bki_srcs);
 			system("perl genbki.pl --set-version=$self->{majorver} $bki_srcs");


Re: [sqlsmith] Failed assertion on pfree() via perform_pruning_combine_step

2018-04-09 Thread Amit Langote
Hi Andreas.

On 2018/04/08 3:33, Andreas Seltenreich wrote:
> Hi,
> 
> testing with master at 039eb6e92f yielded another query triggering an
> assertion.

Thanks for the report.

> Backtrace and query against the regression database below.
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> #1  0x7f25474cf42a in __GI_abort () at abort.c:89
> #2  0x556c14b75bb3 in ExceptionalCondition (
> conditionName=conditionName@entry=0x556c14d11510 "!(((context) != ((void 
> *)0) && (const Node*)((context)))->type) == T_AllocSetContext) || 
> const Node*)((context)))->type) == T_SlabContext) || const 
> Node*)((context)))->type) == T_Generatio"..., 
> errorType=errorType@entry=0x556c14bcac95 "BadArgument", 
> fileName=fileName@entry=0x556c14d11480 
> "../../../../src/include/utils/memutils.h", lineNumber=lineNumber@entry=129)
> at assert.c:54
> #3  0x556c14ba28cb in GetMemoryChunkContext (pointer=) at 
> ../../../../src/include/utils/memutils.h:129
> #4  pfree (pointer=) at mcxt.c:1033
> #5  0x556c1495fc01 in bms_int_members (a=, b= out>) at bitmapset.c:917
> #6  0x556c149d910a in perform_pruning_combine_step 
> (context=0x7ffe80889b20, step_results=0x7f253e3efcc0, cstep=0x7f253e3f13a0)
> at partprune.c:2697
> #7  get_matching_partitions (context=0x7ffe80889b20, pruning_steps= out>) at partprune.c:317
> #8  0x556c149d9596 in prune_append_rel_partitions 
> (rel=rel@entry=0x7f253e3eb3e8) at partprune.c:262
> #9  0x556c14989e21 in set_append_rel_size (rte=0x7f254819d810, rti=2, 
> rel=0x7f253e3eb3e8, root=0x7f254819d3c8) at allpaths.c:907

[  ]

> select
>   sample_0.dd as c0,
>   subq_1.c3 as c1,
>   subq_1.c0 as c2,
>   subq_1.c2 as c3,
>   subq_1.c3 as c4,
>   sample_0.bb as c5,
>   subq_1.c0 as c6,
>   pg_catalog.pg_current_wal_flush_lsn() as c7,
>   public.func_with_bad_set() as c8,
>   sample_0.bb as c9,
>   sample_0.aa as c10,
>   sample_0.dd as c11
> from
>   public.d as sample_0 tablesample bernoulli (2.8) ,
>   lateral (select
>   subq_0.c1 as c0,
>   sample_0.aa as c1,
>   subq_0.c0 as c2,
>   sample_0.cc as c3,
>   subq_0.c0 as c4,
>   subq_0.c1 as c5
>   from
>   (select
> sample_1.a as c0,
> (select s from public.reloptions_test limit 1 offset 2)
>as c1
>   from
> public.pagg_tab_ml as sample_1 tablesample system (3.6)
>   where select c from public.test_tbl3 limit 1 offset 2)
>  <= cast(null as test_type3))
> or (((select n from testxmlschema.test2 limit 1 offset 1)
><= true)
>   or (sample_0.bb is not NULL)))
>   and ((true)
> or ((cast(null as varbit) >= (select varbitcol from 
> public.brintest limit 1 offset 3)
>   )
>   and ((select macaddrcol from public.brintest limit 1 offset 
> 6)
><> cast(null as macaddr)
> or ((sample_1.a is NULL)
>   and ((sample_1.c is not NULL)
> or (sample_1.c is NULL as subq_0
>   where (select salary from public.rtest_emp limit 1 offset 3)
>  = (select pg_catalog.min(newsal) from public.rtest_emplog)
> 
>   limit 13) as subq_1
> where sample_0.aa is NULL
> limit 140;

I have reproduced this and found that the problem is that
perform_pruning_combine_step forgets to *copy* the bitmapset of the first
step in the handling of an COMBINE_INTERSECT step.

Attached fixes that.  I see that Michael Paquier has added this to the
open items list.  Thanks, Michael.

https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items#Open_Issues

Thanks,
Amit
diff --git a/src/backend/partitioning/partprune.c 
b/src/backend/partitioning/partprune.c
index 417e1fee81..0f0080928a 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -2923,7 +2923,8 @@ perform_pruning_combine_step(PartitionPruneContext 
*context,
if (firststep)
{
/* Copy step's result the first time. */
-   result->bound_offsets = 
step_result->bound_offsets;
+   result->bound_offsets =
+   
bms_copy(step_result->bound_offsets);
result->scan_null = 
step_result->scan_null;
result->scan_default = 
step_result->scan_default;
firststep = false;


Re: [sqlsmith] Failed assertion on pfree() via perform_pruning_combine_step

2018-04-09 Thread Amit Langote
On 2018/04/09 17:50, Amit Langote wrote:
> Attached fixes that.  I see that Michael Paquier has added this to the
> open items list.  Thanks, Michael.
> 
> https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items#Open_Issues

Oops, it was Tom who added that.  Thank you!

Regards,
Amit




Re: [HACKERS] path toward faster partition pruning

2018-04-09 Thread Amit Langote
Hi David.

On 2018/04/09 12:48, David Rowley wrote:
> While looking at the docs in [1], I saw that we still mention:
> 
> 4. Ensure that the constraint_exclusion configuration parameter is not
> disabled in postgresql.conf. If it is, queries will not be optimized
> as desired.
> 
> This is no longer true. The attached patch removed it.
> 
> [1] https://www.postgresql.org/docs/10/static/ddl-partitioning.htm
Thanks.  I was aware of the changes that would need to be made, but you
beat me to writing the patch itself.

About the patch:

While the users no longer need to enable constraint_exclusion true for
select queries, one would still need it for update/delete queries, because
the new pruning logic only gets invoked for the former.  Alas...

Also, further down on that page, there is a 5.10.4 Partitioning and
Constraint Exclusion sub-section.  I think it would also need some tweaks
due to new developments.

I updated your patch to fix that.  Please take a look.

Thanks,
Amit
From a4fe924936fe623ff95e6aa050b8fd7d22dbbb84 Mon Sep 17 00:00:00 2001
From: "[email protected]" 
Date: Mon, 9 Apr 2018 15:43:32 +1200
Subject: [PATCH v2] Remove mention of constraint_exclusion in partitioning
 docs

As of 9fdb675fc5d2, this GUC now no longer has an affect on partition pruning.
---
 doc/src/sgml/ddl.sgml | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index feb2ab7792..eed8753e24 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3194,13 +3194,14 @@ CREATE INDEX ON measurement (logdate);
   
  
 
-  
-   
-Ensure that the 
-configuration parameter is not disabled in 
postgresql.conf.
-If it is, queries will not be optimized as desired.
-   
-  
+ 
+  
+   Ensure that the 
+   configuration parameter is not disabled in 
postgresql.conf.
+   While enabling it is not required for select queries, not doing so will 
result
+   in update and delete queries to not be optimized as desired.
+  
+ 
 

 
@@ -3767,10 +3768,12 @@ ANALYZE measurement;

 

-Constraint exclusion is a query optimization 
technique
-that improves performance for partitioned tables defined in the
-fashion described above (both declaratively partitioned tables and those
-implemented using inheritance).  As an example:
+Constraint exclusion is a query optimization
+technique that improves performance for partitioned tables defined in the
+fashion described above.  While it is used only for update and delete
+queries in the case of declaratively partitioned tables, it is used for all
+queries in the case of table partitioning implemented using inheritance.
+As an example:
 
 
 SET constraint_exclusion = on;
-- 
2.11.0



Re: Documentation for bootstrap data conversion

2018-04-09 Thread John Naylor
On 4/7/18, Tom Lane  wrote:
> John and I are probably both too close to the patch to be able to
> review this documentation for clarity and usefulness, so if anyone
> else wants to have a look, please comment.

No argument there, but I did want to note some minor details:

1.
 reformat_dat_file.pl preserves blank lines
 and comment lines as-is.

As it is now, it will actually collapse consecutive blank lines into
one. I maintain that was necessary during conversion to get some
semblance of consistency, but now it may not be a good idea to tie
developers' hands in surprising ways if they want double blank lines
in some places. It would be pretty easy to remove this behavior.
Apologies if it was not documented well enough.

2. I noticed the use of
pg_xxx.h
pg_xxx_d.h
where I would expect . Not sure if it matters.

3. It seems the preferred style is to refer to "bootstrap" relations
rather than "bootstrapped" relations. The attached patch makes code
comments more like the docs in this regard.

-John Naylor
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 59cd4b1..3da7861 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -77,7 +77,7 @@ int			numattr;			/* number of attributes for cur. rel */
 /*
  * Basic information associated with each type.  This is used before
  * pg_type is filled, so it has to cover the datatypes used as column types
- * in the core "bootstrapped" catalogs.
+ * in the core bootstrap catalogs.
  *
  *		XXX several of these input/output functions do catalog scans
  *			(e.g., F_REGPROCIN scans pg_proc).  this obviously creates some
diff --git a/src/backend/catalog/genbki.pl b/src/backend/catalog/genbki.pl
index 56312de..d0afcc7 100644
--- a/src/backend/catalog/genbki.pl
+++ b/src/backend/catalog/genbki.pl
@@ -520,7 +520,7 @@ sub gen_pg_attribute
 	{
 		my $table = $catalogs{$table_name};
 
-		# Currently, all bootstrapped relations also need schemapg.h
+		# Currently, all bootstrap relations also need schemapg.h
 		# entries, so skip if the relation isn't to be in schemapg.h.
 		next if !$table->{schema_macro};
 
@@ -540,7 +540,7 @@ sub gen_pg_attribute
 			morph_row_for_pgattr(\%row, $schema, $attr, $priornotnull);
 			$priornotnull &= ($row{attnotnull} eq 't');
 
-			# If it's bootstrapped, put an entry in postgres.bki.
+			# If it's a bootstrap relation, put an entry in postgres.bki.
 			print_bki_insert(\%row, $schema) if $table->{bootstrap};
 
 			# Store schemapg entries for later.
diff --git a/src/include/catalog/pg_attribute.h b/src/include/catalog/pg_attribute.h
index 8eef7d2..71c24ff 100644
--- a/src/include/catalog/pg_attribute.h
+++ b/src/include/catalog/pg_attribute.h
@@ -4,7 +4,7 @@
  *	  definition of the system "attribute" relation (pg_attribute)
  *
  * The initial contents of pg_attribute are generated at compile time by
- * genbki.pl, so there is no pg_attribute.dat file.  Only "bootstrapped"
+ * genbki.pl, so there is no pg_attribute.dat file.  Only bootstrap
  * relations need be included.
  *
  *
diff --git a/src/include/catalog/pg_class.dat b/src/include/catalog/pg_class.dat
index e1450e3..3c4a122 100644
--- a/src/include/catalog/pg_class.dat
+++ b/src/include/catalog/pg_class.dat
@@ -12,7 +12,7 @@
 
 [
 
-# Note: only "bootstrapped" relations, ie those marked BKI_BOOTSTRAP, need to
+# Note: only bootstrap relations, ie those marked BKI_BOOTSTRAP, need to
 # have entries here.  Be sure that the OIDs listed here match those given in
 # their CATALOG and BKI_ROWTYPE_OID macros, and that the relnatts values are
 # correct.
diff --git a/src/include/catalog/pg_type.dat b/src/include/catalog/pg_type.dat
index ae7e89b..8decacc 100644
--- a/src/include/catalog/pg_type.dat
+++ b/src/include/catalog/pg_type.dat
@@ -99,7 +99,7 @@
   typoutput => 'oidvectorout', typreceive => 'oidvectorrecv',
   typsend => 'oidvectorsend', typalign => 'i' },
 
-# hand-built rowtype entries for bootstrapped catalogs
+# hand-built rowtype entries for bootstrap catalogs
 # NB: OIDs assigned here must match the BKI_ROWTYPE_OID declarations
 { oid => '71',
   typname => 'pg_type', typlen => '-1', typbyval => 'f', typtype => 'c',


pruning disabled for array, enum, record, range type partition keys

2018-04-09 Thread Amit Langote
Hi.

I noticed that the newly added pruning does not work if the partition key
is of one of the types that have a corresponding pseudo-type.

-- array type list partition key
create table arrpart (a int[]) partition by list (a);
create table arrpart1 partition of arrpart for values in ('{1}');
create table arrpart2 partition of arrpart for values in ('{2, 3}', '{4, 5}');
explain (costs off) select * from arrpart where a = '{1}';
   QUERY PLAN

 Append
   ->  Seq Scan on arrpart1
 Filter: (a = '{1}'::integer[])
   ->  Seq Scan on arrpart2
 Filter: (a = '{1}'::integer[])
(5 rows)

For pruning, we normally rely on the type's operator class information in
the system catalogs to be up-to-date, which if it isn't we give up on
pruning.  For example, if pg_amproc entry for a given type and AM type
(btree, hash, etc.) has not been populated, we may fail to prune using a
clause that contains an expression of the said type.  While this is the
theory for the normal cases, we should make an exception for the
pseudo-type types.  For those types, we never have pg_amproc entries with
the "real" type listed.  Instead, the pg_amproc entries contain the
corresponding pseudo-type.  For example, there aren't pg_amproc entries
with int4[] (really, its OID) as amproclefttype and/or amprocrighttype,
instead anyarray is listed there.

Attached find a patch that tries to fix that and adds relevant tests.

Thanks,
Amit
From c7945da855973b606b5aa012295e2c0ae93c39c5 Mon Sep 17 00:00:00 2001
From: amit 
Date: Fri, 8 Dec 2017 19:09:31 +0900
Subject: [PATCH v1] Fix pruning when partition key of array, enum, record type

We never have pg_amproc catalog entries with "real" array, enum, record,
range types as leftop and rightop types.  Instead, AM procedures
manipulating such types have entries with the corresponding "pseudo-types"
listed as leftop and rightop types.  For example, for enums, all
procedures entries are marked with anyenum as their leftop and rightop
types.  So, if we pass "real" type OIDs to get_opfamily_member() or
get_opfamily_proc(), we get back an InvalidOid for these type categories.
Whereas we'd normally give up on performing pruning in that case, don't
do that in this case.
---
 src/backend/partitioning/partprune.c  |  4 +-
 src/test/regress/expected/partition_prune.out | 98 +++
 src/test/regress/sql/partition_prune.sql  | 44 +++-
 3 files changed, 144 insertions(+), 2 deletions(-)

diff --git a/src/backend/partitioning/partprune.c 
b/src/backend/partitioning/partprune.c
index 417e1fee81..bd1b99102d 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -1519,7 +1519,9 @@ match_clause_to_partition_key(RelOptInfo *rel,
 
/* Check if we're going to need a cross-type comparison 
function. */
exprtype = exprType((Node *) expr);
-   if (exprtype != part_scheme->partopcintype[partkeyidx])
+   if (exprtype != part_scheme->partopcintype[partkeyidx] &&
+   get_typtype(part_scheme->partopcintype[partkeyidx]) !=
+   TYPTYPE_PSEUDO)
{
switch (part_scheme->strategy)
{
diff --git a/src/test/regress/expected/partition_prune.out 
b/src/test/regress/expected/partition_prune.out
index df3fca025e..69e679d930 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -2399,3 +2399,101 @@ select * from boolp where a = (select value from 
boolvalues where not value);
 
 drop table boolp;
 reset enable_indexonlyscan;
+--
+-- check that pruning works properly when the partition key is of array, enum,
+-- or record type
+--
+-- array type list partition key
+create table arrpart (a int[]) partition by list (a);
+create table arrpart1 partition of arrpart for values in ('{1}');
+create table arrpart2 partition of arrpart for values in ('{2, 3}', '{4, 5}');
+explain (costs off) select * from arrpart where a = '{1}';
+   QUERY PLAN   
+
+ Append
+   ->  Seq Scan on arrpart1
+ Filter: (a = '{1}'::integer[])
+(3 rows)
+
+explain (costs off) select * from arrpart where a = '{1, 2}';
+QUERY PLAN
+--
+ Result
+   One-Time Filter: false
+(2 rows)
+
+explain (costs off) select * from arrpart where a in ('{4, 5}', '{1}');
+  QUERY PLAN  
+--
+ Append
+   ->  Seq Scan on arrpart1
+ Filter: ((a = '{4,5}'::integer[]) OR (a = '{1}'::integer[]))
+   ->  Seq Scan on arrpart2
+ Filter: ((a = '{4,5}'::integer[]) OR (a = '{1}'::integer[]))
+(5 rows)
+
+drop table arrpart;
+-- enum type list partition key
+create type colors as enu

Re: postgres_fdw: perform UPDATE/DELETE .. RETURNING on a join directly

2018-04-09 Thread Etsuro Fujita

(2018/04/07 4:17), Andres Freund wrote:

On 2018-03-05 17:07:10 -0500, Tom Lane wrote:

Meanwhile, I'm back to wondering what could possibly have affected
the planner's estimates, if pg_proc and pg_statistic didn't change.
I confess bafflement ... but we've now eliminated the autovacuum-
did-it theory entirely, so it's time to start looking someplace else.
I wonder if something in the postgres_fdw remote join machinery
is not as deterministic as it should be.


I wonder if temporarily changing postgres_fdw's test to specify an extra
config that installs auto_explain in full aggressiveness (i.e. including
costs etc) and enables debug3 logging could help narrow this down?


+1 because we cannot deny the possibility that the plan instability is 
caused by such an unexpected behavior of postgres_fdw.


Best regards,
Etsuro Fujita



Re: pruning disabled for array, enum, record, range type partition keys

2018-04-09 Thread Amit Langote
On 2018/04/09 19:14, Amit Langote wrote:
> Hi.
> 
> I noticed that the newly added pruning does not work if the partition key
> is of one of the types that have a corresponding pseudo-type.
> 
> -- array type list partition key
> create table arrpart (a int[]) partition by list (a);
> create table arrpart1 partition of arrpart for values in ('{1}');
> create table arrpart2 partition of arrpart for values in ('{2, 3}', '{4, 5}');
> explain (costs off) select * from arrpart where a = '{1}';
>QUERY PLAN
> 
>  Append
>->  Seq Scan on arrpart1
>  Filter: (a = '{1}'::integer[])
>->  Seq Scan on arrpart2
>  Filter: (a = '{1}'::integer[])
> (5 rows)
> 
> For pruning, we normally rely on the type's operator class information in
> the system catalogs to be up-to-date, which if it isn't we give up on
> pruning.  For example, if pg_amproc entry for a given type and AM type
> (btree, hash, etc.) has not been populated, we may fail to prune using a
> clause that contains an expression of the said type.  While this is the
> theory for the normal cases, we should make an exception for the
> pseudo-type types.  For those types, we never have pg_amproc entries with
> the "real" type listed.  Instead, the pg_amproc entries contain the
> corresponding pseudo-type.  For example, there aren't pg_amproc entries
> with int4[] (really, its OID) as amproclefttype and/or amprocrighttype,
> instead anyarray is listed there.
> 
> Attached find a patch that tries to fix that and adds relevant tests.

Added to the open items list.

https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items#Open_Issues

Thanks,
Amit




Re: Optimizing nested ConvertRowtypeExpr execution

2018-04-09 Thread Kyotaro HORIGUCHI
At Fri, 6 Apr 2018 08:37:57 +0200, Pavel Stehule  
wrote in 
> 2018-04-06 8:21 GMT+02:00 Ashutosh Bapat :
> 
> > On Tue, Apr 3, 2018 at 10:48 AM, Ashutosh Bapat
> >  wrote:
> > >>
> > >> Why is this done appropriately at ExecInitExpr() time, rather than at
> > >> plan time? Seems like eval_const_expressions() would be a bit more
> > >> appropriate (being badly named aside...)?
> > >
> > > That seems to be a better idea. Here's patch.
> > >
> >
> > Previous patch didn't try to fold the ConvertRowtypeExpr::arg into a Const.
> >
> > postgres=# create table t1 (a int, b int, c int) partition by range(a);
> > postgres=# create table t1p1 partition of t1 for values from (0) to
> > (100) partition by range(b);
> > postgres=# create table t1p1p1 partition of t1p1 for values from (0) to
> > (50);
> > postgres=# explain verbose select (1, 2, 3)::t1p1p1::t1p1::t1; --
> > notice Rowexpression here.
> > QUERY PLAN
> > ---
> >  Result  (cost=0.00..0.01 rows=1 width=32)
> >Output: (ROW(1, 2, 3)::t1p1p1)::t1
> > (2 rows)
> >
> > Here's patch fixing that. With this patch
> > postgres=# explain verbose select (1, 2, 3)::t1p1p1::t1p1::t1;
> > QUERY PLAN
> > ---
> >  Result  (cost=0.00..0.01 rows=1 width=32)
> >Output: '(1,2,3)'::t1
> > (2 rows)
> >
> >
> +1

I don't think it is not only on constatns.  With the patch,
non-constants are .. getting a rather strange conversion.


> =# explain verbose select (a, b, c)::t1p1p1::t1p1::t1 from (select i, i * 2, 
> i * 3 from generate_series(0, 10) i) x(a, b, c);
>  QUERY PLAN
> -
...
>Output: (ROW(i.i, (i.i * 2), (i.i * 3))::t1p1p1)::t1

Conversions between scalar values cannot be assumed safely
composed each other for general inputs but it is known to be safe
for the ConvertRowtypeExpr case.. I think.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] Add support for tuple routing to foreign partitions

2018-04-09 Thread Etsuro Fujita

(2018/04/07 8:25), Robert Haas wrote:

On Thu, Apr 5, 2018 at 6:21 AM, Etsuro Fujita
  wrote:

Attached is an updated version of the patch set plus the patch in [1]. Patch
0003_foreign-routing-fdwapi-6.patch can be applied on top of patch
0001_postgres-fdw-refactoring-6.patch and
0002_copy-from-check-constraint-fix.patch.


Committed.  Let me say that you do nice work.


Thanks to Robert for committing!  Thanks to everyone who got involved in 
this patch for all of the help with review and commentary!


Best regards,
Etsuro Fujita



Re: Optimizing nested ConvertRowtypeExpr execution

2018-04-09 Thread Ashutosh Bapat
On Mon, Apr 9, 2018 at 3:49 PM, Kyotaro HORIGUCHI
 wrote:
>
> I don't think it is not only on constatns.  With the patch,
> non-constants are .. getting a rather strange conversion.
>
>
>> =# explain verbose select (a, b, c)::t1p1p1::t1p1::t1 from (select i, i * 2, 
>> i * 3 from generate_series(0, 10) i) x(a, b, c);
>>  QUERY PLAN
>> -
> ...
>>Output: (ROW(i.i, (i.i * 2), (i.i * 3))::t1p1p1)::t1
>
> Conversions between scalar values cannot be assumed safely
> composed each other for general inputs but it is known to be safe
> for the ConvertRowtypeExpr case.. I think.

I am not able to parse this statement.

What output do you see without the patch?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Fixing a trivial typo in comment in rewriteManip.c

2018-04-09 Thread Kyotaro HORIGUCHI
While I was looking on some patch, I happened to notice that
there's a trivial typo in rewriteManip.c

> a ConvertRowTypeExpr to map back to the rowtype expected by the expression.

The correct name for "ConvertRowTypeExpr" is "ConvertRowtypeExpr"

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/rewrite/rewriteManip.c b/src/backend/rewrite/rewriteManip.c
index abad1bf7e4..f1f4212b5d 100644
--- a/src/backend/rewrite/rewriteManip.c
+++ b/src/backend/rewrite/rewriteManip.c
@@ -1205,7 +1205,7 @@ replace_rte_variables_mutator(Node *node,
  * If the expression tree contains a whole-row Var for the target RTE,
  * *found_whole_row is set to true.  In addition, if to_rowtype is
  * not InvalidOid, we replace the Var with a Var of that vartype, inserting
- * a ConvertRowTypeExpr to map back to the rowtype expected by the expression.
+ * a ConvertRowtypeExpr to map back to the rowtype expected by the expression.
  * (Therefore, to_rowtype had better be a child rowtype of the rowtype of the
  * RTE we're changing references to.)  Callers that don't provide to_rowtype
  * should report an error if *found_row_type is true; we don't do that here


Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Heikki Linnakangas

On 28/03/18 19:53, Teodor Sigaev wrote:

Hi!

I slightly modified test for clean demonstration of difference between
fastupdate on and off. Also I added CheckForSerializableConflictIn() to
fastupdate off codepath, but only in case of non-empty pending list.

The next question what I see: why do not we lock entry leaf pages in some cases?


Why should we?


As I understand, scan should lock any visited page, but now it's true only for
posting tree. Seems, it also should lock pages in entry tree because concurrent
procesess could add new entries which could be matched by partial search, for
example. BTW, partial search (see collectMatchBitmap()) locks correctly entry
tree, but regular startScanEntry() doesn't lock entry page in case of posting
tree, only in case of posting list.
I think this needs some high-level comments or README to explain how the 
locking works. It seems pretty ad hoc at the moment. And incorrect.


1. Why do we lock all posting tree pages, even though they all represent 
the same value? Isn't it enough to lock the root of the posting tree?


2. Why do we lock any posting tree pages at all, if we lock the entry 
tree page anyway? Isn't the lock on the entry tree page sufficient to 
cover the key value?


3. Why do we *not* lock the entry leaf page, if there is no match? We 
still need a lock to remember that we probed for that value and there 
was no match, so that we conflict with a tuple that might be inserted later.


At least #3 is a bug. The attached patch adds an isolation test that 
demonstrates it. #1 and #2 are weird, and cause unnecessary locking, so 
I think we should fix those too, even if they don't lead to incorrect 
results.


Remember, the purpose of predicate locks is to lock key ranges, not 
physical pages or tuples in the index. We use leaf pages as handy 
shortcut for "any key value that would belong on this page", but it is 
just an implementation detail.


I took a stab at fixing those issues, as well as the bug when fastupdate 
is turned on concurrently. Does the attached patch look sane to you?


- Heikki
>From b1ccb28fa8249d644382fd0b9c2a6ab94f6395e7 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 9 Apr 2018 13:31:42 +0300
Subject: [PATCH 1/1] Re-think predicate locking on GIN indexes.

The principle behind the locking was not very well thought-out, and not
documented. Add a section in the README to explain how it's supposed to
work, and change the code so that it actually works that way.

This fixes two bugs:

1. If fast update was turned on concurrently, subsequent inserts to the
   pending list would not conflict with predicate locks that were acquired
   earlier, on entry pages. The included 'predicate-gin-fastupdate' test
   demonstrates that. To fix, make all scans acquire a predicate lock on
   the metapage. That lock represents a scan of the pending list, whether
   or not there is a pending list at the moment. Forget about the
   optimization to skip locking/checking for locks, when fastupdate=off.
   Maybe some of that was safe, but I couldn't convince myself of it, so
   better to rip it out and keep things simple.

2. If a scan finds no match, it still needs to lock the entry page. The
   point of predicate locks is to lock the gabs between values, whether
   or not there is a match. The included 'predicate-gin-nomatch' test
   tests that case.

In addition to those two bug fixes, this removes some unnecessary locking,
following the principle laid out in the README. Because all items in
a posting tree have the same key value, a lock on the posting tree root is
enough to cover all the items. (With a very large posting tree, it would
possibly be better to lock the posting tree leaf pages instead, so that a
"skip scan" with a query like "A & B", you could avoid unnecessary conflict
if a new tuple is inserted with A but !B. But let's keep this simple.)

Also, some spelling and whitespace fixes.
---
 src/backend/access/gin/README  |  34 ++
 src/backend/access/gin/ginbtree.c  |  13 ++-
 src/backend/access/gin/gindatapage.c   |  25 +++--
 src/backend/access/gin/ginfast.c   |   8 ++
 src/backend/access/gin/ginget.c| 116 ++---
 src/backend/access/gin/gininsert.c |  34 ++
 src/backend/access/gin/ginutil.c   |   7 --
 src/backend/access/gin/ginvacuum.c |   1 -
 src/backend/access/gist/gist.c |   2 +-
 src/backend/storage/lmgr/README-SSI|  22 ++--
 src/include/access/gin_private.h   |   7 +-
 .../expected/predicate-gin-fastupdate.out  |  30 ++
 .../isolation/expected/predicate-gin-nomatch.out   |  15 +++
 src/test/isolation/expected/predicate-gin.out  |   4 +-
 src/test/isolation/isolation_schedule  |   2 +
 .../isolation/specs/predicate-gin-fastupdate.spec  |  49 +
 .../isolation/specs/predicate-gin-nomatch.

Re: Optimizing nested ConvertRowtypeExpr execution

2018-04-09 Thread Ashutosh Bapat
On Mon, Apr 9, 2018 at 3:53 PM, Ashutosh Bapat
 wrote:
> On Mon, Apr 9, 2018 at 3:49 PM, Kyotaro HORIGUCHI
>  wrote:
>>
>> I don't think it is not only on constatns.  With the patch,
>> non-constants are .. getting a rather strange conversion.
>>
>>
>>> =# explain verbose select (a, b, c)::t1p1p1::t1p1::t1 from (select i, i * 
>>> 2, i * 3 from generate_series(0, 10) i) x(a, b, c);
>>>  QUERY PLAN
>>> -
>> ...
>>>Output: (ROW(i.i, (i.i * 2), (i.i * 3))::t1p1p1)::t1
>>
>> Conversions between scalar values cannot be assumed safely
>> composed each other for general inputs but it is known to be safe
>> for the ConvertRowtypeExpr case.. I think.
>
> I am not able to parse this statement.
>
> What output do you see without the patch?
>

Without the patch, I see
explain verbose select (a, b, c)::t1p1p1::t1p1::t1 from (select i, i *
2, i * 3 from generate_series(0, 10) i) x(a, b, c);
  QUERY PLAN
--
 Function Scan on pg_catalog.generate_series i  (cost=0.00..15.00
rows=1000 width=32)
   Output: ((ROW(i.i, (i.i * 2), (i.i * 3))::t1p1p1)::t1p1)::t1
   Function Call: generate_series(0, 10)
(3 rows)

Only difference between the two outputs is direct conversion of t1p1p1
row into t1 row, which is what is expected with this patch. Are you
suggesting that the optimization attempted in the patch is not safe?
If yes, can you please explain why, and give a scenario showing its
"un"safety?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 09:45:40AM +0100, Greg Stark wrote:
> On 8 April 2018 at 22:47, Anthony Iliopoulos  wrote:
> > On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:
> >> On 8 April 2018 at 04:27, Craig Ringer  wrote:
> >> > On 8 April 2018 at 10:16, Thomas Munro 
> >
> > The question is, what should the kernel and application do in cases
> > where this is simply not possible (according to freebsd that keeps
> > dirty pages around after failure, for example, -EIO from the block
> > layer is a contract for unrecoverable errors so it is pointless to
> > keep them dirty). You'd need a specialized interface to clear-out
> > the errors (and drop the dirty pages), or potentially just remount
> > the filesystem.
> 
> Well firstly that's not necessarily the question. ENOSPC is not an
> unrecoverable error. And even unrecoverable errors for a single write
> doesn't mean the write will never be able to succeed in the future.

To make things a bit simpler, let us focus on EIO for the moment.
The contract between the block layer and the filesystem layer is
assumed to be that of, when an EIO is propagated up to the fs,
then you may assume that all possibilities for recovering have
been exhausted in lower layers of the stack. Mind you, I am not
claiming that this contract is either documented or necessarily
respected (in fact there have been studies on the error propagation
and handling of the block layer, see [1]). Let us assume that
this is the design contract though (which appears to be the case
across a number of open-source kernels), and if not - it's a bug.
In this case, indeed the specific write()s will never be able
to succeed in the future, at least not as long as the BIOs are
allocated to the specific failing LBAs.

> But secondly doesn't such an interface already exist? When the device
> is dropped any dirty pages already get dropped with it. What's the
> point in dropping them but keeping the failing device?

I think there are degrees of failure. There are certainly cases
where one may encounter localized unrecoverable medium errors
(specific to certain LBAs) that are non-maskable from the block
layer and below. That does not mean that the device is dropped
at all, so it does make sense to continue all other operations
to all other regions of the device that are functional. In cases
of total device failure, then the filesystem will prevent you
from proceeding anyway.

> But just to underline the point. "pointless to keep them dirty" is
> exactly backwards from the application's point of view. If the error
> writing to persistent media really is unrecoverable then it's all the
> more critical that the pages be kept so the data can be copied to some
> other device. The last thing user space expects to happen is if the
> data can't be written to persistent storage then also immediately
> delete it from RAM. (And the *really* last thing user space expects is
> for this to happen and return no error.)

Right. This implies though that apart from the kernel having
to keep around the dirtied-but-unrecoverable pages for an
unbounded time, that there's further an interface for obtaining
the exact failed pages so that you can read them back. This in
turn means that there needs to be an association between the
fsync() caller and the specific dirtied pages that the caller
intents to drain (for which we'd need an fsync_range(), among
other things). BTW, currently the failed writebacks are not
dropped from memory, but rather marked clean. They could be
lost though due to memory pressure or due to explicit request
(e.g. proc drop_caches), unless mlocked.

There is a clear responsibility of the application to keep
its buffers around until a successful fsync(). The kernels
do report the error (albeit with all the complexities of
dealing with the interface), at which point the application
may not assume that the write()s where ever even buffered
in the kernel page cache in the first place.

What you seem to be asking for is the capability of dropping
buffers over the (kernel) fence and idemnifying the application
from any further responsibility, i.e. a hard assurance
that either the kernel will persist the pages or it will
keep them around till the application recovers them
asynchronously, the filesystem is unmounted, or the system
is rebooted.

Best regards,
Anthony

[1] 
https://www.usenix.org/legacy/event/fast08/tech/full_papers/gunawi/gunawi.pdf



Re: Optimizing nested ConvertRowtypeExpr execution

2018-04-09 Thread Kyotaro HORIGUCHI
At Mon, 9 Apr 2018 15:53:04 +0530, Ashutosh Bapat 
 wrote in 

> On Mon, Apr 9, 2018 at 3:49 PM, Kyotaro HORIGUCHI
>  wrote:
> >
> > I don't think it is not only on constatns.  With the patch,
> > non-constants are .. getting a rather strange conversion.
> >
> >
> >> =# explain verbose select (a, b, c)::t1p1p1::t1p1::t1 from (select i, i * 
> >> 2, i * 3 from generate_series(0, 10) i) x(a, b, c);
> >>  QUERY PLAN
> >> -
> > ...
> >>Output: (ROW(i.i, (i.i * 2), (i.i * 3))::t1p1p1)::t1
> >
> > Conversions between scalar values cannot be assumed safely
> > composed each other for general inputs but it is known to be safe
> > for the ConvertRowtypeExpr case.. I think.
> 
> I am not able to parse this statement.

I apologize for the unreadable statement..

> What output do you see without the patch?

I got the following on the master.

>   Output: ((ROW(i.i, (i.i * 2), (i.i * 3))::t1p1p1)::t1p1)::t1

And I expect the following with this patch.

>   Output: ROW(i.i, (i.i * 2), (i.i * 3))::t1

And what the current patch generates looks imcomplete to me.

> >>Output: (ROW(i.i, (i.i * 2), (i.i * 3))::t1p1p1)::t1


I try to reword the unreadable thing..

We get the similar composition of casts on scalar values.

> =# explain verbose select 1::text::int::float;
..
>   Output: '1'::double precision

But we don't get the same on non-constant.

> =# explain verbose select x::text::int::float from generate_series(0, 10) x;
...
>Output: (((x)::text)::integer)::double precision

This seems reasonable since we cannot assume that "::double
precision" and "::text::integer::double precision" are equivelant
on arbitrary (or undecided, anyway I'm not sure it is true)
input. But ConvertRowtypeExpr seems to be composable (or
mergeable) for arbitrary input. Otherwise composition (or
merging) of ConvertRowtypeExpr should not be performed at all.

# I wish this makes sense..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Optimizing nested ConvertRowtypeExpr execution

2018-04-09 Thread Kyotaro HORIGUCHI
At Mon, 9 Apr 2018 16:07:33 +0530, Ashutosh Bapat 
 wrote in 

> On Mon, Apr 9, 2018 at 3:53 PM, Ashutosh Bapat
>  wrote:
> > On Mon, Apr 9, 2018 at 3:49 PM, Kyotaro HORIGUCHI
> >  wrote:
> >>
> >> I don't think it is not only on constatns.  With the patch,
> >> non-constants are .. getting a rather strange conversion.
> >>
> >>
> >>> =# explain verbose select (a, b, c)::t1p1p1::t1p1::t1 from (select i, i * 
> >>> 2, i * 3 from generate_series(0, 10) i) x(a, b, c);
> >>>  QUERY PLAN
> >>> -
> >> ...
> >>>Output: (ROW(i.i, (i.i * 2), (i.i * 3))::t1p1p1)::t1
> >>
> >> Conversions between scalar values cannot be assumed safely
> >> composed each other for general inputs but it is known to be safe
> >> for the ConvertRowtypeExpr case.. I think.
> >
> > I am not able to parse this statement.
> >
> > What output do you see without the patch?
> >
> 
> Without the patch, I see
> explain verbose select (a, b, c)::t1p1p1::t1p1::t1 from (select i, i *
> 2, i * 3 from generate_series(0, 10) i) x(a, b, c);
>   QUERY PLAN
> --
>  Function Scan on pg_catalog.generate_series i  (cost=0.00..15.00
> rows=1000 width=32)
>Output: ((ROW(i.i, (i.i * 2), (i.i * 3))::t1p1p1)::t1p1)::t1
>Function Call: generate_series(0, 10)
> (3 rows)
> 
> Only difference between the two outputs is direct conversion of t1p1p1
> row into t1 row, which is what is expected with this patch. Are you
> suggesting that the optimization attempted in the patch is not safe?
> If yes, can you please explain why, and give a scenario showing its
> "un"safety?

I understood the reason for the current output. Maybe I meant the
contrary, we can remove intermediate conversions more
aggressively. I assumed that ::t1p1p1 and ::t1 yield the same
output from any input. Is it right?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Optimizing nested ConvertRowtypeExpr execution

2018-04-09 Thread Ashutosh Bapat
On Mon, Apr 9, 2018 at 4:29 PM, Kyotaro HORIGUCHI
 wrote:
> At Mon, 9 Apr 2018 16:07:33 +0530, Ashutosh Bapat 
>  wrote in 
> 
>> On Mon, Apr 9, 2018 at 3:53 PM, Ashutosh Bapat
>>  wrote:
>> > On Mon, Apr 9, 2018 at 3:49 PM, Kyotaro HORIGUCHI
>> >  wrote:
>> >>
>> >> I don't think it is not only on constatns.  With the patch,
>> >> non-constants are .. getting a rather strange conversion.
>> >>
>> >>
>> >>> =# explain verbose select (a, b, c)::t1p1p1::t1p1::t1 from (select i, i 
>> >>> * 2, i * 3 from generate_series(0, 10) i) x(a, b, c);
>> >>>  QUERY PLAN
>> >>> -
>> >> ...
>> >>>Output: (ROW(i.i, (i.i * 2), (i.i * 3))::t1p1p1)::t1
>> >>
>> >> Conversions between scalar values cannot be assumed safely
>> >> composed each other for general inputs but it is known to be safe
>> >> for the ConvertRowtypeExpr case.. I think.
>> >
>> > I am not able to parse this statement.
>> >
>> > What output do you see without the patch?
>> >
>>
>> Without the patch, I see
>> explain verbose select (a, b, c)::t1p1p1::t1p1::t1 from (select i, i *
>> 2, i * 3 from generate_series(0, 10) i) x(a, b, c);
>>   QUERY PLAN
>> --
>>  Function Scan on pg_catalog.generate_series i  (cost=0.00..15.00
>> rows=1000 width=32)
>>Output: ((ROW(i.i, (i.i * 2), (i.i * 3))::t1p1p1)::t1p1)::t1
>>Function Call: generate_series(0, 10)
>> (3 rows)
>>
>> Only difference between the two outputs is direct conversion of t1p1p1
>> row into t1 row, which is what is expected with this patch. Are you
>> suggesting that the optimization attempted in the patch is not safe?
>> If yes, can you please explain why, and give a scenario showing its
>> "un"safety?
>
> I understood the reason for the current output. Maybe I meant the
> contrary, we can remove intermediate conversions more
> aggressively. I assumed that ::t1p1p1 and ::t1 yield the same
> output from any input. Is it right?

We can't directly cast Row() into t1 unless it's compatible with a
leaf child row hence ROW()::t1p1p1 cast. But I think that's a single
expression not two as it looks.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Fixing a trivial typo in comment in rewriteManip.c

2018-04-09 Thread Heikki Linnakangas

On 09/04/18 13:26, Kyotaro HORIGUCHI wrote:

While I was looking on some patch, I happened to notice that
there's a trivial typo in rewriteManip.c


a ConvertRowTypeExpr to map back to the rowtype expected by the expression.


The correct name for "ConvertRowTypeExpr" is "ConvertRowtypeExpr"


Fixed, thanks!

- Heikki



Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Teodor Sigaev
Attached is a test case that demonstrates a case where we miss a serialization 
failure, when fastupdate is turned on concurrently. It works on v10, but fails 
to throw a serialization error on v11.

Thank you for reserching!

Proof of concept:
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 43b2fce2c5..b8291f96e2 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -10745,6 +10745,7 @@ ATExecSetRelOptions(Relation rel, List *defList, 
AlterTableType operation,

case RELKIND_INDEX:
case RELKIND_PARTITIONED_INDEX:
(void) index_reloptions(rel->rd_amroutine->amoptions, newOptions, 
true);

+   TransferPredicateLocksToHeapRelation(rel);
break;
default:
ereport(ERROR,

it fixes pointed bug, but will gives false positives. Right place for that in 
ginoptions function, but ginoptions doesn't has an access to relation structure 
and I don't see a reason why it should.


--
Teodor Sigaev   E-mail: [email protected]
   WWW: http://www.sigaev.ru/



Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

2018-04-09 Thread Rushabh Lathia
Added to the open items list.

https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items#Open_Issues

On Tue, Apr 3, 2018 at 2:52 PM, Amit Langote 
wrote:

> On 2018/04/03 14:45, Kyotaro HORIGUCHI wrote:
> > Hello.
> >
> > At Mon, 2 Apr 2018 16:11:12 -0300, Alvaro Herrera <
> [email protected]> wrote:
> >> Why do we need AccessExclusiveLock on all children of a relation that we
> >> want to scan to search for rows not satisfying the constraint?  I think
> >> it should be enough to get ShareLock, which prevents INSERT, no?  I have
> >> a feeling I'm missing something here, but I don't know what, and all
> >> tests pass with that change.
>
> Thinking on this a bit, I see no problem with locking the children with
> just ShareLock.  It was just a paranoia that if we're going to lock the
> table itself being attached with AEL, we must its children (if any) with
> AEL, too.
>
> > Mmm. I'm not sure if there's a lock-upgrade case but the
> > following sentense is left at the last of one of the modified
> > comments. ATREwriteTables is called with AEL after that (that has
> > finally no effect in this case).
> >
> > |   But we cannot risk a deadlock by
> taking
> > | * a weaker lock now and the stronger one only when needed.
> >
> > I don't actually find places where the children's lock can be
> > raised but this seems just following the lock parent first
> > principle.
>
> No lock upgrade happen as of now.  The comment was added by the commit
> 972b6ec20bf [1], which removed the code that could cause such a deadlock.
> The comment fragment is simply trying to explain why we don't postpone the
> locking of children to a later time, say, to the point where we actually
> know that they need to be scanned.  Previously the code next to the
> comment used to lock the children using AccessShareLock, because at that
> point we just needed to check if the table being attached is one of those
> children and then later locked with AEL if it turned out that they need to
> be scanned to check the partition constraint.
>
> > By the way check_default_allows_bound() (CREATE TABLE case)
> > contains a similar usage of find_all_inheritors(default_rel,
> > AEL).
>
> Good catch.  Its job is more or less same as
> ValidatePartitionConstraints(), except all the children (if any) are
> scanned right away instead of queuing it like in the AT case.
>
> >> Also: the change proposed to remove the get_default_oid_from_partdesc()
> >> call actually fixes the bug, but to me it was not at all obvious why.
> >
> > CommandCounterIncrement updates the content of a relcache entry
> > via invalidation. It can be surprising for callers of a function
> > like StorePartitionBound.
> >
> > CommandCounterIncrement
> >  
> >RelationCacheInvalidateEntry
> >  RelationFlushRelation
> >RelationClearRelation
>
> Because of the CCI() after storing the default partition OID into the
> system catalog, RelationClearRelation() would changes what
> rel->rd_partdesc points to where 'rel' is the ATExecAttachPartition()'s
> reference to the relcache entry of the parent that it passed to
> StorePartitionBound.
>
> So, whereas the rel->rd_partdesc wouldn't contain the default partition
> before StorePartitionBound() was called, it would after.
>
> >> To figure out why, I first had to realize that
> >> ValidatePartitionConstraints was lying to me, both in name and in
> >> comments: it doesn't actually validate anything, it merely queues
> >> entries so that alter table's phase 3 would do the validation.  I found
> >> this extremely confusing, so I fixed the comments to match reality, and
> >> later decided to rename the function also.
> >
> > It is reasonable. Removing exccessive extension of lower-level
> > partitions is also reasonable. The previous code extended
> > inheritances in different manner for levels at odd and even
> > depth.
>
> I like the new code including the naming, but I notice this changes the
> order in which we do the locking now.  There are still sites in the code
> where the locking order is breadth-first, that is, as determined by
> find_all_inheritors(), as this function would too previously.
>
> Also note that beside the breadth-first -> depth-first change, this also
> changes the locking order of partitions for a given partitioned table.
> The OIDs in partdesc->oids[] are canonically ordered (that is order of
> their partition bounds), whereas find_inheritance_children() that's called
> by find_all_inheritors() would lock them in the order in which the
> individual OIDs were found in the system catalog.
>
> Not sure if there is anything to be alarmed of here, but in all previous
> discussions, this has been a thing to avoid.
>
> >> At that point I was able to understand what the problem was: when
> >> attaching the default partition, we were setting the constraint to be
> >> validated for that partition to the correctly computed partition
> >> constraint; and later in the same function we wo

Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Teodor Sigaev

Ugh, I miss your last email where you another locking protocol. Reading.

Teodor Sigaev wrote:
Attached is a test case that demonstrates a case where we miss a serialization 
failure, when fastupdate is turned on concurrently. It works on v10, but fails 
to throw a serialization error on v11.

Thank you for reserching!

Proof of concept:
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 43b2fce2c5..b8291f96e2 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -10745,6 +10745,7 @@ ATExecSetRelOptions(Relation rel, List *defList, 
AlterTableType operation,

     case RELKIND_INDEX:
     case RELKIND_PARTITIONED_INDEX:
     (void) index_reloptions(rel->rd_amroutine->amoptions, newOptions, 
true);

+   TransferPredicateLocksToHeapRelation(rel);
     break;
     default:
     ereport(ERROR,

it fixes pointed bug, but will gives false positives. Right place for that in 
ginoptions function, but ginoptions doesn't has an access to relation structure 
and I don't see a reason why it should.




--
Teodor Sigaev   E-mail: [email protected]
   WWW: http://www.sigaev.ru/



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Geoff Winkless
On 9 April 2018 at 11:50, Anthony Iliopoulos  wrote:

> What you seem to be asking for is the capability of dropping
> buffers over the (kernel) fence and idemnifying the application
> from any further responsibility, i.e. a hard assurance
> that either the kernel will persist the pages or it will
> keep them around till the application recovers them
> asynchronously, the filesystem is unmounted, or the system
> is rebooted.
>

That seems like a perfectly reasonable position to take, frankly.

The whole _point_ of an Operating System should be that you can do exactly
that. As a developer I should be able to call write() and fsync() and know
that if both calls have succeeded then the result is on disk, no matter
what another application has done in the meantime. If that's a "difficult"
problem then that's the OS's problem, not mine. If the OS doesn't do that,
it's _not_doing_its_job_.

Geoff


Re: Optimizing nested ConvertRowtypeExpr execution

2018-04-09 Thread Kyotaro HORIGUCHI
At Mon, 9 Apr 2018 16:43:22 +0530, Ashutosh Bapat 
 wrote in 

> On Mon, Apr 9, 2018 at 4:29 PM, Kyotaro HORIGUCHI
>  wrote:
> > At Mon, 9 Apr 2018 16:07:33 +0530, Ashutosh Bapat 
> >  wrote in 
> > 
> >> On Mon, Apr 9, 2018 at 3:53 PM, Ashutosh Bapat
> >>  wrote:
> >> > On Mon, Apr 9, 2018 at 3:49 PM, Kyotaro HORIGUCHI
> >> >  wrote:
> >> >>
> >> >> I don't think it is not only on constatns.  With the patch,
> >> >> non-constants are .. getting a rather strange conversion.
> >> >>
> >> >>
> >> >>> =# explain verbose select (a, b, c)::t1p1p1::t1p1::t1 from (select i, 
> >> >>> i * 2, i * 3 from generate_series(0, 10) i) x(a, b, c);
> >> >>>  QUERY PLAN
> >> >>> -
> >> >> ...
> >> >>>Output: (ROW(i.i, (i.i * 2), (i.i * 3))::t1p1p1)::t1
> >> >>
> >> >> Conversions between scalar values cannot be assumed safely
> >> >> composed each other for general inputs but it is known to be safe
> >> >> for the ConvertRowtypeExpr case.. I think.
> >> >
> >> > I am not able to parse this statement.
> >> >
> >> > What output do you see without the patch?
> >> >
> >>
> >> Without the patch, I see
> >> explain verbose select (a, b, c)::t1p1p1::t1p1::t1 from (select i, i *
> >> 2, i * 3 from generate_series(0, 10) i) x(a, b, c);
> >>   QUERY PLAN
> >> --
> >>  Function Scan on pg_catalog.generate_series i  (cost=0.00..15.00
> >> rows=1000 width=32)
> >>Output: ((ROW(i.i, (i.i * 2), (i.i * 3))::t1p1p1)::t1p1)::t1
> >>Function Call: generate_series(0, 10)
> >> (3 rows)
> >>
> >> Only difference between the two outputs is direct conversion of t1p1p1
> >> row into t1 row, which is what is expected with this patch. Are you
> >> suggesting that the optimization attempted in the patch is not safe?
> >> If yes, can you please explain why, and give a scenario showing its
> >> "un"safety?
> >
> > I understood the reason for the current output. Maybe I meant the
> > contrary, we can remove intermediate conversions more
> > aggressively. I assumed that ::t1p1p1 and ::t1 yield the same
> > output from any input. Is it right?
> 
> We can't directly cast Row() into t1 unless it's compatible with a
> leaf child row hence ROW()::t1p1p1 cast. But I think that's a single
> expression not two as it looks.

I misunderstood that ConvertRowtypeExpr is only for partitioned
tables. I noticed that it is shared with inheritance. So it works
on inheritacne as the follows. RowExpr make it work differently.

> =# explain verbose select (1, 2, 3, 4, 5)::jt1p1p1::jt1p1::jt1; --
...
>Output: '(1,2,3)'::jt1

Thanks for replaying patiently.

The new code doesn't seem to work as written.

>   arg = eval_const_expressions_mutator((Node *) cre->arg,
>context);
> 
>   /*
>* In case of a nested ConvertRowtypeExpr, we can convert the
>* leaf row directly to the topmost row format without any
>* intermediate conversions.
>*/
>   while (arg != NULL && IsA(arg, ConvertRowtypeExpr))
>   arg = (Node *) castNode(ConvertRowtypeExpr, arg)->arg;

This runs depth-first so the while loop seems to run at most
once. I suppose that the "arg =" and the while loop are
transposed as intention.

>  arg = (Node *) cre->arg;
>  while (arg != NULL && IsA(arg, ConvertRowtypeExpr))
>  arg = (Node *) castNode(ConvertRowtypeExpr, arg)->arg;
> 
>  arg = eval_const_expressions_mutator(arg, context);

It looks fine except the above point.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Fixing a trivial typo in comment in rewriteManip.c

2018-04-09 Thread Kyotaro HORIGUCHI
At Mon, 9 Apr 2018 14:22:33 +0300, Heikki Linnakangas  wrote 
in <[email protected]>
> On 09/04/18 13:26, Kyotaro HORIGUCHI wrote:
> > While I was looking on some patch, I happened to notice that
> > there's a trivial typo in rewriteManip.c
> > 
> >> a ConvertRowTypeExpr to map back to the rowtype expected by the
> >> expression.
> > The correct name for "ConvertRowTypeExpr" is "ConvertRowtypeExpr"
> 
> Fixed, thanks!

Thanks!

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Optimizing nested ConvertRowtypeExpr execution

2018-04-09 Thread Ashutosh Bapat
On Mon, Apr 9, 2018 at 5:34 PM, Kyotaro HORIGUCHI
 wrote:
>
> The new code doesn't seem to work as written.
>
>>   arg = eval_const_expressions_mutator((Node *) cre->arg,
>>context);
>>
>>   /*
>>* In case of a nested ConvertRowtypeExpr, we can convert the
>>* leaf row directly to the topmost row format without any
>>* intermediate conversions.
>>*/
>>   while (arg != NULL && IsA(arg, ConvertRowtypeExpr))
>>   arg = (Node *) castNode(ConvertRowtypeExpr, arg)->arg;
>
> This runs depth-first so the while loop seems to run at most
> once. I suppose that the "arg =" and the while loop are
> transposed as intention.

Yes. I have modelled it after RelableType case few lines above in the
same function.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Craig Ringer
On 9 April 2018 at 18:50, Anthony Iliopoulos  wrote:

>
> There is a clear responsibility of the application to keep
> its buffers around until a successful fsync(). The kernels
> do report the error (albeit with all the complexities of
> dealing with the interface), at which point the application
> may not assume that the write()s where ever even buffered
> in the kernel page cache in the first place.
>





> What you seem to be asking for is the capability of dropping
> buffers over the (kernel) fence and idemnifying the application
> from any further responsibility, i.e. a hard assurance
> that either the kernel will persist the pages or it will
> keep them around till the application recovers them
> asynchronously, the filesystem is unmounted, or the system
> is rebooted.
>

That's what Pg appears to assume now, yes.

Whether that's reasonable is a whole different topic.

I'd like a middle ground where the kernel lets us register our interest and
tells us if it lost something, without us having to keep eight million FDs
open for some long period. "Tell us about anything that happens under
pgdata/" or an inotify-style per-directory-registration option. I'd even
say that's ideal.

In the mean time, I propose that we fsync() on close() before we age FDs
out of the LRU on backends. Yes, that will hurt throughput and cause
stalls, but we don't seem to have many better options. At least it'll only
flush what we actually wrote to the OS buffers not what we may have in
shared_buffers. If the bgwriter does the same thing, we should be 100% safe
from this problem on 4.13+, and it'd be trivial to make it a GUC much like
the fsync or full_page_writes options that people can turn off if they know
the risks / know their storage is safe / don't care.

Some keen person who wants to later could optimise it by adding a fsync
worker thread pool in backends, so we don't block the main thread. Frankly
that might be a nice thing to have in the checkpointer anyway. But it's out
of scope for fixing this in durability terms.

I'm partway through a patch that makes fsync panic on errors now. Once
that's done, the next step will be to force fsync on close() in md and see
how we go with that.

Thoughts?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [sqlsmith] Failed assertion in create_gather_path

2018-04-09 Thread Amit Kapila
On Sun, Apr 8, 2018 at 1:04 PM, Jeevan Chalke
 wrote:
> Hi,
>
> At some places, I have observed that we are adding a partial path even when
> rel's consider_parallel is false. Due to this, the partial path added has
> parallel_safe set to false and then later in create_gather_path() assertion
> fails.
>

Few Comments:
1.
@@ -2196,6 +2196,10 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
pathkeys, required_outer));
  }

+ /* If parallelism is not possible, return. */
+ if (!rel->consider_parallel || !bms_is_empty(required_outer))
+ return;

In this case shouldn't we set the rel's consider_parallel flag
correctly rather than avoiding adding the path to it as we do in
recurse_set_operations?

2.
@@ -331,7 +331,7 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
  * to build a partial path for this relation.  But there's no point in
  * considering any path but the cheapest.
  */
- if (final_rel->partial_pathlist != NIL)
+ if (final_rel->consider_parallel && final_rel->partial_pathlist != NIL)

What problem did you see here or is it just for additional safety?
Ideally, if the consider_parallel is false for a rel, it's partial
path list should also be NULL.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [sqlsmith] Failed assertion in create_gather_path

2018-04-09 Thread Amit Kapila
On Mon, Apr 9, 2018 at 5:52 PM, Amit Kapila  wrote:
> On Sun, Apr 8, 2018 at 1:04 PM, Jeevan Chalke
>  wrote:
>> Hi,
>>
>> At some places, I have observed that we are adding a partial path even when
>> rel's consider_parallel is false. Due to this, the partial path added has
>> parallel_safe set to false and then later in create_gather_path() assertion
>> fails.
>>
>
> Few Comments:
> 1.
> @@ -2196,6 +2196,10 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo 
> *rel,
> pathkeys, required_outer));
>   }
>
> + /* If parallelism is not possible, return. */
> + if (!rel->consider_parallel || !bms_is_empty(required_outer))
> + return;
>
> In this case shouldn't we set the rel's consider_parallel flag
> correctly rather than avoiding adding the path to it
>

In the above sentence, I mean to say *partial* path.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Boolean partitions syntax

2018-04-09 Thread Peter Eisentraut
On 4/7/18 11:16, Jonathan S. Katz wrote:
> The last line yielding:
> 
>     ERROR:  syntax error at or near "TRUE"
>     LINE 3: FOR VALUES IN (TRUE);
> 
> [Omitted from example: the “records_active” partition]
> 
> I’m glad to see this was added to the open items. I would strongly
> suggest fixing
> this prior to the 11 release as it is unintuitive from a user standpoint
> to use ‘TRUE’

I think this is actually more accurately classified as an existing bug
in PostgreSQL 10.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: Store 2PC GID in commit/abort WAL recs for logical decoding

2018-04-09 Thread Heikki Linnakangas

On 28/03/18 19:47, Simon Riggs wrote:

Store 2PC GID in commit/abort WAL recs for logical decoding


This forgot to update the comments in xl_xact_commit and xl_xact_abort, 
for the new fields.



+
+   if (parsed->xinfo & XACT_XINFO_HAS_GID)
+   {
+   int gidlen;
+   strcpy(parsed->twophase_gid, data);
+   gidlen = strlen(parsed->twophase_gid) + 1;
+   data += MAXALIGN(gidlen);
+   }
+   }
+
+   if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+   {
+   xl_xact_origin xl_origin;
+
+   /* we're only guaranteed 4 byte alignment, so copy onto stack */
+   memcpy(&xl_origin, data, sizeof(xl_origin));
+
+   parsed->origin_lsn = xl_origin.origin_lsn;
+   parsed->origin_timestamp = xl_origin.origin_timestamp;
+
+   data += sizeof(xl_xact_origin);
}


There seems to be some confusion on the padding here. Firstly, what's 
the point of zero-padding the GID length to the next MAXALIGN boundary, 
which would be 8 bytes on 64-bit systems, if the start is only 
guaranteed 4-byte alignment, per the comment at the memcpy() above. 
Secondly, if we're memcpying the fields that follow anyway, why bother 
with any alignment padding at all?


I propose the attached.

- Heikki
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 3b3c95f810..2e4ea62e62 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -105,10 +105,8 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
 
 		if (parsed->xinfo & XACT_XINFO_HAS_GID)
 		{
-			int gidlen;
 			strlcpy(parsed->twophase_gid, data, sizeof(parsed->twophase_gid));
-			gidlen = strlen(data) + 1;
-			data += MAXALIGN(gidlen);
+			data += strlen(data) + 1;
 		}
 	}
 
@@ -116,7 +114,7 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
 	{
 		xl_xact_origin xl_origin;
 
-		/* we're only guaranteed 4 byte alignment, so copy onto stack */
+		/* we're not guaranteed any alignment, so copy onto stack */
 		memcpy(&xl_origin, data, sizeof(xl_origin));
 
 		parsed->origin_lsn = xl_origin.origin_lsn;
@@ -189,10 +187,8 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
 
 		if (parsed->xinfo & XACT_XINFO_HAS_GID)
 		{
-			int gidlen;
 			strlcpy(parsed->twophase_gid, data, sizeof(parsed->twophase_gid));
-			gidlen = strlen(data) + 1;
-			data += MAXALIGN(gidlen);
+			data += strlen(data) + 1;
 		}
 	}
 
@@ -200,7 +196,7 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
 	{
 		xl_xact_origin xl_origin;
 
-		/* we're only guaranteed 4 byte alignment, so copy onto stack */
+		/* we're not guaranteed any alignment, so copy onto stack */
 		memcpy(&xl_origin, data, sizeof(xl_origin));
 
 		parsed->origin_lsn = xl_origin.origin_lsn;
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d6e4b7980f..5c05d545c4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1129,9 +1129,11 @@ EndPrepare(GlobalTransaction gxact)
 	gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE);
 
 	if (replorigin)
+	{
 		/* Move LSNs forward for this replication origin */
 		replorigin_session_advance(replorigin_session_origin_lsn,
    gxact->prepare_end_lsn);
+	}
 
 	XLogFlush(gxact->prepare_end_lsn);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b88d4ccf74..1dcf825625 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5248,7 +5248,6 @@ XactLogCommitRecord(TimestampTz commit_time,
 	xl_xact_origin xl_origin;
 
 	uint8		info;
-	int			gidlen = 0;
 
 	Assert(CritSectionCount > 0);
 
@@ -5314,10 +5313,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 		Assert(twophase_gid != NULL);
 
 		if (XLogLogicalInfoActive())
-		{
 			xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
-			gidlen = strlen(twophase_gid) + 1; /* include '\0' */
-		}
 	}
 
 	/* dump transaction origin information */
@@ -5371,12 +5367,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 	{
 		XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
 		if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
-		{
-			static const char zeroes[MAXIMUM_ALIGNOF] = { 0 };
-			XLogRegisterData((char*) twophase_gid, gidlen);
-			if (MAXALIGN(gidlen) != gidlen)
-XLogRegisterData((char*) zeroes, MAXALIGN(gidlen) - gidlen);
-		}
+			XLogRegisterData((char *) twophase_gid, strlen(twophase_gid));
 	}
 
 	if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
@@ -5410,7 +5401,6 @@ XactLogAbortRecord(TimestampTz abort_time,
 	xl_xact_origin xl_origin;
 
 	uint8		info;
-	int			gidlen = 0;
 
 	Assert(CritSectionCount > 0);
 
@@ -5449,10 +5439,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 		Ass

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 01:03:28PM +0100, Geoff Winkless wrote:
> On 9 April 2018 at 11:50, Anthony Iliopoulos  wrote:
> 
> > What you seem to be asking for is the capability of dropping
> > buffers over the (kernel) fence and idemnifying the application
> > from any further responsibility, i.e. a hard assurance
> > that either the kernel will persist the pages or it will
> > keep them around till the application recovers them
> > asynchronously, the filesystem is unmounted, or the system
> > is rebooted.
> >
> 
> That seems like a perfectly reasonable position to take, frankly.

Indeed, as long as you are willing to ignore the consequences of
this design decision: mainly, how you would recover memory when no
application is interested in clearing the error. At which point
other applications with different priorities will find this position
rather unreasonable since there can be no way out of it for them.
Good luck convincing any OS kernel upstream to go with this design.

> The whole _point_ of an Operating System should be that you can do exactly
> that. As a developer I should be able to call write() and fsync() and know
> that if both calls have succeeded then the result is on disk, no matter
> what another application has done in the meantime. If that's a "difficult"
> problem then that's the OS's problem, not mine. If the OS doesn't do that,
> it's _not_doing_its_job_.

No OS kernel that I know of provides any promises for atomicity of a
write()+fsync() sequence, unless one is using O_SYNC. It doesn't
provide you with isolation either, as this is delegated to userspace,
where processes that share a file should coordinate accordingly.

It's not a difficult problem, but rather the kernels provide a common
denominator of possible interfaces and designs that could accommodate
a wider range of potential application scenarios for which the kernel
cannot possibly anticipate requirements. There have been plenty of
experimental works for providing a transactional (ACID) filesystem
interface to applications. On the opposite end, there have been quite
a few commercial databases that completely bypass the kernel storage
stack. But I would assume it is reasonable to figure out something
between those two extremes that can work in a "portable" fashion.

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 08:16:38PM +0800, Craig Ringer wrote:
> 
> I'd like a middle ground where the kernel lets us register our interest and
> tells us if it lost something, without us having to keep eight million FDs
> open for some long period. "Tell us about anything that happens under
> pgdata/" or an inotify-style per-directory-registration option. I'd even
> say that's ideal.

I see what you are saying. So basically you'd always maintain the
notification descriptor open, where the kernel would inject events
related to writeback failures of files under watch (potentially
enriched to contain info regarding the exact failed pages and
the file offset they map to). The kernel wouldn't even have to
maintain per-page bits to trace the errors, since they will be
consumed by the process that reads the events (or discarded,
when the notification fd is closed).

Assuming this would be possible, wouldn't Pg still need to deal
with synchronizing writers and related issues (since this would
be merely a notification mechanism - not prevent any process
from continuing), which I understand would be rather intrusive
for the current Pg multi-process design.

But other than that, similarly this interface could in principle
be similarly implemented in the BSDs via kqueue(), I suppose,
to provide what you need.

Best regards,
Anthony



Re: Transform for pl/perl

2018-04-09 Thread Dagfinn Ilmari Mannsåker
Peter Eisentraut  writes:

> On 3/15/18 03:46, Pavel Stehule wrote:
>> It looks well
>> 
>> the patch has tests and doc,
>> there are not any warnings or compilation issues
>> all tests passed
>> 
>> I'll mark this patch as ready for commiter
>
> committed

I played around with this a bit, and noticed that the number handling
doesn't cope with Perl UVs (unsigned integers) in the (IV_MAX, UV_MAX]
range:


ilmari@[local]:5432 ~=# CREATE FUNCTION testUVToJsonb() RETURNS jsonb
ilmari@[local]:5432 ~-# LANGUAGE plperl TRANSFORM FOR TYPE jsonb
ilmari@[local]:5432 ~-# as $$
ilmari@[local]:5432 ~$# $val = ~0;
ilmari@[local]:5432 ~$# elog(NOTICE, "value is $val");
ilmari@[local]:5432 ~$# return $val;
ilmari@[local]:5432 ~$# $$;
CREATE FUNCTION
Time: 6.795 ms
ilmari@[local]:5432 ~=# select testUVToJsonb();
NOTICE:  value is 18446744073709551615
┌───┐
│ testuvtojsonb │
├───┤
│ -1│
└───┘
(1 row)

I tried fixing this by adding an 'if (SvUV(in))' clause to
SV_to_JsonbValue, but I couldn't find a function to create a numeric
value from an uint64.  If it's not possible, should we error on UVs
greater than PG_INT64_MAX?

- ilmari
-- 
"The surreality of the universe tends towards a maximum" -- Skud's Law
"Never formulate a law or axiom that you're not prepared to live with
 the consequences of."  -- Skud's Meta-Law



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Tomas Vondra
On 04/09/2018 02:31 PM, Anthony Iliopoulos wrote:
> On Mon, Apr 09, 2018 at 01:03:28PM +0100, Geoff Winkless wrote:
>> On 9 April 2018 at 11:50, Anthony Iliopoulos  wrote:
>>
>>> What you seem to be asking for is the capability of dropping
>>> buffers over the (kernel) fence and idemnifying the application
>>> from any further responsibility, i.e. a hard assurance
>>> that either the kernel will persist the pages or it will
>>> keep them around till the application recovers them
>>> asynchronously, the filesystem is unmounted, or the system
>>> is rebooted.
>>>
>>
>> That seems like a perfectly reasonable position to take, frankly.
> 
> Indeed, as long as you are willing to ignore the consequences of
> this design decision: mainly, how you would recover memory when no
> application is interested in clearing the error. At which point
> other applications with different priorities will find this position
> rather unreasonable since there can be no way out of it for them.

Sure, but the question is whether the system can reasonably operate
after some of the writes failed and the data got lost. Because if it
can't, then recovering the memory is rather useless. It might be better
to stop the system in that case, forcing the system administrator to
resolve the issue somehow (fail-over to a replica, perform recovery from
the last checkpoint, ...).

We already have dirty_bytes and dirty_background_bytes, for example. I
don't see why there couldn't be another limit defining how much dirty
data to allow before blocking writes altogether. I'm sure it's not that
simple, but you get the general idea - do not allow using all available
memory because of writeback issues, but don't throw the data away in
case it's just a temporary issue.

> Good luck convincing any OS kernel upstream to go with this design.
> 

Well, there seem to be kernels that seem to do exactly that already. At
least that's how I understand what this thread says about FreeBSD and
Illumos, for example. So it's not an entirely insane design, apparently.

The question is whether the current design makes it any easier for
user-space developers to build reliable systems. We have tried using it,
and unfortunately the answers seems to be "no" and "Use direct I/O and
manage everything on your own!"

>> The whole _point_ of an Operating System should be that you can do exactly
>> that. As a developer I should be able to call write() and fsync() and know
>> that if both calls have succeeded then the result is on disk, no matter
>> what another application has done in the meantime. If that's a "difficult"
>> problem then that's the OS's problem, not mine. If the OS doesn't do that,
>> it's _not_doing_its_job_.
> 
> No OS kernel that I know of provides any promises for atomicity of a
> write()+fsync() sequence, unless one is using O_SYNC. It doesn't
> provide you with isolation either, as this is delegated to userspace,
> where processes that share a file should coordinate accordingly.
> 

We can (and do) take care of the atomicity and isolation. Implementation
of those parts is obviously very application-specific, and we have WAL
and locks for that purpose. I/O on the other hand seems to be a generic
service provided by the OS - at least that's how we saw it until now.

> It's not a difficult problem, but rather the kernels provide a common
> denominator of possible interfaces and designs that could accommodate
> a wider range of potential application scenarios for which the kernel
> cannot possibly anticipate requirements. There have been plenty of
> experimental works for providing a transactional (ACID) filesystem
> interface to applications. On the opposite end, there have been quite
> a few commercial databases that completely bypass the kernel storage
> stack. But I would assume it is reasonable to figure out something
> between those two extremes that can work in a "portable" fashion.
> 

Users ask us about this quite often, actually. The question is usually
about "RAW devices" and performance, but ultimately it boils down to
buffered vs. direct I/O. So far our answer was we rely on kernel to do
this reliably, because they know how to do that correctly and we simply
don't have the manpower to implement it (portable, reliable, handling
different types of storage, ...).

One has to wonder how many applications actually use this correctly,
considering PostgreSQL cares about data durability/consistency so much
and yet we've been misunderstanding how it works for 20+ years.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Boolean partitions syntax

2018-04-09 Thread Jonathan S. Katz

> On Apr 9, 2018, at 8:28 AM, Peter Eisentraut 
>  wrote:
> 
> On 4/7/18 11:16, Jonathan S. Katz wrote:
>> The last line yielding:
>> 
>> ERROR:  syntax error at or near "TRUE"
>> LINE 3: FOR VALUES IN (TRUE);
>> 
>> [Omitted from example: the “records_active” partition]
>> 
>> I’m glad to see this was added to the open items. I would strongly
>> suggest fixing
>> this prior to the 11 release as it is unintuitive from a user standpoint
>> to use ‘TRUE’
> 
> I think this is actually more accurately classified as an existing bug
> in PostgreSQL 10.

+1 based on running the above scenario on my 10.3 instance and
receiving the same error.  Is there a chance the fix could make it into
10.4 then?

Thanks,

Jonathan




Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Tomas Vondra
On 04/09/2018 12:29 AM, Bruce Momjian wrote:
> 
> An crazy idea would be to have a daemon that checks the logs and
> stops Postgres when it seems something wrong.
> 

That doesn't seem like a very practical way. It's better than nothing,
of course, but I wonder how would that work with containers (where I
think you may not have access to the kernel log at all). Also, I'm
pretty sure the messages do change based on kernel version (and possibly
filesystem) so parsing it reliably seems rather difficult. And we
probably don't want to PANIC after I/O error on an unrelated device, so
we'd need to understand which devices are related to PostgreSQL.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Abhijit Menon-Sen
At 2018-04-09 15:42:35 +0200, [email protected] wrote:
>
> On 04/09/2018 12:29 AM, Bruce Momjian wrote:
> > 
> > An crazy idea would be to have a daemon that checks the logs and
> > stops Postgres when it seems something wrong.
> > 
> 
> That doesn't seem like a very practical way.

Not least because Craig's tests showed that you can't rely on *always*
getting an error message in the logs.

-- Abhijit



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Tomas Vondra
On 04/09/2018 04:00 AM, Craig Ringer wrote:
> On 9 April 2018 at 07:16, Andres Freund  > wrote:
>  
> 
> 
> I think the danger presented here is far smaller than some of the
> statements in this thread might make one think.
> 
> 
> Clearly it's not happening a huge amount or we'd have a lot of noise
> about Pg eating people's data, people shouting about how unreliable it
> is, etc. We don't. So it's not some earth shattering imminent threat to
> everyone's data. It's gone unnoticed, or the root cause unidentified,
> for a long time.
> 

Yeah, it clearly isn't the case that everything we do suddenly got
pointless. It's fairly annoying, though.

> I suspect we've written off a fair few issues in the past as "it'd
> bad hardware" when actually, the hardware fault was the trigger for
> a Pg/kernel interaction bug. And blamed containers for things that
> weren't really the container's fault. But even so, if it were
> happening tons, we'd hear more noise.
> 

Right. Write errors are fairly rare, and we've probably ignored a fair
number of cases demonstrating this issue. It kinda reminds me the wisdom
 that not seeing planes with bullet holes in the engine does not mean
engines don't need armor [1].

[1]
https://medium.com/@penguinpress/an-excerpt-from-how-not-to-be-wrong-by-jordan-ellenberg-664e708cfc3d



regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [sqlsmith] Failed assertion on pfree() via perform_pruning_combine_step

2018-04-09 Thread Alvaro Herrera
Hello,

Amit Langote wrote:

> I have reproduced this and found that the problem is that
> perform_pruning_combine_step forgets to *copy* the bitmapset of the first
> step in the handling of an COMBINE_INTERSECT step.

Pushed, thanks Amit and Andreas!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transform for pl/perl

2018-04-09 Thread Tom Lane
[email protected] (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:
> I tried fixing this by adding an 'if (SvUV(in))' clause to
> SV_to_JsonbValue, but I couldn't find a function to create a numeric
> value from an uint64.  If it's not possible, should we error on UVs
> greater than PG_INT64_MAX?

I think you'd have to convert to text and back.  That's kind of icky,
but it beats failing.

Or we could add a not-visible-to-SQL uint8-to-numeric function in
numeric.c.  Not sure if this is enough use-case to justify that
though.

regards, tom lane



Re: Boolean partitions syntax

2018-04-09 Thread Tom Lane
"Jonathan S. Katz"  writes:
> +1 based on running the above scenario on my 10.3 instance and
> receiving the same error.  Is there a chance the fix could make it into
> 10.4 then?

It's premature to discuss whether this could be back-patched when
we haven't got an acceptable patch yet.

regards, tom lane



RE: WIP: Covering + unique indexes.

2018-04-09 Thread Shinoda, Noriyoshi
Hi, 

I tested this feature and found a document shortage in the columns added to the 
pg_constraint catalog.
The attached patch will add the description of the 'conincluding' column to the 
manual of the pg_constraint catalog.

Regards,
Noriyoshi Shinoda

-Original Message-
From: Teodor Sigaev [mailto:[email protected]] 
Sent: Monday, April 9, 2018 3:20 PM
To: Peter Geoghegan 
Cc: Jeff Janes ; Alexander Korotkov 
; Anastasia Lubennikova 
; PostgreSQL Hackers 

Subject: Re: WIP: Covering + unique indexes.

Thank you, pushed.


Peter Geoghegan wrote:
> On Sun, Apr 8, 2018 at 11:18 AM, Teodor Sigaev  wrote:
>> Thank you, fixed
> 
> I suggest that we remove some unneeded amcheck tests, as in the 
> attached patch. They don't seem to add anything.
> 

-- 
Teodor Sigaev  E-mail: [email protected]
   WWW: http://www.sigaev.ru/



pg_constraint.patch
Description: pg_constraint.patch


Re: Boolean partitions syntax

2018-04-09 Thread Jonathan S. Katz

> On Apr 9, 2018, at 10:06 AM, Tom Lane  wrote:
> 
> "Jonathan S. Katz"  writes:
>> +1 based on running the above scenario on my 10.3 instance and
>> receiving the same error.  Is there a chance the fix could make it into
>> 10.4 then?
> 
> It's premature to discuss whether this could be back-patched when
> we haven't got an acceptable patch yet.

Understood.

Jonathan




Re: Documentation for bootstrap data conversion

2018-04-09 Thread Tom Lane
John Naylor  writes:
> On 4/7/18, Tom Lane  wrote:
>>  reformat_dat_file.pl preserves blank lines
>>  and comment lines as-is.

> As it is now, it will actually collapse consecutive blank lines into
> one. I maintain that was necessary during conversion to get some
> semblance of consistency, but now it may not be a good idea to tie
> developers' hands in surprising ways if they want double blank lines
> in some places. It would be pretty easy to remove this behavior.
> Apologies if it was not documented well enough.

I did note that in some internal comments, but forgot it when writing
this.  I agree that now that the conversion is done, it'd be better
to remove that special case.  Would you send a patch for that?

> 2. I noticed the use of
> pg_xxx.h
> pg_xxx_d.h
> where I would expect . Not sure if it matters.

Good point.  It likely doesn't matter in terms of the output, but
considering I was going to the trouble of using those instead of just
 everywhere, would be better to choose the right one.

> 3. It seems the preferred style is to refer to "bootstrap" relations
> rather than "bootstrapped" relations. The attached patch makes code
> comments more like the docs in this regard.

Meh, I think either is fine really.  I do recall changing something
in bki.sgml that referred to both "bootstrap relations" and "bootstrap
catalogs" in practically the same sentence.  I think that *is* confusing,
because it's not obvious whether relation and catalog are meant to be
interchangeable terms (and, in general, they aren't).  If we wanted to
do anything here I'd be more interested in s/relation/catalog/g in this
usage.

regards, tom lane



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:
>
> We already have dirty_bytes and dirty_background_bytes, for example. I
> don't see why there couldn't be another limit defining how much dirty
> data to allow before blocking writes altogether. I'm sure it's not that
> simple, but you get the general idea - do not allow using all available
> memory because of writeback issues, but don't throw the data away in
> case it's just a temporary issue.

Sure, there could be knobs for limiting how much memory such "zombie"
pages may occupy. Not sure how helpful it would be in the long run
since this tends to be highly application-specific, and for something
with a large data footprint one would end up tuning this accordingly
in a system-wide manner. This has the potential to leave other
applications running in the same system with very little memory, in
cases where for example original application crashes and never clears
the error. Apart from that, further interfaces would need to be provided
for actually dealing with the error (again assuming non-transient
issues that may not be fixed transparently and that temporary issues
are taken care of by lower layers of the stack).

> Well, there seem to be kernels that seem to do exactly that already. At
> least that's how I understand what this thread says about FreeBSD and
> Illumos, for example. So it's not an entirely insane design, apparently.

It is reasonable, but even FreeBSD has a big fat comment right
there (since 2017), mentioning that there can be no recovery from
EIO at the block layer and this needs to be done differently. No
idea how an application running on top of either FreeBSD or Illumos
would actually recover from this error (and clear it out), other
than remounting the fs in order to force dropping of relevant pages.
It does provide though indeed a persistent error indication that
would allow Pg to simply reliably panic. But again this does not
necessarily play well with other applications that may be using
the filesystem reliably at the same time, and are now faced with
EIO while their own writes succeed to be persisted.

Ideally, you'd want a (potentially persistent) indication of error
localized to a file region (mapping the corresponding failed writeback
pages). NetBSD is already implementing fsync_ranges(), which could
be a step in the right direction.

> One has to wonder how many applications actually use this correctly,
> considering PostgreSQL cares about data durability/consistency so much
> and yet we've been misunderstanding how it works for 20+ years.

I would expect it would be very few, potentially those that have
a very simple process model (e.g. embedded DBs that can abort a
txn on fsync() EIO). I think that durability is a rather complex
cross-layer issue which has been grossly misunderstood similarly
in the past (e.g. see [1]). It seems that both the OS and DB
communities greatly benefit from a periodic reality check, and
I see this as an opportunity for strengthening the IO stack in
an end-to-end manner.

Best regards,
Anthony

[1] 
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf



Re: WIP: Covering + unique indexes.

2018-04-09 Thread Alexander Korotkov
Hi!

On Mon, Apr 9, 2018 at 5:07 PM, Shinoda, Noriyoshi <
[email protected]> wrote:

> I tested this feature and found a document shortage in the columns added
> to the pg_constraint catalog.
> The attached patch will add the description of the 'conincluding' column
> to the manual of the pg_constraint catalog.
>

Thank you for pointing this!
I think we need more wordy explanation here.  My proposal is attached.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pg_constraint-2.patch
Description: Binary data


Re: pruning disabled for array, enum, record, range type partition keys

2018-04-09 Thread Tom Lane
Amit Langote  writes:
> I noticed that the newly added pruning does not work if the partition key
> is of one of the types that have a corresponding pseudo-type.

While I don't recall the details due to acute caffeine shortage,
there are specific coding patterns that are used in the planner
(e.g. in indxpath.c) to ensure that the code works with pseudotype
opclasses as well as simple ones.  The existence of this bug
indicates that those conventions were not followed in the pruning
code.  I wonder whether this patch makes the pruning code look
more like the pre-existing code, or even less like it.

regards, tom lane



RE: WIP: Covering + unique indexes.

2018-04-09 Thread Shinoda, Noriyoshi
Hi!

Thank you for your response.
I think that it is good with your proposal.

Regards,
Noriyoshi Shinoda

From: Alexander Korotkov [mailto:[email protected]]
Sent: Monday, April 9, 2018 11:22 PM
To: Shinoda, Noriyoshi 
Cc: PostgreSQL Hackers ; Teodor Sigaev 
; Peter Geoghegan ; Jeff Janes 
; Anastasia Lubennikova 
Subject: Re: WIP: Covering + unique indexes.

Hi!

On Mon, Apr 9, 2018 at 5:07 PM, Shinoda, Noriyoshi 
mailto:[email protected]>> wrote:
I tested this feature and found a document shortage in the columns added to the 
pg_constraint catalog.
The attached patch will add the description of the 'conincluding' column to the 
manual of the pg_constraint catalog.

Thank you for pointing this!
I think we need more wordy explanation here.  My proposal is attached.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Teodor Sigaev

Hi!


1. Why do we lock all posting tree pages, even though they all represent the 
same value? Isn't it enough to lock the root of the posting tree?
2. Why do we lock any posting tree pages at all, if we lock the entry tree page 
anyway? Isn't the lock on the entry tree page sufficient to cover the key value?
3. Why do we *not* lock the entry leaf page, if there is no match? We still need 
a lock to remember that we probed for that value and there was no match, so that 
we conflict with a tuple that might be inserted later.


At least #3 is a bug. The attached patch adds an isolation test that 
demonstrates it. #1 and #2 are weird, and cause unnecessary locking, so I think 
we should fix those too, even if they don't lead to incorrect results.


I can't find a hole here. Agree.

I took a stab at fixing those issues, as well as the bug when fastupdate is 
turned on concurrently. Does the attached patch look sane to you?


I like an idea use metapage locking, thank you. Patch seems good, will you push 
it?

--
Teodor Sigaev   E-mail: [email protected]
   WWW: http://www.sigaev.ru/



Re: WIP: Covering + unique indexes.

2018-04-09 Thread Teodor Sigaev

Thanks to both of you, pushed

Shinoda, Noriyoshi wrote:

Hi!

Thank you for your response.

I think that it is good with your proposal.

Regards,

Noriyoshi Shinoda

*From:*Alexander Korotkov [mailto:[email protected]]
*Sent:* Monday, April 9, 2018 11:22 PM
*To:* Shinoda, Noriyoshi 
*Cc:* PostgreSQL Hackers ; Teodor Sigaev 
; Peter Geoghegan ; Jeff Janes 
; Anastasia Lubennikova 

*Subject:* Re: WIP: Covering + unique indexes.

Hi!

On Mon, Apr 9, 2018 at 5:07 PM, Shinoda, Noriyoshi > wrote:


I tested this feature and found a document shortage in the columns added to
the pg_constraint catalog.
The attached patch will add the description of the 'conincluding' column to
the manual of the pg_constraint catalog.

Thank you for pointing this!

I think we need more wordy explanation here.  My proposal is attached.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com 
The Russian Postgres Company



--
Teodor Sigaev   E-mail: [email protected]
   WWW: http://www.sigaev.ru/



Re: lazy detoasting

2018-04-09 Thread Chapman Flack
On 04/08/2018 02:01 AM, Andrew Gierth wrote:

>  Chapman> (d) some other reason I haven't thought of ?
> 
> It has to be pushed as the active snapshot so that it's the snapshot
> that the executor uses to run the query to populate the tuplestore which
> becomes the "held" portal content.

That seems closest to my (b) guess.

> I think you're confusing the stack of active snapshots with the registry
> of snapshots. The snapshot of a portal that's not currently running in
> the executor won't be on the stack, but it will be registered (which is
> enough to maintain the session's reported xmin, which is what prevents
> visible data from being vacuumed away).

That's why I was asking. The first paragraph in snapmgr.c seems to say
that the registry and the active stack both exist as ways to keep the
snapshot alive, with reachability from either one being sufficient:

* We keep track of snapshots in two ways: those "registered" by
* resowner.c, and the "active snapshot" stack.  All snapshots in
* either of them live in persistent memory.  When a snapshot is
* no longer in any of these lists (tracked by separate refcounts
* on each snapshot), its memory can be freed.

AFAICS, that is *all* that comment block has to say about why there's
an active snapshot stack. I believe you are saying it has another
important function, namely that its top element is what tells the
executor what can be seen. That makes sense, and to clarify that
was why I was asking.

I suppose the reason that's not mentioned in the comments is that it
was so obviously the ultimate purpose of the whole scheme that nobody
writing or reviewing the comments could imagine not knowing it. :)

-Chap



Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Alexander Korotkov
Hi!

Thank you for taking a look at this patch.  I really appreciate your
attention over complex subjects like this.

On Mon, Apr 9, 2018 at 1:33 PM, Heikki Linnakangas  wrote:

> On 28/03/18 19:53, Teodor Sigaev wrote:
>
>> As I understand, scan should lock any visited page, but now it's true
>> only for
>>
> posting tree. Seems, it also should lock pages in entry tree because
>> concurrent
>> procesess could add new entries which could be matched by partial search,
>> for
>> example. BTW, partial search (see collectMatchBitmap()) locks correctly
>> entry
>> tree, but regular startScanEntry() doesn't lock entry page in case of
>> posting
>> tree, only in case of posting list.
>>
> I think this needs some high-level comments or README to explain how the
> locking works. It seems pretty ad hoc at the moment. And incorrect.
>

I agree that explanation in README in insufficient.

1. Why do we lock all posting tree pages, even though they all represent
> the same value? Isn't it enough to lock the root of the posting tree?
>
> 2. Why do we lock any posting tree pages at all, if we lock the entry tree
> page anyway? Isn't the lock on the entry tree page sufficient to cover the
> key value?
>

I already have similar concerns in [1].  The idea of locking posting tree
leafs was to
get more granular locks.  I think you've correctly describe it in the
commit message
here:

With a very large posting tree, it would
possibly be better to lock the posting tree leaf pages instead, so that a
"skip scan" with a query like "A & B", you could avoid unnecessary conflict
if a new tuple is inserted with A but !B. But let's keep this simple.

However, it's very complex and error prone.  So, +1 for simplify it for v11.


> 3. Why do we *not* lock the entry leaf page, if there is no match? We
> still need a lock to remember that we probed for that value and there was
> no match, so that we conflict with a tuple that might be inserted later.
>

+1

At least #3 is a bug. The attached patch adds an isolation test that
> demonstrates it. #1 and #2 are weird, and cause unnecessary locking, so I
> think we should fix those too, even if they don't lead to incorrect results.
>
> Remember, the purpose of predicate locks is to lock key ranges, not
> physical pages or tuples in the index. We use leaf pages as handy shortcut
> for "any key value that would belong on this page", but it is just an
> implementation detail.
>
> I took a stab at fixing those issues, as well as the bug when fastupdate
> is turned on concurrently. Does the attached patch look sane to you?


Teodor has already answered you, and I'd like to mention that your patch
looks good for me too.

1. https://www.postgresql.org/message-id/CAPpHfdvED_-7KbJp-e
i4zRZu1brLgkJt4CA-uxF0iRO9WX2Sqw%40mail.gmail.com

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Transform for pl/perl

2018-04-09 Thread Dagfinn Ilmari Mannsåker
Tom Lane  writes:

> [email protected] (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:
>> I tried fixing this by adding an 'if (SvUV(in))' clause to
>> SV_to_JsonbValue, but I couldn't find a function to create a numeric
>> value from an uint64.  If it's not possible, should we error on UVs
>> greater than PG_INT64_MAX?
>
> I think you'd have to convert to text and back.  That's kind of icky,
> but it beats failing.

I had a look, and that's what the PL/Python transform does.  Attached is
a patch that does that for PL/Perl too, but only if the value is
actually > PG_INT64_MAX.

The secondary output files are for Perls with 32bit IV/UV types, but I
haven't been able to test them, since Debian's Perl uses 64bit integers
even on 32bit platforms.

> Or we could add a not-visible-to-SQL uint8-to-numeric function in
> numeric.c.  Not sure if this is enough use-case to justify that
> though.

I don't think this one use-case is enough, but it's worth keeping in
mind if it keeps cropping up.

- ilmari
-- 
"I use RMS as a guide in the same way that a boat captain would use
 a lighthouse.  It's good to know where it is, but you generally
 don't want to find yourself in the same spot." - Tollef Fog Heen

>From acf968b4df81797fc06868dac87123413f3f4167 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Thu, 5 Apr 2018 16:23:59 +0100
Subject: [PATCH] Handle integers > PG_INT64_MAX in PL/Perl JSONB transform

---
 .../jsonb_plperl/expected/jsonb_plperl.out|  15 +-
 .../jsonb_plperl/expected/jsonb_plperl_1.out  | 223 ++
 .../jsonb_plperl/expected/jsonb_plperlu.out   |  15 +-
 .../jsonb_plperl/expected/jsonb_plperlu_1.out | 223 ++
 contrib/jsonb_plperl/jsonb_plperl.c   |  20 +-
 contrib/jsonb_plperl/sql/jsonb_plperl.sql |   9 +
 contrib/jsonb_plperl/sql/jsonb_plperlu.sql|  10 +
 7 files changed, 512 insertions(+), 3 deletions(-)
 create mode 100644 contrib/jsonb_plperl/expected/jsonb_plperl_1.out
 create mode 100644 contrib/jsonb_plperl/expected/jsonb_plperlu_1.out

diff --git a/contrib/jsonb_plperl/expected/jsonb_plperl.out b/contrib/jsonb_plperl/expected/jsonb_plperl.out
index 99a2e8e135..c311a603f0 100644
--- a/contrib/jsonb_plperl/expected/jsonb_plperl.out
+++ b/contrib/jsonb_plperl/expected/jsonb_plperl.out
@@ -52,6 +52,19 @@ SELECT testRegexpResultToJsonb();
  0
 (1 row)
 
+CREATE FUNCTION testUVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+as $$
+$val = ~0;
+return $val;
+$$;
+SELECT testUVToJsonb();
+testuvtojsonb 
+--
+ 18446744073709551615
+(1 row)
+
 CREATE FUNCTION roundtrip(val jsonb) RETURNS jsonb
 LANGUAGE plperl
 TRANSFORM FOR TYPE jsonb
@@ -207,4 +220,4 @@ SELECT roundtrip('{"1": {"2": [3, 4, 5]}, "2": 3}');
 
 \set VERBOSITY terse \\ -- suppress cascade details
 DROP EXTENSION plperl CASCADE;
-NOTICE:  drop cascades to 6 other objects
+NOTICE:  drop cascades to 7 other objects
diff --git a/contrib/jsonb_plperl/expected/jsonb_plperl_1.out b/contrib/jsonb_plperl/expected/jsonb_plperl_1.out
new file mode 100644
index 00..c425c73b9c
--- /dev/null
+++ b/contrib/jsonb_plperl/expected/jsonb_plperl_1.out
@@ -0,0 +1,223 @@
+CREATE EXTENSION jsonb_plperl CASCADE;
+NOTICE:  installing required extension "plperl"
+CREATE FUNCTION testHVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+$val = {a => 1, b => 'boo', c => undef};
+return $val;
+$$;
+SELECT testHVToJsonb();
+  testhvtojsonb  
+-
+ {"a": 1, "b": "boo", "c": null}
+(1 row)
+
+CREATE FUNCTION testAVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+$val = [{a => 1, b => 'boo', c => undef}, {d => 2}];
+return $val;
+$$;
+SELECT testAVToJsonb();
+testavtojsonb
+-
+ [{"a": 1, "b": "boo", "c": null}, {"d": 2}]
+(1 row)
+
+CREATE FUNCTION testSVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+$val = 1;
+return $val;
+$$;
+SELECT testSVToJsonb();
+ testsvtojsonb 
+---
+ 1
+(1 row)
+
+-- this revealed a bug in the original implementation
+CREATE FUNCTION testRegexpResultToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+return ('1' =~ m(0\t2));
+$$;
+SELECT testRegexpResultToJsonb();
+ testregexpresulttojsonb 
+-
+ 0
+(1 row)
+
+CREATE FUNCTION testUVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+as $$
+$val = ~0;
+return $val;
+$$;
+SELECT testUVToJsonb();
+ testuvtojsonb 
+---
+ 4294967295
+(1 row)
+
+CREATE FUNCTION roundtrip(val jsonb) RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+return $_[0];
+$$;
+SELECT roundtrip('null');
+ roundtrip 
+---
+ null
+(1 row)
+
+SELECT roundtrip('1');
+ roundtrip 
+---
+ 1
+(1 row)
+
+SELECT roundtrip('1E+131071');
+ERROR:  cannot convert infinite value to jsonb
+C

Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Alvaro Herrera
Heikki Linnakangas wrote:

> Remember, the purpose of predicate locks is to lock key ranges, not physical
> pages or tuples in the index. We use leaf pages as handy shortcut for "any
> key value that would belong on this page", but it is just an implementation
> detail.

Hmm ... so, thinking about pending list locking, would it work to
acquire locks on the posting tree's root of each item in the pending
list, when the item is put in the pending list? (even if we insert the
item in the pending list instead of its posting tree).

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Andrey Borodin

> 9 апр. 2018 г., в 19:50, Teodor Sigaev  написал(а):
>> 
>> 3. Why do we *not* lock the entry leaf page, if there is no match? We still 
>> need a lock to remember that we probed for that value and there was no 
>> match, so that we conflict with a tuple that might be inserted later.
>> At least #3 is a bug. The attached patch adds an isolation test that 
>> demonstrates it. #1 and #2 are weird, and cause unnecessary locking, so I 
>> think we should fix those too, even if they don't lead to incorrect results.
> 
> I can't find a hole here. Agree.
Please correct me if I'm wrong.
Let's say we have posting trees for word A and word B.
We are looking for a document that contains both.
We will read through all posting tree of A, but only through some segments of B.
If we will not find anything in B, we have to lock only segments where we 
actually were looking, not all the posting tree of B.

BTW I do not think that we lock ranges. We lock possibility of appearance of 
tuples that we might find. Ranges are shortcuts for places where we were 
looking.. That's how I understand, chances are I'm missing something.

Best regards, Andrey Borodin.


Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Teodor Sigaev



Alvaro Herrera wrote:

Heikki Linnakangas wrote:


Remember, the purpose of predicate locks is to lock key ranges, not physical
pages or tuples in the index. We use leaf pages as handy shortcut for "any
key value that would belong on this page", but it is just an implementation
detail.


Hmm ... so, thinking about pending list locking, would it work to
acquire locks on the posting tree's root of each item in the pending
list, when the item is put in the pending list? (even if we insert the
item in the pending list instead of its posting tree).


Items in pending list doesn't use posting tree or list, pending list is just 
list of pair (ItemPointer to heap, entry) represented as IndexTuple. There is no 
order in pending list, so Heikki suggests to lock metapage always instead of 
locking whole relation if fastupdate=on. If fastupdate is off then insertion 
process will not change metapage.


--
Teodor Sigaev   E-mail: [email protected]
   WWW: http://www.sigaev.ru/



Re: Transform for pl/perl

2018-04-09 Thread Tom Lane
[email protected] (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:
> Tom Lane  writes:
>> I think you'd have to convert to text and back.  That's kind of icky,
>> but it beats failing.

> I had a look, and that's what the PL/Python transform does.  Attached is
> a patch that does that for PL/Perl too, but only if the value is
> actually > PG_INT64_MAX.

> The secondary output files are for Perls with 32bit IV/UV types, but I
> haven't been able to test them, since Debian's Perl uses 64bit integers
> even on 32bit platforms.

Ugh.  I really don't want to maintain a separate expected-file for this,
especially not if it's going to be hard to test.  Can we choose another
way of exercising the code path?

Another issue with this code as written is that on 32-bit-UV platforms,
at least some vompilers will give warnings about the constant-false
predicate.  Not sure about a good solution for that.  Maybe it's a
sufficient reason to invent uint8_numeric so we don't need a range check.

regards, tom lane



Re: [HACKERS] path toward faster partition pruning

2018-04-09 Thread Robert Haas
On Fri, Apr 6, 2018 at 11:41 PM, Tom Lane  wrote:
> David Rowley  writes:
>> Sounds like you're saying that if we have too many alternative files
>> then there's a chance that one could pass by luck.
>
> Yeah, exactly: it passed, but did it pass for the right reason?
>
> If there's just two expected-files, it's likely not a big problem,
> but if you have a bunch it's something to worry about.
>
> I'm also wondering how come we had hash partitioning before and
> did not have this sort of problem.  Is it just that we added a
> new test that's more sensitive to the details of the hashing
> (if so, could it be made less so)?  Or is there actually more
> platform dependence now than before (and if so, why is that)?

The existing hash partitioning tests did have some dependencies on the
hash function, but they took care not to use the built-in hash
functions.  Instead they did stuff like this:

CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS
$$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE;
CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS
OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8);
CREATE TABLE mchash (a int, b text, c jsonb)
  PARTITION BY HASH (a test_int4_ops, b test_text_ops);

I think that this approach should also be used for the new tests.
Variant expected output files are a pain to maintain, and you
basically just have to take whatever output you get as the right
answer, because nobody knows what output a certain built-in hash
function should produce for a given input except by running the code.
If you do the kind of thing shown above, though, then you can easily
see by inspection that you're getting the right answer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Greg Stark
On 9 April 2018 at 15:22, Anthony Iliopoulos  wrote:
> On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:
>>
> Sure, there could be knobs for limiting how much memory such "zombie"
> pages may occupy. Not sure how helpful it would be in the long run
> since this tends to be highly application-specific, and for something
> with a large data footprint one would end up tuning this accordingly
> in a system-wide manner.

Surely this is exactly what the kernel is there to manage. It has to
control how much memory is allowed to be full of dirty buffers in the
first place to ensure that the system won't get memory starved if it
can't clean them fast enough. That isn't even about persistent
hardware errors. Even when the hardware is working perfectly it can
only flush buffers so fast.  The whole point of the kernel is to
abstract away shared resources. It's not like user space has any
better view of the situation here. If Postgres implemented all this in
DIRECT_IO it would have exactly the same problem only with less
visibility into what the rest of the system is doing. If every
application implemented its own buffer cache we would be back in the
same boat only with a fragmented memory allocation.

> This has the potential to leave other
> applications running in the same system with very little memory, in
> cases where for example original application crashes and never clears
> the error.

I still think we're speaking two different languages. There's no
application anywhere that's going to "clear the error". The
application has done the writes and if it's calling fsync it wants to
wait until the filesystem can arrange for the write to be persisted.
If the application could manage without the persistence then it
wouldn't have called fsync.

The only way to "clear out" the error would be by having the writes
succeed. There's no reason to think that wouldn't be possible
sometime. The filesystem could remap blocks or an administrator could
replace degraded raid device components. The only thing Postgres could
do to recover would be create a new file and move the data (reading
from the dirty buffer in memory!) to a new file anyways so we would
"clear the error" by just no longer calling fsync on the old file.

We always read fsync as a simple write barrier. That's what the
documentation promised and it's what Postgres always expected. It
sounds like the kernel implementors looked at it as some kind of
communication channel to communicate status report for specific writes
back to user-space. That's a much more complex problem and would have
entirely different interface. I think this is why we're having so much
difficulty communicating.



> It is reasonable, but even FreeBSD has a big fat comment right
> there (since 2017), mentioning that there can be no recovery from
> EIO at the block layer and this needs to be done differently. No
> idea how an application running on top of either FreeBSD or Illumos
> would actually recover from this error (and clear it out), other
> than remounting the fs in order to force dropping of relevant pages.
> It does provide though indeed a persistent error indication that
> would allow Pg to simply reliably panic. But again this does not
> necessarily play well with other applications that may be using
> the filesystem reliably at the same time, and are now faced with
> EIO while their own writes succeed to be persisted.

Well if they're writing to the same file that had a previous error I
doubt there are many applications that would be happy to consider
their writes "persisted" when the file was corrupt. Ironically the
earlier discussion quoted talked about how applications that wanted
more granular communication would be using O_DIRECT -- but what we
have is fsync trying to be *too* granular such that it's impossible to
get any strong guarantees about anything with it.

>> One has to wonder how many applications actually use this correctly,
>> considering PostgreSQL cares about data durability/consistency so much
>> and yet we've been misunderstanding how it works for 20+ years.
>
> I would expect it would be very few, potentially those that have
> a very simple process model (e.g. embedded DBs that can abort a
> txn on fsync() EIO).

Honestly I don't think there's *any* way to use the current interface
to implement reliable operation. Even that embedded database using a
single process and keeping every file open all the time (which means
file descriptor limits limit its scalability) can be having silent
corruption whenever some other process like a backup program comes
along and calls fsync (or even sync?).



-- 
greg



Re: [HACKERS] [PATCH] Incremental sort

2018-04-09 Thread Alexander Korotkov
On Sat, Apr 7, 2018 at 11:57 PM, Tomas Vondra 
wrote:

> On 04/07/2018 06:23 PM, Tom Lane wrote:
> > Teodor Sigaev  writes:
> >>> I dunno, how would you estimate whether this is actually a win or not?
> >>> I don't think our model of sort costs is anywhere near refined enough
> >>> or accurate enough to reliably predict whether this is better than
> >>> just doing it in one step.  Even if the cost model is good, it's not
> >>> going to be better than our statistics about the number/size of the
> >>> groups in the first column(s), and that's a notoriously unreliable
> stat.
> >
> >> I think that improvement in cost calculation of sort should be a
> >> separate patch, not directly connected to this one. Postpone patches
> >> till other part will be ready to get max improvement for postponed ones
> >> doesn't seem to me very good, especially if it suggests some improvement
> >> right now.
> >
> > No, you misunderstand the point of my argument.  Without a reasonably
> > reliable cost model, this patch could easily make performance *worse*
> > not better for many people, due to choosing incremental-sort plans
> > where they were really a loss.
> >
>
> Yeah. Essentially the patch could push the planner to pick a path that
> has low startup cost (and very high total cost), assuming it'll only
> need to read small part of the input. But if the number of groups in the
> input is low (perhaps just one huge group), that would be a regression.
>

Yes, I think the biggest risk here is too small number of groups.  More
precisely the risk is too large groups while total number of groups might
be large enough.

> If we were at the start of a development cycle and work were being
> > promised to be done later in the cycle to improve the planning aspect,
> > I'd be more charitable about it.  But this isn't merely the end of a
> > cycle, it's the *last day*.  Now is not the time to commit stuff that
> > needs, or even just might need, follow-on work.
> >
>
> +1 to that
>
> FWIW I'm willing to spend some time on the patch for PG12, particularly
> on the planner / costing part. The potential gains are too interesting
>

Thank you very much for your efforts on reviewing this patch.
I'm looking forward work with you on this feature for PG12.

FWIW, I think that we're moving this patch in the right direction by
providing separate paths for incremental sort.  It's much better than
deciding between full or incremental sort in-place.  For sure, considereing
extra paths might cause planning time regression.  But I think the
same statement is true about many other planning optimizations.
One thing be can do is to make enable_incrementalsort = off by
default.  Then only users who understand improtance of incremental
sort will get extra planning time.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Runtime Partition Pruning

2018-04-09 Thread Robert Haas
On Sat, Apr 7, 2018 at 5:13 PM, Alvaro Herrera  wrote:
> I had reservations about a relation_open() in the new executor code.  It
> seemed a bit odd; we don't have any other relation_open in the executor
> anywhere.  However, setting up the pruneinfo needs some stuff from
> relcache that I don't see a reasonable mechanism to pass through
> planner.  I asked Andres about it on IM and while he didn't endorse the
> patch in any way, his quick opinion was that "it wasn't entirely
> insane".  I verified that we already hold lock on the relation.

I don't get this.  The executor surely had to (and did) open all of
the relations somewhere even before this patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [WIP PATCH] Index scan offset optimisation using visibility map

2018-04-09 Thread Andrey Borodin
Hi!

The work on the patch goes on, where was some discussion of this patch off-list 
with author.
Advise-request is still actual.

I think that we should move this patch to next CF. So I'm marking patch as 
needs review.

Best regards, Andrey Borodin.


Optimization of range queries

2018-04-09 Thread Konstantin Knizhnik

Hi hackers,

Postgres optimizer is not able to build efficient execution plan for the 
following query:


explain select * from  people_raw where not ("ID"<2068113880 AND "INN" 
is not null) and "ID"<=2068629726 AND "INN" is not null;

 QUERY PLAN

 Bitmap Heap Scan on people_raw  (cost=74937803.72..210449640.49 
rows=121521030 width=336)

   Recheck Cond: ("ID" <= 2068629726)
   Filter: (("INN" IS NOT NULL) AND (("ID" >= 2068113880) OR ("INN" IS 
NULL)))
   ->  Bitmap Index Scan on "People_pkey" (cost=0.00..74907423.47 
rows=2077021718 width=0)

 Index Cond: ("ID" <= 2068629726)
(5 rows)


Here the table is very large, but query effects only relatively small 
number of rows located in the range: [2068113880,2068629726]

But unfortunately optimizer doesn't take it into the account.
Moreover, using "is not null" and "not null" is both operands of AND is 
not smart:

 (("INN" IS NOT NULL) AND (("ID" >= 2068113880) OR ("INN" IS NULL)))

If I remove "is not null" condition, then plan is perfect:

explain select * from  people_raw where not ("ID"<2068113880) and 
"ID"<=2068629726;

 QUERY PLAN

 Index Scan using "People_pkey" on people_raw  (cost=0.58..196745.57 
rows=586160 width=336)

   Index Cond: (("ID" >= 2068113880) AND ("ID" <= 2068629726))
(2 rows)

Before starting  investigation of the problem, I will like to know 
opinion and may be some advise of people familiar with optimizer:

how difficult will be to handle this case and where to look.

Thanks in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Robert Haas
On Mon, Apr 9, 2018 at 8:16 AM, Craig Ringer  wrote:
> In the mean time, I propose that we fsync() on close() before we age FDs out
> of the LRU on backends. Yes, that will hurt throughput and cause stalls, but
> we don't seem to have many better options. At least it'll only flush what we
> actually wrote to the OS buffers not what we may have in shared_buffers. If
> the bgwriter does the same thing, we should be 100% safe from this problem
> on 4.13+, and it'd be trivial to make it a GUC much like the fsync or
> full_page_writes options that people can turn off if they know the risks /
> know their storage is safe / don't care.

Ouch.  If a process exits -- say, because the user typed \q into psql
-- then you're talking about potentially calling fsync() on a really
large number of file descriptor flushing many gigabytes of data to
disk.  And it may well be that you never actually wrote any data to
any of those file descriptors -- those writes could have come from
other backends.  Or you may have written a little bit of data through
those FDs, but there could be lots of other data that you end up
flushing incidentally.  Perfectly innocuous things like starting up a
backend, running a few short queries, and then having that backend
exit suddenly turn into something that could have a massive
system-wide performance impact.

Also, if a backend ever manages to exit without running through this
code, or writes any dirty blocks afterward, then this still fails to
fix the problem completely.  I guess that's probably avoidable -- we
can put this late in the shutdown sequence and PANIC if it fails.

I have a really tough time believing this is the right way to solve
the problem.  We suffered for years because of ext3's desire to flush
the entire page cache whenever any single file was fsync()'d, which
was terrible.  Eventually ext4 became the norm, and the problem went
away.  Now we're going to deliberately insert logic to do a very
similar kind of terrible thing because the kernel developers have
decided that fsync() doesn't have to do what it says on the tin?  I
grant that there doesn't seem to be a better option, but I bet we're
going to have a lot of really unhappy users if we do this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Optimization of range queries

2018-04-09 Thread Teodor Sigaev

Hi!

12 years ago I proposed patch to which could "union" OR clauses into one 
range clause if it's possible. In that time pgsql could not use IS NULL 
as index clause, so patch doesn't support that


https://www.postgresql.org/message-id/flat/45742C51.9020602%40sigaev.ru

option number 4), all other are already committed.


Konstantin Knizhnik wrote:

Hi hackers,

Postgres optimizer is not able to build efficient execution plan for the 
following query:


explain select * from  people_raw where not ("ID"<2068113880 AND "INN" 
is not null) and "ID"<=2068629726 AND "INN" is not null;

  QUERY PLAN
 

  Bitmap Heap Scan on people_raw  (cost=74937803.72..210449640.49 
rows=121521030 width=336)

    Recheck Cond: ("ID" <= 2068629726)
    Filter: (("INN" IS NOT NULL) AND (("ID" >= 2068113880) OR ("INN" IS 
NULL)))
    ->  Bitmap Index Scan on "People_pkey" (cost=0.00..74907423.47 
rows=2077021718 width=0)

  Index Cond: ("ID" <= 2068629726)
(5 rows)


Here the table is very large, but query effects only relatively small 
number of rows located in the range: [2068113880,2068629726]

But unfortunately optimizer doesn't take it into the account.
Moreover, using "is not null" and "not null" is both operands of AND is 
not smart:

  (("INN" IS NOT NULL) AND (("ID" >= 2068113880) OR ("INN" IS NULL)))

If I remove "is not null" condition, then plan is perfect:

explain select * from  people_raw where not ("ID"<2068113880) and 
"ID"<=2068629726;

  QUERY PLAN
 

  Index Scan using "People_pkey" on people_raw  (cost=0.58..196745.57 
rows=586160 width=336)

    Index Cond: (("ID" >= 2068113880) AND ("ID" <= 2068629726))
(2 rows)

Before starting  investigation of the problem, I will like to know 
opinion and may be some advise of people familiar with optimizer:

how difficult will be to handle this case and where to look.

Thanks in advance,



--
Teodor Sigaev  E-mail: [email protected]
  WWW: http://www.sigaev.ru/



Re: pgsql: Merge catalog/pg_foo_fn.h headers back into pg_foo.h headers.

2018-04-09 Thread Tom Lane
Michael Paquier  writes:
> On Sun, Apr 08, 2018 at 11:05:09PM -0400, Tom Lane wrote:
>> Hm.  I'd tested "make -j all", but not going directly to "install".
>> Does it help if you add
>> $(SUBDIRS:%=install-%-recurse): | submake-generated-headers
>> to src/Makefile?

> That takes care of the problem from the root of the directory, but when
> doing the same from src/bin/ then the same issue shows up even if
> src/Makefile is patched to handle install targets.

Hm.  Not sure how far we want to go in that direction.  It's never really
been the case that you could configure a maintainer-clean tree and then
go and build in any random subdirectory without taking care of the
generated headers.  In principle, we could do something like the hack
Peter did with temp-install, where it's basically forced to be a
prerequisite of "make check" in EVERY subdirectory and then we buy back
some of the inefficiency by making it a no-op in child make runs.  Not
sure that's a good thing though.

Independently of that, though, I noticed that part of the issue in your
"make install" example is that relpath.c, along with some other frontend
code, includes catalog/catalog.h which includes pg_class.h (and thereby
now pg_class_d.h).  Since src/common/Makefile thinks the only generated
header it needs to worry about is errcodes.h, building src/common first
now falls over.  But allowing frontend code to include catalog.h is
pretty insane: that pulls in relcache.h as well, and surely we don't
want to tie our hands to the point that relcache.h has to be frontend
clean.

What we need to do is take OIDCHARS and TABLESPACE_VERSION_DIRECTORY
out of catalog.h and put them in some more frontend-friendly header.
My first thought was pg_tablespace_d.h, but my second one is relpath.h.
Comments?

regards, tom lane



Re: Online enabling of checksums

2018-04-09 Thread Magnus Hagander
On Sat, Apr 7, 2018 at 6:22 PM, Andres Freund  wrote:

> Hi,
>
> On 2018-04-07 08:57:03 +0200, Magnus Hagander wrote:
> > Note however that I'm sans-laptop until Sunday, so I will revert it then
> or
> > possibly Monday.
>
> I'll deactive the isolationtester tests until then. They've been
> intermittently broken for days now and prevent other tests from being
> exercised.
>

Thanks.

I've pushed the revert now, and left the pg_verify_checksums in place for
the time being.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Heikki Linnakangas

On 09/04/18 18:04, Alvaro Herrera wrote:

Heikki Linnakangas wrote:


Remember, the purpose of predicate locks is to lock key ranges, not physical
pages or tuples in the index. We use leaf pages as handy shortcut for "any
key value that would belong on this page", but it is just an implementation
detail.


Hmm ... so, thinking about pending list locking, would it work to
acquire locks on the posting tree's root of each item in the pending
list, when the item is put in the pending list? (even if we insert the
item in the pending list instead of its posting tree).


Hmm, you mean, when inserting a new tuple? Yes, that would be correct. I 
don't think it would perform very well, though. If you have to traverse 
down to the posting trees, anyway, then you might as well insert the new 
tuples there directly, and forget about the pending list.


- Heikki



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Joshua D. Drake

On 04/09/2018 09:45 AM, Robert Haas wrote:

On Mon, Apr 9, 2018 at 8:16 AM, Craig Ringer  wrote:

In the mean time, I propose that we fsync() on close() before we age FDs out
of the LRU on backends. Yes, that will hurt throughput and cause stalls, but
we don't seem to have many better options. At least it'll only flush what we
actually wrote to the OS buffers not what we may have in shared_buffers. If
the bgwriter does the same thing, we should be 100% safe from this problem
on 4.13+, and it'd be trivial to make it a GUC much like the fsync or
full_page_writes options that people can turn off if they know the risks /
know their storage is safe / don't care.

I have a really tough time believing this is the right way to solve
the problem.  We suffered for years because of ext3's desire to flush
the entire page cache whenever any single file was fsync()'d, which
was terrible.  Eventually ext4 became the norm, and the problem went
away.  Now we're going to deliberately insert logic to do a very
similar kind of terrible thing because the kernel developers have
decided that fsync() doesn't have to do what it says on the tin?  I
grant that there doesn't seem to be a better option, but I bet we're
going to have a lot of really unhappy users if we do this.


I don't have a better option but whatever we do, it should be an optional
(GUC) change. We have plenty of YEARS of people not noticing this issue and
Robert's correct, if we go back to an era of things like stalls it is going
to look bad on us no matter how we describe the problem.

Thanks,

JD


--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
* Unless otherwise stated, opinions are my own.   *




Re: Documentation for bootstrap data conversion

2018-04-09 Thread John Naylor
On 4/9/18, Tom Lane  wrote:
>> As it is now, it will actually collapse consecutive blank lines into
>> one. I maintain that was necessary during conversion to get some
>> semblance of consistency, but now it may not be a good idea to tie
>> developers' hands in surprising ways if they want double blank lines
>> in some places. It would be pretty easy to remove this behavior.
>> Apologies if it was not documented well enough.
>
> I did note that in some internal comments, but forgot it when writing
> this.  I agree that now that the conversion is done, it'd be better
> to remove that special case.  Would you send a patch for that?

Sure, attached.

-John Naylor
diff --git a/src/include/catalog/reformat_dat_file.pl b/src/include/catalog/reformat_dat_file.pl
index bbceb16..038ba7b 100644
--- a/src/include/catalog/reformat_dat_file.pl
+++ b/src/include/catalog/reformat_dat_file.pl
@@ -7,8 +7,7 @@
 #
 #Metadata entries (if any) come first, with normal attributes
 #starting on the following line, in the same order they would be in
-#the corresponding table. Comments and non-consecutive blank lines
-#are preserved.
+#the corresponding table. Comments and blank lines are preserved.
 #
 # Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
 # Portions Copyright (c) 1994, Regents of the University of California
@@ -109,7 +108,6 @@ foreach my $catname (@catnames)
 	my $catalog = $catalogs{$catname};
 	my @attnames;
 	my $schema = $catalog->{columns};
-	my $prev_blank = 0;
 
 	foreach my $column (@$schema)
 	{
@@ -158,27 +156,22 @@ foreach my $catname (@catnames)
 			my $data_str = format_hash(\%values, @attnames);
 			print $dat $data_str;
 			print $dat " },\n";
-			$prev_blank = 0;
 		}
 
 		# Strings -- handle accordingly or ignore. It was necessary to
 		# ignore bare commas during the initial data conversion. This
 		# should be a no-op now, but we may as well keep that behavior.
-		# Note: We don't update $prev_blank if we ignore a string.
 
-		# Preserve non-consecutive blank lines.
+		# Preserve blank lines.
 		elsif ($data =~ /^\s*$/)
 		{
-			next if $prev_blank;
 			print $dat "\n";
-			$prev_blank = 1;
 		}
 
 		# Preserve comments or brackets that are on their own line.
 		elsif ($data =~ /^\s*(\[|\]|#.*?)\s*$/)
 		{
 			print $dat "$1\n";
-			$prev_blank = 0;
 		}
 	}
 	close $dat;


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Gasper Zejn
On 09. 04. 2018 15:42, Tomas Vondra wrote:
> On 04/09/2018 12:29 AM, Bruce Momjian wrote:
>> An crazy idea would be to have a daemon that checks the logs and
>> stops Postgres when it seems something wrong.
>>
> That doesn't seem like a very practical way. It's better than nothing,
> of course, but I wonder how would that work with containers (where I
> think you may not have access to the kernel log at all). Also, I'm
> pretty sure the messages do change based on kernel version (and possibly
> filesystem) so parsing it reliably seems rather difficult. And we
> probably don't want to PANIC after I/O error on an unrelated device, so
> we'd need to understand which devices are related to PostgreSQL.
>
> regards
>

For a bit less (or more) crazy idea, I'd imagine creating a Linux kernel
module with kprobe/kretprobe capturing the file passed to fsync or even
byte range within file and corresponding return value shouldn't be that
hard. Kprobe has been a part of Linux kernel for a really long time, and
from first glance it seems like it could be backported to 2.6 too.

Then you could have stable log messages or implement some kind of "fsync
error log notification" via whatever is the most sane way to get this
out of kernel.

If the kernel is new enough and has eBPF support (seems like >=4.4),
using bcc-tools[1] should enable you to write a quick script to get
exactly that info via perf events[2].

Obviously, that's a stopgap solution ...


Kind regards,
Gasper


[1] https://github.com/iovisor/bcc
[2]
https://blog.yadutaf.fr/2016/03/30/turn-any-syscall-into-event-introducing-ebpf-kernel-probes/



Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-04-09 Thread Heikki Linnakangas

On 09/04/18 18:21, Andrey Borodin wrote:



9 апр. 2018 г., в 19:50, Teodor Sigaev 
написал(а):


3. Why do we *not* lock the entry leaf page, if there is no
match? We still need a lock to remember that we probed for that
value and there was no match, so that we conflict with a tuple
that might be inserted later. At least #3 is a bug. The attached
patch adds an isolation test that demonstrates it. #1 and #2 are
weird, and cause unnecessary locking, so I think we should fix
those too, even if they don't lead to incorrect results.


I can't find a hole here. Agree.

Please correct me if I'm wrong. Let's say we have posting trees for
word A and word B. We are looking for a document that contains both. 
We will read through all posting tree of A, but only through some

segments of B. If we will not find anything in B, we have to lock
only segments where we actually were looking, not all the posting
tree of B.


True, that works. It was not clear from the code or comments that that 
was intended. I'm not sure if that's worthwhile, compared to locking 
just the posting tree root block. I'll let Teodor decide..



BTW I do not think that we lock ranges. We lock possibility of
appearance of tuples that we might find. Ranges are shortcuts for
places where we were looking.. That's how I understand, chances are
I'm missing something.


Yeah, that's one way of thinking about it.

- Heikki



Re: Verbosity of genbki.pl

2018-04-09 Thread Andres Freund
Hi,

On 2018-04-08 13:33:42 -0400, Tom Lane wrote:
> Traditionally genbki.pl has printed "Writing foo" for every file
> it writes out.

> 2. Print just one message like "Generating postgres.bki and related
> files", and I guess a second one for fmgroids.h and related files.

+0.5.

- Andres



Re: [HACKERS] Runtime Partition Pruning

2018-04-09 Thread Alvaro Herrera
Robert Haas wrote:
> On Sat, Apr 7, 2018 at 5:13 PM, Alvaro Herrera  
> wrote:
> > I had reservations about a relation_open() in the new executor code.  It
> > seemed a bit odd; we don't have any other relation_open in the executor
> > anywhere.  However, setting up the pruneinfo needs some stuff from
> > relcache that I don't see a reasonable mechanism to pass through
> > planner.  I asked Andres about it on IM and while he didn't endorse the
> > patch in any way, his quick opinion was that "it wasn't entirely
> > insane".  I verified that we already hold lock on the relation.
> 
> I don't get this.  The executor surely had to (and did) open all of
> the relations somewhere even before this patch.

Yeah.

I was worried that this coding could be seen as breaking modularity, or
trying to do excessive work.  However, after looking closer at it, it
doesn't really look like it's the case.  So, nevermind.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Mark Dilger

> On Apr 9, 2018, at 10:26 AM, Joshua D. Drake  wrote:

> We have plenty of YEARS of people not noticing this issue

I disagree.  I have noticed this problem, but blamed it on other things.
For over five years now, I have had to tell customers not to use thin
provisioning, and I have had to add code to postgres to refuse to perform
inserts or updates if the disk volume is more than 80% full.  I have lost
count of the number of customers who are running an older version of the
product (because they refuse to upgrade) and come back with complaints that
they ran out of disk and now their database is corrupt.  All this time, I
have been blaming this on virtualization and thin provisioning.

mark



Re: [HACKERS] Runtime Partition Pruning

2018-04-09 Thread Robert Haas
On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera  wrote:
> Robert Haas wrote:
>> On Sat, Apr 7, 2018 at 5:13 PM, Alvaro Herrera  
>> wrote:
>> > I had reservations about a relation_open() in the new executor code.  It
>> > seemed a bit odd; we don't have any other relation_open in the executor
>> > anywhere.  However, setting up the pruneinfo needs some stuff from
>> > relcache that I don't see a reasonable mechanism to pass through
>> > planner.  I asked Andres about it on IM and while he didn't endorse the
>> > patch in any way, his quick opinion was that "it wasn't entirely
>> > insane".  I verified that we already hold lock on the relation.
>>
>> I don't get this.  The executor surely had to (and did) open all of
>> the relations somewhere even before this patch.
>
> Yeah.
>
> I was worried that this coding could be seen as breaking modularity, or
> trying to do excessive work.  However, after looking closer at it, it
> doesn't really look like it's the case.  So, nevermind.

Well what I'm saying is that it shouldn't be necessary.  If the
relations are being opened already and the pointers to the relcache
entries are being saved someplace, you shouldn't need to re-open them
elsewhere to get pointers to the relcache entries.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Runtime Partition Pruning

2018-04-09 Thread Alvaro Herrera
Robert Haas wrote:
> On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera  
> wrote:
> > Robert Haas wrote:

> >> I don't get this.  The executor surely had to (and did) open all of
> >> the relations somewhere even before this patch.
> >
> > I was worried that this coding could be seen as breaking modularity, or
> > trying to do excessive work.  However, after looking closer at it, it
> > doesn't really look like it's the case.  So, nevermind.
> 
> Well what I'm saying is that it shouldn't be necessary.  If the
> relations are being opened already and the pointers to the relcache
> entries are being saved someplace, you shouldn't need to re-open them
> elsewhere to get pointers to the relcache entries.

Oh, okay.  I can look into that.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Documentation for bootstrap data conversion

2018-04-09 Thread Tom Lane
John Naylor  writes:
> On 4/9/18, Tom Lane  wrote:
>> I did note that in some internal comments, but forgot it when writing
>> this.  I agree that now that the conversion is done, it'd be better
>> to remove that special case.  Would you send a patch for that?

> Sure, attached.

Pushed, thanks.  I fixed the sgml markup too.

regards, tom lane



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Robert Haas
On Mon, Apr 9, 2018 at 12:45 PM, Robert Haas  wrote:
> Ouch.  If a process exits -- say, because the user typed \q into psql
> -- then you're talking about potentially calling fsync() on a really
> large number of file descriptor flushing many gigabytes of data to
> disk.  And it may well be that you never actually wrote any data to
> any of those file descriptors -- those writes could have come from
> other backends.  Or you may have written a little bit of data through
> those FDs, but there could be lots of other data that you end up
> flushing incidentally.  Perfectly innocuous things like starting up a
> backend, running a few short queries, and then having that backend
> exit suddenly turn into something that could have a massive
> system-wide performance impact.
>
> Also, if a backend ever manages to exit without running through this
> code, or writes any dirty blocks afterward, then this still fails to
> fix the problem completely.  I guess that's probably avoidable -- we
> can put this late in the shutdown sequence and PANIC if it fails.
>
> I have a really tough time believing this is the right way to solve
> the problem.  We suffered for years because of ext3's desire to flush
> the entire page cache whenever any single file was fsync()'d, which
> was terrible.  Eventually ext4 became the norm, and the problem went
> away.  Now we're going to deliberately insert logic to do a very
> similar kind of terrible thing because the kernel developers have
> decided that fsync() doesn't have to do what it says on the tin?  I
> grant that there doesn't seem to be a better option, but I bet we're
> going to have a lot of really unhappy users if we do this.

What about the bug we fixed in
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ce439f3379aed857517c8ce207485655000fc8e
?  Say somebody does something along the lines of:

ps uxww | grep postgres | grep -v grep | awk '{print $2}' | xargs kill -9

...and then restarts postgres.  Craig's proposal wouldn't cover this
case, because there was no opportunity to run fsync() after the first
crash, and there's now no way to go back and fsync() any stuff we
didn't fsync() before, because the kernel may have already thrown away
the error state, or may lie to us and tell us everything is fine
(because our new fd wasn't opened early enough).  I can't find the
original discussion that led to that commit right now, so I'm not
exactly sure what scenarios we were thinking about.  But I think it
would at least be a problem if full_page_writes=off or if you had
previously started the server with fsync=off and now wish to switch to
fsync=on after completing a bulk load or similar.  Recovery can read a
page, see that it looks OK, and continue, and then a later fsync()
failure can revert that page to an earlier state and now your database
is corrupted -- and there's absolute no way to detect this because
write() gives you the new page contents later, fsync() doesn't feel
obliged to tell you about the error because your fd wasn't opened
early enough, and eventually the write can be discarded and you'll
revert back to the old page version with no errors ever being reported
anywhere.

Another consequence of this behavior that initdb -S is never reliable,
so pg_rewind's use of it doesn't actually fix the problem it was
intended to solve.  It also means that initdb itself isn't crash-safe,
since the data file changes are made by the backend but initdb itself
is doing the fsyncs, and initdb has no way of knowing what files the
backend is going to create and therefore can't -- even theoretically
-- open them first.

What's being presented to us as the API contract that we should expect
from buffered I/O is that if you open a file and read() from it, call
fsync(), and get no error, the kernel may nevertheless decide that
some previous write that it never managed to flush can't be flushed,
and then revert the page to the contents it had at some point in the
past.  That's mostly or less equivalent to letting a malicious
adversary randomly overwrite database pages plausible-looking but
incorrect contents without notice and hoping you can still build a
reliable system.  You can avoid the problem if you can always open an
fd for every file you want to modify before it's written and hold on
to it until after it's fsync'd, but that's pretty hard to guarantee in
the face of kill -9.

I think the simplest technological solution to this problem is to
rewrite the entire backend and all supporting processes to use
O_DIRECT everywhere.  To maintain adequate performance, we'll have to
write a complete I/O scheduling system inside PostgreSQL.  Also, since
we'll now have to make shared_buffers much larger -- since we'll no
longer be benefiting from the OS cache -- we'll need to replace the
use of malloc() with an allocator that pulls from shared_buffers.
Plus, as noted, we'll need to totally rearchitect several of our
critical frontend tools.  Let's freeze all other development f

Re: Verbosity of genbki.pl

2018-04-09 Thread Tom Lane
Andres Freund  writes:
> On 2018-04-08 13:33:42 -0400, Tom Lane wrote:
>> Traditionally genbki.pl has printed "Writing foo" for every file
>> it writes out.

>> 2. Print just one message like "Generating postgres.bki and related
>> files", and I guess a second one for fmgroids.h and related files.

> +0.5.

Hearing no votes against, done that way.

regards, tom lane



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andres Freund
Hi,

On 2018-04-09 15:02:11 -0400, Robert Haas wrote:
> I think the simplest technological solution to this problem is to
> rewrite the entire backend and all supporting processes to use
> O_DIRECT everywhere.  To maintain adequate performance, we'll have to
> write a complete I/O scheduling system inside PostgreSQL.  Also, since
> we'll now have to make shared_buffers much larger -- since we'll no
> longer be benefiting from the OS cache -- we'll need to replace the
> use of malloc() with an allocator that pulls from shared_buffers.
> Plus, as noted, we'll need to totally rearchitect several of our
> critical frontend tools.  Let's freeze all other development for the
> next year while we work on that, and put out a notice that Linux is no
> longer a supported platform for any existing release.  Before we do
> that, we might want to check whether fsync() actually writes the data
> to disk in a usable way even with O_DIRECT.  If not, we should just
> de-support Linux entirely as a hopelessly broken and unsupportable
> platform.

Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
absurd, as is some of the proposed ways this is all supposed to
work. But I think the case we're discussing is much closer to a near
irresolvable corner case than anything else.

We're talking about the storage layer returning an irresolvable
error. You're hosed even if we report it properly.  Yes, it'd be nice if
we could report it reliably.  But that doesn't change the fact that what
we're doing is ensuring that data is safely fsynced unless storage
fails, in which case it's not safely fsynced anyway.

Greetings,

Andres Freund



Re: using index or check in ALTER TABLE SET NOT NULL

2018-04-09 Thread Sergei Kornilov
Hello
I notice my patch does not apply again. Here is update to current HEAD

regards, Sergeidiff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index bd22627..db98a98 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -215,8 +215,15 @@ WITH ( MODULUS numeric_literal, REM
 
  
   These forms change whether a column is marked to allow null
-  values or to reject null values.  You can only use SET
-  NOT NULL when the column contains no null values.
+  values or to reject null values.
+ 
+
+ 
+  You can only use SET NOT NULL when the column 
+  contains no null values. Full table scan is performed to check 
+  this condition unless there are valid CHECK 
+  constraints that prohibit NULL values for specified column.
+  In the latter case table scan is skipped.
  
 
  
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 0f5932f..6fd3663 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1153,7 +1153,7 @@ check_default_allows_bound(Relation parent, Relation default_rel,
 	 * not contain any row that would belong to the new partition, we can
 	 * avoid scanning the default partition.
 	 */
-	if (PartConstraintImpliedByRelConstraint(default_rel, def_part_constraints))
+	if (ConstraintImpliedByRelConstraint(default_rel, def_part_constraints))
 	{
 		ereport(INFO,
 (errmsg("updated partition constraint for default partition \"%s\" is implied by existing constraints",
@@ -1197,8 +1197,8 @@ check_default_allows_bound(Relation parent, Relation default_rel,
 			 * that it will not contain any row that would belong to the new
 			 * partition, we can avoid scanning the child table.
 			 */
-			if (PartConstraintImpliedByRelConstraint(part_rel,
-	 def_part_constraints))
+			if (ConstraintImpliedByRelConstraint(part_rel,
+ def_part_constraints))
 			{
 ereport(INFO,
 		(errmsg("updated partition constraint for default partition \"%s\" is implied by existing constraints",
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 43b2fce..11dc92c 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -160,7 +160,7 @@ typedef struct AlteredTableInfo
 	/* Information saved by Phases 1/2 for Phase 3: */
 	List	   *constraints;	/* List of NewConstraint */
 	List	   *newvals;		/* List of NewColumnValue */
-	bool		new_notnull;	/* T if we added new NOT NULL constraints */
+	bool		verify_new_notnull; /* T if we should recheck NOT NULL */
 	int			rewrite;		/* Reason for forced rewrite, if any */
 	Oid			newTableSpace;	/* new tablespace; 0 means no change */
 	bool		chgPersistence; /* T if SET LOGGED/UNLOGGED is used */
@@ -372,6 +372,7 @@ static ObjectAddress ATExecDropNotNull(Relation rel, const char *colName, LOCKMO
 static void ATPrepSetNotNull(Relation rel, bool recurse, bool recursing);
 static ObjectAddress ATExecSetNotNull(AlteredTableInfo *tab, Relation rel,
  const char *colName, LOCKMODE lockmode);
+static bool NotNullImpliedByRelConstraints(Relation rel, Form_pg_attribute attr);
 static ObjectAddress ATExecColumnDefault(Relation rel, const char *colName,
 	Node *newDefault, LOCKMODE lockmode);
 static ObjectAddress ATExecAddIdentity(Relation rel, const char *colName,
@@ -4462,7 +4463,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode)
 			 * Test the current data within the table against new constraints
 			 * generated by ALTER TABLE commands, but don't rebuild data.
 			 */
-			if (tab->constraints != NIL || tab->new_notnull ||
+			if (tab->constraints != NIL || tab->verify_new_notnull ||
 tab->partition_constraint != NULL)
 ATRewriteTable(tab, InvalidOid, lockmode);
 
@@ -4623,7 +4624,7 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 	}
 
 	notnull_attrs = NIL;
-	if (newrel || tab->new_notnull)
+	if (newrel || tab->verify_new_notnull)
 	{
 		/*
 		 * If we are rebuilding the tuples OR if we added any new NOT NULL
@@ -5658,7 +5659,7 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
 			 * an OID column.  OID will be marked not null, but since it's
 			 * filled specially, there's no need to test anything.)
 			 */
-			tab->new_notnull |= colDef->is_not_null;
+			tab->verify_new_notnull |= colDef->is_not_null;
 		}
 	}
 
@@ -6062,8 +6063,17 @@ ATExecSetNotNull(AlteredTableInfo *tab, Relation rel,
 
 		CatalogTupleUpdate(attr_rel, &tuple->t_self, tuple);
 
-		/* Tell Phase 3 it needs to test the constraint */
-		tab->new_notnull = true;
+		/*
+		 * Verifying table data can be done in Phase 3 with single table scan
+		 * for all constraint (both existing and new)
+		 * We can skip this check only if every new NOT NULL contraint
+		 * implied by existed table constraints
+		 */
+		if (!NotNullImpliedByRelConstraints(rel, (Form_pg_attribu

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Tomas Vondra


On 04/09/2018 08:29 PM, Mark Dilger wrote:
> 
>> On Apr 9, 2018, at 10:26 AM, Joshua D. Drake  wrote:
> 
>> We have plenty of YEARS of people not noticing this issue
> 
> I disagree.  I have noticed this problem, but blamed it on other things.
> For over five years now, I have had to tell customers not to use thin
> provisioning, and I have had to add code to postgres to refuse to perform
> inserts or updates if the disk volume is more than 80% full.  I have lost
> count of the number of customers who are running an older version of the
> product (because they refuse to upgrade) and come back with complaints that
> they ran out of disk and now their database is corrupt.  All this time, I
> have been blaming this on virtualization and thin provisioning.
> 

Yeah. There's a big difference between not noticing an issue because it
does not happen very often vs. attributing it to something else. If we
had the ability to revisit past data corruption cases, we would probably
discover a fair number of cases caused by this.

The other thing we probably need to acknowledge is that the environment
changes significantly - things like thin provisioning are likely to get
even more common, increasing the incidence of these issues.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Fix pg_rewind which can be run as root user

2018-04-09 Thread Magnus Hagander
On Mon, Apr 9, 2018 at 7:11 AM, Michael Paquier  wrote:

> Hi all,
>
> I was just going through pg_rewind's code, and noticed the following
> pearl:
> /*
>  * Don't allow pg_rewind to be run as root, to avoid overwriting the
>  * ownership of files in the data directory. We need only check for
> root
>  * -- any other user won't have sufficient permissions to modify files
> in
>  * the data directory.
>  */
> #ifndef WIN32
> if (geteuid() == 0)
> {
> fprintf(stderr, _("cannot be executed by \"root\"\n"));
> fprintf(stderr, _("You must run %s as the PostgreSQL
> superuser.\n"),
> progname);
> }
> #endif
>
> While that's nice to inform the user about the problem, that actually
> does not prevent pg_rewind to run as root.  Attached is a patch, which
> needs a back-patch down to 9.5.
>

Seems simple enough and the right hting to do, but I wonder if we should
really backpatch it. Yes, the behaviour is not great now, but there is also
a non-zero risk of breaking peoples automated failover scripts of we
backpatch it, isn't it?


-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Peter Geoghegan
On Mon, Apr 9, 2018 at 12:13 PM, Andres Freund  wrote:
> Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
> absurd, as is some of the proposed ways this is all supposed to
> work. But I think the case we're discussing is much closer to a near
> irresolvable corner case than anything else.

+1

> We're talking about the storage layer returning an irresolvable
> error. You're hosed even if we report it properly.  Yes, it'd be nice if
> we could report it reliably.  But that doesn't change the fact that what
> we're doing is ensuring that data is safely fsynced unless storage
> fails, in which case it's not safely fsynced anyway.

Right. We seem to be implicitly assuming that there is a big
difference between a problem in the storage layer that we could in
principle detect, but don't, and any other problem in the storage
layer. I've read articles claiming that technologies like SMART are
not really reliable in a practical sense [1], so it seems to me that
there is reason to doubt that this gap is all that big.

That said, I suspect that the problems with running out of disk space
are serious practical problems. I have personally scoffed at stories
involving Postgres databases corruption that gets attributed to
running out of disk space. Looks like I was dead wrong.

[1] https://danluu.com/file-consistency/ -- "Filesystem correctness"
-- 
Peter Geoghegan



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 04:29:36PM +0100, Greg Stark wrote:
> Honestly I don't think there's *any* way to use the current interface
> to implement reliable operation. Even that embedded database using a
> single process and keeping every file open all the time (which means
> file descriptor limits limit its scalability) can be having silent
> corruption whenever some other process like a backup program comes
> along and calls fsync (or even sync?).

That is indeed true (sync would induce fsync on open inodes and clear
the error), and that's a nasty bug that apparently went unnoticed for
a very long time. Hopefully the errseq_t linux 4.13 fixes deal with at
least this issue, but similar fixes need to be adopted by many other
kernels (all those that mark failed pages as clean).

I honestly do not expect that keeping around the failed pages will
be an acceptable change for most kernels, and as such the recommendation
will probably be to coordinate in userspace for the fsync().

What about having buffered IO with implied fsync() atomicity via O_SYNC?
This would probably necessitate some helper threads that mask the
latency and present an async interface to the rest of PG, but sounds
less intrusive than going for DIO.

Best regards,
Anthony



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andres Freund
On 2018-04-09 21:26:21 +0200, Anthony Iliopoulos wrote:
> What about having buffered IO with implied fsync() atomicity via
> O_SYNC?

You're kidding, right?  We could also just add sleep(30)'s all over the
tree, and hope that that'll solve the problem.  There's a reason we
don't permanently fsync everything. Namely that it'll be way too slow.

Greetings,

Andres Freund



Re: Fix pg_rewind which can be run as root user

2018-04-09 Thread Peter Geoghegan
On Mon, Apr 9, 2018 at 12:23 PM, Magnus Hagander  wrote:
> Seems simple enough and the right hting to do, but I wonder if we should
> really backpatch it. Yes, the behaviour is not great now, but there is also
> a non-zero risk of breaking peoples automated failover scripts of we
> backpatch it, isn't it?

+1 to committing to master as a documented behavioral change.

-- 
Peter Geoghegan



Re: Fix pg_rewind which can be run as root user

2018-04-09 Thread Tom Lane
Magnus Hagander  writes:
> Seems simple enough and the right hting to do, but I wonder if we should
> really backpatch it. Yes, the behaviour is not great now, but there is also
> a non-zero risk of breaking peoples automated failover scripts of we
> backpatch it, isn't it?

Yeah, I'd vote against backpatching.  This doesn't seem like an essential
fix.

regards, tom lane



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Andres Freund


On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos  wrote:

>I honestly do not expect that keeping around the failed pages will
>be an acceptable change for most kernels, and as such the
>recommendation
>will probably be to coordinate in userspace for the fsync().

Why is that required? You could very well just keep per inode information about 
fatal failures that occurred around. Report errors until that bit is explicitly 
cleared.  Yes, that keeps some memory around until unmount if nobody clears it. 
But it's orders of magnitude less, and results in usable semantics.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Fix pg_rewind which can be run as root user

2018-04-09 Thread Magnus Hagander
On Mon, Apr 9, 2018 at 9:31 PM, Tom Lane  wrote:

> Magnus Hagander  writes:
> > Seems simple enough and the right hting to do, but I wonder if we should
> > really backpatch it. Yes, the behaviour is not great now, but there is
> also
> > a non-zero risk of breaking peoples automated failover scripts of we
> > backpatch it, isn't it?
>
> Yeah, I'd vote against backpatching.  This doesn't seem like an essential
> fix.
>

Applied, and pushed this way.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Justin Pryzby
On Mon, Apr 09, 2018 at 09:31:56AM +0800, Craig Ringer wrote:
> You could make the argument that it's OK to forget if the entire file
> system goes away. But actually, why is that ok?

I was going to say that it'd be okay to clear error flag on umount, since any
opened files would prevent unmounting; but, then I realized we need to consider
the case of close()ing all FDs then opening them later..in another process.

I was going to say that's fine for postgres, since it chdir()s into its
basedir, but actually not fine for nondefault tablespaces..

On Mon, Apr 09, 2018 at 02:54:16PM +0200, Anthony Iliopoulos wrote:
> notification descriptor open, where the kernel would inject events
> related to writeback failures of files under watch (potentially
> enriched to contain info regarding the exact failed pages and
> the file offset they map to).

For postgres that'd require backend processes to open() an file such that,
following its close(), any writeback errors are "signalled" to the checkpointer
process...

Justin



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 12:29:16PM -0700, Andres Freund wrote:
> On 2018-04-09 21:26:21 +0200, Anthony Iliopoulos wrote:
> > What about having buffered IO with implied fsync() atomicity via
> > O_SYNC?
> 
> You're kidding, right?  We could also just add sleep(30)'s all over the
> tree, and hope that that'll solve the problem.  There's a reason we
> don't permanently fsync everything. Namely that it'll be way too slow.

I am assuming you can apply the same principle of selectively using O_SYNC
at times and places that you'd currently actually call fsync().

Also assuming that you'd want to have a backwards-compatible solution for
all those kernels that don't keep the pages around, irrespective of future
fixes. Short of loading a kernel module and dealing with the problem directly,
the only other available options seem to be either O_SYNC, O_DIRECT or ignoring
the issue.

Best regards,
Anthony



Re: Warnings and uninitialized variables in TAP tests

2018-04-09 Thread Magnus Hagander
On Mon, Apr 9, 2018 at 3:15 AM, Michael Paquier  wrote:

> Hi all,
>
> While looking at the output of the TAP tests, I have seen warnings like
> the following:
> Use of uninitialized value $target_lsn in concatenation (.) or string at
> /home/foo/git/postgres/src/bin/pg_rewind/../../../src/
> test/perl/PostgresNode.pm
> line 1540.
> [...]
> ./src/bin/pg_basebackup/tmp_check/log/regress_log_010_pg_basebackup:Use
> of uninitialized value $str in print at
> /home/foo/git/postgres/src/bin/pg_basebackup/../../../
> src/test/perl/TestLib.pm
> line 244.
>
> The first one shows somethng like 30 times, and the second only once.
>
> Attached is a patch to clean up all that.  In order to find all that, I
> simply ran the following and they pop up:
> find . -name "regress_log*" | xargs grep -i "uninitialized"
>
> What I have found is harmless, still they pollute the logs.
>

Applied, thanks.


-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Tomas Vondra
On 04/09/2018 04:22 PM, Anthony Iliopoulos wrote:
> On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:
>>
>> We already have dirty_bytes and dirty_background_bytes, for example. I
>> don't see why there couldn't be another limit defining how much dirty
>> data to allow before blocking writes altogether. I'm sure it's not that
>> simple, but you get the general idea - do not allow using all available
>> memory because of writeback issues, but don't throw the data away in
>> case it's just a temporary issue.
> 
> Sure, there could be knobs for limiting how much memory such "zombie"
> pages may occupy. Not sure how helpful it would be in the long run
> since this tends to be highly application-specific, and for something
> with a large data footprint one would end up tuning this accordingly
> in a system-wide manner. This has the potential to leave other
> applications running in the same system with very little memory, in
> cases where for example original application crashes and never clears
> the error. Apart from that, further interfaces would need to be provided
> for actually dealing with the error (again assuming non-transient
> issues that may not be fixed transparently and that temporary issues
> are taken care of by lower layers of the stack).
> 

I don't quite see how this is any different from other possible issues
when running multiple applications on the same system. One application
can generate a lot of dirty data, reaching dirty_bytes and forcing the
other applications on the same host to do synchronous writes.

Of course, you might argue that is a temporary condition - it will
resolve itself once the dirty pages get written to storage. In case of
an I/O issue, it is a permanent impact - it will not resolve itself
unless the I/O problem gets fixed.

Not sure what interfaces would need to be written? Possibly something
that says "drop dirty pages for these files" after the application gets
killed or something. That makes sense, of course.

>> Well, there seem to be kernels that seem to do exactly that already. At
>> least that's how I understand what this thread says about FreeBSD and
>> Illumos, for example. So it's not an entirely insane design, apparently.
> 
> It is reasonable, but even FreeBSD has a big fat comment right
> there (since 2017), mentioning that there can be no recovery from
> EIO at the block layer and this needs to be done differently. No
> idea how an application running on top of either FreeBSD or Illumos
> would actually recover from this error (and clear it out), other
> than remounting the fs in order to force dropping of relevant pages.
> It does provide though indeed a persistent error indication that
> would allow Pg to simply reliably panic. But again this does not
> necessarily play well with other applications that may be using
> the filesystem reliably at the same time, and are now faced with
> EIO while their own writes succeed to be persisted.
> 

In my experience when you have a persistent I/O error on a device, it
likely affects all applications using that device. So unmounting the fs
to clear the dirty pages seems like an acceptable solution to me.

I don't see what else the application should do? In a way I'm suggesting
applications don't really want to be responsible for recovering (cleanup
or dirty pages etc.). We're more than happy to hand that over to kernel,
e.g. because each kernel will do that differently. What we however do
want is reliable information about fsync outcome, which we need to
properly manage WAL, checkpoints etc.

> Ideally, you'd want a (potentially persistent) indication of error
> localized to a file region (mapping the corresponding failed writeback
> pages). NetBSD is already implementing fsync_ranges(), which could
> be a step in the right direction.
> 
>> One has to wonder how many applications actually use this correctly,
>> considering PostgreSQL cares about data durability/consistency so much
>> and yet we've been misunderstanding how it works for 20+ years.
> 
> I would expect it would be very few, potentially those that have
> a very simple process model (e.g. embedded DBs that can abort a
> txn on fsync() EIO). I think that durability is a rather complex
> cross-layer issue which has been grossly misunderstood similarly
> in the past (e.g. see [1]). It seems that both the OS and DB
> communities greatly benefit from a periodic reality check, and
> I see this as an opportunity for strengthening the IO stack in
> an end-to-end manner.
> 

Right. What I was getting to is that perhaps the current fsync()
behavior is not very practical for building actual applications.

> Best regards,
> Anthony
> 
> [1] 
> https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf
> 

Thanks. The paper looks interesting.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Anthony Iliopoulos
On Mon, Apr 09, 2018 at 12:37:03PM -0700, Andres Freund wrote:
> 
> 
> On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos  
> wrote:
> 
> >I honestly do not expect that keeping around the failed pages will
> >be an acceptable change for most kernels, and as such the
> >recommendation
> >will probably be to coordinate in userspace for the fsync().
> 
> Why is that required? You could very well just keep per inode information 
> about fatal failures that occurred around. Report errors until that bit is 
> explicitly cleared.  Yes, that keeps some memory around until unmount if 
> nobody clears it. But it's orders of magnitude less, and results in usable 
> semantics.

As discussed before, I think this could be acceptable, especially
if you pair it with an opt-in mechanism (only applications that
care to deal with this will have to), and would give it a shot.

Still need a way to deal with all other systems and prior kernel
releases that are eating fsync() writeback errors even over sync().

Best regards,
Anthony



  1   2   >