Re: ALTER TABLE ADD COLUMN fast default

2018-03-28 Thread Andres Freund
Hi,

On 2018-03-26 09:02:10 +1030, Andrew Dunstan wrote:
> Thanks for this, all looks good. Here is the consolidate patch
> rebased. If there are no further comments I propose to commit this in
> a few days time.

Here's a bit of post commit review:

@@ -1310,13 +1679,11 @@ slot_getsomeattrs(TupleTableSlot *slot, int attnum)
 
/*
 * If tuple doesn't have all the atts indicated by tupleDesc, read the
-* rest as null
+* rest as NULLs or missing values
 */
-   for (; attno < attnum; attno++)
-   {
-   slot->tts_values[attno] = (Datum) 0;
-   slot->tts_isnull[attno] = true;
-   }
+   if (attno < attnum)
+   slot_getmissingattrs(slot, attno, attnum);
+
slot->tts_nvalid = attnum;
 }

It's worthwhile to note that this'll re-process *all* missing values,
even if they've already been deformed. I.e. if
slot_getmissingattrs(.., first-missing)
slot_getmissingattrs(.., first-missing + 1)
slot_getmissingattrs(.., first-missing + 2)
is called, all three missing values will be copied every time. That's
because tts_nvalid isn't taken into account.  I wonder if slot_getmissingattrs
could take tts_nvalid into account?

I also wonder if this doesn't deserve an unlikely(), to avoid the cost
of spilling registers in the hot branch of not missing any values.


+   else
+   {
+   /* if there is a missing values array we must process them one 
by one */
+   for (missattnum = lastAttNum - 1;
+missattnum >= startAttNum;
+missattnum--)
+   {
+   slot->tts_values[missattnum] = 
attrmiss[missattnum].ammissing;
+   slot->tts_isnull[missattnum] =
+   !attrmiss[missattnum].ammissingPresent;
+   }
+   }
+}

Why is this done backwards? It's noticeably slower to walk arrays
backwards on some CPU microarchitectures.



@@ -176,6 +179,23 @@ CreateTupleDescCopyConstr(TupleDesc tupdesc)
}
}
 


@@ -469,6 +503,29 @@ equalTupleDescs(TupleDesc tupdesc1, TupleDesc tupdesc2)
if (strcmp(defval1->adbin, defval2->adbin) != 0)
return false;
}
+   if (constr1->missing)
+   {
+   if (!constr2->missing)
+   return false;
+   for (i = 0; i < tupdesc1->natts; i++)
+   {
+   AttrMissing *missval1 = constr1->missing + i;
+   AttrMissing *missval2 = constr2->missing + i;

It's a bit odd to not use array indexing here?


@@ -625,7 +625,15 @@ ExecCopySlotMinimalTuple(TupleTableSlot *slot)
if (slot->tts_mintuple)
return heap_copy_minimal_tuple(slot->tts_mintuple);
if (slot->tts_tuple)
-   return minimal_tuple_from_heap_tuple(slot->tts_tuple);
+   {
+   if (TTS_HAS_PHYSICAL_TUPLE(slot) &&
+   HeapTupleHeaderGetNatts(slot->tts_tuple->t_data)
+   < slot->tts_tupleDescriptor->natts)
+   return minimal_expand_tuple(slot->tts_tuple,
+   
slot->tts_tupleDescriptor);
+   else
+   return minimal_tuple_from_heap_tuple(slot->tts_tuple);
+   }
 

What's the TTS_HAS_PHYSICAL_TUPLE doing here? How can this be relevant
given the previous tts_mintuple check? Am I missing something?



@@ -563,10 +569,63 @@ RelationBuildTupleDesc(Relation relation)

MemoryContextAllocZero(CacheMemoryContext,

   relation->rd_rel->relnatts *

   sizeof(AttrDefault));
-   attrdef[ndef].adnum = attp->attnum;
+   attrdef[ndef].adnum = attnum;
attrdef[ndef].adbin = NULL;
+
ndef++;
}
+
+   /* Likewise for a missing value */
+   if (attp->atthasmissing)
+   {
+   Datum   missingval;
+   boolmissingNull;
+
+   /* Do we have a missing value? */
+   missingval = heap_getattr(pg_attribute_tuple,
+ 
Anum_pg_attribute_attmissingval,
+ 
pg_attribute_desc->rd_att,
+ 
);
+   if (!missingNull)
+   {
+   /* Yes, fetch from the array 

Re: Problem while setting the fpw with SIGHUP

2018-03-28 Thread Michael Paquier
On Wed, Mar 28, 2018 at 03:40:59PM +0900, Kyotaro HORIGUCHI wrote:
> The attached does that. I don't like that it uses ControlFileLock
> to exlucde concurrent UpdateFullPageWrites and StartupXLOG but
> WALInsertLock cannot be used since UpdateFullPageWrites may take
> the same lock.

You visibly forgot your patch.
--
Michael


signature.asc
Description: PGP signature


Re: Problem while setting the fpw with SIGHUP

2018-03-28 Thread Kyotaro HORIGUCHI
At Tue, 27 Mar 2018 22:02:26 +0900, Michael Paquier  wrote 
in <20180327130226.ga1...@paquier.xyz>
> On Tue, Mar 27, 2018 at 09:01:20PM +0900, Kyotaro HORIGUCHI wrote:
> > The current UpdateFullPageWrites is safe on standby and promotion
> > so what we should consider is only the non-standby case. I think
> > what we should do is just calling RecoveryInProgress() at the
> > beginning of CheckPointerMain, which is just the same thing with
> > InitPostgres, but before setting up signal handler to avoid
> > processing SIGHUP before being ready to insert xlog.
> 
> Your proposal does not fix the issue for a checkpointer process started
> on a standby.  After a promotion, if SIGHUP is issued with a change in
> full_page_writes, then the initialization of InitXLogInsert() would
> happen again in the critical section of UpdateFullPageWrites().  The
> window is rather small for normal promotions as the startup process
> requests a checkpoint which would do the initialization, and much larger
> for fallback_promote where the startup process is in charge of doing the
> end-of-recovery checkpoint.

Yeah. I realized that after sending the mail.

I looked closer and found several problems there.

- On standby, StartupXLOG calls UpdateFullPageWrites and
  checkpointer can call the same function simultaneously, but it
  doesn't assume concurrent call.

- StartupXLOG can make a concurrent write to
  Insert->fullPageWrite so it needs to be locked.

- At the time of the very end of recovery, the startup process
  ignores possible change of full_page_writes GUC. It sticks with
  the startup value. It leads to loss of
  XLOG_CHANGE_FPW. (StartXLOG is not considering GUC changes by
  reload)

- If checkpointer calls UpdateFullPageWrites concurrently with
  change of SharedRecoveryInProgress to false in StartupXLOG the
  change may lose corresponding XLOG_CHANGE_FPW.

So, if we don't accept the current behavior, what I think we
should do are all of the follows.

A. In StartupXLOG, protect write to Insert->fullPageWrites with
  wal insert exlusive lock. Do the same thing for read in
  UpdateFullPageWrites.

B. Surround the whole UpdateFullPageWrites with any kind of lock
  to exclude concurrent calls. The attached uses ControlFileLock.
  This also exludes the function with chaging of
  SharedRecoveryInProgress to avoid loss of XLOG_CHANGE_FPW.

C. After exiting recovery mode, call UpdateFullPageWrites from
  StartupXLOG if shared fullPageWrites is found changed from the
  last known value. If checkponiter did the same thing at the
  same time, one of them completes the work.

D. Call RecoveryInProgress to set up xlog working area.

The attached does that. I don't like that it uses ControlFileLock
to exlucde concurrent UpdateFullPageWrites and StartupXLOG but
WALInsertLock cannot be used since UpdateFullPageWrites may take
the same lock.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] [PATCH] Lockable views

2018-03-28 Thread Tatsuo Ishii
> On Tue, 27 Mar 2018 23:28:04 +0900
> Yugo Nagata  wrote:
> 
> I found the previous patch was broken and this can't handle
> views that has subqueries as bellow;
> 
>  CREATE VIEW lock_view6 AS SELECT * from (select * from lock_tbl1) sub;
> 
> I fixed this and attached the updated version including additional tests.

This patch gives a warning while compiling:

lockcmds.c:186:1: warning: no semicolon at end of struct or union
 } LockViewRecurse_context;
 ^

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: PATCH: Configurable file mode mask

2018-03-28 Thread Michael Paquier
On Tue, Mar 27, 2018 at 04:21:09PM -0400, David Steele wrote:
> These updates address Michael's latest review and implement group access
> for pg_basebackup, pg_receivewal, and pg_recvlogical.  A new internal
> GUC, data_directory_group_access, allows remote processes to determine
> the correct mode using the existing SHOW protocol command.

Neat idea.  This is another use case where the SHOW command in the
replication protocol proves to be useful.

> I have dropped patch 01, which added the pg_resetwal tests.  The tests
> Peter added recently are sufficient for this patch so I'll pursue adding
> the other tests separately to avoid noise on this thread.

That's fair.

Some nits from patch 1...

+   ok (check_mode_recursive($node_master->data_dir(), 0700, 0600));
[...]
+   ok (check_mode_recursive($node_master->data_dir(), 0700, 0600));
Incorrect indentations (space after "ok", yes that's a nit...).

+   /* Set dir/file mode mask */
+   umask(PG_MODE_MASK_DEFAULT);
I would still see that rather committed as a separate patch.  I am fine
to delegate the decision to the committer who is hopefully going to pick
up this patch.

And more things about patch 1 which are not nits.

In pg_backup_tar.c, we still have a 0600 hardcoded.  You should use
PG_FILE_MODE_DEFAULT there as well.

dsm_impl_posix() can create a new segment with shm_open using as mode
0600.  This is present in pg_dynshmem, which would be included in
backups.  First, it seems to me that this should use
PG_FILE_MODE_DEFAULT.  Then, with patch 2 it seems to me that if an
instance is using DSM then there is a risk of breaking a simple backup
which uses for example "tar" without --exclude filters with group access
(sometimes scripts are not smart enough to skip the same contents as
base backups).  So it seems to me that DSM should be also made more
aware of group access to ease the life of users.

And then for patch 2, a couple of issues spotted..

+   /* for previous versions set the default group access */
+   if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_GROUP_ACCESS)
+   {
+   DataDirGroupAccess = false;
+   return false;
+   }
Enforcing DataDirGroupAccess to false for servers older than v11 is
fine, but RetrieveDataDirGroupAccess() should return true.  If I read
your patch correctly then support for pg_basebackup on older server
would just fail.

+#if !defined(WIN32) && !defined(__CYGWIN__)
+   if ((stat_buf.st_mode & PG_DIR_MODE_DEFAULT) == PG_DIR_MODE_DEFAULT)
+   {
+   umask(PG_MODE_MASK_ALLOW_GROUP);
+   DataDirGroupAccess = true;
+   }
This should use SetConfigOption() or I am missing something?

+/*
+ * Does the data directory allow group read access?  The default is false, i.e.
+ * mode 0700, but may be changed in checkDataDir() to true, i.e. mode 0750.
+ */
+bool DataDirGroupAccess = false;

Instead of a boolean, I would suggest to use an integer, this is more
consistent with log_file_mode.  Actually, about that, should we complain
if log_file_mode is set to a value incompatible?
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] [PATCH] Lockable views

2018-03-28 Thread Tatsuo Ishii
>> I found the previous patch was broken and this can't handle
>> views that has subqueries as bellow;
>> 
>>  CREATE VIEW lock_view6 AS SELECT * from (select * from lock_tbl1) sub;
>> 
>> I fixed this and attached the updated version including additional tests.
> 
> This patch gives a warning while compiling:
> 
> lockcmds.c:186:1: warning: no semicolon at end of struct or union
>  } LockViewRecurse_context;
>  ^

Also I get a regression test failure:

*** /usr/local/src/pgsql/current/postgresql/src/test/regress/expected/lock.out  
2018-03-28 15:24:13.805314577 +0900
--- /usr/local/src/pgsql/current/postgresql/src/test/regress/results/lock.out   
2018-03-28 15:42:07.975592594 +0900
***
*** 118,124 
  
   lock_tbl1
   lock_view6
! (2 rows)
  
  ROLLBACK;
  -- Verify that we can lock a table with inheritance children.
--- 118,125 
  
   lock_tbl1
   lock_view6
!  mvtest_tm
! (3 rows)

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Enhance pg_stat_wal_receiver view to display connected host

2018-03-28 Thread Michael Paquier
On Wed, Mar 28, 2018 at 03:41:33PM +1100, Haribabu Kommi wrote:
> On Wed, Mar 28, 2018 at 12:54 PM, Michael Paquier 
> wrote:
> Updated patch attached.

Thanks, switched as ready for committer.
--
Michael


signature.asc
Description: PGP signature


Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-03-28 Thread Michael Paquier
On Wed, Mar 28, 2018 at 02:13:19PM +1030, Andrew Dunstan wrote:
> On Wed, Mar 28, 2018 at 12:57 PM, Michael Paquier  wrote:
>> Something that would address the issue would be to enforce a segment
>> switch after each checkpoint, but that's a high price to pay on mostly
>> idle systems with large WAL segments, which is not appealing either, and
>> this even if the checkpoint skip logic has been fixed in v10 with the
>> concept of "important" WAL records.
> 
> If the system is mostly idle would it really matter that much?

We cannot assume that all archive commands support compression, even if
the rest of a forcibly switched segment is filled with zeros, so that
would cause extra I/O effort for such instances.  I see quite a lot of
downsides to that.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] Surjective functional indexes

2018-03-28 Thread Konstantin Knizhnik



On 27.03.2018 22:08, Simon Riggs wrote:

On 23 March 2018 at 15:54, Simon Riggs  wrote:


So please could you make the change?

Committed, but I still think that change would be good.


Thank you.
But I still not sure whether replacement of bitmap with List or array of 
Oids is really good idea.


There are two aspects: size and speed. It seems to me that memory 
overhead is most important.
At least you rejected your own idea about using autotune for this 
parameters because of presence of extra 16 bytes in statistic.
But RelationData structure is even more space critical: its size is 
multiplied by number of relations and backends.
Bitmap seems to provide the most compact representation of the 
projective index list.


Concerning speed aspect, certainly iteration through the list of all 
indexes with checking presence of index in the bitmap is more expensive 
than just direct iteration through Oid
list or array. But since check of bitmap can be done in constant time, 
both approaches have the same complexity. Also for typical case (< 5 
normal indexes for a table and 1-2 functional indexes)

difference in performance can not be measured.

Also bitmap provides convenient interface for adding new members. To 
construct array of Oids I need first to determine number of indexes, so 
I have to perform two loops.


So if you and Alvaro insist on this change, then I will definitely do 
it. But from my point of view, using bitmap here is acceptable solution.




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [HACKERS] why not parallel seq scan for slow functions

2018-03-28 Thread Amit Kapila
On Tue, Mar 27, 2018 at 10:59 PM, Robert Haas  wrote:
> On Tue, Mar 27, 2018 at 7:42 AM, Robert Haas  wrote:
>> On Tue, Mar 27, 2018 at 1:45 AM, Amit Kapila  wrote:
>>> If we don't want to go with the upperrel logic, then maybe we should
>>> consider just merging some of the other changes from my previous patch
>>> in 0003* patch you have posted and then see if it gets rid of all the
>>> cases where we are seeing a regression with this new approach.
>>
>> Which changes are you talking about?
>
> I realized that this version could be optimized in the case where the
> scanjoin_target and the topmost scan/join rel have the same
> expressions in the target list.
>

Good idea, such an optimization will ensure that the cases reported
above will not have regression.  However isn't it somewhat beating the
idea you are using in the patch which is to avoid modifying the path
in-place?  In any case, I think one will still see regression in cases
where this optimization doesn't apply.  For example,

DO $$
DECLARE count integer;
BEGIN
For count In 1..100 Loop
Execute 'explain Select sum(thousand)from tenk1 group by ten';
END LOOP;
END;
$$;

The above block takes 43700.0289 ms on Head and 45025.3779 ms with the
patch which is approximately 3% regression.  In this case, the
regression is lesser than the previously reported cases, but I guess
we might see bigger regression if the number of columns is more or
with a different set of queries.   I think the cases which can
slowdown are where we need to use physical tlist in
create_projection_plan and the caller has requested CP_LABEL_TLIST.  I
have checked in regression tests and there seem to be more cases which
will be impacted.  Another such example from regression tests is:
select count(*) >= 0 as ok from pg_available_extension_versions;


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



new buildfarm with gcc & clang "trunk" -Werror?

2018-03-28 Thread Fabien COELHO


Dear devs,

I've been running buildfarm members moonjelly and seawasp which use gcc & 
clang weekly recompiled trunk versions to build postgres branch HEAD for 
about 6 months


These helped uncover bugs in both gcc & clang, which were reported and 
fixed, so there is a benefit on this side.


These compilations currently fail with -Werror. "gcc trunk" generates 44 
warnings about snprintf-related possible truncations, anc "clang trunk" 
generates 68 warnings about undeclared strl* functions.


Would the project feel appropriate to:

 - fix these warnings (other than putting -Wno-format-truncation or
   similar workarounds...).

 - add permanent gcc/clang trunk beasts with -Werror
   (if so, I'd suggest cannonball & seanettle for the names).

The rational is that postgres will need to be compilable with future 
versions of these compilers, eventually, so the earlier issues are 
detected the better. Also, warnings help detect latent bugs.


However these beasts would be more unstable than others, eg start failing 
when a new warning is added. They are somehow already unstable though, as 
every few month postgres compilation would fail because of a new bug in 
the compilers.


--
Fabien.



Re: Proposal: http2 wire format

2018-03-28 Thread Craig Ringer
On 28 March 2018 at 16:02, Andres Freund  wrote:

> On 2018-03-26 22:44:09 +0200, Damir Simunic wrote:
> > > *NONE* of the interesting problems are solved by HTTP2. You *still*
> > > need a full blown protocol ontop of it. So no, this doesn't change
> that.
> >
> > If you had to nominate only one of those problems, which one would you
> consider the most interesting?
>
> A few random, very tired, points:
>
> - consolidated message for common tasks:
>   - (bind, [describe?,] execute) to reduce overhead of prepared
> statement execution (both in messages, as well as branches)
>   - (anonymous parse, bind, describe, execute) to make it cheaper to
> send statements with out-of-line parameters
> - get rid of length limits of individual fields, probably w/ some variable
>   length encoding (simple 7 bit?)
>

In preparation for the eventually-inevitable 64-bit field sizes, yes.

This should be on the protocol todo wiki.


> - allow *streaming* of large datums


Yes, very much +1 there. That's already on the wiki. Yeah:

* Permit lazy fetches of large values, at least out-of-line TOASTED values
http://www.postgresql.org/message-id/53ff0ef8@2ndquadrant.com


- type-level decisions about binary type transport, right now it's a lot
>   of effort (including potentially additional roundtrips), to get the
>   few important types to be transported in binary fashion. E.g. floating
>   points are really expensive to stringify, bytea as text gets a lot
>   bigger etc, but a lot of other types don't benefit a lot
>

Yeah, as distinct from now, where the client has specify param-by-param,
and where libpq doesn't support mixing text and binary formats in result
sets at all.

Again, needs wiki. I'll add.

- annotate COMMIT, PREPARE TRANSACTION, COMMIT PREPARED with LSN of
>   associated WAL record
>

Already on the wiki, as is the related job of sending the xid of a txn to
the client when one is assigned.


> - have a less insane cancellation handling
>

+100

- nested table support
>
>
Can you elaborate on that one?


A few other points that come to mind for me are:

* labeled result sets (useful for stored procs, etc, as came up recently
with trying to figure out how to let stored procs have OUT params and
multiple result sets)

* room for other resultset formats later. Like Damir, I really want to add
protobuf or json serializations of result sets at some point, mainly so we
can return "entity graphs" in graph representation rather than left-join
projection.

* Robert Haas was talking about some issues relating to sync and the COPY
BOTH protocol a while ago, which we'd want to address.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Proposal: http2 wire format

2018-03-28 Thread Tatsuo Ishii
> A few random, very tired, points:
> 
> - consolidated message for common tasks:
>   - (bind, [describe?,] execute) to reduce overhead of prepared
> statement execution (both in messages, as well as branches)
>   - (anonymous parse, bind, describe, execute) to make it cheaper to
> send statements with out-of-line parameters
> - get rid of length limits of individual fields, probably w/ some variable
>   length encoding (simple 7 bit?)
> - allow *streaming* of large datums
> - type-level decisions about binary type transport, right now it's a lot
>   of effort (including potentially additional roundtrips), to get the
>   few important types to be transported in binary fashion. E.g. floating
>   points are really expensive to stringify, bytea as text gets a lot
>   bigger etc, but a lot of other types don't benefit a lot
> - annotate COMMIT, PREPARE TRANSACTION, COMMIT PREPARED with LSN of
>   associated WAL record
> - have a less insane cancellation handling
> - nested table support

I would like to have portal/statement name to be added to response
messages (i.e. parse complete, bind complete, close complete, and
command complete.). Currently it's not easy to recognize which
response corresponds to which message, which makes certain
applications such as Pgpool-II hard to implement and inefficient.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Cast jsonb to numeric, int, float, bool

2018-03-28 Thread Teodor Sigaev
I think, it should support from/to numeric/bool/text only. If we want to have 
casts to from numeric to other numeric types then it should be full set (int2, 
int4, int8, float4, float8).


I was too optimistic about casting to/from text. It already exists and it works 
by differ way from suggested  casts:


1) select '"foo"'::jsonb::text;  -> "foo"  // with ""
2) select '123'::jsonb::text;-> 123
3) select '123'::jsonb::int4;-> 123

Seems, sometime it would be desirable result in 1) as just
'foo' without "" decoration, but we cannot have 2 different explicit casts for 
the same types. So, I withdraw objections about text casting, but I still 
believe that we need one of two variants:

1) numeric/bool
2) bool, numeric/all variants of numeric types


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: new function for tsquery creartion

2018-03-28 Thread Aleksander Alekseev
Hello Dmitry,

> A few gotchas:
> 
> I haven't touched gettoken_tsvector() yet. As a result, the following
> queries produce errors:
> 
> select websearch_to_tsquery('simple', );
> ERROR:  syntax error in tsquery: "'"
> 
> select websearch_to_tsquery('simple', '\');
> ERROR:  there is no escaped character: "\"
> 
> Maybe there's more. The question is: should we fix those, or it's fine as it
> is? I don't have a strong opinion about this.

It doesn't sound right to me to accept any input as a general rule but
sometimes return errors nevertheless. That API would be complicated for
the users. Thus I suggest to accept any garbage and try our best to
interpret it.

-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature


Re: Proposal: http2 wire format

2018-03-28 Thread Craig Ringer
On 28 March 2018 at 00:42, Damir Simunic 
wrote:

>
>
> I'm rapidly losing interest. Unless this goes back toward the concrete and
> practical I think it's going nowhere.
>
>
>
> Your message is exactly what I was hoping for. Thanks for your guidance
> and support, really appreciate you.
>
> Let me now get busy and earn your continued interest and support.
>
>
I spent a lot of time reviewing what you wrote and proposed, looked over
your proof of concept, and offered feedback. Much of which you ignored.
I've been trying to help and took a fair bit of time to do so.

I've outlined what I think needs to happen to push this in a practical
direction.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Proposal: http2 wire format

2018-03-28 Thread Andres Freund
On 2018-03-26 22:44:09 +0200, Damir Simunic wrote:
> > *NONE* of the interesting problems are solved by HTTP2. You *still*
> > need a full blown protocol ontop of it. So no, this doesn't change that.
>
> If you had to nominate only one of those problems, which one would you 
> consider the most interesting?

A few random, very tired, points:

- consolidated message for common tasks:
  - (bind, [describe?,] execute) to reduce overhead of prepared
statement execution (both in messages, as well as branches)
  - (anonymous parse, bind, describe, execute) to make it cheaper to
send statements with out-of-line parameters
- get rid of length limits of individual fields, probably w/ some variable
  length encoding (simple 7 bit?)
- allow *streaming* of large datums
- type-level decisions about binary type transport, right now it's a lot
  of effort (including potentially additional roundtrips), to get the
  few important types to be transported in binary fashion. E.g. floating
  points are really expensive to stringify, bytea as text gets a lot
  bigger etc, but a lot of other types don't benefit a lot
- annotate COMMIT, PREPARE TRANSACTION, COMMIT PREPARED with LSN of
  associated WAL record
- have a less insane cancellation handling
- nested table support

Greetings,

Andres Freund



Re: Oddity in COPY FROM handling of check constraints on partition tables

2018-03-28 Thread Ashutosh Bapat
On Wed, Mar 28, 2018 at 6:58 AM, Amit Langote
 wrote:
>> which violates the constraint on the column b (ie, b > 0), so this
>> should abort.  The reason for that is because CopyFrom looks at the
>> parent relation's constraints, not the partition's constraints, when
>> checking the constraint against the input row.
>
> Good catch, thanks!
>

+1

>> Attached is a patch for fixing this issue.
>
> That looks good to me.  This one would need to be back-patched to v10.
Thanks. Please add to the next commitfest so that it doesn't get lost.
We can not add this to v11 open items since it isn't a v11 bug
exactly.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] [PATCH] Lockable views

2018-03-28 Thread Yugo Nagata
On Wed, 28 Mar 2018 15:45:09 +0900 (JST)
Tatsuo Ishii  wrote:

> >> I found the previous patch was broken and this can't handle
> >> views that has subqueries as bellow;
> >> 
> >>  CREATE VIEW lock_view6 AS SELECT * from (select * from lock_tbl1) sub;
> >> 
> >> I fixed this and attached the updated version including additional tests.
> > 
> > This patch gives a warning while compiling:
> > 
> > lockcmds.c:186:1: warning: no semicolon at end of struct or union
> >  } LockViewRecurse_context;
> >  ^
> 
> Also I get a regression test failure:

Thank you for your reviewing my patch.
I attached the updated patch, v10.

Regards,

> 
> *** 
> /usr/local/src/pgsql/current/postgresql/src/test/regress/expected/lock.out
> 2018-03-28 15:24:13.805314577 +0900
> --- /usr/local/src/pgsql/current/postgresql/src/test/regress/results/lock.out 
> 2018-03-28 15:42:07.975592594 +0900
> ***
> *** 118,124 
>   
>lock_tbl1
>lock_view6
> ! (2 rows)
>   
>   ROLLBACK;
>   -- Verify that we can lock a table with inheritance children.
> --- 118,125 
>   
>lock_tbl1
>lock_view6
> !  mvtest_tm
> ! (3 rows)
> 
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp


-- 
Yugo Nagata 
diff --git a/doc/src/sgml/ref/lock.sgml b/doc/src/sgml/ref/lock.sgml
index b2c7203..96d477c 100644
--- a/doc/src/sgml/ref/lock.sgml
+++ b/doc/src/sgml/ref/lock.sgml
@@ -46,6 +46,11 @@ LOCK [ TABLE ] [ ONLY ] name [ * ]
   
 
   
+   When a view is specified to be locked, all relations appearing in the view
+   definition query are also locked recursively with the same lock mode. 
+  
+
+  
When acquiring locks automatically for commands that reference
tables, PostgreSQL always uses the least
restrictive lock mode possible. LOCK TABLE
diff --git a/src/backend/commands/lockcmds.c b/src/backend/commands/lockcmds.c
index 6479dcb..e8a8877 100644
--- a/src/backend/commands/lockcmds.c
+++ b/src/backend/commands/lockcmds.c
@@ -23,11 +23,15 @@
 #include "utils/acl.h"
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
+#include "rewrite/rewriteHandler.h"
+#include "access/heapam.h"
+#include "nodes/nodeFuncs.h"
 
-static void LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool nowait);
-static AclResult LockTableAclCheck(Oid relid, LOCKMODE lockmode);
+static void LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool nowait, Oid userid);
+static AclResult LockTableAclCheck(Oid relid, LOCKMODE lockmode, Oid userid);
 static void RangeVarCallbackForLockTable(const RangeVar *rv, Oid relid,
 			 Oid oldrelid, void *arg);
+static void LockViewRecurse(Oid reloid, Oid root_reloid, LOCKMODE lockmode, bool nowait);
 
 /*
  * LOCK TABLE
@@ -62,8 +66,10 @@ LockTableCommand(LockStmt *lockstmt)
 		  RangeVarCallbackForLockTable,
 		  (void *) >mode);
 
-		if (recurse)
-			LockTableRecurse(reloid, lockstmt->mode, lockstmt->nowait);
+		if (get_rel_relkind(reloid) == RELKIND_VIEW)
+			LockViewRecurse(reloid, reloid, lockstmt->mode, lockstmt->nowait);
+		else if (recurse)
+			LockTableRecurse(reloid, lockstmt->mode, lockstmt->nowait, GetUserId());
 	}
 }
 
@@ -86,15 +92,17 @@ RangeVarCallbackForLockTable(const RangeVar *rv, Oid relid, Oid oldrelid,
 		return;	/* woops, concurrently dropped; no permissions
  * check */
 
-	/* Currently, we only allow plain tables to be locked */
-	if (relkind != RELKIND_RELATION && relkind != RELKIND_PARTITIONED_TABLE)
+
+	/* Currently, we only allow plain tables or views to be locked */
+	if (relkind != RELKIND_RELATION && relkind != RELKIND_PARTITIONED_TABLE &&
+		relkind != RELKIND_VIEW)
 		ereport(ERROR,
 (errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("\"%s\" is not a table",
+ errmsg("\"%s\" is not a table or view",
 		rv->relname)));
 
 	/* Check permissions. */
-	aclresult = LockTableAclCheck(relid, lockmode);
+	aclresult = LockTableAclCheck(relid, lockmode, GetUserId());
 	if (aclresult != ACLCHECK_OK)
 		aclcheck_error(aclresult, get_relkind_objtype(get_rel_relkind(relid)), rv->relname);
 }
@@ -107,7 +115,7 @@ RangeVarCallbackForLockTable(const RangeVar *rv, Oid relid, Oid oldrelid,
  * multiply-inheriting children more than once, but that's no problem.
  */
 static void
-LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool nowait)
+LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool nowait, Oid userid)
 {
 	List	   *children;
 	ListCell   *lc;
@@ -120,7 +128,7 @@ LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool nowait)
 		AclResult	aclresult;
 
 		/* Check permissions before acquiring the lock. */
-		aclresult = LockTableAclCheck(childreloid, lockmode);
+		aclresult = LockTableAclCheck(childreloid, lockmode, userid);
 		if (aclresult != ACLCHECK_OK)
 		{
 			char	   *relname = get_rel_name(childreloid);
@@ -157,15 +165,120 @@ LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool 

Re: [HACKERS] A design for amcheck heapam verification

2018-03-28 Thread Pavan Deolasee
On Wed, Mar 28, 2018 at 2:48 AM, Peter Geoghegan  wrote:

> On Tue, Mar 27, 2018 at 6:48 AM, Pavan Deolasee
>  wrote:
> > + * When index-to-heap verification is requested, a Bloom filter is used
> to
> > + * fingerprint all tuples in the target index, as the index is
> traversed to
> > + * verify its structure.  A heap scan later verifies the presence in the
> > heap
> > + * of all index tuples fingerprinted within the Bloom filter.
> > + *
> >
> > Is that correct? Aren't we actually verifying the presence in the index
> of
> > all
> > heap tuples?
>
> I think that you could describe it either way. We're verifying the
> presence of heap tuples in the heap that ought to have been in the
> index (that is, most of those that were fingerprinted).
>

Hmm Ok. It still sounds backwards to me, but then English is not my first
or even second language. "A heap scan later verifies the presence in the
heap of all index tuples fingerprinted" sounds as if we expect to find all
fingerprinted index tuples in the heap. But in reality, we check if all
heap tuples have a fingerprinted index tuple.


>
> You're right - there is a narrow window for REPEATABLE READ and
> SERIALIZABLE transactions. This is a regression in v6, the version
> removed the TransactionXmin test.
>
> I am tempted to fix this by calling GetLatestSnapshot() instead of
> GetTransactionSnapshot(). However, that has a problem of its own -- it
> won't work in parallel mode, and we're dealing with parallel
> restricted functions, not parallel unsafe functions. I don't want to
> change that to fix such a narrow issue. IMV, a better fix is to treat
> this as a serialization failure. Attached patch, which applies on top
> of v7, shows what I mean.
>

Ok. I am happy with that.


>
> I think that this bug is practically impossible to hit, because we use
> the xmin from the pg_index tuple during "is index safe to use?"
> indcheckxmin/TransactionXmin checks (the patch that I attached adds a
> similar though distinct check), which raises another question for a
> REPEATABLE READ xact. That question is: How is a REPEATABLE READ
> transaction supposed to see the pg_index entry to get the index
> relation's oid to call a verification function in the first place?


Well pg_index entry will be visible and should be visible. Otherwise how do
you even maintain a newly created index. I am not sure, but I guess we take
fresh MVCC snapshots for catalog searches, even in RR transactions.


> My
> point is that there is no need for a more complicated solution than
> what I propose.
>

I agree on that.


>
> I don't think so. The way we compute OldestXmin for
> IndexBuildHeapRangeScan() is rather like a snapshot acquisition --
> GetOldestXmin() locks the proc array in shared mode. As you pointed
> out, the fact that it comes after everything else (fingerprinting)
> means that it must be equal to or later than what index scans saw,
> that allowed them to do the kill_prior_tuple() stuff (set LP_DEAD
> bits).
>
>
That's probably true.


>
> > Are there any interesting cases around INSERT_IN_PROGRESS/DELETE_IN_
> PROGRESS
> > tuples, especially if those tuples were inserted/deleted by our own
> > transaction? It probably worth thinking.
>

Anything here that you would like to check? I understand that you may see
such tuples only for catalog tables.


>
> IndexBuildHeapRangeScan() doesn't mention anything about CIC's heap
> ShareUpdateExclusiveLock (it just says SharedLock), because that lock
> strength doesn't have anything to do with IndexBuildHeapRangeScan()
> when it operates with an MVCC snapshot. I think that this means that
> this patch doesn't need to update comments within
> IndexBuildHeapRangeScan(). Perhaps that's a good idea, but it seems
> independent.
>

Ok, I agree. But note that we are now invoking that code
with AccessShareLock() whereas the existing callers either use ShareLock or
ShareUpdateExclusiveLock. That's probably does not matter, but it's a
change worth noting.


>
> > Is
> > there anything we can do to lessen that burden like telling other
> backends
> > to
> > ignore our xmin while computing OldestXmin (like vacuum does)?
>
> I don't think so. The transaction involved is still an ordinary user
> transaction.
>

While that's true, I am thinking if there is anything we can do to stop
this a consistency-checking utility to create other non-trivial side
effects on the state of the database, even if that means we can not check
all heap tuples. For example, can there be a way by which we allow
concurrent vacuum or HOT prune to continue to prune away dead tuples, even
if that means running bt_check_index() is some specialised way (such as not
allowing in a transaction block) and the heap scan  might miss out some
tuples? I don't know answer to that, but given that sometimes bt_check_index()
may take hours if not days to finish, it seems worth thinking or at least
documenting the side-effects somewhere.

Thanks,
Pavan

-- 
 

Re: Oddity in COPY FROM handling of check constraints on partition tables

2018-03-28 Thread Etsuro Fujita
(2018/03/28 10:28), Amit Langote wrote:
>> Attached is a patch for fixing this issue.
> 
> That looks good to me.  This one would need to be back-patched to v10.

Thanks for the review!

Best regards,
Etsuro Fujita



Re: [HACKERS] [PATCH] Lockable views

2018-03-28 Thread Tatsuo Ishii
> On Wed, 28 Mar 2018 15:45:09 +0900 (JST)
> Tatsuo Ishii  wrote:
> 
>> >> I found the previous patch was broken and this can't handle
>> >> views that has subqueries as bellow;
>> >> 
>> >>  CREATE VIEW lock_view6 AS SELECT * from (select * from lock_tbl1) sub;
>> >> 
>> >> I fixed this and attached the updated version including additional tests.
>> > 
>> > This patch gives a warning while compiling:
>> > 
>> > lockcmds.c:186:1: warning: no semicolon at end of struct or union
>> >  } LockViewRecurse_context;
>> >  ^
>> 
>> Also I get a regression test failure:
> 
> Thank you for your reviewing my patch.
> I attached the updated patch, v10.

Thanks. Looks good to me. I marked the patch as "Ready for Committer".
Unless there's an objection, especially from Robert or Thomas Munro, I
am going to commit/push it.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] [PATCH] Incremental sort

2018-03-28 Thread Teodor Sigaev

BTW, patch had conflicts with master.  Please, find rebased version attached.


Despite by patch conflist patch looks commitable, has anybody objections to 
commit it?


Patch recieved several rounds of review during 2 years, and seems to me, keeping 
it out from sources may cause a lost it. Although it suggests performance 
improvement in rather wide usecases.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: Parallel Aggregates for string_agg and array_agg

2018-03-28 Thread Tomas Vondra


On 03/28/2018 05:28 AM, Tom Lane wrote:
> Tomas Vondra  writes:
>> On 03/27/2018 04:51 AM, David Rowley wrote:
>>> Seems I didn't mean "trans types". I should have said aggregate
>>> function argument types.
> 
>> I'm not sure that's better than the check proposed by Tom. An argument
>> type without send/receive function does not necessarily mean we can't
>> serialize/deserialize the trans value. Because who says the argument
>> value will be embedded in the trans value?
> 
> In general we would not know that, but *for these specific serial/
> deserial functions*, we know exactly what they will do.  Also, IIRC,
> the trans type is declared as INTERNAL, so we don't really have any
> hope of identifying the behavior by inspecting that type declaration.
> 

Sure, which is why I thought your solution was better, as it was
targeted at those particular functions.

> Getting a solution that would work for other polymorphic serialization
> functions seems like a bit of a research project to me.  In the meantime,
> I think David's right that what we need to look at is the actual input
> type of the aggregate, and then assume that what's to be serialized is
> an array of that.  Conceivably an aggregate could be built that uses
> these serial/deserial functions and yet its input type is something else
> than what it constructs an array of ... but I find it a bit hard to
> wrap my brain around what that would be exactly.
> 

But David's fix doesn't check the aggregate to produce an array of the
input type (or anyarray). It could easily be an aggregate computing a
bloom filter or something like that, which has no such issues in the
serial/deserial functions.

Also, if it's checking aggref->aggargtypes, it'll reject anyelement
parameters, no?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Jsonb transform for pl/python

2018-03-28 Thread Peter Eisentraut
On 3/21/18 18:50, Peter Eisentraut wrote:
> On 3/12/18 11:26, Nikita Glukhov wrote:
>> I have reviewed this patch.  Attached new 6th version of the patch with
>> v5-v6 delta-patch.
> 
> Thanks for the update.  I'm working on committing this.

Committed.

Everything seemed to work well.  I just did a bit of cosmetic cleanups.
I did a fair amount of tweaking on the naming of functions, the
extensions and library names, etc., to make it look like the existing
plpython transforms.  I also changed it so that the transform would
support mappings and sequences other than dict and list.  I removed the
non-ASCII test and the test with the huge numbers.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] A design for amcheck heapam verification

2018-03-28 Thread Pavan Deolasee
On Wed, Mar 28, 2018 at 2:48 AM, Peter Geoghegan  wrote:

>
> I don't think so. The transaction involved is still an ordinary user
> transaction.
>
>
Mostly a nitpick, but I guess we should leave a comment
after IndexBuildHeapScan() saying heap_endscan() is not necessary
since IndexBuildHeapScan()
does that internally. I stumbled upon that while looking for any potential
leaks. I know at least one other caller of IndexBuildHeapScan() doesn't
bother to say anything either, but it's helpful.

FWIW I also looked at the 0001 patch and it looks fine to me.

Thanks,
Pavan
-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PL/pgSQL nested CALL with transactions

2018-03-28 Thread Tomas Vondra


On 03/28/2018 02:54 PM, Peter Eisentraut wrote:
> On 3/27/18 20:43, Tomas Vondra wrote:
 3) utility.c

 I find this condition rather confusing:

 (!(context == PROCESS_UTILITY_TOPLEVEL ||
context == PROCESS_UTILITY_QUERY_NONATOMIC) ||
IsTransactionBlock())

 I mean, negated || with another || - it took me a while to parse what
 that means. I suggest doing this instead:

 #define ProcessUtilityIsAtomic(context)\
(!(context == PROCESS_UTILITY_TOPLEVEL ||
   context == PROCESS_UTILITY_QUERY_NONATOMIC))

 (ProcessUtilityIsAtomic(context) || IsTransactionBlock())
>>> fixed
>>>
>> Ummm, I still see the original code here.
> 
> I put the formula into a separate variable isAtomicContext instead of
> repeating it twice.  I think that makes it clearer.  I'm not sure
> splitting it up like above makes it better, because the
> IsTransactionBlock() is part of the "is atomic".  Maybe adding a comment
> would make it clearer.
> 

I see. I thought "fixed" means you adopted the #define, but that's not
the case. I don't think an extra comment is needed - the variable name
is descriptive enough IMHO.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] [PATCH] Incremental sort

2018-03-28 Thread Teodor Sigaev

BTW, patch had conflicts with master.  Please, find rebased version attached.


Sorry, but patch conflicts with master again.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: Oddity in COPY FROM handling of check constraints on partition tables

2018-03-28 Thread Etsuro Fujita

(2018/03/28 18:51), Ashutosh Bapat wrote:

On Wed, Mar 28, 2018 at 6:58 AM, Amit Langote
  wrote:



Attached is a patch for fixing this issue.


That looks good to me.  This one would need to be back-patched to v10.

Thanks. Please add to the next commitfest so that it doesn't get lost.
We can not add this to v11 open items since it isn't a v11 bug
exactly.


OK, done.

Best regards,
Etsuro Fujita



Re: PL/pgSQL nested CALL with transactions

2018-03-28 Thread Peter Eisentraut
On 3/27/18 20:43, Tomas Vondra wrote:
>>> 3) utility.c
>>>
>>> I find this condition rather confusing:
>>>
>>> (!(context == PROCESS_UTILITY_TOPLEVEL ||
>>>context == PROCESS_UTILITY_QUERY_NONATOMIC) ||
>>>IsTransactionBlock())
>>>
>>> I mean, negated || with another || - it took me a while to parse what
>>> that means. I suggest doing this instead:
>>>
>>> #define ProcessUtilityIsAtomic(context)\
>>>(!(context == PROCESS_UTILITY_TOPLEVEL ||
>>>   context == PROCESS_UTILITY_QUERY_NONATOMIC))
>>>
>>> (ProcessUtilityIsAtomic(context) || IsTransactionBlock())
>> fixed
>>
> Ummm, I still see the original code here.

I put the formula into a separate variable isAtomicContext instead of
repeating it twice.  I think that makes it clearer.  I'm not sure
splitting it up like above makes it better, because the
IsTransactionBlock() is part of the "is atomic".  Maybe adding a comment
would make it clearer.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] logical decoding of two-phase transactions

2018-03-28 Thread Simon Riggs
On 28 March 2018 at 16:28, Nikhil Sontakke  wrote:

> Simon, 0003-Add-GID-and-replica-origin-to-two-phase-commit-abort.patch
> is the exact patch that you had posted for an earlier commit.

0003 Pushed

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Rangejoin rebased

2018-03-28 Thread David Steele
On 3/2/18 11:44 AM, Robert Haas wrote:
> On Fri, Mar 2, 2018 at 11:12 AM, Alexander Kuzmenkov
>  wrote:
>> On 16.01.2018 10:49, Jeff Davis wrote:
>>> My proposed fix is to make an internal opfamily identical to the
>>> external one, such that it's not recognized as part of the same EC,
>>> and the planner won't try to eliminate it. It loses out on potential
>>> optimizations, but those are mostly theoretical since the btree
>>> opclass ordering for ranges is not very interesting to a user.
>>
>> I think I figured out what to do with missing sort directions. We can change
>> select_outer_pathkeys_for_merge() to generate the pathkeys we need. Also,
>> find_mergeclauses_for_outer_pathkeys() has to be changed too, so that it
>> knows which pathkeys are compatible to which range join clauses.
>>
>> About the patch, do I understand it right that you are working on the next
>> version now?
> 
> I think we are quite clearly past the deadline to submit a new patch
> for inclusion in v11 at this point.

It seems that a new patch is needed here but none has been presented.
I've marked this Waiting on Author for the moment, but I really think it
should be marked Returned with Feedback and submitted to the next CF
when a new patch is ready.

Regards,
-- 
-David
da...@pgmasters.net



Re: [bug fix] pg_rewind creates corrupt WAL files, and the standby cannot catch up the primary

2018-03-28 Thread Fujii Masao
On Wed, Mar 28, 2018 at 11:24 AM, Michael Paquier  wrote:
> On Wed, Mar 28, 2018 at 04:45:49AM +0900, Fujii Masao wrote:
>> The patch looks good to me! Barring objection, I will commit it
>> and back-patch to 9.5 where pg_rewind was added.
>
> Thanks in advance!  I just eyeballed the patch again and I don't see
> issues with what's used here.  The thing should apply smoothly back to
> 9.5, of course let me know if you need help or an extra pair of eyes.

Pushed. Thanks!

Regards,

-- 
Fujii Masao



Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-28 Thread Andres Freund
Hi,

On 2018-03-28 13:52:24 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Given, as explained nearby, we already do store transient data in the
> > ctid for speculative insertions (i.e. ON CONFLICT), and it hasn't caused
> > even a whiff of trouble, I'm currently not inclined to see a huge issue
> > here.  It'd be great if you could expand on your concerns here a bit, we
> > gotta figure out a way forward.
> 
> Just what I said.  There's a lot of code that knows how to follow tuple
> update chains, probably not all of it in core, and this will break it.
> But only in seldom-exercised corner cases, which is the worst of all
> possible worlds from a reliability standpoint.

How will it break it? They'll see an invalid ctid and conclude that the
tuple is dead? Without any changes that's already something that can
happen if a later tuple in the chain has been pruned away.  Sure, that
code won't realize it should error out because the tuple is now in a
different partition, but neither would a infomask bit.

I think my big problem is that I just don't see what the worst that can
happen is. We'd potentially see a broken ctid chain, something that very
commonly happens, and consider the tuple to be invisible.  That seems
pretty sane behaviour for unadapted code, and not any worse than other
potential solutions.


> (I don't think ON CONFLICT is a counterexample because, IIUC, it's not
> a persistent state.)

Hm, it can be persistent if we error out in the right moment. But it's
nots super common to encounter that over a long time, I grant you
that. Not that this'd be super persistent either, vacuum/pruning would
normally remove the tuple as well, it's dead after all.


> >> I would've been happier about expending an infomask bit towards this
> >> purpose.  Just eyeing what we've got, I can't help noticing that
> >> HEAP_MOVED_OFF/HEAP_MOVED_IN couldn't possibly be set in any tuple
> >> in a partitioned table.  Perhaps making these tests depend on
> >> partitioned-ness would be unworkably messy, but it's worth thinking
> >> about.
> 
> > They previously couldn't be set together IIRC, so we could just (mask &
> > (HEAP_MOVED_OFF |HEAP_MOVED_IN)) == (HEAP_MOVED_OFF |HEAP_MOVED_IN) but
> > that'd be permanently eating two infomask bits.
> 
> Hmm.  That objection only matters if we have realistic intentions of
> reclaiming those bits in future, which I've not heard anyone making
> serious effort towards.

I plan to submit a patch early v12 that keeps track of the last time a
table has been fully scanned (and when it was created). With part of the
goal being debuggability and part being able to reclaim things like
these bits.


Greetings,

Andres Freund




Re: Sample values for pg_stat_statements

2018-03-28 Thread David Steele
On 3/21/18 1:31 PM, David Steele wrote:
> 
> On 3/10/18 9:02 AM, Tomas Vondra wrote:
>>
>> I've looked at this patch today. I like the idea / intent in general, as
>> it helps with some investigation tasks. That being said, I have a couple
>> of questions/comments based on read through the patch:
> 
> It looks like there are some privacy concerns with this patch as well as
> the suggestions in the review from Tomas.
> 
> Will you be providing an updated patch for this CF?  If not, I think we
> should mark the entry Returned with Feedback for now and let you
> resubmit when you have an updated patch.

Since it doesn't appear that a new patch will be ready for this CF I
have marked this entry Returned with Feedback.

Regards,
-- 
-David
da...@pgmasters.net



Re: csv format for psql

2018-03-28 Thread David G. Johnston
On Wednesday, March 28, 2018, Fabien COELHO  wrote:

>
>
> And if we introduce csv-specific fieldsep, then we multiply this wrong
>> design. The fix in this direction is renaming fieldsep to fieldsep-unaliagn
>> - but it is probably too big change too. So this design is nothing what I
>> can mark as good solution.
>>
>
> Good, we somehow agree on something!
>
> I can live with because it is much better than using pipe as separator for
>> csv, and because real impact is small - for almost people it will be
>> invisible - but have not good feeling from it.
>
>
Concretely...I'm in favor of the "\pset fieldsep_csv ," solution and csv
format should always use its existing value.  Teach \pset fieldsep to fail
if the current format is csv.  Being able to specify the csv fieldsep like
 "\pset format csv ;" would be a plus.

Unaligned format could grow its own fieldsep if it wanted to but it can
also just use the default provided fieldsep var whose default value is
pipe.  If it did grow one it would need to understand "not set" in order to
preserve existing behavior.  We don't have that problem with csv.

I don't believe we can modify fieldsep without causing unwanted grief.

David J.


Re: Reopen logfile on SIGHUP

2018-03-28 Thread David Steele
On 2/27/18 8:54 PM, Michael Paquier wrote:
> On Tue, Feb 27, 2018 at 05:52:20PM -0500, Tom Lane wrote:
>> It already does treat SIGUSR1 as a log rotation request.  Apparently
>> the point of this patch is that some people don't find that easy enough
>> to use, which is fair, because finding out the collector's PID from
>> outside isn't very easy.
> 
> True enough.  The syslogger does not show up in pg_stat_activity either,
> so I think that being able to do so would help for this case.

There does not seem to be any consensus on this patch so I'm marking it
Waiting on Author for the time being.  At the end of the CF I'll mark it
Returned with Feedback if there is no further activity.

Regards,
-- 
-David
da...@pgmasters.net



Re: Rangejoin rebased

2018-03-28 Thread David Steele
On 3/28/18 2:23 PM, Andres Freund wrote:
> On 2018-03-28 14:18:42 -0400, David Steele wrote:
>> It seems that a new patch is needed here but none has been presented.
>> I've marked this Waiting on Author for the moment, but I really think it
>> should be marked Returned with Feedback and submitted to the next CF
>> when a new patch is ready.
> 
> I'd just do so now. There's not been any progress for months, and
> there's been an update request weeks ago...
Done.

-- 
-David
da...@pgmasters.net



Re: Function to track shmem reinit time

2018-03-28 Thread David Steele
On 3/4/18 11:17 AM, Tomas Vondra wrote:
> 
> Furthermore, the patch is yet another victim of fd1a421fe - fixing the
> pg_proc entries is trivial, but a new version is needed.
> 
> I'd also like to see an example/explanation how this improves this
> situation compared to only having pg_postmaster_start_time.
> 
> So I'm setting this as waiting on author for now.

I'm not sure why this was set back to Needs Review since it neither
applies cleanly nor builds.

I'm setting this entry to Waiting on Author, but based on the discussion
I think it should be Returned with Feedback.

Regards,
-- 
-David
da...@pgmasters.net



Re: csv format for psql

2018-03-28 Thread David G. Johnston
On Wednesday, March 28, 2018, David G. Johnston 
wrote:

> On Wednesday, March 28, 2018, Pavel Stehule 
> wrote:
>
>>
>> Are there some possible alternatives?

>>>
>>> Given the date and the fact that the cf end is 3 days away, the proposed
>>> short term alternative is Daniel's version, that I feel is reasonable. Ok,
>>> people have to do two pset to get comma-separated csv, otherwise they get
>>> pipe-separated csv in one pset.
>>>
>>
>>
> Could someone post how captions, rows-only, and footer pset settings
> factor into this?  Specifically are they fixed to on/off or will they
> hide/show if users request them explicitly?
>
>
I found the original post that covers that indeed we simply fix these
values, which is why I was thinking.

 and something like --csv-fieldsep should be provided (I like the prefixing
> to tie the option lexically to the master --csv option).
>

Or, really, just make --csv take an optional argument which, if present, is
the delimiter.  No separate argument needed, and we ignore the pset
fieldsep argument like we ignore everything else.

Need to decide whether to "not ignore" --tuples-only, which doesn't seem
controversial aside from being a pset argument that isn't being ignored in
this design...

David J.


Re: csv format for psql

2018-03-28 Thread Joshua D. Drake

On 03/28/2018 12:35 PM, David G. Johnston wrote:
On Monday, March 26, 2018, Daniel Verite > wrote:



We could even support only the comma and make it non-configurable
based on the fact it's Comma-Separated-Values, not
Whatever-Separated-Values, except that won't do much
to serve the users interests, as the reality is that
people use various separators.


I like to call it "Character Separated Values" now for just that reason.


Isn't the actual wording Character Delimited Values? I may be picking at 
hairs here but every single time I use anything to import a CSV or other 
delimited file (TAB or | usually) that is what the import screen says.


JD



David J.



--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
* Unless otherwise stated, opinions are my own.   *



Re: Using base backup exclusion filters to reduce data transferred with pg_rewind

2018-03-28 Thread Fujii Masao
On Wed, Mar 28, 2018 at 7:54 AM, Michael Paquier  wrote:
> On Wed, Mar 28, 2018 at 04:13:25AM +0900, Fujii Masao wrote:
>> This code is almost ok in practice, but using the check of
>> "strstr(path, localpath) == path" is more robust here?
>
> No problems with that either.
>
>> Using the following code instead is more robust?
>> This original code is almost ok in practice, though.
>>
>> filename = last_dir_separator(path);
>> if (filename == NULL)
>> filename = path;
>> else
>> filename++;
>> if (strcmp(filename, excludeFiles[excludeIdx]) == 0)
>
> Indeed, using last_dir_separator is a better idea for files.  I was
> looking for something like that actually.
>
>> +  (everything except the relation files). Similarly to base backups,
>> +  the contents of the directories pg_dynshmem/,
>> +  pg_notify/, pg_replslot/,
>> +  pg_serial/, pg_snapshots/,
>> +  pg_stat_tmp/, and
>> +  pg_subtrans/ are omitted from the data copied
>> +  from the source cluster. Any file or directory beginning with
>> +  pgsql_tmp is omitted, as well as are
>> +  backup_label,
>> +  tablespace_map,
>> +  pg_internal.init,
>> +  postmaster.opts and
>> +  postmaster.pid.
>>
>> I don't think this description is necessary. The doc for pg_basebackup
>> also doesn't seem to have this kind of description.
>
> Those are listed in backup.sgml.  And I really think that we should at
> least document that the same type of exclusion filters as base backups
> are used in pg_rewind.

Okay, I revived that change in the doc.

>> So attached is the patch that I updated based on your patch and
>> am thinking to apply.
>
> Thanks.  Except for the documentation part, I am OK for the changes
> proposed.

Committed. Thanks!

Regards,

-- 
Fujii Masao



Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-03-28 Thread Teodor Sigaev

Hi!

I slightly modified test for clean demonstration of difference between 
fastupdate on and off. Also I added CheckForSerializableConflictIn() to 
fastupdate off codepath, but only in case of non-empty pending list.


The next question what I see: why do not we lock entry leaf pages in some cases? 
As I understand, scan should lock any visited page, but now it's true only for 
posting tree. Seems, it also should lock pages in entry tree because concurrent 
procesess could add new entries which could be matched by partial search, for 
example. BTW, partial search (see collectMatchBitmap()) locks correctly entry 
tree, but regular startScanEntry() doesn't lock entry page in case of posting 
tree, only in case of posting list.



Dmitry Ivanov wrote:

I'd like to see fastupdate=on in test too, now tests cover only case
without fastupdate. Please, add them.


Here's a couple of tests for pending list (fastupdate = on).



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 37070b3b72..095b1192cb 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/ginxlog.h"
 #include "access/xloginsert.h"
+#include "storage/predicate.h"
 #include "miscadmin.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -515,6 +516,19 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			btree->fillRoot(btree, newrootpg,
 			BufferGetBlockNumber(lbuffer), newlpage,
 			BufferGetBlockNumber(rbuffer), newrpage);
+
+			if (GinPageIsLeaf(BufferGetPage(stack->buffer)))
+			{
+
+PredicateLockPageSplit(btree->index,
+		BufferGetBlockNumber(stack->buffer),
+		BufferGetBlockNumber(lbuffer));
+
+PredicateLockPageSplit(btree->index,
+		BufferGetBlockNumber(stack->buffer),
+		BufferGetBlockNumber(rbuffer));
+			}
+
 		}
 		else
 		{
@@ -524,6 +538,14 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
 			GinPageGetOpaque(newrpage)->rightlink = savedRightLink;
 			GinPageGetOpaque(newlpage)->flags |= GIN_INCOMPLETE_SPLIT;
 			GinPageGetOpaque(newlpage)->rightlink = BufferGetBlockNumber(rbuffer);
+
+			if (GinPageIsLeaf(BufferGetPage(stack->buffer)))
+			{
+
+PredicateLockPageSplit(btree->index,
+		BufferGetBlockNumber(stack->buffer),
+		BufferGetBlockNumber(rbuffer));
+			}
 		}
 
 		/*
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index f9daaba52e..3fb4fc8264 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -19,6 +19,7 @@
 #include "access/xloginsert.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
+#include "storage/predicate.h"
 #include "utils/rel.h"
 
 /*
@@ -1423,7 +1424,7 @@ disassembleLeaf(Page page)
  * Any segments that acquire new items are decoded, and the new items are
  * merged with the old items.
  *
- * Returns true if any new items were added. False means they were all
+ * Returns true if any new items were added. false means they were all
  * duplicates of existing items on the page.
  */
 static bool
@@ -1759,7 +1760,7 @@ leafRepackItems(disassembledLeaf *leaf, ItemPointer remaining)
  */
 BlockNumber
 createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
-  GinStatsData *buildStats)
+  GinStatsData *buildStats, Buffer entrybuffer)
 {
 	BlockNumber blkno;
 	Buffer		buffer;
@@ -1810,6 +1811,12 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
 	page = BufferGetPage(buffer);
 	blkno = BufferGetBlockNumber(buffer);
 
+	/*
+	 * Copy a predicate lock from entry tree leaf (containing posting list)
+	 * to  posting tree.
+	 */
+	PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
+
 	START_CRIT_SECTION();
 
 	PageRestoreTempPage(tmppage, page);
@@ -1904,6 +1911,7 @@ ginInsertItemPointers(Relation index, BlockNumber rootBlkno,
 		btree.itemptr = insertdata.items[insertdata.curitem];
 		stack = ginFindLeafPage(, false, NULL);
 
+		GinCheckForSerializableConflictIn(btree.index, NULL, stack->buffer);
 		ginInsertValue(, stack, , buildStats);
 	}
 }
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 6fe67f346d..63603859bc 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -17,8 +17,10 @@
 #include "access/gin_private.h"
 #include "access/relscan.h"
 #include "miscadmin.h"
+#include "storage/predicate.h"
 #include "utils/datum.h"
 #include "utils/memutils.h"
+#include "utils/rel.h"
 
 /* GUC parameter */
 int			GinFuzzySearchLimit = 0;
@@ -33,11 +35,18 @@ typedef struct pendingPosition
 } pendingPosition;
 
 
+static void
+GinPredicateLockPage(Relation index, BlockNumber blkno, Snapshot snapshot)
+{
+	if (!GinGetUseFastUpdate(index))
+			

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2018-03-28 Thread David Steele
On 2/27/18 2:21 AM, Masahiko Sawada wrote:
> 
> Hmm, although I've thought concern in case where we don't consider
> local xids of un-resolved fdwxact in GetOldestXmin, I could not find
> problem. Could you share your concern if you have? I'll try to find a
> possibility based on it.

It appears that this entry should be marked Waiting on Author so I have
done that.

I also think it may be time to move this patch to the next CF.

Regards,
-- 
-David
da...@pgmasters.net



Re: PL/pgSQL nested CALL with transactions

2018-03-28 Thread Peter Eisentraut
On 3/28/18 09:00, Tomas Vondra wrote:
> I see. I thought "fixed" means you adopted the #define, but that's not
> the case. I don't think an extra comment is needed - the variable name
> is descriptive enough IMHO.

Committed as presented then.  Thanks.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: csv format for psql

2018-03-28 Thread David G. Johnston
On Monday, March 26, 2018, Daniel Verite  wrote:

>
> We could even support only the comma and make it non-configurable
> based on the fact it's Comma-Separated-Values, not
> Whatever-Separated-Values, except that won't do much
> to serve the users interests, as the reality is that
> people use various separators.
>

I like to call it "Character Separated Values" now for just that reason.

David J.


Re: PostgreSQL crashes with SIGSEGV

2018-03-28 Thread Tom Lane
Peter Geoghegan  writes:
> On Mon, Mar 26, 2018 at 5:14 AM, Kyotaro HORIGUCHI
>  wrote:
>> If no one objects, I'll mark this as Ready for Commit in a couple
>> of days.

> Thank you for the review, Horiguchi-san. It's hard to decide how
> important each goal is when coming up with a back-patchable fix like
> this. When the goals are somewhat in competition with each other, a
> second or a third opinion is particularly appreciated.

It looks good to me.  The only real objection would be if someone came
up with a test case proving that there's a significant performance
degradation from the extra copies.  But given that these are back
branches, it would take a pretty steep penalty for me to want to take
the risk of refactoring to avoid that.

I've pushed it with some cosmetic adjustments.

regards, tom lane



Re: disable SSL compression?

2018-03-28 Thread Konstantin Knizhnik



On 28.03.2018 20:26, Konstantin Knizhnik wrote:



On 17.03.2018 17:18, Peter Eisentraut wrote:

On 3/11/18 13:28, Tom Lane wrote:

My proposal is the attached patch that sets the default in libpq to off
and adjusts the documentation a bit so it doesn't sound like we have
missed the news altogether.

Seems reasonable as far as it goes, but do we need to make corresponding
server-side changes?

We could add a setting to disable SSL compression on the server, as some
web servers have been doing, but the niche of version combinations where
that would be applicable seems pretty low.




One of our customers was managed to improve speed about 10 times by 
using SSL compression for the system where client and servers are 
located in different geographical regions
and query results are very large because of JSON columns. Them 
actually do not need encryption, just compression.
I expect that it is not the only case where compression of libpq 
protocol can be useful. Please notice that logical replication is also 
using libpq protocol.


If SSL compression is deprecated, should we provide own compression?
I have implemented some prototype implementation of it (patch is 
attached).
I have added compression=on/off parameter to connection string and -Z 
option to psql and pgbench utilities.

Below are some results:

Compression ratio (raw->compressed):


libz (level=1)
libzstd (level=1)
pgbench -i -s 10
16997209->2536330
16997209->268077
pgbench -t 10 -S
6289036->1523862
6600338<-900293
6288933->1777400
6600338<-1000318


There is no mistyping: libzstd compress COPY data about 10 times 
better than libz, with wonderful compression ratio 63.


Influence on execution time is minimal (I have tested local 
configuration when client and server are at the same host):



no compression
libz (level=1)
libzstd (level=1)
pgbench -i -s 10
1.552
1.572
1.611
pgbench -t 10 -S
4.482
4.926
4.877





--
Konstantin Knizhnik
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company


Sorry, I have attached wrong patch.
To use zstd compression, Postgres should be configured with --with-zstd. 
Otherwise compression will use zlib unless it is disabled by 
--without-zlib option.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/configure b/configure
index 6f659a5..f2ef62d 100755
--- a/configure
+++ b/configure
@@ -698,6 +698,7 @@ ELF_SYS
 EGREP
 GREP
 with_zlib
+with_zstd
 with_system_tzdata
 with_libxslt
 with_libxml
@@ -862,6 +863,7 @@ with_libxml
 with_libxslt
 with_system_tzdata
 with_zlib
+with_zstd
 with_gnu_ld
 enable_largefile
 enable_float4_byval
@@ -8016,6 +8018,86 @@ fi
 
 
 #
+# ZStd
+#
+
+
+
+# Check whether --with-zstd was given.
+if test "${with_zstd+set}" = set; then :
+  withval=$with_zstd;
+  case $withval in
+yes)
+  ;;
+no)
+  :
+  ;;
+*)
+  as_fn_error $? "no argument expected for --with-zstd option" "$LINENO" 5
+  ;;
+  esac
+
+else
+  with_zstd=no
+
+fi
+
+
+
+
+if test "$with_zstd" = yes ; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for ZSTD_compress in -lzstd" >&5
+$as_echo_n "checking for ZSTD_compress in -lzstd... " >&6; }
+if ${ac_cv_lib_zstd_ZSTD_compress+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lzstd  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char ZSTD_compress ();
+int
+main ()
+{
+return ZSTD_compress ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_zstd_ZSTD_compress=yes
+else
+  ac_cv_lib_zstd_ZSTD_compress=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_zstd_ZSTD_compress" >&5
+$as_echo "$ac_cv_lib_zstd_ZSTD_compress" >&6; }
+if test "x$ac_cv_lib_zstd_ZSTD_compress" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBZSTD 1
+_ACEOF
+
+  LIBS="-lzstd $LIBS"
+
+else
+  as_fn_error $? "library 'zstd' is required for ZSTD support" "$LINENO" 5
+fi
+
+fi
+
+
+
+#
 # Elf
 #
 
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3bbdf17..08b8ab2 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -195,6 +195,7 @@ with_llvm	= @with_llvm@
 with_system_tzdata = @with_system_tzdata@
 with_uuid	= @with_uuid@
 with_zlib	= @with_zlib@
+with_zstd   = @with_zstd@
 enable_rpath	= @enable_rpath@
 enable_nls	= @enable_nls@
 enable_debug	= @enable_debug@
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 6450c5a..1522c70 100644

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-28 Thread Tom Lane
Andres Freund  writes:
> On 2018-03-08 13:46:53 -0500, Tom Lane wrote:
>> ... Breaking fundamental invariants like
>> "ctid points to this tuple or its update successor" is going to cause
>> trouble.  There's a lot of code that knows that; more than knows the
>> details of what's in xmax, I believe.

> Given, as explained nearby, we already do store transient data in the
> ctid for speculative insertions (i.e. ON CONFLICT), and it hasn't caused
> even a whiff of trouble, I'm currently not inclined to see a huge issue
> here.  It'd be great if you could expand on your concerns here a bit, we
> gotta figure out a way forward.

Just what I said.  There's a lot of code that knows how to follow tuple
update chains, probably not all of it in core, and this will break it.
But only in seldom-exercised corner cases, which is the worst of all
possible worlds from a reliability standpoint.  (I don't think ON CONFLICT
is a counterexample because, IIUC, it's not a persistent state.)

Given that there are other ways we could attack it, I think throwing away
this particular invariant is an unnecessarily risky solution.

>> I would've been happier about expending an infomask bit towards this
>> purpose.  Just eyeing what we've got, I can't help noticing that
>> HEAP_MOVED_OFF/HEAP_MOVED_IN couldn't possibly be set in any tuple
>> in a partitioned table.  Perhaps making these tests depend on
>> partitioned-ness would be unworkably messy, but it's worth thinking
>> about.

> They previously couldn't be set together IIRC, so we could just (mask &
> (HEAP_MOVED_OFF |HEAP_MOVED_IN)) == (HEAP_MOVED_OFF |HEAP_MOVED_IN) but
> that'd be permanently eating two infomask bits.

Hmm.  That objection only matters if we have realistic intentions of
reclaiming those bits in future, which I've not heard anyone making
serious effort towards.  Rather than messing with the definition of ctid,
I'd be happier with saying that they're never going to be reclaimed, but
at least we're getting one bit's worth of real use out of them.

regards, tom lane



Re: csv format for psql

2018-03-28 Thread David G. Johnston
On Wednesday, March 28, 2018, Pavel Stehule  wrote:

>
> Are there some possible alternatives?
>>>
>>
>> Given the date and the fact that the cf end is 3 days away, the proposed
>> short term alternative is Daniel's version, that I feel is reasonable. Ok,
>> people have to do two pset to get comma-separated csv, otherwise they get
>> pipe-separated csv in one pset.
>>
>
>
Could someone post how captions, rows-only, and footer pset settings factor
into this?  Specifically are they fixed to on/off or will they hide/show if
users request them explicitly?

My take on this is that --csv mode is/should be an alternate output mode
from the existing pset controlled one, and functions basically like "\copy
to stdout" and all echoing and metadata outputs are disabled and only query
results, with header and the user specified delimiter, are output.  No
control other than the delimiter seems to be provided in the current design
but that could be expanded upon.  In that specification the existing
fieldsep argument that is tied to pset should not be used and something
like --csv-fieldsep should be provided (I like the prefixing to tie the
option lexically to the master --csv option).

David J.


Re: disable SSL compression?

2018-03-28 Thread Konstantin Knizhnik



On 17.03.2018 17:18, Peter Eisentraut wrote:

On 3/11/18 13:28, Tom Lane wrote:

My proposal is the attached patch that sets the default in libpq to off
and adjusts the documentation a bit so it doesn't sound like we have
missed the news altogether.

Seems reasonable as far as it goes, but do we need to make corresponding
server-side changes?

We could add a setting to disable SSL compression on the server, as some
web servers have been doing, but the niche of version combinations where
that would be applicable seems pretty low.




One of our customers was managed to improve speed about 10 times by 
using SSL compression for the system where client and servers are 
located in different geographical regions
and query results are very large because of JSON columns. Them actually 
do not need encryption, just compression.
I expect that it is not the only case where compression of libpq 
protocol can be useful. Please notice that logical replication is also 
using libpq protocol.


If SSL compression is deprecated, should we provide own compression?
I have implemented some prototype implementation of it (patch is attached).
I have added compression=on/off parameter to connection string and -Z 
option to psql and pgbench utilities.

Below are some results:

Compression ratio (raw->compressed):


libz (level=1)
libzstd (level=1)
pgbench -i -s 10
16997209->2536330
16997209->268077
pgbench -t 10 -S
6289036->1523862
6600338<-900293
6288933->1777400
6600338<-1000318


There is no mistyping: libzstd compress COPY data about 10 times better 
than libz, with wonderful compression ratio 63.


Influence on execution time is minimal (I have tested local 
configuration when client and server are at the same host):



no compression
libz (level=1)
libzstd (level=1)
pgbench -i -s 10
1.552
1.572
1.611
pgbench -t 10 -S
4.482
4.926
4.877






--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/configure b/configure
index 6f659a5..f2ef62d 100755
--- a/configure
+++ b/configure
@@ -698,6 +698,7 @@ ELF_SYS
 EGREP
 GREP
 with_zlib
+with_zstd
 with_system_tzdata
 with_libxslt
 with_libxml
@@ -862,6 +863,7 @@ with_libxml
 with_libxslt
 with_system_tzdata
 with_zlib
+with_zstd
 with_gnu_ld
 enable_largefile
 enable_float4_byval
@@ -8016,6 +8018,86 @@ fi
 
 
 #
+# ZStd
+#
+
+
+
+# Check whether --with-zstd was given.
+if test "${with_zstd+set}" = set; then :
+  withval=$with_zstd;
+  case $withval in
+yes)
+  ;;
+no)
+  :
+  ;;
+*)
+  as_fn_error $? "no argument expected for --with-zstd option" "$LINENO" 5
+  ;;
+  esac
+
+else
+  with_zstd=no
+
+fi
+
+
+
+
+if test "$with_zstd" = yes ; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for ZSTD_compress in -lzstd" >&5
+$as_echo_n "checking for ZSTD_compress in -lzstd... " >&6; }
+if ${ac_cv_lib_zstd_ZSTD_compress+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lzstd  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char ZSTD_compress ();
+int
+main ()
+{
+return ZSTD_compress ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_zstd_ZSTD_compress=yes
+else
+  ac_cv_lib_zstd_ZSTD_compress=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_zstd_ZSTD_compress" >&5
+$as_echo "$ac_cv_lib_zstd_ZSTD_compress" >&6; }
+if test "x$ac_cv_lib_zstd_ZSTD_compress" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBZSTD 1
+_ACEOF
+
+  LIBS="-lzstd $LIBS"
+
+else
+  as_fn_error $? "library 'zstd' is required for ZSTD support" "$LINENO" 5
+fi
+
+fi
+
+
+
+#
 # Elf
 #
 
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3bbdf17..08b8ab2 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -195,6 +195,7 @@ with_llvm	= @with_llvm@
 with_system_tzdata = @with_system_tzdata@
 with_uuid	= @with_uuid@
 with_zlib	= @with_zlib@
+with_zstd   = @with_zstd@
 enable_rpath	= @enable_rpath@
 enable_nls	= @enable_nls@
 enable_debug	= @enable_debug@
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 6450c5a..1522c70 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -50,6 +50,14 @@ ifeq ($(with_systemd),yes)
 LIBS += -lsystemd
 endif
 
+ifeq ($(with_zstd),yes)
+LIBS += -lzstd
+endif
+
+ifeq ($(with_zlib),yes)
+LIBS += -lz
+endif
+
 ##
 
 all: submake-libpgport submake-schemapg postgres 

Re: [HACKERS] why not parallel seq scan for slow functions

2018-03-28 Thread Amit Kapila
On Thu, Mar 29, 2018 at 2:31 AM, Robert Haas  wrote:
> On Wed, Mar 28, 2018 at 3:06 AM, Amit Kapila  wrote:
>>
>> The above block takes 43700.0289 ms on Head and 45025.3779 ms with the
>> patch which is approximately 3% regression.
>
> Thanks for the analysis -- the observation that this seemed to affect
> cases where CP_LABEL_TLIST gets passed to create_projection_plan
> allowed me to recognize that I was doing an unnecessary copyObject()
> call.  Removing that seems to have reduced this regression below 1% in
> my testing.
>

I think that is under acceptable range.  I am seeing few regression
failures with the patch series.  The order of targetlist seems to have
changed for Remote SQL.  Kindly find the failure report attached.  I
have requested my colleague Ashutosh Sharma to cross-verify this and
he is also seeing the same failures.

Few comments/suggestions:

1.
typedef enum UpperRelationKind
 {
  UPPERREL_SETOP, /* result of UNION/INTERSECT/EXCEPT, if any */
+ UPPERREL_TLIST, /* result of projecting final scan/join rel */
  UPPERREL_PARTIAL_GROUP_AGG, /* result of partial grouping/aggregation, if
  * any */
  UPPERREL_GROUP_AGG, /* result of grouping/aggregation, if any */
...
...
  /*
  * Save the various upper-rel PathTargets we just computed into
@@ -2003,6 +2004,7 @@ grouping_planner(PlannerInfo *root, bool
inheritance_update,
  root->upper_targets[UPPERREL_FINAL] = final_target;
  root->upper_targets[UPPERREL_WINDOW] = sort_input_target;
  root->upper_targets[UPPERREL_GROUP_AGG] = grouping_target;
+ root->upper_targets[UPPERREL_TLIST] = scanjoin_target;

It seems UPPERREL_TLIST is redundant in the patch now.  I think we can
remove it unless you have something else in mind.

2.
+ /*
+ * If the relation is partitioned, recurseively apply the same changes to
+ * all partitions and generate new Append paths. Since Append is not
+ * projection-capable, that might save a separate Result node, and it also
+ * is important for partitionwise aggregate.
+ */
+ if (rel->part_scheme && rel->part_rels)
  {


I think the handling of partitioned rels looks okay, but we might want
to once check the overhead of the same unless you are sure that this
shouldn't be a problem.  If you think, we should check it once, then
is it possible that we can do it as a separate patch as this doesn't
look to be directly linked to the main patch.  It can be treated as an
optimization for partitionwise aggregates.  I think we can treat it
along with the main patch as well, but it might be somewhat simpler to
verify it if we do it separately.


> Also, keep in mind that we're talking about extremely small amounts of
> time here.  On a trivial query that you're not even executing, you're
> seeing a difference of (45025.3779 - 43700.0289)/100 = 0.00132 ms
> per execution.  Sure, it's still 3%, but it's 3% of the time in an
> artificial case where you don't actually run the query.  In real life,
> nobody runs EXPLAIN in a tight loop a million times without ever
> running the query, because that's not a useful thing to do.
>

Agreed, but this was to ensure that we should not do this optimization
at the cost of adding significant cycles to the planner time.

>  The
> overhead on a realistic test case will be smaller.  Furthermore, at
> least in my testing, there are now cases where this is faster than
> master.  Now, I welcome further ideas for optimization, but a patch
> that makes some cases slightly slower while making others slightly
> faster, and also improving the quality of plans in some cases, is not
> to my mind an unreasonable thing.
>

Agreed.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


regression.diffs
Description: Binary data


Re: Problem while setting the fpw with SIGHUP

2018-03-28 Thread Kyotaro HORIGUCHI
At Wed, 28 Mar 2018 15:59:48 +0900, Michael Paquier  wrote 
in <20180328065948.gm1...@paquier.xyz>
> On Wed, Mar 28, 2018 at 03:40:59PM +0900, Kyotaro HORIGUCHI wrote:
> > The attached does that. I don't like that it uses ControlFileLock
> > to exlucde concurrent UpdateFullPageWrites and StartupXLOG but
> > WALInsertLock cannot be used since UpdateFullPageWrites may take
> > the same lock.
> 
> You visibly forgot your patch.

Mmm, someone must have eaten that. I'm sure it is attached this
time.

I don't like UpdateFullPageWrites is using ControlFileLock to
exclusion..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cb9c2a29cb..d2bb5e16ac 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7558,9 +7558,11 @@ StartupXLOG(void)
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
-	 * record is written.
+	 * record is written, ignoreing the change of full_page_write GUC so far.
 	 */
+	WALInsertLockAcquireExclusive();
 	Insert->fullPageWrites = lastFullPageWrites;
+	WALInsertLockRelease();
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
@@ -7783,6 +7785,9 @@ StartupXLOG(void)
 	 * itself, we use the info_lck here to ensure that there are no race
 	 * conditions concerning visibility of other recent updates to shared
 	 * memory.
+	 *
+	 * ControlFileLock excludes concurrent update of shared fullPageWrites in
+	 * addition to its ordinary use.
 	 */
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 	ControlFile->state = DB_IN_PRODUCTION;
@@ -7790,11 +7795,25 @@ StartupXLOG(void)
 
 	SpinLockAcquire(>info_lck);
 	XLogCtl->SharedRecoveryInProgress = false;
+	lastFullPageWrites = Insert->fullPageWrites;
 	SpinLockRelease(>info_lck);
 
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Shared fullPageWrites at the end of recovery differs to our last known
+	 * value. The change has been made by checkpointer but we haven't made
+	 * corresponding XLOG_FPW_CHANGE record. We reread GUC change if any and
+	 * try update shared fullPageWrites by myself. It ends with doing nothing
+	 * if checkpointer already did the same thing.
+	 */
+	if (lastFullPageWrites != fullPageWrites)
+	{
+		HandleStartupProcInterrupts();
+		UpdateFullPageWrites();
+	}
+
 	/*
 	 * If there were cascading standby servers connected to us, nudge any wal
 	 * sender processes to notice that we've been promoted.
@@ -9525,8 +9544,10 @@ XLogReportParameters(void)
  * Update full_page_writes in shared memory, and write an
  * XLOG_FPW_CHANGE record if necessary.
  *
- * Note: this function assumes there is no other process running
- * concurrently that could update it.
+ * This function assumes called concurrently from multiple processes and
+ * called concurrently with changing of shared fullPageWrites. See
+ * StartupXLOG().
+ * 
  */
 void
 UpdateFullPageWrites(void)
@@ -9536,13 +9557,25 @@ UpdateFullPageWrites(void)
 	/*
 	 * Do nothing if full_page_writes has not been changed.
 	 *
-	 * It's safe to check the shared full_page_writes without the lock,
-	 * because we assume that there is no concurrently running process which
-	 * can update it.
+	 * We acquire ControlFileLock to exclude other concurrent call and change
+	 * of shared fullPageWrites.
 	 */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	WALInsertLockAcquireExclusive();
 	if (fullPageWrites == Insert->fullPageWrites)
+	{
+		WALInsertLockRelease();
+		LWLockRelease(ControlFileLock);
 		return;
+	}
+	WALInsertLockRelease();
 
+	/*
+	 * Need to set up XLogInsert working area before entering critical section
+	 * if we are not in recovery mode.
+	 */
+	(void) RecoveryInProgress();
+		
 	START_CRIT_SECTION();
 
 	/*
@@ -9578,6 +9611,8 @@ UpdateFullPageWrites(void)
 		WALInsertLockRelease();
 	}
 	END_CRIT_SECTION();
+
+	LWLockRelease(ControlFileLock);
 }
 
 /*


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-03-28 Thread Thomas Munro
On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby  wrote:
> The retries are the source of the problem ; the first fsync() can return EIO,
> and also *clears the error* causing a 2nd fsync (of the same data) to return
> success.

What I'm failing to grok here is how that error flag even matters,
whether it's a single bit or a counter as described in that patch.  If
write back failed, *the page is still dirty*.  So all future calls to
fsync() need to try to try to flush it again, and (presumably) fail
again (unless it happens to succeed this time around).

-- 
Thomas Munro
http://www.enterprisedb.com



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-03-28 Thread Craig Ringer
On 29 March 2018 at 10:48, Thomas Munro 
wrote:

> On Thu, Mar 29, 2018 at 3:30 PM, Michael Paquier 
> wrote:
> > On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
> >> Craig Ringer  writes:
> >>> TL;DR: Pg should PANIC on fsync() EIO return.
> >>
> >> Surely you jest.
> >
> > Any callers of pg_fsync in the backend code are careful enough to check
> > the returned status, sometimes doing retries like in mdsync, so what is
> > proposed here would be a regression.
>
> Craig, is the phenomenon you described the same as the second issue
> "Reporting writeback errors" discussed in this article?
>
> https://lwn.net/Articles/724307/


A variant of it, by the looks.

The problem in our case is that the kernel only tells us about the error
once. It then forgets about it. So yes, that seems like a variant of the
statement:


> "Current kernels might report a writeback error on an fsync() call,
> but there are a number of ways in which that can fail to happen."
>
> That's... I'm speechless.


Yeah.

It's a bit nuts.

I was astonished when I saw the behaviour, and that it appears undocumented.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-03-28 Thread Justin Pryzby
On Thu, Mar 29, 2018 at 11:30:59AM +0900, Michael Paquier wrote:
> On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
> > Craig Ringer  writes:
> >> TL;DR: Pg should PANIC on fsync() EIO return.
> > 
> > Surely you jest.
> 
> Any callers of pg_fsync in the backend code are careful enough to check
> the returned status, sometimes doing retries like in mdsync, so what is
> proposed here would be a regression.

The retries are the source of the problem ; the first fsync() can return EIO,
and also *clears the error* causing a 2nd fsync (of the same data) to return
success.

(Note, I can see that it might be useful to PANIC on EIO but retry for ENOSPC).

On Thu, Mar 29, 2018 at 03:48:27PM +1300, Thomas Munro wrote:
> Craig, is the phenomenon you described the same as the second issue
> "Reporting writeback errors" discussed in this article?
> https://lwn.net/Articles/724307/

Worse, the article acknowledges the behavior without apparently suggesting to
change it:

 "Storing that value in the file structure has an important benefit: it makes
it possible to report a writeback error EXACTLY ONCE TO EVERY PROCESS THAT
CALLS FSYNC()  In current kernels, ONLY THE FIRST CALLER AFTER AN ERROR
OCCURS HAS A CHANCE OF SEEING THAT ERROR INFORMATION."

I believe I reproduced the problem behavior using dmsetup "error" target, see
attached.

strace looks like this:

kernel is Linux 4.10.0-28-generic #32~16.04.2-Ubuntu SMP Thu Jul 20 10:19:48 
UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

 1  open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3
 2  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
8192
 3  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
8192
 4  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
8192
 5  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
8192
 6  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
8192
 7  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
8192
 8  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
2560
 9  write(3, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 
-1 ENOSPC (No space left on device)
10  dup(2)  = 4
11  fcntl(4, F_GETFL)   = 0x8402 (flags 
O_RDWR|O_APPEND|O_LARGEFILE)
12  brk(NULL)   = 0x1299000
13  brk(0x12ba000)  = 0x12ba000
14  fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
15  write(4, "write(1): No space left on devic"..., 34write(1): No space 
left on device
16  ) = 34
17  close(4)= 0
18  fsync(3)= -1 EIO (Input/output error)
19  dup(2)  = 4
20  fcntl(4, F_GETFL)   = 0x8402 (flags 
O_RDWR|O_APPEND|O_LARGEFILE)
21  fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
22  write(4, "fsync(1): Input/output error\n", 29fsync(1): Input/output 
error
23  ) = 29
24  close(4)= 0
25  close(3)= 0
26  open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3
27  fsync(3)= 0
28  write(3, "\0", 1)   = 1
29  fsync(3)= 0
30  exit_group(0)   = ?

2: EIO isn't seen initially due to writeback page cache;
9: ENOSPC due to small device
18: original IO error reported by fsync, good
25: the original FD is closed
26: ..and file reopened
27: fsync on file with still-dirty data+EIO returns success BAD

10, 19: I'm not sure why there's dup(2), I guess glibc thinks that perror
should write to a separate FD (?)

Also note, close() ALSO returned success..which you might think exonerates the
2nd fsync(), but I think may itself be problematic, no?  In any case, the 2nd
byte certainly never got written to DM error, and the failure status was lost
following fsync().

I get the exact same behavior if I break after one write() loop, such as to
avoid ENOSPC.

Justin
/*
% make CFLAGS='-Wall -Wextra -O3' /tmp/eio
% sudo lvcreate -L 9M -n tst data

echo '
0 1 linear /dev/data/tst 0
1 1 error 1
2 99 linear /dev/data/tst 2' |sudo dmsetup create eio

*/

#include 

#include 
#include 
#include 

#include 
#include 

int main()
{
	char buf[8<<10];
	int fd;

	if (-1==(fd=open("/dev/mapper/eio", O_CREAT|O_RDWR, 00600))) {
	//if (-1==(fd=open("/mnt/t", O_CREAT|O_RDWR, 00600))) {
		perror("open");
		exit(1);
	}

	while (1) {
		// if (sizeof(buf)!=(write(fd, buf, sizeof(buf {
		if (-1==write(fd, buf, sizeof(buf))) {
			perror("write(1)");
			

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-03-28 Thread Craig Ringer
On 29 March 2018 at 13:06, Thomas Munro 
wrote:

> On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby 
> wrote:
> > The retries are the source of the problem ; the first fsync() can return
> EIO,
> > and also *clears the error* causing a 2nd fsync (of the same data) to
> return
> > success.
>
> What I'm failing to grok here is how that error flag even matters,
> whether it's a single bit or a counter as described in that patch.  If
> write back failed, *the page is still dirty*.  So all future calls to
> fsync() need to try to try to flush it again, and (presumably) fail
> again (unless it happens to succeed this time around).
> 
>

You'd think so. But it doesn't appear to work that way. You can see
yourself with the error device-mapper destination mapped over part of a
volume.

I wrote a test case here.

https://github.com/ringerc/scrapcode/blob/master/testcases/fsync-error-clear.c

I don't pretend the kernel behaviour is sane. And it's possible I've made
an error in my analysis. But since I've observed this in the wild, and seen
it in a test case, I strongly suspect that's what I've described is just
what's happening, brain-dead or no.

Presumably the kernel marks the page clean when it dispatches it to the I/O
subsystem and doesn't dirty it again on I/O error? I haven't dug that deep
on the kernel side. See the stackoverflow post for details on what I found
in kernel code analysis.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-03-28 Thread Craig Ringer
On 29 March 2018 at 10:30, Michael Paquier  wrote:

> On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
> > Craig Ringer  writes:
> >> TL;DR: Pg should PANIC on fsync() EIO return.
> >
> > Surely you jest.
>
> Any callers of pg_fsync in the backend code are careful enough to check
> the returned status, sometimes doing retries like in mdsync, so what is
> proposed here would be a regression.



I covered this in my original post.

Yes, we check the return value. But what do we do about it? For fsyncs of
heap files, we ERROR, aborting the checkpoint. We'll retry the checkpoint
later, which will retry the fsync(). **Which will now appear to succeed**
because the kernel forgot that it lost our writes after telling us the
first time. So we do check the error code, which returns success, and we
complete the checkpoint and move on.

But we only retried the fsync, not the writes before the fsync.

So we lost data. Or rather, failed to detect that the kernel did so, so our
checkpoint was bad and could not be completed.

The problem is that we keep retrying checkpoints *without* repeating the
writes leading up to the checkpoint, and retrying fsync.

I don't pretend the kernel behaviour is sane, but we'd better deal with it
anyway.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: pg_class.reltuples of brin indexes

2018-03-28 Thread Masahiko Sawada
On Tue, Mar 27, 2018 at 11:28 PM, Alvaro Herrera
 wrote:
> Tom Lane wrote:
>> Tomas Vondra  writes:
>> > I think number of index tuples makes sense, as long as that's what the
>> > costing needs. That is, it's up to the index AM to define it. But it
>> > clearly should not flap like this ...
>>
>> > And it's not just BRIN. This is what I get with a GIN index:
>>
>> Sounds like the same kind of thing we just fixed for SP-GiST :-(
>
> Most likely I modelled the BRIN code after GIN.
>

It's better to create a new index AM that estimates the number of
index tuples, and to update the index stats using that returned value?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-03-28 Thread Craig Ringer
On 28 March 2018 at 11:53, Tom Lane  wrote:

> Craig Ringer  writes:
> > TL;DR: Pg should PANIC on fsync() EIO return.
>
> Surely you jest.
>

No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as
well to avoid similar lost-page-write issues.

It's not necessary on ext3/ext4 with errors=remount-ro, but that's only
because the FS stops us dead in our tracks.

I don't pretend it's sane. The kernel behaviour is IMO crazy. If it's going
to lose a write, it should at minimum mark the FD as broken so no further
fsync() or anything else can succeed on the FD, and an app that cares about
durability must repeat the whole set of work since the prior succesful
fsync(). Just reporting it once and forgetting it is madness.

But even if we convince the kernel folks of that, how do other platforms
behave? And how long before these kernels are out of use? We'd better deal
with it, crazy or no.

Please see my StackOverflow post for the kernel-level explanation. Note
also the test case link there. https://stackoverflow.com/a/42436054/398670

> Retrying fsync() is not OK at
> > least on Linux. When fsync() returns success it means "all writes since
> the
> > last fsync have hit disk" but we assume it means "all writes since the
> last
> > SUCCESSFUL fsync have hit disk".
>
> If that's actually the case, we need to push back on this kernel brain
> damage, because as you're describing it fsync would be completely useless.
>

It's not useless, it's just telling us something other than what we think
it means. The promise it seems to give us is that if it reports an error
once, everything *after* that is useless, so we should throw our toys,
close and reopen everything, and redo from the last known-good state.

Though as Tomas posted below, it provides rather weaker guarantees than I
thought in some other areas too. See that lwn.net article he linked.


> Moreover, POSIX is entirely clear that successful fsync means all
> preceding writes for the file have been completed, full stop, doesn't
> matter when they were issued.
>

I can't find anything that says so to me. Please quote relevant spec.

I'm working from
http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html which
states that

"The fsync() function shall request that all data for the open file
descriptor named by fildes is to be transferred to the storage device
associated with the file described by fildes. The nature of the transfer is
implementation-defined. The fsync() function shall not return until the
system has completed that action or until an error is detected."

My reading is that POSIX does not specify what happens AFTER an error is
detected. It doesn't say that error has to be persistent and that
subsequent calls must also report the error. It also says:

"If the fsync() function fails, outstanding I/O operations are not
guaranteed to have been completed."

but that doesn't clarify matters much either, because it can be read to
mean that once there's been an error reported for some IO operations
there's no guarantee those operations are ever completed even after a
subsequent fsync returns success.

I'm not seeking to defend what the kernel seems to be doing. Rather, saying
that we might see similar behaviour on other platforms, crazy or not. I
haven't looked past linux yet, though.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PQHost() undefined behavior if connecting string contains both host and hostaddr types

2018-03-28 Thread Kyotaro HORIGUCHI
At Wed, 28 Mar 2018 10:34:49 +0900, Michael Paquier  wrote 
in <20180328013449.gc1...@paquier.xyz>
> On Wed, Mar 28, 2018 at 11:06:23AM +1100, Haribabu Kommi wrote:
> > On Wed, Mar 28, 2018 at 3:35 AM, Peter Eisentraut <
> > peter.eisentr...@2ndquadrant.com> wrote:
> >>
> >> Committed after fixing up the documentation a bit as suggested by others.
> > 
> > Thanks.
> 
> +1.  Thanks for working on this Hari, Peter for the commit and all
> others for your input!

Me too.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: csv format for psql

2018-03-28 Thread David G. Johnston
On Wed, Mar 28, 2018 at 4:19 PM, Isaac Morland 
wrote:

> On 28 March 2018 at 15:43, Joshua D. Drake  wrote:
>
>> On 03/28/2018 12:35 PM, David G. Johnston wrote:
>>
>> I like to call it "Character Separated Values" now for just that reason.
>>
>>
>> Isn't the actual wording Character Delimited Values? I may be picking at
>> hairs here but every single time I use anything to import a CSV or other
>> delimited file (TAB or | usually) that is what the import screen says.
>>
>
> CSV stands for Comma Separated Values, and not anything else common as far
> as I can tell.
>

​Language evolves - Wikipedia just hasn't recognized this particular
evolution yet :)

The actual reason I'm posting this is because some of the discussion has
> made me a bit confused: there is already a CSV format defined for the COPY
> command and used by the psql \copy. I just want to check that what is being
> discussed here would use the exact same format as the existing CSV COPY
> format; and the configurability of them should be the exact same options
> (which already includes being able to change the delimiter).
>

​Nope, the \copy meta-command, and corresponding COPY SQL command, have
additional options that I don't believe will be accessible here.​
Specifically the "OIDS", "QUOTE", and various "FORCE_" options are not
controllable, nor are ESCAPE and ENCODING (this last I think, or at least
not as flexible).


> I think it's important that Postgres not have two CSVs with slightly
> different behaviours. Indeed, psql could use COPY behind the scenes to
> generate the CSV output, which would guarantee no nasty surprises.
>
> If this is already widely accepted or if I'm horribly misunderstanding the
> discussion then I'm sorry for the noise.
>

​I agree this is a possible approach but the way the thinking is presently
is that this is just another "\pset format" option with forced defaults
excepting the delimiter (DELIMITER),presence of the column names (HEADER)​,
NULL (I think...)

​Issuing \copy with the equivalent settings should result in the same
output...but the two mechanisms are distinct for better and worse.

David J.


Re: ALTER TABLE ADD COLUMN fast default

2018-03-28 Thread Andrew Dunstan
On Wed, Mar 28, 2018 at 5:30 PM, Andres Freund  wrote:
> Hi,
>
> On 2018-03-26 09:02:10 +1030, Andrew Dunstan wrote:
>> Thanks for this, all looks good. Here is the consolidate patch
>> rebased. If there are no further comments I propose to commit this in
>> a few days time.
>
> Here's a bit of post commit review:
>
> @@ -1310,13 +1679,11 @@ slot_getsomeattrs(TupleTableSlot *slot, int attnum)
>
> /*
>  * If tuple doesn't have all the atts indicated by tupleDesc, read the
> -* rest as null
> +* rest as NULLs or missing values
>  */
> -   for (; attno < attnum; attno++)
> -   {
> -   slot->tts_values[attno] = (Datum) 0;
> -   slot->tts_isnull[attno] = true;
> -   }
> +   if (attno < attnum)
> +   slot_getmissingattrs(slot, attno, attnum);
> +
> slot->tts_nvalid = attnum;
>  }
>
> It's worthwhile to note that this'll re-process *all* missing values,
> even if they've already been deformed. I.e. if
> slot_getmissingattrs(.., first-missing)
> slot_getmissingattrs(.., first-missing + 1)
> slot_getmissingattrs(.., first-missing + 2)
> is called, all three missing values will be copied every time. That's
> because tts_nvalid isn't taken into account.  I wonder if slot_getmissingattrs
> could take tts_nvalid into account?
>
> I also wonder if this doesn't deserve an unlikely(), to avoid the cost
> of spilling registers in the hot branch of not missing any values.
>
>


One of us at least is very confused about this function.
slot_getmissingattrs() gets called at most once by slot_getsomeattrs
and never for any attribute that's already been deformed.



> +   else
> +   {
> +   /* if there is a missing values array we must process them 
> one by one */
> +   for (missattnum = lastAttNum - 1;
> +missattnum >= startAttNum;
> +missattnum--)
> +   {
> +   slot->tts_values[missattnum] = 
> attrmiss[missattnum].ammissing;
> +   slot->tts_isnull[missattnum] =
> +   !attrmiss[missattnum].ammissingPresent;
> +   }
> +   }
> +}
>
> Why is this done backwards? It's noticeably slower to walk arrays
> backwards on some CPU microarchitectures.
>
>

I'll change it.


>
> @@ -176,6 +179,23 @@ CreateTupleDescCopyConstr(TupleDesc tupdesc)
> }
> }
>
>
>
> @@ -469,6 +503,29 @@ equalTupleDescs(TupleDesc tupdesc1, TupleDesc tupdesc2)
> if (strcmp(defval1->adbin, defval2->adbin) != 0)
> return false;
> }
> +   if (constr1->missing)
> +   {
> +   if (!constr2->missing)
> +   return false;
> +   for (i = 0; i < tupdesc1->natts; i++)
> +   {
> +   AttrMissing *missval1 = constr1->missing + i;
> +   AttrMissing *missval2 = constr2->missing + i;
>
> It's a bit odd to not use array indexing here?
>


*shrug* Maybe. I'll change it if you like, doesn't seem that important or odd.

>
> @@ -625,7 +625,15 @@ ExecCopySlotMinimalTuple(TupleTableSlot *slot)
> if (slot->tts_mintuple)
> return heap_copy_minimal_tuple(slot->tts_mintuple);
> if (slot->tts_tuple)
> -   return minimal_tuple_from_heap_tuple(slot->tts_tuple);
> +   {
> +   if (TTS_HAS_PHYSICAL_TUPLE(slot) &&
> +   HeapTupleHeaderGetNatts(slot->tts_tuple->t_data)
> +   < slot->tts_tupleDescriptor->natts)
> +   return minimal_expand_tuple(slot->tts_tuple,
> + 
>   slot->tts_tupleDescriptor);
> +   else
> +   return minimal_tuple_from_heap_tuple(slot->tts_tuple);
> +   }
>
>
> What's the TTS_HAS_PHYSICAL_TUPLE doing here? How can this be relevant
> given the previous tts_mintuple check? Am I missing something?
>
>

Hmm, that dates back to the original patch. My bad, I should have
picked that up. I'll just remove it.


>
> @@ -563,10 +569,63 @@ RelationBuildTupleDesc(Relation relation)
> 
> MemoryContextAllocZero(CacheMemoryContext,
>   
>  relation->rd_rel->relnatts *
>   
>  sizeof(AttrDefault));
> -   attrdef[ndef].adnum = attp->attnum;
> +   attrdef[ndef].adnum = attnum;
> attrdef[ndef].adbin = NULL;
> +
> ndef++;
> }
> +
> +   /* Likewise for a missing value */
> +   if 

Re: Using base backup exclusion filters to reduce data transferred with pg_rewind

2018-03-28 Thread Michael Paquier
On Thu, Mar 29, 2018 at 05:01:40AM +0900, Fujii Masao wrote:
> Committed. Thanks!

Thanks for including as well the documentation changes and committing
it.  The result looks good to me.
--
Michael


signature.asc
Description: PGP signature


Re: Proposal: http2 wire format

2018-03-28 Thread Peter Eisentraut
On 3/28/18 12:09, Andres Freund wrote:
> Yea, not the most descriptive... Returning multiple different resultsets
> from a function / procedure. Inability to do so is a serious limitation
> of postgres in comparison to some other language with procedures.

This is already possible as far as the protocol is concerned.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] A design for amcheck heapam verification

2018-03-28 Thread Peter Geoghegan
On Wed, Mar 28, 2018 at 5:04 AM, Pavan Deolasee
 wrote:
> Hmm Ok. It still sounds backwards to me, but then English is not my first or
> even second language. "A heap scan later verifies the presence in the heap
> of all index tuples fingerprinted" sounds as if we expect to find all
> fingerprinted index tuples in the heap. But in reality, we check if all heap
> tuples have a fingerprinted index tuple.

Why don't we leave this to the committer that picks the patch up? I
don't actually mind very much. I suspect that I am too close to the
problem to be sure that I've explained it in the clearest possible
way.

>> You're right - there is a narrow window for REPEATABLE READ and
>> SERIALIZABLE transactions. This is a regression in v6, the version
>> removed the TransactionXmin test.
>>
>> I am tempted to fix this by calling GetLatestSnapshot() instead of
>> GetTransactionSnapshot(). However, that has a problem of its own -- it
>> won't work in parallel mode, and we're dealing with parallel
>> restricted functions, not parallel unsafe functions. I don't want to
>> change that to fix such a narrow issue. IMV, a better fix is to treat
>> this as a serialization failure. Attached patch, which applies on top
>> of v7, shows what I mean.
>
>
> Ok. I am happy with that.

Cool.

> Well pg_index entry will be visible and should be visible. Otherwise how do
> you even maintain a newly created index. I am not sure, but I guess we take
> fresh MVCC snapshots for catalog searches, even in RR transactions.

True, but that isn't what would happen with an SQL query that queries
the system catalogs. That only applies to how the system catalogs are
accessed internally, not how they'd almost certainly be accessed when
using amcheck.

>> > Are there any interesting cases around
>> > INSERT_IN_PROGRESS/DELETE_IN_PROGRESS
>> > tuples, especially if those tuples were inserted/deleted by our own
>> > transaction? It probably worth thinking.
>
>
> Anything here that you would like to check? I understand that you may see
> such tuples only for catalog tables.

I think that the WARNING ought to be fine. We shouldn't ever see it,
but if we do it's probably due to a bug in an extension or something.
It doesn't seem particularly likely that someone could ever insert
into the table concurrently despite our having a ShareLock on the
relation. Even with corruption.

>> IndexBuildHeapRangeScan() doesn't mention anything about CIC's heap
>> ShareUpdateExclusiveLock (it just says SharedLock), because that lock
>> strength doesn't have anything to do with IndexBuildHeapRangeScan()
>> when it operates with an MVCC snapshot. I think that this means that
>> this patch doesn't need to update comments within
>> IndexBuildHeapRangeScan(). Perhaps that's a good idea, but it seems
>> independent.
>
>
> Ok, I agree. But note that we are now invoking that code with
> AccessShareLock() whereas the existing callers either use ShareLock or
> ShareUpdateExclusiveLock. That's probably does not matter, but it's a change
> worth noting.

Fair point, even though the ShareUpdateExclusiveLock case isn't
actually acknowledged by IndexBuildHeapRangeScan(). Can we leave this
one up to the committer, too? I find it very hard to argue either for
or against this, and I want to avoid "analysis paralysis" at this
important time.

> While that's true, I am thinking if there is anything we can do to stop this
> a consistency-checking utility to create other non-trivial side effects on
> the state of the database, even if that means we can not check all heap
> tuples. For example, can there be a way by which we allow concurrent vacuum
> or HOT prune to continue to prune away dead tuples, even if that means
> running bt_check_index() is some specialised way (such as not allowing in a
> transaction block) and the heap scan  might miss out some tuples? I don't
> know answer to that, but given that sometimes bt_check_index() may take
> hours if not days to finish, it seems worth thinking or at least documenting
> the side-effects somewhere.

That seems like a worthwhile goal for a heap checker that only checks
the structure of the heap, rather than something that checks the
consistency of an index against its table. Especially one that can run
without a connection to the database, for things like backup tools,
where performance is really important. There is certainly room for
that. For this particular enhancement, the similarity to CREATE INDEX
seems like an asset.

-- 
Peter Geoghegan



Re: [HACKERS] A design for amcheck heapam verification

2018-03-28 Thread Peter Geoghegan
On Wed, Mar 28, 2018 at 5:47 AM, Pavan Deolasee
 wrote:
> Mostly a nitpick, but I guess we should leave a comment after
> IndexBuildHeapScan() saying heap_endscan() is not necessary since
> IndexBuildHeapScan() does that internally. I stumbled upon that while
> looking for any potential leaks. I know at least one other caller of
> IndexBuildHeapScan() doesn't bother to say anything either, but it's
> helpful.

Fair point. Again, I'm going to suggest deferring to the committer. I
seem to have decision fatigue this week.

> FWIW I also looked at the 0001 patch and it looks fine to me.

I'm grateful that you didn't feel any need to encourage me to use
whatever the novel/variant filter du jour is! :-)

-- 
Peter Geoghegan



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-03-28 Thread Michael Paquier
On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
> Craig Ringer  writes:
>> TL;DR: Pg should PANIC on fsync() EIO return.
> 
> Surely you jest.

Any callers of pg_fsync in the backend code are careful enough to check
the returned status, sometimes doing retries like in mdsync, so what is
proposed here would be a regression.
--
Michael


signature.asc
Description: PGP signature


Re: Rangejoin rebased

2018-03-28 Thread Andres Freund
Hi,

On 2018-03-28 14:18:42 -0400, David Steele wrote:
> It seems that a new patch is needed here but none has been presented.
> I've marked this Waiting on Author for the moment, but I really think it
> should be marked Returned with Feedback and submitted to the next CF
> when a new patch is ready.

I'd just do so now. There's not been any progress for months, and
there's been an update request weeks ago...

Greetings,

Andres Freund



Re: Trigger file behavior with the standby

2018-03-28 Thread Keiko Oda
Thanks a lot for the answer, Michael (and sorry for the slow response)!

So, if I understand what you're saying correctly, I'm seeing this behavior
because wal-e keeps fetching wal files from s3 regardless of this
trigger_file, and these fetched wal files are in pg_wal (or pg_xlog),
therefore Postgres just tries to restore whatever available in pg_wal
before the failover. Or, even if there is no file in pg_wal, it still tries
to fetch from the "archive" (s3).
In other words, if I would like to do "immediate failover" (and do not care
about WAL files available in archive or in pg_wal), I should be tweaking
restore_command so that no further fetching/restoring happens.
Is it... accurate?

Thanks,
Keiko

On Mon, Mar 19, 2018 at 9:28 PM, Michael Paquier 
wrote:

> On Mon, Mar 19, 2018 at 01:27:21PM -0700, Keiko Oda wrote:
> > I'm seeing the following behavior with a trigger file which is very
> > confusing to me, I'd like to get some advice of what is the expected
> > behavior of the trigger file with the standby.
>
> This portion from the docs includes your answer:
> https://www.postgresql.org/docs/devel/static/warm-
> standby.html#STANDBY-SERVER-OPERATION
> "Standby mode is exited and the server switches to normal operation when
> pg_ctl promote is run or a trigger file is found (trigger_file). Before
> failover, any WAL immediately available in the archive or in pg_wal will
> be restored, but no attempt is made to connect to the master.
>
> So when creating a trigger file or signaling for promotion, any WAL
> files available are first fetched, and then promotion happens.  In your
> case all the WAL segments from the archives are retrieved first.
> --
> Michael
>


Re: JIT compiling with LLVM v12

2018-03-28 Thread Michael Paquier
On Wed, Mar 28, 2018 at 06:24:53PM -0400, David Steele wrote:
> On 3/28/18 6:09 PM, Andres Freund wrote:
>> Hah! Happy to, if there's enough people interested.  I've a talk about
>> it too (state of jit, 2018 edition), but I wasn't planning to go into
>> too low level details. More about what is good, what is bad, and how we
>> make it better ;)
> 
> +1 for an unconference session.  This is some seriously cool stuff.

Take room for two sessions then, with a break in-between to give enough
time to people to recover from the damage of the first session :)

Jokes apart, an unconference session at PGcon would be great.
--
Michael


signature.asc
Description: PGP signature


Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-03-28 Thread Michael Paquier
On Wed, Mar 28, 2018 at 08:50:21PM -0400, Robert Haas wrote:
> On Tue, Mar 27, 2018 at 11:53 PM, Pavan Deolasee  
> wrote:
>> Yeah, we should not do that. The patch surely does not intend to replay any
>> more WAL than what we do today. The idea is to just use a different
>> mechanism to find the prior checkpoint. But we should surely find the latest
>> prior checkpoint. In the rare scenario that Tom showed, we should just throw
>> an error and fix the patch if it's not doing that already.
> 
> It's not clear to me that there is any reasonable fix short of giving
> up on this approach altogether.  But I might be missing something.

One way to get things done right is to scan recursively segments instead
of records.  Not impossible but this introduces more code churns in
pg_rewind's findLastCheckpoint which is now dead-simple as it just need
to look at things backwards.  This way you can know as well what is the
real last checkpoint as you know as well what's the point where WAL has
forked.  So for multiple checkpoints on the same segment, the lookup
just needs to happen up to the point where WAL has forked, while keeping
track of the last checkpoint record found.

If we would want to keep compatibility with backward searches, it may be
nice to keep a mostly-compatible API layer in xlogreader.c.  Now I have
honestly not thought about that much.  And there may be other tools
which rely on being able to do backward searches.  Breaking
compatibility for them could cause wraith because that would mean less
availability when running a rewind-like operation.

This thread makes me wonder as well if we are overlooking other
things...
--
Michael


signature.asc
Description: PGP signature


Re: Flexible permissions for REFRESH MATERIALIZED VIEW

2018-03-28 Thread Isaac Morland
Thanks for taking the time to look at this. I think I was unclear in a
couple of places so I think my proposal may have appeared worse than it is.
Details below:

On 18 March 2018 at 20:25, Tom Lane  wrote:

> Isaac Morland  writes:
> > The original idea was to allow access to REFRESH MATERIALIZED VIEW to be
> a
> > grantable permission, rather than being reserved to the table owner.
>
> I'm not really on board with making that a separately grantable
> permission.  You can do what you need today by having the matview be owned
> by a single-purpose group role and granting membership in that role as
> needed.  We do not have an infinite supply of available privilege type
> bits --- in fact, without breaking on-disk compatibility, there are only
> four free bits in AclMode.  So I can't see using one of them up for
> REFRESH, let alone using three of them up for REFRESH, CLUSTER and
> REINDEX.


I didn't mean to suggest introducing 3 new permissions, just one, which
would allow all three operations (clustering and reindexing for tables,
including matviews, and refreshing matviews). So that still leaves 3
remaining bits for future use. I recognize that using up the limited supply
of privilege bits is a legitimate concern. However, I think of "refresh" as
a general concept that may apply differently to different objects. A future
permissions proposal may well be able to use it, similar to how USAGE was
re-used when it was decided to have a permission on types. So we aren't
fully using up even that bit, although clearly it is gone as to
table-related permissions, and it shouldn't be used for non-table objects
for something that doesn't feel like it can be described with the word
REFRESH.

One question I would have is: what proposals exist or have existed for
additional privilege bits? How much pressure is there to use some of the
remaining bits? I actually looked into the history of the permission bits
and found that we can summarize and approximate the history as 10 years of
expansion from 4 to 12, then nothing added in the last 10 years.

1996-07-09 - first git commit with append/read/write/rule - total 4
2001-05-27 - split write into update and delete, rename append to insert
and read to select, add references and trigger - total 7
2002-04-21 - add execute, usage, create, temp - total 11 - expand AclMode
from uint8 to uint32
2004-01-14 - add connect - total 12
2006-09-05 - remove rule, leaving gap - total 11
2008-09-08 - add truncate - total 12

If we do end up running out, how hard would it be to change AclItem to have
2 uint64 fields, one for the actual permissions and one for the grant
permissions? I haven't done a thorough study, but looking at the various
macros defined in utils/acl.h and where they are used it looks to me like
it might be not too bad (e.g. the only direct references to
AclItem.ai_privs are in acl.c and acl.h). I'm concerned about changing the
disk format, however - I don't know enough to say how painful that would be.

Also consider that these are the only non-DDL table-related actions that
are not controlled by permission bits but instead can only be done by owner
(at least I think they are - am I forgetting something?). Filling in that
gap feels to me like a reasonable thing to want to do.

> The patch so far uses TRUNCATE permission to control access as a
> > proof-of-concept.
>
> I can't see doing that either, TBH.  It's just ugly, and it surely doesn't
> scale to cases where the conflated operations all apply to the same kind
> of object.  (For instance, including CLUSTER permissions in TRUNCATE
> wouldn't be very sane.)
>

My original idea was to say that some combination of existing privileges
would grant the ability to refresh a materialized view. But I came to
prefer introducing a new privilege, especially once I realized it would be
sensible to include clustering and reindexing. The only reason I used an
existing privilege for this initial trivial version of the patch was to
check that the concept actually works without going to the effort of
actually writing in a new privilege. So I agree that using TRUNCATE in
particular and any existing set of privileges more generally would not be
my preference.


> It's conceivable that we could somehow allow new bits to get overlaid
> onto bits currently used only for other object types, without that being
> user-visible.  But I believe there are significant implementation issues
> that'd have to be surmounted; for instance aclitemout has no idea what
> sort of object the ACL it's printing is for.  Also, the ACL machinery
> doesn't currently think that "matview" is a distinct type of object
> from "table" anyway; it just sees 'em all as "relation".
>

This sounds harder than changing AclItem!

Or, maybe I'm wrong and people feel that REFRESH-by-non-owner is important
> enough to be worth consuming an AclMode bit for.  But I'm dubious.
>
> > I think backward compatibility is pretty good.
>
> 

Re: Flexible permissions for REFRESH MATERIALIZED VIEW

2018-03-28 Thread David G. Johnston
On Wed, Mar 28, 2018 at 6:38 PM, Isaac Morland 
wrote:

> ​​
> One question I would have is: what proposals exist or have existed for
> additional privilege bits? How much pressure is there to use some of the
> remaining bits? I actually looked into the history of the permission bits
> and found that we can summarize and approximate the history as 10 years of
> expansion from 4 to 12, then nothing added in the last 10 years.
>

​I made an argument for an "ANALYZE" grant a little while back, and it
kinda leads one to want one for VACUUM as well.

https://www.postgresql.org/message-id/CAKFQuwZ6dhjTFV7Bwmehe1N3%3Dk484y4mM22zuYjVEU2dq9V1aQ%40mail.gmail.com

​David J.​


Re: Parallel Aggregates for string_agg and array_agg

2018-03-28 Thread David Rowley
On 29 March 2018 at 03:05, Tomas Vondra  wrote:
> On 03/28/2018 03:54 PM, Tom Lane wrote:
>> I had in mind to look at exprType() of the argument.
>
> Right, I'm fine with that.

Attached is v9, which is based on Tom's v8 but includes the new tests
and I think the required fix to disable use of the serial/deserial
function for array_agg().

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


combinefn_for_string_and_array_aggs_v9.patch
Description: Binary data


Re: ALTER TABLE ADD COLUMN fast default

2018-03-28 Thread Tom Lane
Andrew Dunstan  writes:
> On Wed, Mar 28, 2018 at 5:30 PM, Andres Freund  wrote:
>> +   /*
>> +* Missing value for added columns. This is a one element array 
>> which lets
>> +* us store a value of the attribute type here.
>> +*/
>> +   anyarrayattmissingval BKI_DEFAULT(_null_);
>> #endif
>> } FormData_pg_attribute;
>> 
>> Still think this is a bad location, and it'll reduce cache hit ratio for
>> catalog lookups.

> As I think I mentioned before, this was discussed previously and as I
> understood it this was the consensus location for it.

I don't have a problem with putting that in pg_attribute (and I certainly
agree with not putting it in pg_attrdef).  But "anyarray" seems like a
damn strange, and bulky, choice.  Why not just make it a bytea holding the
bits for the value, nothing more?

regards, tom lane



Re: Trigger file behavior with the standby

2018-03-28 Thread Michael Paquier
On Wed, Mar 28, 2018 at 12:23:31PM -0700, Keiko Oda wrote:
> Thanks a lot for the answer, Michael (and sorry for the slow response)!

No problem.

> So, if I understand what you're saying correctly, I'm seeing this behavior
> because wal-e keeps fetching wal files from s3 regardless of this
> trigger_file, and these fetched wal files are in pg_wal (or pg_xlog),
> therefore Postgres just tries to restore whatever available in pg_wal
> before the failover. Or, even if there is no file in pg_wal, it still tries
> to fetch from the "archive" (s3).
> In other words, if I would like to do "immediate failover" (and do not care
> about WAL files available in archive or in pg_wal), I should be tweaking
> restore_command so that no further fetching/restoring happens.
> Is it... accurate?

Per the code and the documentation, the current behavior is clearly
intentional.  If you think about it, it can be relatively important
especially in the case of a base backup taken without WAL segments in
pg_wal while relying on a separate archive: this gives more guarantees
that the consistent point will be reached.  That also covers a bit what
people can look for in some cases with recovery_target = 'immediate'.

You could indeed tweak the restore command.  If a failure happens while
attempting to fetch a WAL segment, then the recovery would immediately
stop.  If you try to trigger a promotion without reaching a consistency
point, then you would get a complain from the startup process.  There
are some safeguards for this purpose.

Please don't take me wrong.  There is room for a feature which does more
efficiently what you are looking for, but that would be a separate
problem.

(Sakura season here by the way, they are blooming this week)
--
Michael


signature.asc
Description: PGP signature


Re: [bug fix] pg_rewind creates corrupt WAL files, and the standby cannot catch up the primary

2018-03-28 Thread Michael Paquier
On Thu, Mar 29, 2018 at 04:04:58AM +0900, Fujii Masao wrote:
> Pushed. Thanks!

Nice, thanks.
--
Michael


signature.asc
Description: PGP signature


Re: Small proposal to improve out-of-memory messages

2018-03-28 Thread Craig Ringer
On 29 March 2018 at 09:07, Tom Lane  wrote:

> I was looking at
> https://www.postgresql.org/message-id/CAG_=8kAYKjhQX3FmAWQBC95Evh3+
> qszoqxknmm1q4w1qo7+...@mail.gmail.com
>
> in which the most salient issue is
>
> > 2018-03-28 19:20:33.264 UTC [10580] cory@match ERROR:  out of memory
> > 2018-03-28 19:20:33.264 UTC [10580] cory@match DETAIL:  Failed on
> request of size 1610612736.
>
> and it suddenly struck me that this'd be noticeably more useful, at least
> for experts, if the errdetail included the name of the memory context
> we were trying to allocate in.  In this case it'd be nice to know right
> off the bat whether the failure occurred in MessageContext, which looked
> bloated already, or someplace else.
>
> In the wake of commit 442accc3f one might think that the message should
> also include the context "identifier" if any.  But I'm specifically not
> proposing that, because of the point made in the discussion of that patch
> that some identifier strings might contain security-sensitive query
> excerpts.  Now that that commit has required all context "names" to be
> compile-time constants, it's hard to see how there could be any security
> downside to showing them in a user-visible message.
>
> Of course, by the same token, maybe this change wouldn't be as useful
> as I'm thinking.  But I can think of past cases where being able to
> distinguish, say, allocation failures in a query's global ExecutorState
> versus ones in an ExprState would save some effort.
>
>
This would have been significantly useful to me in the past, so +1 from me.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2018-03-28 Thread Masahiko Sawada
On Thu, Mar 29, 2018 at 2:27 AM, David Steele  wrote:
> On 2/27/18 2:21 AM, Masahiko Sawada wrote:
>>
>> Hmm, although I've thought concern in case where we don't consider
>> local xids of un-resolved fdwxact in GetOldestXmin, I could not find
>> problem. Could you share your concern if you have? I'll try to find a
>> possibility based on it.
>
> It appears that this entry should be marked Waiting on Author so I have
> done that.
>
> I also think it may be time to move this patch to the next CF.
>

I agree to move this patch to the next CF.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: JIT compiling with LLVM v12

2018-03-28 Thread Amit Langote
On 2018/03/29 9:35, Michael Paquier wrote:
> On Wed, Mar 28, 2018 at 06:24:53PM -0400, David Steele wrote:
>> On 3/28/18 6:09 PM, Andres Freund wrote:
>>> Hah! Happy to, if there's enough people interested.  I've a talk about
>>> it too (state of jit, 2018 edition), but I wasn't planning to go into
>>> too low level details. More about what is good, what is bad, and how we
>>> make it better ;)
>>
>> +1 for an unconference session.  This is some seriously cool stuff.
> 
> Take room for two sessions then, with a break in-between to give enough
> time to people to recover from the damage of the first session :)
> 
> Jokes apart, an unconference session at PGcon would be great.

+1

Thanks,
Amit




Re: ALTER TABLE ADD COLUMN fast default

2018-03-28 Thread Andrew Dunstan
On Thu, Mar 29, 2018 at 10:19 AM, Tom Lane  wrote:
> Andrew Dunstan  writes:
>> On Wed, Mar 28, 2018 at 5:30 PM, Andres Freund  wrote:
>>> +   /*
>>> +* Missing value for added columns. This is a one element array 
>>> which lets
>>> +* us store a value of the attribute type here.
>>> +*/
>>> +   anyarrayattmissingval BKI_DEFAULT(_null_);
>>> #endif
>>> } FormData_pg_attribute;
>>>
>>> Still think this is a bad location, and it'll reduce cache hit ratio for
>>> catalog lookups.
>
>> As I think I mentioned before, this was discussed previously and as I
>> understood it this was the consensus location for it.
>
> I don't have a problem with putting that in pg_attribute (and I certainly
> agree with not putting it in pg_attrdef).  But "anyarray" seems like a
> damn strange, and bulky, choice.  Why not just make it a bytea holding the
> bits for the value, nothing more?
>


That's what we started with and Andres suggested we change it. One
virtue of the array is that is displays nicely when examining
pg_attribute. But changing it back would be easy enough.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-03-28 Thread Thomas Munro
On Thu, Mar 29, 2018 at 3:30 PM, Michael Paquier  wrote:
> On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
>> Craig Ringer  writes:
>>> TL;DR: Pg should PANIC on fsync() EIO return.
>>
>> Surely you jest.
>
> Any callers of pg_fsync in the backend code are careful enough to check
> the returned status, sometimes doing retries like in mdsync, so what is
> proposed here would be a regression.

Craig, is the phenomenon you described the same as the second issue
"Reporting writeback errors" discussed in this article?

https://lwn.net/Articles/724307/

"Current kernels might report a writeback error on an fsync() call,
but there are a number of ways in which that can fail to happen."

That's... I'm speechless.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-03-28 Thread Tom Lane
Claudio Freire  writes:
> v10 counted the number of blocks with updated free space to vacuum the
> FSM only after a lot of changes to it were made. This will vacuum the
> FSM after *scanning* a lot of pages, even if little modifications were
> made to them.

Yes, that's exactly the point.  We still need to update the upper pages of
the FSM, regardless of exactly how many pages were changed by VACUUM
underneath.  The pages VACUUM didn't change might still be out of date in
the upper pages, since retail updates from other operations don't
immediately propagate; so there's a fixed amount of work that needs to be
done there and we might as well do it in more-or-less-predictable quanta.
I don't see any point in adding counting overhead/complexity for that.
In fact, I seriously considered just making it update after every
VACUUM_FSM_EVERY_PAGES period, but decided that for the case with indexes,
doing it right after each cleanup cycle was sensible, and then we might as
well make the no-index case look as much like that as we conveniently can.
The no-index case seems vestigial anyway; how often is that really going
to apply?  So spending a lot of complexity on it doesn't seem warranted,
especially when the argument for more complexity is at best dubious.

> This doesn't seem correct.

[ thinks for a bit... ]  Yeah, you're right, we need to round up not down
when determining the last slot to scan on the upper level.

I wonder how hard it is to avoid rounding up when the stop point actually
does fall right at a FSM page boundary.  OTOH, that might not happen often
enough to be worth sweating over.

regards, tom lane



Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-03-28 Thread Dean Rasheed
On 28 March 2018 at 15:50, Tomas Vondra  wrote:
> After thinking about this a bit more, I'm not sure if updating the info
> based on recursive calls makes sense. The fullmatch flag was supposed to
> answer a simple question - can there be just a single matching item?
>
> If there are equality conditions on all columns, there can be just a
> single matching item - if we have found it in the MCV (i.e. s1 > 0.0),
> then we don't need to inspect the non-MCV part.
>
> But handling this in recursive manner breaks this entirely, because with
> something like
>
>(a=1) AND (b=1 OR b=2)
>
> you suddenly can have multiple matching items. Which makes the fullmatch
> flag somewhat useless.
>
> So I think we should be looking at top-level equality clauses only, just
> like number_of_groups() does.
>

I'm not quite sure what you mean by that, but it sounds a bit limiting
in terms of the kinds of user queries that would be supported.


> I think we can remove the fullmatch flag from mcv_update_bitmap
> entirely. All we need to know is the presence of equality clauses and
> whether there was a match in MCV (which we know from s1 > 0.0).
>

I agree with removing the fullmatch flag, but I don't think we
actually need to know about the presence of equality clauses:

The way that mcv_update_bitmap() recursively computes the set of
matching MCVs seems to be correct. That gives us a value (call it
mcv_matchsel) for the proportion of the table's rows that are in the
MCV list and satisfy the clauses in stat_clauses.

We can also estimate that there are (1-mcv_totalsel)*N rows that are
not in the MCV list, for which the MCV stats therefore tell us
nothing. The best way to estimate those rows would seem to be to use
the logic from the guts of clauselist_selectivity(), without
consulting any extended MCV stats (but still consulting other extended
stats, I think). Doing that would return a selectivity value (call it
nonmcv_sel) for those remaining rows. Then a reasonable estimate for
the overall selectivity would seem to be

  mcv_matchsel + (1-mcv_totalsel) * nonmcv_sel

and there would be no need for mcv_update_bitmap() to track eqmatches
or return fullmatch, and it wouldn't actually matter whether or not we
had equality clauses or if all the MCV columns were used.

Regards,
Dean



Re: Fix a typo in walsender.c

2018-03-28 Thread atorikoshi

On 2018/03/29 7:23, Bruce Momjian wrote:


Thanks, patch applied.


Thank you for committing!





Re: [HACKERS] Pluggable storage

2018-03-28 Thread Michael Paquier
On Wed, Mar 28, 2018 at 12:23:56PM -0400, David Steele wrote:
> I think this entry should be moved the the next CF.  I'll do that
> tomorrow unless there are objections.

Instead of moving things to the next CF by default, perhaps it would
make more sense to mark things as reviewed with feedback as this is the
last CF?  There is a 5-month gap between this commit fest and the next
one, I am getting afraid of flooding the beginning of v12 development
cycle with entries which keep rotting over time.  If the author(s) claim
that they will be able to work on it, then that's of course fine.

Sorry for the digression, patches ignored across CFs contribute to the
bloat we see, and those eat the time of the CFM.
--
Michael


signature.asc
Description: PGP signature


Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-03-28 Thread Robert Haas
On Tue, Mar 27, 2018 at 11:53 PM, Pavan Deolasee
 wrote:
> Yeah, we should not do that. The patch surely does not intend to replay any
> more WAL than what we do today. The idea is to just use a different
> mechanism to find the prior checkpoint. But we should surely find the latest
> prior checkpoint. In the rare scenario that Tom showed, we should just throw
> an error and fix the patch if it's not doing that already.

It's not clear to me that there is any reasonable fix short of giving
up on this approach altogether.  But I might be missing something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ALTER TABLE ADD COLUMN fast default

2018-03-28 Thread Haribabu Kommi
On Mon, Mar 26, 2018 at 9:32 AM, Andrew Dunstan  wrote:
>
>
> Thanks for this, all looks good. Here is the consolidate patch
> rebased. If there are no further comments I propose to commit this in
> a few days time.


I have some comments with the committed patch.

@@ -663,7 +671,23 @@ ExecFetchSlotTuple(TupleTableSlot *slot)
 * If we have a regular physical tuple then just return it.
 */
if (TTS_HAS_PHYSICAL_TUPLE(slot))
-   return slot->tts_tuple;
+   {
+   if (HeapTupleHeaderGetNatts(slot->tts_tuple->t_data) <
+   slot->tts_tupleDescriptor->natts)
+   {
+   MemoryContext oldContext =
MemoryContextSwitchTo(slot->tts_mcxt);
+
+   slot->tts_tuple = heap_expand_tuple(slot->tts_tuple,
+
 slot->tts_tupleDescriptor);
+   slot->tts_shouldFree = true;
+   MemoryContextSwitchTo(oldContext);
+   return slot->tts_tuple;
+   }
+   else
+   {
+   return slot->tts_tuple;
+   }
+   }

In the above scenario, directly replacing the slot->tts_tuple without
freeing the exisitng
tuple will unnecessarily increase the slot context memory size, this may
lead to a problem
if the same slot is used for many tuples. Better to use ExecStoreTuple()
function to update
the slot tuple.

Regards,
Hari Babu
Fujitsu Australia


Re: Small proposal to improve out-of-memory messages

2018-03-28 Thread Michael Paquier
On Thu, Mar 29, 2018 at 09:29:59AM +0800, Craig Ringer wrote:
> This would have been significantly useful to me in the past, so +1 from me.

As long as that does not cost more memory, +1.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2018-03-28 Thread Dmitry Dolgov
> On 22 March 2018 at 14:18, Ashutosh Bapat  
> wrote:
> On Thu, Mar 22, 2018 at 4:32 AM, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>>> On 12 March 2018 at 06:00, Ashutosh Bapat  
>>> wrote:
>>> Thanks for the note. Here are rebased patches.
>>
>> Since I started to look at this patch, I can share few random notes (although
>> it's not a complete review, I'm in the progress now), maybe it can be 
>> helpful.
>>
>> In `partition_range_bounds_merge`
>>
>> + if (!merged)
>> + break;
>>
>> is a bit redundant I think, because every time `merged` set to false it
>> followed by break.
>
> Yes, right now. May be I should turn it into Assert(merged); What do you 
> think?

Thank you for reply. Yes, that sounds good. But actually you also mentioned
another topic that bothers me about this patch. Different parts of the
algorithm implementation (at least for functions that build maps of matching
partitions) are quite dependent on each other in terms of shared state. At
first glance in `partition_range_bounds_merge` we have about a dozen of
variables of different mutability level, that affect the control flow:

outer_lb_index
inner_lb_index
merged
merged_index
overlap
merged_lb
merged_ub
finished_outer
finished_inner
ub_cmpval
lb_cmpval
inner_has_default
outer_has_default
jointype

It looks a bit too much for me, and would require commentaries like "if you
changed the logic here, also take a look there". But I'm not saying that I have
any specific suggestions how to change it (although I'll definitely try to do
so, at least to get some better understanding of the underlying algorithm).

>>
>> I've noticed that in some places `IS_PARTITIONED_REL` was replaced
>>
>> - if (!IS_PARTITIONED_REL(joinrel))
>> + if (joinrel->part_scheme == NULL)
>>
>> but I'm not quite follow why? Is it because `boundinfo` is not available
>> anymore at this point? If so, maybe it makes sense to update the commentary 
>> for
>> this macro and mention to not use for joinrel.
>
>
> This is done in try_partitionwise_join(). As explained in the comment
>
>  * Get the list of matching partitions to be joined along with the
>  * partition bounds of the join relation. Because of the restrictions
>  * imposed by partition matching algorithm, not every pair of joining
>  * relations for this join will be able to use partition-wise join. But 
> all
>  * those pairs which can use partition-wise join will produce the same
>  * partition bounds for the join relation.
>
> boundinfo for the join relation is built in this function. So, we
> don't have join relation's partitioning information fully set up yet.
> So we can't use IS_PARTITIONED_REL() there. joinrel->part_scheme if
> set tells that the joining relations have matching partition schemes
> and thus the join relation can possibly use partition-wise join
> technique. If it's not set, then we can't use partition-wise join.
>
> But IS_PARTITIONED_REL() is still useful at a number of other places,
> where it's known to encounter a RelOptInfo whose partitioning
> properties are fully setup. So, I don't think we should change the
> macro or the comments above it.

Just to make myself clear, I wanted to suggest not to change the commentary for
`IS_PARTITIONED_REL` significantly, but just add a sentence that you need to
check if given relation is fully set up.

Also, few more random notes (mostly related to readability, since I found some
parts of the patch hard to read, but of course it's arguable).

```
PartitionRangeBound outer_lb;
PartitionRangeBound outer_ub;
PartitionRangeBound inner_lb;
PartitionRangeBound inner_ub;
PartitionRangeBound *merged_lb = NULL;
PartitionRangeBound *merged_ub = NULL;
```

Maybe it would be better to not repeat the type here? Let's say:

```
PartitionRangeBound outer_lb,
outer_ub,
...
```

It's just too long and distracting.

```
partition_range_bounds_merge(int partnatts, FmgrInfo *partsupfuncs,
 Oid *partcollations, PartitionBoundInfo outer_bi,
 int outer_nparts, PartitionBoundInfo inner_bi,
 int inner_nparts, JoinType jointype,
 List **outer_parts, List **inner_parts)
```

>From what I see in `partition.c` there are a lot functions that accept
`partnatts`, `partcollations` only to pass it down to, e.g.
`partition_rbound_cmp`.
What do you think about introducing a data structure to keep these arguments,
and pass only an instance of this structure instead?



Re: JIT compiling with LLVM v12

2018-03-28 Thread Andres Freund
Hi,

On 2018-03-28 14:27:51 -0700, Andres Freund wrote:
> > 7/10 committed. Inlining, Explain, Docs remain.
> 
> I've pushed these three.

One tiny pending commit I have is to add a few pg_noinline annotations
to slow-path functions, to avoid very common spurious inlines. I'll play
a littlebit more with the set that I think make sense there, and will
send a separate email about that.

Greetings,

Andres Freund



Re: JIT compiling with LLVM v12

2018-03-28 Thread Andres Freund
Hi,

On 2018-03-28 18:06:24 -0400, Peter Eisentraut wrote:
> On 3/28/18 17:27, Andres Freund wrote:
> > I've pushed these three.
> 
> Great, now the only thing remaining is to prepare an unconference
> session explaining all this to the rest of us. ;-)

Hah! Happy to, if there's enough people interested.  I've a talk about
it too (state of jit, 2018 edition), but I wasn't planning to go into
too low level details. More about what is good, what is bad, and how we
make it better ;)

Greetings,

Andres Freund



Re: [PROPOSAL] Shared Ispell dictionaries

2018-03-28 Thread Arthur Zakirov
Here is the new version of the patch.

Now RemoveTSDictionaryById() and AlterTSDictionary() unpin the
dictionary DSM segment. So if all attached backends disconnect allocated
DSM segments will be released.

lookup_ts_dictionary_cache() may unping DSM mapping for all invalid
dictionary cache entries.

I added xmax in DictPointerData. It is used as a lookup key now too. It
helps to reload a dictionary after roll back DROP command.

There was a bug in ts_dict_shmem_location(), I fixed it.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/src/backend/tsearch/spell.c b/src/backend/tsearch/spell.c
index b9fdd77e19..e071994523 100644
--- a/src/backend/tsearch/spell.c
+++ b/src/backend/tsearch/spell.c
@@ -78,6 +78,8 @@
 #define tmpalloc(sz)  MemoryContextAlloc(Conf->buildCxt, (sz))
 #define tmpalloc0(sz)  MemoryContextAllocZero(Conf->buildCxt, (sz))
 
+#define tmpstrdup(str) MemoryContextStrdup(Conf->buildCxt, (str))
+
 /*
  * Prepare for constructing an ISpell dictionary.
  *
@@ -498,7 +500,7 @@ NIAddSpell(IspellDict *Conf, const char *word, const char 
*flag)
Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + 
strlen(word) + 1);
strcpy(Conf->Spell[Conf->nspell]->word, word);
Conf->Spell[Conf->nspell]->p.flag = (*flag != '\0')
-   ? cpstrdup(Conf, flag) : VoidString;
+   ? tmpstrdup(flag) : VoidString;
Conf->nspell++;
 }
 
@@ -1040,7 +1042,7 @@ setCompoundAffixFlagValue(IspellDict *Conf, 
CompoundAffixFlag *entry,
entry->flag.i = i;
}
else
-   entry->flag.s = cpstrdup(Conf, s);
+   entry->flag.s = tmpstrdup(s);
 
entry->flagMode = Conf->flagMode;
entry->value = val;
@@ -1536,6 +1538,9 @@ nextline:
return;
 
 isnewformat:
+   pfree(recoded);
+   pfree(pstr);
+
if (oldformat)
ereport(ERROR,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
diff --git a/contrib/dict_int/dict_int.c b/contrib/dict_int/dict_int.c
index 56ede37089..8dd4959028 100644
--- a/contrib/dict_int/dict_int.c
+++ b/contrib/dict_int/dict_int.c
@@ -30,7 +30,7 @@ PG_FUNCTION_INFO_V1(dintdict_lexize);
 Datum
 dintdict_init(PG_FUNCTION_ARGS)
 {
-   List   *dictoptions = (List *) PG_GETARG_POINTER(0);
+   DictInitData *init_data = (DictInitData *) PG_GETARG_POINTER(0);
DictInt*d;
ListCell   *l;
 
@@ -38,7 +38,7 @@ dintdict_init(PG_FUNCTION_ARGS)
d->maxlen = 6;
d->rejectlong = false;
 
-   foreach(l, dictoptions)
+   foreach(l, init_data->dict_options)
{
DefElem*defel = (DefElem *) lfirst(l);
 
diff --git a/contrib/dict_xsyn/dict_xsyn.c b/contrib/dict_xsyn/dict_xsyn.c
index a79ece240c..0b8a32d459 100644
--- a/contrib/dict_xsyn/dict_xsyn.c
+++ b/contrib/dict_xsyn/dict_xsyn.c
@@ -140,7 +140,7 @@ read_dictionary(DictSyn *d, const char *filename)
 Datum
 dxsyn_init(PG_FUNCTION_ARGS)
 {
-   List   *dictoptions = (List *) PG_GETARG_POINTER(0);
+   DictInitData *init_data = (DictInitData *) PG_GETARG_POINTER(0);
DictSyn*d;
ListCell   *l;
char   *filename = NULL;
@@ -153,7 +153,7 @@ dxsyn_init(PG_FUNCTION_ARGS)
d->matchsynonyms = false;
d->keepsynonyms = true;
 
-   foreach(l, dictoptions)
+   foreach(l, init_data->dict_options)
{
DefElem*defel = (DefElem *) lfirst(l);
 
diff --git a/contrib/unaccent/unaccent.c b/contrib/unaccent/unaccent.c
index 247c202755..2a2fbee5fa 100644
--- a/contrib/unaccent/unaccent.c
+++ b/contrib/unaccent/unaccent.c
@@ -267,12 +267,12 @@ PG_FUNCTION_INFO_V1(unaccent_init);
 Datum
 unaccent_init(PG_FUNCTION_ARGS)
 {
-   List   *dictoptions = (List *) PG_GETARG_POINTER(0);
+   DictInitData *init_data = (DictInitData *) PG_GETARG_POINTER(0);
TrieChar   *rootTrie = NULL;
boolfileloaded = false;
ListCell   *l;
 
-   foreach(l, dictoptions)
+   foreach(l, init_data->dict_options)
{
DefElem*defel = (DefElem *) lfirst(l);
 
diff --git a/src/backend/commands/tsearchcmds.c 
b/src/backend/commands/tsearchcmds.c
index 3a843512d1..3753e32b2c 100644
--- a/src/backend/commands/tsearchcmds.c
+++ b/src/backend/commands/tsearchcmds.c
@@ -386,17 +386,25 @@ verify_dictoptions(Oid tmplId, List *dictoptions)
}
else
{
+   DictInitData init_data;
+
/*
 * Copy the options just in case init method thinks it can 
scribble on
 * them ...
 */
dictoptions = copyObject(dictoptions);
 
+   init_data.dict_options = dictoptions;
+   init_data.dict.id = InvalidOid;
+   init_data.dict.xmin = InvalidTransactionId;
+   init_data.dict.xmax = InvalidTransactionId;
+   

Re: WIP: Covering + unique indexes.

2018-03-28 Thread Peter Eisentraut
On 1/25/18 23:19, Thomas Munro wrote:
> +  PRIMARY KEY (  class="parameter">column_name [, ... ] )  class="parameter">index_parameters INCLUDE
> (column_name [,
> ...]) |
> 
> I hadn't seen that use of "" before.  Almost everywhere else
> we use explicit [ and ] characters, but I see that there are other
> examples, and it is rendered as [ and ] in the output.

I think this will probably not come out right in the generated psql
help.  Check that please.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: JIT compiling with LLVM v12

2018-03-28 Thread David Steele

On 3/28/18 6:09 PM, Andres Freund wrote:


On 2018-03-28 18:06:24 -0400, Peter Eisentraut wrote:

On 3/28/18 17:27, Andres Freund wrote:

I've pushed these three.


Great, now the only thing remaining is to prepare an unconference
session explaining all this to the rest of us. ;-)


Hah! Happy to, if there's enough people interested.  I've a talk about
it too (state of jit, 2018 edition), but I wasn't planning to go into
too low level details. More about what is good, what is bad, and how we
make it better ;)


+1 for an unconference session.  This is some seriously cool stuff.

--
-David
da...@pgmasters.net



Re: csv format for psql

2018-03-28 Thread Isaac Morland
On 28 March 2018 at 15:43, Joshua D. Drake  wrote:

> On 03/28/2018 12:35 PM, David G. Johnston wrote:
>
> I like to call it "Character Separated Values" now for just that reason.
>
>
> Isn't the actual wording Character Delimited Values? I may be picking at
> hairs here but every single time I use anything to import a CSV or other
> delimited file (TAB or | usually) that is what the import screen says.
>

CSV stands for Comma Separated Values, and not anything else common as far
as I can tell. A Google search for "csv" turns up the Wikipedia page
describing the file format as the first hit, followed by the Wikipedia
disambiguation page for CSV, which links to the aforementioned Wikipedia
page as the only file-format-related link.

The actual reason I'm posting this is because some of the discussion has
made me a bit confused: there is already a CSV format defined for the COPY
command and used by the psql \copy. I just want to check that what is being
discussed here would use the exact same format as the existing CSV COPY
format; and the configurability of them should be the exact same options
(which already includes being able to change the delimiter). I think it's
important that Postgres not have two CSVs with slightly different
behaviours. Indeed, psql could use COPY behind the scenes to generate the
CSV output, which would guarantee no nasty surprises.

If this is already widely accepted or if I'm horribly misunderstanding the
discussion then I'm sorry for the noise.


Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-03-28 Thread Tom Lane
Claudio Freire  writes:
> Attached patches, rebased and modified as discussed:
> 1 no longer does tree pruning, it just vacuums a range of the FSM
> 2 reintroduces tree pruning for the initial FSM vacuum
> 3 and 4 remain as they were, but rebased

I reviewed and cleaned up 0001.  The API for FreeSpaceMapVacuumRange was
underspecified, and the use of it seemed to have off-by-one errors.  Also,
you still had vacuum doing a full FreeSpaceMapVacuum call at the end;
I thought the point was to get rid of that.  We then need to make sure
we clean up after a truncation, but we can do that by introducing a
FreeSpaceMapVacuumRange call into FreeSpaceMapTruncateRel.  I think the
attached 0001 is committable, but feel free to take another look.

I still don't like 0002.  It's adding a lot of complexity, and
not-negligible overhead, to solve yesterday's problem.  After 0001,
there's no reason to assume that vacuum is particularly likely to get
cancelled between having made cleanups and having updated the upper FSM
levels.  (Maybe the odds are a bit more for the no-indexes case, but
that doesn't seem like it's common enough to justify a special mechanism
either.)

Not sure what to think about 0003.  At this point I'd be inclined
to flush UpdateFreeSpaceMap altogether and use FreeSpaceMapVacuumRange
in its place.  I don't think the design of that function is any better
chosen than its name, and possible bugs in its subroutines don't make
it more attractive.

Not sure about 0004 either.  The fact that we can't localize what part of
the index needs to be updated means that repeated IndexFreeSpaceMapVacuum
calls are a roughly quadratic cost.  Maybe in proportion to the other work
we have to do, they're not much, but on the other hand how much benefit is
there?  Should we make the call conditional on how many index pages got
updated?  Also, I wonder why you only touched nbtree and spgist.  (For
that matter, I wonder why BRIN doesn't go through IndexFreeSpaceMapVacuum
like the rest of the index AMs.  Or perhaps it has the right idea and we
should get rid of IndexFreeSpaceMapVacuum as a useless layer.)

regards, tom lane

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index f9da24c..d2a0066 100644
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***
*** 85,90 
--- 85,99 
  #define VACUUM_TRUNCATE_LOCK_TIMEOUT			5000	/* ms */
  
  /*
+  * When a table has no indexes, vacuum the FSM after every 8GB, approximately
+  * (it won't be exact because we only vacuum FSM after processing a heap page
+  * that has some removable tuples).  When there are indexes, this is ignored,
+  * and we vacuum FSM after each index/heap cleaning pass.
+  */
+ #define VACUUM_FSM_EVERY_PAGES \
+ 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
+ 
+ /*
   * Guesstimation of number of dead tuples per page.  This is used to
   * provide an upper limit to memory allocated when vacuuming small
   * tables.
*** lazy_vacuum_rel(Relation onerel, int opt
*** 285,293 
  	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
   PROGRESS_VACUUM_PHASE_FINAL_CLEANUP);
  
- 	/* Vacuum the Free Space Map */
- 	FreeSpaceMapVacuum(onerel);
- 
  	/*
  	 * Update statistics in pg_class.
  	 *
--- 294,299 
*** lazy_scan_heap(Relation onerel, int opti
*** 465,471 
  	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
  	TransactionId relminmxid = onerel->rd_rel->relminmxid;
  	BlockNumber empty_pages,
! vacuumed_pages;
  	double		num_tuples,		/* total number of nonremovable tuples */
  live_tuples,	/* live tuples (reltuples estimate) */
  tups_vacuumed,	/* tuples cleaned up by vacuum */
--- 471,478 
  	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
  	TransactionId relminmxid = onerel->rd_rel->relminmxid;
  	BlockNumber empty_pages,
! vacuumed_pages,
! next_fsm_block_to_vacuum;
  	double		num_tuples,		/* total number of nonremovable tuples */
  live_tuples,	/* live tuples (reltuples estimate) */
  tups_vacuumed,	/* tuples cleaned up by vacuum */
*** lazy_scan_heap(Relation onerel, int opti
*** 501,506 
--- 508,514 
  		relname)));
  
  	empty_pages = vacuumed_pages = 0;
+ 	next_fsm_block_to_vacuum = (BlockNumber) 0;
  	num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0;
  
  	indstats = (IndexBulkDeleteResult **)
*** lazy_scan_heap(Relation onerel, int opti
*** 752,757 
--- 760,772 
  			vacrelstats->num_dead_tuples = 0;
  			vacrelstats->num_index_scans++;
  
+ 			/*
+ 			 * Vacuum the Free Space Map to make newly-freed space visible on
+ 			 * upper-level FSM pages.  Note we have not yet processed blkno.
+ 			 */
+ 			FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
+ 			next_fsm_block_to_vacuum = blkno;
+ 
  			/* Report that we are once again scanning the heap 

Re: new buildfarm with gcc & clang "trunk" -Werror?

2018-03-28 Thread Peter Eisentraut
On 3/28/18 04:06, Fabien COELHO wrote:
> Would the project feel appropriate to:
> 
>   - fix these warnings (other than putting -Wno-format-truncation or
> similar workarounds...).

I've been tracking gcc-8, and I have a patch ready, but these things
change a bit with every compiler snapshot, so I'm waiting until things
settle down.

>   - add permanent gcc/clang trunk beasts with -Werror
> (if so, I'd suggest cannonball & seanettle for the names).

I would not like that.  The build farm success should ideally be a
function of correct PostgreSQL code, not external factors.  If an
in-progress compiler does funny things, what are we supposed to do about
that?

Generally, fixing up PostgreSQL for a new compiler release isn't a major
effort and can be done briefly before the release of the compiler.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Fix a typo in walsender.c

2018-03-28 Thread Bruce Momjian
On Tue, Feb 27, 2018 at 07:22:20PM +0900, atorikoshi wrote:
> Hi,
> 
> Attached a minor patch for a typo: s/log/lag
> 
> Regards,
> 
> -- 
> Atsushi Torikoshi
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center

> diff --git a/src/backend/replication/walsender.c 
> b/src/backend/replication/walsender.c
> index d46374d..8bb1919 100644
> --- a/src/backend/replication/walsender.c
> +++ b/src/backend/replication/walsender.c
> @@ -1245,7 +1245,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr 
> lsn, TransactionId xid,
>  /*
>   * LogicalDecodingContext 'update_progress' callback.
>   *
> - * Write the current position to the log tracker (see XLogSendPhysical).
> + * Write the current position to the lag tracker (see XLogSendPhysical).
>   */
>  static void
>  WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, 
> TransactionId xid)

Thanks, patch applied.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



  1   2   >