date:20190716

Re: [PATCH] Implement INSERT SET syntax

2019-07-16 Thread Marko Tiikkaja

On Wed, Jul 17, 2019 at 7:30 AM Gareth Palmer 
wrote:

> Attached is a patch that adds the option of using SET clause to specify
> the columns and values in an INSERT statement in the same manner as that
> of an UPDATE statement.
>

Cool!  Thanks for working on this, I'd love to see the syntax in PG.

There was a brief discussion regarding INSERT SET on pgsql-hackers in late
> August 2009 [1].
>

There was also at least one slightly more recent adventure:
https://www.postgresql.org/message-id/709e06c0-59c9-ccec-d216-21e38cb5ed61%40joh.to

You might want to check that thread too, in case any of the criticism there
applies to this patch as well.

.m

Re: pg_receivewal documentation

2019-07-16 Thread Laurenz Albe

On Wed, 2019-07-17 at 10:38 +0900, Michael Paquier wrote:
> +   
> +Note that while WAL will be flushed with this setting,
> +pg_receivewal never applies it, so
> + must not be set to
> +remote_apply if 
> pg_receivewal
> +is the only synchronous standby. Similarly, if
> +pg_receivewal is part of a quorum-based
> +set of synchronous standbys, it won't count towards the quorum if
> + is set to
> +remote_apply.
> +   
> 
> I think we should really document the caveat with priority-based sets
> of standbys as much as quorum-based sets.  For example if a user sets
> synchronous_commit = remote_apply in postgresql.conf, and then sets
> s_s_names to '2(pg_receivewal, my_connected_standby)' to get a
> priority-based set, then you have the same problem, and pg_receivewal
> is not the only synchronous standby in this configuration.  The patch
> does not cover that case properly.

I understand the concern, I'm just worried that too much accuracy may
render the sentence hard to read.

How about adding "or priority-based" after "quorum-based"?

Yours,
Laurenz Albe

Re: POC: Cleaning up orphaned files using undo logs

2019-07-16 Thread Dilip Kumar

On Wed, Jul 17, 2019 at 9:27 AM Thomas Munro  wrote:
>
> On Wed, Jul 17, 2019 at 3:44 PM Dilip Kumar  wrote:
> > Right, actually I got that point.  But, I was thinking that we are
> > wasting one logno from undo log addressing space no?.  Instead, if we
> > can keep it attached to the slot and somehow manage to add to the free
> > list then the same logno can be used by someone else?
>
> We can never reuse log numbers.  UndoRecPtr values containing that log
> number could exist in permanent storage anywhere (zheap, zedstore etc)
> and must appear to be discarded forever if anyone asks.

Yeah right.  I knew that we can not reuse UndoRecPtr but forget to
think that if we reuse logno then it is same as reusing UndoRecPtr.
Sorry for the noise.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: psql ctrl+f skips displaying of one record and displays skipping one line

2019-07-16 Thread vignesh C

Thanks Tom.
That sounds good to me.

On Wed, Jul 17, 2019 at 10:17 AM Tom Lane  wrote:

> vignesh C  writes:
> > I'm able to get the same behaviour in centos as well.
> > Should we do anything to handle this in Postgres or any documentation
> > required?
>
> It already is documented:
>
> PSQL_PAGER
> PAGER
>
> If a query's results do not fit on the screen, they are piped
> through this command. Typical values are more or less. Use of the
> pager can be disabled by setting PSQL_PAGER or PAGER to an empty
> string, or by adjusting the pager-related options of the \pset
> command. These variables are examined in the order listed; the
> first that is set is used. If none of them is set, the default is
> to use more on most platforms, but less on Cygwin.
>
> We're certainly not going to copy four or five different versions
> of the "more" and "less" man pages into psql's man page, if that's
> what you're suggesting.  Nor is it our job to point out shortcomings
> in some versions of those man pages.
>
> regards, tom lane
>


-- 
Regards,
vignesh
  Have a nice day

Re: buildfarm's typedefs list has gone completely nutso

2019-07-16 Thread Andres Freund

Hi,

On 2019-07-16 19:35:39 -0700, Andres Freund wrote:
> Hi,
> 
> On 2019-07-10 17:24:41 -0700, Andres Freund wrote:
> > Not yet sure what's actually going on, but there's something odd with
> > debug information afaict:
> > 
> > objdump -W spits out warnings for a few files, all static libraries:
> > 
> > ../install/lib/libpgcommon.a
> > objdump: Warning: Location list starting at offset 0x0 is not terminated.
> > objdump: Warning: Hole and overlap detection requires adjacent view lists 
> > and loclists.
> > objdump: Warning: There are 3411 unused bytes at the end of section 
> > .debug_loc
> 
> ...
> 
> 
> Interestingly, for the same files, readelf spits out correct
> data. E.g. here's a short excerpt from libpq.a:
> 
> objdump -W src/interfaces/libpq/libpq.a
> ...
>  <0>: Abbrev Number: 1 (DW_TAG_compile_unit)
>DW_AT_producer: (indirect string, offset: 0x0): 
> SYNC_FILE_RANGE_WRITE 2
> <10>   DW_AT_language: 12   (ANSI C99)
> <11>   DW_AT_name: (indirect string, offset: 0x0): 
> SYNC_FILE_RANGE_WRITE 2
> <15>   DW_AT_comp_dir: (indirect string, offset: 0x0): 
> SYNC_FILE_RANGE_WRITE 2
> ...
>  <1><31>: Abbrev Number: 2 (DW_TAG_typedef)
> <32>   DW_AT_name: Oid
> <36>   DW_AT_decl_file   : 30
> <37>   DW_AT_decl_line   : 31
> <38>   DW_AT_decl_column : 22
> <39>   DW_AT_type: <0x3d>
>  <1><3d>: Abbrev Number: 3 (DW_TAG_base_type)
> <3e>   DW_AT_byte_size   : 4
> <3f>   DW_AT_encoding: 7(unsigned)
> <40>   DW_AT_name: (indirect string, offset: 0x0): 
> SYNC_FILE_RANGE_WRITE 2
>  <1><44>: Abbrev Number: 3 (DW_TAG_base_type)
> <45>   DW_AT_byte_size   : 8
> <46>   DW_AT_encoding: 5(signed)
> <47>   DW_AT_name: (indirect string, offset: 0x0): 
> SYNC_FILE_RANGE_WRITE 2
>  <1><4b>: Abbrev Number: 4 (DW_TAG_typedef)
> <4c>   DW_AT_name: (indirect string, offset: 0x0): 
> SYNC_FILE_RANGE_WRITE 2
> <50>   DW_AT_decl_file   : 2
> <51>   DW_AT_decl_line   : 216
> <52>   DW_AT_decl_column : 23
> <53>   DW_AT_type: <0x57>

> so it seems that objdump mis-parses all indirect strings - which IIRC is
> something like a pointer into a "global" string table, assuming the
> offset to be 0. That then just returns the first table entry.
> 
> It doesn't happen with a slightly older version of binutils. Bisecting
> now.

It's

commit 39f0547e554df96608dd041d2a7b3c72882fd515 (HEAD, refs/bisect/bad)
Author: Nick Clifton 
Date:   2019-02-25 12:15:41 +

Extend objdump's --dwarf=follow-links option so that separate debug info 
files will also be affected by other dump function, and symbol tables from 
separate debug info files will be used when disassembling the main file.

* objdump.c (sym_ok): New function.
(find_symbol_for_address): Use new function.
(disassemble_section): Compare sections by name, not pointer.
(dump_dwarf): Move code to initialise byte_get pointer and iterate
over separate debug files from here to ...
(dump_bfd): ... here.  Add parameter indicating that a separate
debug info file is being dumped.  For main file, pull in the
symbol tables from all separate debug info files.
(display_object): Update call to dump_bfd.
* doc/binutils.texi: Document extened behaviour of the
--dwarf=follow-links option.
* NEWS: Mention this new feature.
* testsuite/binutils-all/objdump.WK2: Update expected output.
* testsuite/binutils-all/objdump.exp (test_follow_debuglink): Add
options and dump file parameters.
Add extra test.
* testsuite/binutils-all/objdump.WK3: New file.
* testsuite/binutils-all/readelf.exp: Change expected output for
readelf -wKis test.
* testsuite/binutils-all/readelf.wKis: New file.

luckily that allowed me to find a workaround too. If objdump -W's k and K
parameters (or --dwarf=links,=follow-links) aren't enabled, the dump
comes out correctly:

objdump -WlLiaprmfFsoRtUuTgAckK /tmp/test.o|grep -A5 '(DW_TAG_compile_unit)'
 <0>: Abbrev Number: 1 (DW_TAG_compile_unit)
   DW_AT_producer: (indirect string, offset: 0x0): /home/andres
<10>   DW_AT_language: 12   (ANSI C99)
<11>   DW_AT_name: (indirect string, offset: 0x0): /home/andres
<15>   DW_AT_comp_dir: (indirect string, offset: 0x0): /home/andres
<19>   DW_AT_low_pc  : 0x0

objdump -WlLiaprmfFsoRtUuTgAc /tmp/test.o|grep -A5 '(DW_TAG_compile_unit)'
 <0>: Abbrev Number: 1 (DW_TAG_compile_unit)
   DW_AT_producer: (indirect string, offset: 0x2a): GNU C17 9.1.0 
-mtune=generic -march=x86-64 -ggdb -fasynchronous-
<10>   DW_AT_language: 12   (ANSI C99)
<11>   DW_AT_name: (indirect string, offset: 0x14): /tmp/test.c
<15>   DW_AT_comp_dir: (indirect

Re: block-level incremental backup

2019-07-16 Thread Jeevan Chalke

On Thu, Jul 11, 2019 at 5:00 PM Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

> Hi Anastasia,
>
> On Wed, Jul 10, 2019 at 11:47 PM Anastasia Lubennikova <
> a.lubennik...@postgrespro.ru> wrote:
>
>> 23.04.2019 14:08, Anastasia Lubennikova wrote:
>> > I'm volunteering to write a draft patch or, more likely, set of
>> > patches, which
>> > will allow us to discuss the subject in more detail.
>> > And to do that I wish we agree on the API and data format (at least
>> > broadly).
>> > Looking forward to hearing your thoughts.
>>
>> Though the previous discussion stalled,
>> I still hope that we could agree on basic points such as a map file
>> format and protocol extension,
>> which is necessary to start implementing the feature.
>>
>
> It's great that you too come up with the PoC patch. I didn't look at your
> changes in much details but we at EnterpriseDB too working on this feature
> and started implementing it.
>
> Attached series of patches I had so far... (which needed further
> optimization and adjustments though)
>
> Here is the overall design (as proposed by Robert) we are trying to
> implement:
>
> 1. Extend the BASE_BACKUP command that can be used with replication
> connections. Add a new [ LSN 'lsn' ] option.
>
> 2. Extend pg_basebackup with a new --lsn=LSN option that causes it to send
> the option added to the server in #1.
>
> Here are the implementation details when we have a valid LSN
>
> sendFile() in basebackup.c is the function which mostly does the thing for
> us. If the filename looks like a relation file, then we'll need to consider
> sending only a partial file. The way to do that is probably:
>
> A. Read the whole file into memory.
>
> B. Check the LSN of each block. Build a bitmap indicating which blocks
> have an LSN greater than or equal to the threshold LSN.
>
> C. If more than 90% of the bits in the bitmap are set, send the whole file
> just as if this were a full backup. This 90% is a constant now; we might
> make it a GUC later.
>
> D. Otherwise, send a file with .partial added to the name. The .partial
> file contains an indication of which blocks were changed at the beginning,
> followed by the data blocks. It also includes a checksum/CRC.
> Currently, a .partial file format looks like:
>  - start with a 4-byte magic number
>  - then store a 4-byte CRC covering the header
>  - then a 4-byte count of the number of blocks included in the file
>  - then the block numbers, each as a 4-byte quantity
>  - then the data blocks
>
>
> We are also working on combining these incremental back-ups with the full
> backup and for that, we are planning to add a new utility called
> pg_combinebackup. Will post the details on that later once we have on the
> same page for taking backup.
>

For combining a full backup with one or more incremental backup, we are
adding
a new utility called pg_combinebackup in src/bin.

Here is the overall design as proposed by Robert.

pg_combinebackup starts from the LAST backup specified and work backward. It
must NOT start with the full backup and work forward. This is important both
for reasons of efficiency and of correctness. For example, if you start by
copying over the full backup and then later apply the incremental backups on
top of it then you'll copy data and later end up overwriting it or removing
it. Any files that are leftover at the end that aren't in the final
incremental backup even as .partial files need to be removed, or the result
is
wrong. We should aim for a system where every block in the output directory
is
written exactly once and nothing ever has to be created and then removed.

To make that work, we should start by examining the final incremental
backup.
We should proceed with one file at a time. For each file:

1. If the complete file is present in the incremental backup, then just
copy it
to the output directory - and move on to the next file.

2. Otherwise, we have a .partial file. Work backward through the backup
chain
until we find a complete version of the file. That might happen when we get
\back to the full backup at the start of the chain, but it might also happen
sooner - at which point we do not need to and should not look at earlier
backups for that file. During this phase, we should read only the HEADER of
each .partial file, building a map of which blocks we're ultimately going to
need to read from each backup. We can also compute the offset within each
file
where that block is stored at this stage, again using the header
information.

3. Now, we can write the output file - reading each block in turn from the
correct backup and writing it to the write output file, using the map we
constructed in the previous step. We should probably keep all of the input
files open over steps 2 and 3 and then close them at the end because
repeatedly closing and opening them is going to be expensive. When that's
done,
go on to the next file and start over at step 1.

We are already started working on this design.

-- 
Jeevan

Re: Minimal logical decoding on standbys

2019-07-16 Thread Amit Khandekar

On Tue, 16 Jul 2019 at 22:56, Andres Freund  wrote:
>
> Hi,
>
> On 2019-07-12 14:53:21 +0530, tushar wrote:
> > On 07/10/2019 05:12 PM, Amit Khandekar wrote:
> > > All right. Will do that in the next patch set. For now, I have quickly
> > > done the below changes in a single patch again (attached), in order to
> > > get early comments if any.
> > Thanks Amit for your patch. i am able to see 1 issues  on Standby server -
> > (where  logical replication slot created ) ,
> > a)size of  pg_wal folder  is NOT decreasing even after firing get_changes
> > function
>
> Even after calling pg_logical_slot_get_changes() multiple times? What
> does
> SELECT * FROM pg_replication_slots; before and after multiple calls return?
>
> Does manually forcing a checkpoint with CHECKPOINT; first on the primary
> and then the standby "fix" the issue?

I independently tried to reproduce this issue on my machine yesterday.
I observed that :
sometimes, the files get cleaned up after two or more
pg_logical_slot_get_changes().
Sometimes, I have to restart the server to see the pg_wal files cleaned up.
This happens more or less the same even for logical slot on *primary*.

Will investigate further with Tushar.


>
>
> > b)pg_wal files are not recycling  and every time it is creating new files
> > after firing get_changes function
>
> I'm not sure what you mean by this. Are you saying that
> pg_logical_slot_get_changes() causes WAL to be written?
>
> Greetings,
>
> Andres Freund



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: psql ctrl+f skips displaying of one record and displays skipping one line

2019-07-16 Thread Tom Lane

vignesh C  writes:
> I'm able to get the same behaviour in centos as well.
> Should we do anything to handle this in Postgres or any documentation
> required?

It already is documented:

PSQL_PAGER
PAGER

If a query's results do not fit on the screen, they are piped
through this command. Typical values are more or less. Use of the
pager can be disabled by setting PSQL_PAGER or PAGER to an empty
string, or by adjusting the pager-related options of the \pset
command. These variables are examined in the order listed; the
first that is set is used. If none of them is set, the default is
to use more on most platforms, but less on Cygwin.

We're certainly not going to copy four or five different versions
of the "more" and "less" man pages into psql's man page, if that's
what you're suggesting.  Nor is it our job to point out shortcomings
in some versions of those man pages.

regards, tom lane

Re: A little report on informal commit tag usage

2019-07-16 Thread Michael Paquier

On Tue, Jul 16, 2019 at 04:33:07PM -0700, Andres Freund wrote:
> On 2019-07-16 19:26:59 -0400, Tom Lane wrote:
>> I've wondered for some time what you think the "-" means in this.
> 
> Up to master. Occasionally there's bugs that only need to be fixed in
> some back branches etc.

Is "-" most common to define a range of branches?  I would have
imagined that "~" makes more sense here.
--
Michael


signature.asc
Description: PGP signature

Re: psql ctrl+f skips displaying of one record and displays skipping one line

2019-07-16 Thread vignesh C

I'm able to get the same behaviour in centos as well.
Should we do anything to handle this in Postgres or any documentation
required?

On Wed, Jul 17, 2019 at 10:05 AM Tom Lane  wrote:

> Thomas Munro  writes:
> > Pretty sure this is coming from your system's pager.  You can see the
> > same thing when you run this on a RHEL box:
> > seq 1 1 | more
> > It skips a line each time you press ^F.
>
> Yeah, duplicated on RHEL6.  It seems to behave the same as the documented
> "s" command.  Not sure why it's not listed in the man page --- though
> there's a disclaimer saying that the man page was basically
> reverse-engineered, so maybe they just missed this synonym.
>
> > Doesn't happen on FreeBSD or macOS though.
>
> macOS's "more" is actually "less", so it's not surprising it's not
> bug-compatible.  Can't say about FreeBSD.
>
> regards, tom lane
>


-- 
Regards,
vignesh
  Have a nice day

Re: psql ctrl+f skips displaying of one record and displays skipping one line

2019-07-16 Thread Tom Lane

Thomas Munro  writes:
> Pretty sure this is coming from your system's pager.  You can see the
> same thing when you run this on a RHEL box:
> seq 1 1 | more
> It skips a line each time you press ^F.

Yeah, duplicated on RHEL6.  It seems to behave the same as the documented
"s" command.  Not sure why it's not listed in the man page --- though
there's a disclaimer saying that the man page was basically
reverse-engineered, so maybe they just missed this synonym.

> Doesn't happen on FreeBSD or macOS though.

macOS's "more" is actually "less", so it's not surprising it's not
bug-compatible.  Can't say about FreeBSD.

regards, tom lane

[PATCH] Implement INSERT SET syntax

2019-07-16 Thread Gareth Palmer

Hello,

Attached is a patch that adds the option of using SET clause to specify
the columns and values in an INSERT statement in the same manner as that
of an UPDATE statement.

A simple example that uses SET instead of a VALUES() clause:

INSERT INTO t SET c1 = 'foo', c2 = 'bar', c3 = 'baz';

Values may also be sourced from a CTE using a FROM clause:

WITH x AS (
  SELECT 'foo' AS c1, 'bar' AS c2, 'baz' AS c3
)
INSERT INTO t SET c1 = x.c1, c2 = x.c2, c3 = x.c3 FROM x;

The advantage of using the SET clause style is that the column and value
are kept together, which can make changing or removing a column or value from
a large list easier.

Internally the grammar parser converts INSERT SET without a FROM clause into
the equivalent INSERT with a VALUES clause. When using a FROM clause it becomes
the equivalent of INSERT with a SELECT statement.

There was a brief discussion regarding INSERT SET on pgsql-hackers in late
August 2009 [1].

INSERT SET is not part of any SQL standard (that I am aware of), however this
syntax is also implemented by MySQL [2]. Their implementation does not support
specifying a FROM clause.

Patch also contains regression tests and documentation.


Regards,
Gareth


[1] 
https://www.postgresql.org/message-id/flat/2c5ef4e30908251010s46d9d566m1da21357891bab3d%40mail.gmail.com
[2] https://dev.mysql.com/doc/refman/8.0/en/insert.html



insert-set-v1.patch
Description: Binary data

Re: psql ctrl+f skips displaying of one record and displays skipping one line

2019-07-16 Thread Thomas Munro

On Wed, Jul 17, 2019 at 4:07 PM vignesh C  wrote:
> One observation when we execute a select query having results more than the 
> screen space available and press ctrl+f to display the remaining records, one 
> of the record was not displayed and the message "...skipping one line" was 
> displayed.
>
> I'm not sure if this is intentional behaviour.

Pretty sure this is coming from your system's pager.  You can see the
same thing when you run this on a RHEL box:

seq 1 1 | more

It skips a line each time you press ^F.

Doesn't happen on FreeBSD or macOS though.

--
Thomas Munro
https://enterprisedb.com

psql ctrl+f skips displaying of one record and displays skipping one line

2019-07-16 Thread vignesh C

Hi,

One observation when we execute a select query having results more than the
screen space available and press ctrl+f to display the remaining records,
one of the record was not displayed and the message "...skipping one line"
was displayed.

I'm not sure if this is intentional behaviour.

Steps for the same:
postgres=# create table psqltest as select generate_series(1,50);
SELECT 50
postgres=# select * from psqltest;
 generate_series
-
   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45

*...skipping one line*
  47
  48
  49
  50
(50 rows)

Is this intended?

-- 
Regards,
vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-07-16 Thread Thomas Munro

On Wed, Jul 17, 2019 at 3:44 PM Dilip Kumar  wrote:
> Right, actually I got that point.  But, I was thinking that we are
> wasting one logno from undo log addressing space no?.  Instead, if we
> can keep it attached to the slot and somehow manage to add to the free
> list then the same logno can be used by someone else?

We can never reuse log numbers.  UndoRecPtr values containing that log
number could exist in permanent storage anywhere (zheap, zedstore etc)
and must appear to be discarded forever if anyone asks.  Now, it so
happens that the current coding in zheap has fxid + urp for each
transaction slot and always checks the fxid first so it probably
wouldn't ask about discarded urps too much, but I don't think that's
policy is a requirement and the undo layer can't count on it.  I think
I heard that zedstore is planning to check urp only.

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-07-16 Thread Dilip Kumar

On Wed, Jul 17, 2019 at 3:48 AM Thomas Munro  wrote:
>
> On Tue, Jul 16, 2019 at 11:33 PM Dilip Kumar  wrote:
> > On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro  wrote:
> > /* If we discarded everything, the slot can be given up. */
> > + if (entirely_discarded)
> > + free_undo_log_slot(slot);
> >
> > I have noticed that when the undo log was detached and it was full
> > then if we discard complete log we release its slot.  But, what is
> > bothering me is should we add that log to the free list?  Or I am
> > missing something?
>
> Stepping back a bit:  The free lists are for undo logs that someone
> might want to attach to and insert into.  If it's full, we probably
> can't insert anything into it again (well, technically someone else
> who wants to insert something a bit smaller might be able to, but
> that's not an interesting case to worry about).  So it doesn't need to
> go back on a free list, but it still needs to exist (= occupy a slot)
> as long as there is undiscarded data in it, because that data is
> needed and we need to be able to test URPs against its discard
> pointer.  But once its data is entirely discarded, it ceases to exist
> -- there is no reason to waste a slot on it,

Right, actually I got that point.  But, I was thinking that we are
wasting one logno from undo log addressing space no?.  Instead, if we
can keep it attached to the slot and somehow manage to add to the free
list then the same logno can be used by someone else?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

[PATCH] Make configuration file "include" directive handling more robust

2019-07-16 Thread Ian Barwick


Hi

While poking about with [1], I noticed a few potential issues with the
inclusion handling for configuration files; another issue is demonstrated in 
[2].

[1] 
https://www.postgresql.org/message-id/aed6cc9f-98f3-2693-ac81-52bb0052307e%402ndquadrant.com
   ("Stop ALTER SYSTEM from making bad assumptions")
[2] 
https://www.postgresql.org/message-id/CY4PR1301MB22001D3FAAB3499C5D41DE23A9E50%40CY4PR1301MB2200.namprd13.prod.outlook.com
   ("First Time Starting Up PostgreSQL and Having Problems")

Specifically these are:

1) Provision of empty include directives
=

The default postgresql.conf file includes these thusly:

#include_dir = ''   # include files ending in '.conf' 
from
# a directory, e.g., 'conf.d'
#include_if_exists = '' # include file only if it exists
#include = ''   # include file

Currently, uncommenting them but leaving the value empty (as happened in [2] 
above) can
result in unexpected behaviour.

For "include" and "include_if_exists", it's a not critical issue as non-existent
files are, well, non-existent, however this will leave behind the cryptic
message "input in flex scanner failed" in pg_file_settings's "error" column, 
e.g.:

postgres=# SELECT sourceline, seqno, name, setting, applied, error
 FROM pg_file_settings
WHERE error IS NOT NULL;
 sourceline | seqno | name | setting | applied |error

+---+--+-+-+--
  1 |45 |  | | f   | input in flex scanner 
failed
  1 |46 |  | | f   | input in flex scanner 
failed
(2 rows)

However, an empty value for "include_dir" will result in the current 
configuration
file's directory being read, which can result in circular inclusion and 
triggering
of the nesting depth check.

Patch {1} makes provision of an empty value for any of these directives cause
configuration file processing to report an approprate error, e.g.:

postgres=# SELECT sourceline, seqno, name, setting, applied, error
 FROM pg_file_settings
WHERE error IS NOT NULL;
 sourceline | seqno | name | setting | applied | error

+---+--+-+-+---
757 |45 |  | | f   | "include" must not be empty
758 |46 |  | | f   | "include_if_exists" must 
not be empty
759 |47 |  | | f   | "include_dir" must not be 
empty


2) Circular inclusion of configuration files


Currently there is a simple maximum nesting threshold (currently 10) which
will stop runaway circular inclusions. However, if triggered, it's not
always obvious what triggered it, and sometimes resource exhaustion
might kick in beforehand (as appeared to be the case in [2] above).

Patch {2} attempts to handle this situation by keeping track of which
files have already been included (based on their absolute, canonical
path) and reporting an error if they were encountered again.

On server startup:

2019-07-11 09:13:25.610 GMT [71140] LOG:  configuration file 
"/var/lib/pgsql/data/postgresql.conf" was previously parsed
2019-07-11 09:13:25.610 GMT [71140] FATAL:  configuration file 
"/var/lib/pgsql/data/postgresql.conf" contains errors

After sending SIGHUP:

postgres=# SELECT sourceline, seqno, name, setting, applied, error FROM 
pg_file_settings WHERE error IS NOT NULL;
 sourceline | seqno | name | setting | applied |
  error

+---+--+-+-+
757 |45 |  | | f   | configuration file 
"/var/lib/pgsql/data/postgresql.conf" was previously parsed
(1 row)

3) "include" directives in postgresql.auto.conf and extension control files
===

Currently these are parsed and acted on, even though it makes no sense for 
further
config files to be included in either case.

With "postgresql.auto.conf", if a file is successfully included, its contents
will then be written to "postgresql.auto.conf" and the include directive will be
removed, which seems like a recipe for confusion.

These are admittedly unlikely corner cases, but it's easy enough to stop this
happening on the offchance someone tries to use this to solve some problem in
completely the wrong way.

Patch {3} implements this (note that this patch depends on patch {2}).

Extension example:

postgres=# CREATE EXTENSION repmgr;
ERROR:  "include" not permitted in file

Re: SegFault on 9.6.14

2019-07-16 Thread Thomas Munro

On Wed, Jul 17, 2019 at 12:57 PM Thomas Munro  wrote:
> On Wed, Jul 17, 2019 at 12:44 PM Thomas Munro  wrote:
> > > #11 0x55666e0359df in ExecShutdownNode 
> > > (node=node@entry=0x55667033a6c8)
> > > at 
> > > /build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execProcnode.c:830
> > > #12 0x55666e04d0ff in ExecLimit (node=node@entry=0x55667033a428)
> > > at 
> > > /build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeLimit.c:139
> >
> > https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/backend/executor/nodeLimit.c#L139
> >
> > Limit thinks it's OK to "shut down" the subtree, but if you shut down a
> > Gather node you can't rescan it later because it destroys its shared
> > memory.  Oops.  Not sure what to do about that yet.
>
> CCing Amit and Robert, authors of commits 19df1702 and 69de1718.

Here's a repro (I'm sure you can find a shorter one, this one's hacked
up from join_hash.sql, basically just adding LIMIT):

create table join_foo as select generate_series(1, 3000) as id,
'x'::text as t;
alter table join_foo set (parallel_workers = 0);
create table join_bar as select generate_series(0, 1) as id,
'x'::text as t;
alter table join_bar set (parallel_workers = 2);

set parallel_setup_cost = 0;
set parallel_tuple_cost = 0;
set max_parallel_workers_per_gather = 2;
set enable_material = off;
set enable_mergejoin = off;
set work_mem = '1GB';

select count(*) from join_foo
left join (select b1.id, b1.t from join_bar b1 join join_bar b2
using (id) limit 1000) ss
on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;

-- 
Thomas Munro
https://enterprisedb.com

Re: refactoring - share str2*int64 functions

2019-07-16 Thread Michael Paquier

On Tue, Jul 16, 2019 at 01:04:38PM -0700, Andres Freund wrote:
> There is the issue that there already is pg_strtoint16 and
> pg_strtoint32, which do not have the option to not raise an error.  I'd
> probably name the non-error throwing ones something like pg_strtointNN_e
> (for extended, or error handling), and have pg_strtointNN wrappers that
> just handle the errors, or reverse the naming (which'd cause a bit of
> churn, but not that much).
> 
> That'd also make the code for pg_strtointNN a bit nicer, because we'd
> not need the gotos anymore, they're just there to avoid redundant error
> messages - which'd not be an issue if the error handling were just a
> switch in a separate function. E.g.

Agreed on that.  I am wondering if we should use a common wrapper for
all the internal functions taking in input a set of bits16 flags to
control its behavior and put all that into common/script.c:
- Issue an error.
- Check for signedness.
- Base length: 16, 32 or 64.
This would have the advantage to move the error string generation, the
trailing whitespace check, sign handling and such in a single place.
We could have the internal routine return uint64 which is casted
afterwards to a proper result depending on what we use.  (Perhaps
that's what you actually meant?)

I would also rather not touch the strtol wrappers that we have able to
handle the base.  There is nothing in the tree using it, but likely
there are extensions relying on it.  Switching all the existing
callers in the tree to the new routines sounds good to me of course.

Consolidating all that still needs more work, so for now I am
switching the patch as waiting on author.
--
Michael


signature.asc
Description: PGP signature

Re: refactoring - share str2*int64 functions

2019-07-16 Thread Michael Paquier

On Tue, Jul 16, 2019 at 01:18:38PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2019-07-16 16:11:44 +0900, Michael Paquier wrote:
> Yea, consistent naming seems like a strong requirement
> here. Additionally I think we should just provide a consistent set
> rather than what's needed just now. That'll just lead to people
> inventing their own again down the line.

Agreed.  The first versions of pg_rewind in the tree have been using
copy_file_range(), which has been introduced in Linux.

> > - The str->integer conversion routines, which actually have very
> > similar characteristics to the strtol families as they remove trailing
> > whitespaces first, check for a sign, etc, except that they work only
> > on base 10.
> 
> There's afaict neither a caller that needs the base argument at the
> moment, nor one in the tree previously. I'd argue for just making
> pg_strtouint64's API consistent.

Good point, indeed, this could be much more simplified.  I have not
paid attention at that part.

> I'd probably also just use the implementation we have for signed
> integers (minus the relevant negation and overflow checks, obviously) -
> it's a lot faster, and I think there's value in keeping the
> implementations in sync.

You mean that it is much faster than the set of wrappers for strtol
than we have?  Is that because we don't care about the base?
--
Michael


signature.asc
Description: PGP signature

Re: POC: converting Lists into arrays

2019-07-16 Thread David Rowley

On Wed, 17 Jul 2019 at 11:06, Tom Lane  wrote:
> 0002 changes some additional places where it's maybe a bit less safe,
> ie there's a potential for user-visible behavioral change because
> processing will occur in a different order.  In particular, the proposed
> change in execExpr.c causes aggregates and window functions that are in
> the same plan node to be executed in a different order than before ---
> but it seems to me that this order is saner.  (Note the change in the
> expected regression results, in a test that's intentionally sensitive to
> the execution order.)  And anyway when did we guarantee anything about
> that?

I've only looked at 0002. Here are my thoughts:

get_tables_to_cluster:
Looks fine. It's a heap scan. Any previous order was accidental, so if
it causes issues then we might need to think of using a more
well-defined order for CLUSTER;

get_rels_with_domain:
This is a static function. Changing the order of the list seems to
only really affect the error message that a failed domain constraint
validation could emit. Perhaps this might break someone else's tests,
but they should just be able to update their expected results.

ExecInitExprRec:
As you mention, the order of aggregate evaluation is reversed. I agree
that the new order is saner. I can't think why we'd be doing it in
backwards order beforehand.

get_relation_statistics:
RelationGetStatExtList does not seem to pay much attention to the
order it returns its results, so I don't think the order we apply
extended statistics was that well defined before. We always attempt to
use the stats with the most matching columns in
choose_best_statistics(), so I think
for people to be affected they'd either multiple stats with the same
sets of columns or a complex clause that equally well matches two sets
of stats, and in that case the other columns would be matched to the
other stats later... I'd better check that... erm... actually that's
not true. I see statext_mcv_clauselist_selectivity() makes no attempt
to match the clause list to another set of stats after finding the
first best match. I think it likely should do that.
estimate_multivariate_ndistinct() seems to have an XXX comment
mentioning thoughts about the stability of which stats are used, but
nothing is done.

parseCheckAggregates:
I can't see any user-visible change to this one. Not even in error messages.

estimate_num_groups:
Similar to get_relation_statistics(), I see that
estimate_multivariate_ndistinct() is only called once and we don't
attempt to match up the remaining clauses with more stats. I can't
imagine swapping lcons for lappend here will upset anyone. The
behaviour does not look well defined already. I think we should likely
change the "if (estimate_multivariate_ndistinct(root, rel,
," to "while ...", then drop the else. Not for this patch
though...

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Adding SMGR discriminator to buffer tags

2019-07-16 Thread Thomas Munro

On Tue, Jul 16, 2019 at 10:49 AM Thomas Munro  wrote:
> I'll go and commit the simple refactoring bits of this work, which
> just move some stuff belonging to md.c out of smgr.c (see attached).

Pushed.  The rest of that earlier patch set is hereby abandoned (at
least for now).  I'll be posting a new-and-improved undo log patch set
soon, now a couple of patches smaller but back to magic database 9.  I
think I'll probably do that with a new catalog header file that
defines pseudo-database OIDs.

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-07-16 Thread Amit Kapila

On Wed, Jul 17, 2019 at 3:53 AM Andres Freund  wrote:
>
> Hi,
>
> On 2019-07-15 12:26:21 -0400, Robert Haas wrote:
> > Yeah. I didn't understand that explanation.  It seems to me that one
> > of the fundamental design questions for this system is whether we
> > should allow there to be an unbounded number of transactions that are
> > pending undo application, or whether it's OK to enforce a hard limit.
> > Either way, there should certainly be pressure applied to try to keep
> > the number low, like forcing undo application into the foreground when
> > a backlog is accumulating, but the question is what to do when that's
> > insufficient.  My original idea was that we should not have a hard
> > limit, in which case the shared memory data on what is pending might
> > be incomplete, in which case we would need the discard workers to
> > discover transactions needing undo and add them to the shared memory
> > data structures, and if those structures are full, then we'd just skip
> > adding those details and rediscover those transactions again at some
> > future point.
> >
> > But, my understanding of the current design being implemented is that
> > there is a hard limit on the number of transactions that can be
> > pending undo and the in-memory data structures are sized accordingly.
>
> My understanding is that that's really just an outcome of needing to
> maintain oldestXidHavingUndo accurately, right?
>

Yes.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

RE: Bug: ECPG: Cannot use CREATE AS EXECUTE statemnt

2019-07-16 Thread Matsumura, Ryo

Meskes-san

Thank you for your comment.

I attach a patch.
It doesn't include tests, but it passed some test(*1).

Explanation about the patch:

- Add a new ECPGst_exec_embedded_in_other_stmt whether EXECUTE
  statement has exprlist or not.

  This type name may not be good.
  It is a type for [CREATE TABLE ... AS EXECUTE ...].
  But I doesn't consider about [EXPLAIN EXECUTE ...].

- If statement type is a new one, ecpglib embeds variables into 
  query in text format at ecpg_build_params().
  Even if the statement does not have exprlist, ecpglib makes
  exprlist and embeds into it.
  The list is expanded incrementally in loop of ecpg_build_params().

- ecpg_build_params() is difficult to read and insert the above
  logic. Therefore, I refactor it. The deitail is described in comments.

(*1) The followings run expectively.
  exec sql create table if not exists foo (c1 int);
  exec sql insert into foo select generate_series(1, 20);
  exec sql prepare st as select * from foo where c1 % $1 = 0 and c1 % $2 = 0;

  exec sql execute st using :v1,:v2;
  exec sql execute st(:v1,:v2);
  exec sql create table if not exists bar (c1) as execute st(2, 3);
  exec sql create table if not exists bar (c1) as execute st using 2,3;
  exec sql create table if not exists bar (c1) as execute st using :v1,:v2;
  exec sql create table bar (c1) as execute st using :v1,:v2;

Regards
Ryo Matsumura


ecpg_createas_execute.v1.0.patch
Description: ecpg_createas_execute.v1.0.patch

Re: buildfarm's typedefs list has gone completely nutso

2019-07-16 Thread Andres Freund

Hi,

On 2019-07-10 17:24:41 -0700, Andres Freund wrote:
> Not yet sure what's actually going on, but there's something odd with
> debug information afaict:
> 
> objdump -W spits out warnings for a few files, all static libraries:
> 
> ../install/lib/libpgcommon.a
> objdump: Warning: Location list starting at offset 0x0 is not terminated.
> objdump: Warning: Hole and overlap detection requires adjacent view lists and 
> loclists.
> objdump: Warning: There are 3411 unused bytes at the end of section .debug_loc

...


Interestingly, for the same files, readelf spits out correct
data. E.g. here's a short excerpt from libpq.a:

objdump -W src/interfaces/libpq/libpq.a
...
 <0>: Abbrev Number: 1 (DW_TAG_compile_unit)
   DW_AT_producer: (indirect string, offset: 0x0): 
SYNC_FILE_RANGE_WRITE 2
<10>   DW_AT_language: 12   (ANSI C99)
<11>   DW_AT_name: (indirect string, offset: 0x0): 
SYNC_FILE_RANGE_WRITE 2
<15>   DW_AT_comp_dir: (indirect string, offset: 0x0): 
SYNC_FILE_RANGE_WRITE 2
...
 <1><31>: Abbrev Number: 2 (DW_TAG_typedef)
<32>   DW_AT_name: Oid
<36>   DW_AT_decl_file   : 30
<37>   DW_AT_decl_line   : 31
<38>   DW_AT_decl_column : 22
<39>   DW_AT_type: <0x3d>
 <1><3d>: Abbrev Number: 3 (DW_TAG_base_type)
<3e>   DW_AT_byte_size   : 4
<3f>   DW_AT_encoding: 7(unsigned)
<40>   DW_AT_name: (indirect string, offset: 0x0): 
SYNC_FILE_RANGE_WRITE 2
 <1><44>: Abbrev Number: 3 (DW_TAG_base_type)
<45>   DW_AT_byte_size   : 8
<46>   DW_AT_encoding: 5(signed)
<47>   DW_AT_name: (indirect string, offset: 0x0): 
SYNC_FILE_RANGE_WRITE 2
 <1><4b>: Abbrev Number: 4 (DW_TAG_typedef)
<4c>   DW_AT_name: (indirect string, offset: 0x0): 
SYNC_FILE_RANGE_WRITE 2
<50>   DW_AT_decl_file   : 2
<51>   DW_AT_decl_line   : 216
<52>   DW_AT_decl_column : 23
<53>   DW_AT_type: <0x57>
...

readelf --debug-dump src/interfaces/libpq/libpq.a
...
 <0>: Abbrev Number: 1 (DW_TAG_compile_unit)
   DW_AT_producer: (indirect string, offset: 0x29268): GNU C17 8.3.0 
-march=skylake -mmmx -mno-3dnow -msse -msse2 -m
<10>   DW_AT_language: 12   (ANSI C99)
<11>   DW_AT_name: (indirect string, offset: 0x28ef3): 
/home/andres/src/postgresql/src/interfaces/libpq/fe-auth.c
<15>   DW_AT_comp_dir: (indirect string, offset: 0xf800): 
/home/andres/build/postgres/dev-assert/vpath/src/interfaces/l
...
 <1><31>: Abbrev Number: 2 (DW_TAG_typedef)
<32>   DW_AT_name: Oid
<36>   DW_AT_decl_file   : 30
<37>   DW_AT_decl_line   : 31
<38>   DW_AT_decl_column : 22
<39>   DW_AT_type: <0x3d>
 <1><3d>: Abbrev Number: 3 (DW_TAG_base_type)
<3e>   DW_AT_byte_size   : 4
<3f>   DW_AT_encoding: 7(unsigned)
<40>   DW_AT_name: (indirect string, offset: 0x4f12f): unsigned int
 <1><44>: Abbrev Number: 3 (DW_TAG_base_type)
<45>   DW_AT_byte_size   : 8
<46>   DW_AT_encoding: 5(signed)
<47>   DW_AT_name: (indirect string, offset: 0x57deb): long int
 <1><4b>: Abbrev Number: 4 (DW_TAG_typedef)
<4c>   DW_AT_name: (indirect string, offset: 0x26129): size_t
<50>   DW_AT_decl_file   : 2
<51>   DW_AT_decl_line   : 216
<52>   DW_AT_decl_column : 23
<53>   DW_AT_type: <0x57>
...

so it seems that objdump mis-parses all indirect strings - which IIRC is
something like a pointer into a "global" string table, assuming the
offset to be 0. That then just returns the first table entry.

It doesn't happen with a slightly older version of binutils. Bisecting
now.

Greetings,

Andres Freund

pg_stat_replication lag fields return non-NULL values even with NULL LSNs

2019-07-16 Thread Michael Paquier

Hi all,
(Thomas in CC as per 6912acc0)

I got surprised by the following behavior from pg_stat_get_wal_senders
when connecting for example pg_receivewal to a primary:
=# select application_name, flush_lsn, replay_lsn, flush_lag,
replay_lag from pg_stat_replication;
 application_name | flush_lsn | replay_lsn |flush_lag|replay_lag
--+---++-+-
 receivewal   | null  | null   | 00:09:13.578185 | 00:09:13.578185
(1 row)

It makes little sense to me, as we are reporting a replay lag on a
position which has never been reported yet, so it cannot actually be
used as a comparison base for the lag.  Am I missing something or
should we return NULL for those fields if we have no write, flush or
apply LSNs like in the attached?

Thoughts?
--
Michael
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e7a59b0a92..61fc9889db 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3346,17 +3346,17 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 */
 			priority = XLogRecPtrIsInvalid(flush) ? 0 : priority;
 
-			if (writeLag < 0)
+			if (writeLag < 0 || XLogRecPtrIsInvalid(write))
 nulls[6] = true;
 			else
 values[6] = IntervalPGetDatum(offset_to_interval(writeLag));
 
-			if (flushLag < 0)
+			if (flushLag < 0 || XLogRecPtrIsInvalid(flush))
 nulls[7] = true;
 			else
 values[7] = IntervalPGetDatum(offset_to_interval(flushLag));
 
-			if (applyLag < 0)
+			if (applyLag < 0 || XLogRecPtrIsInvalid(apply))
 nulls[8] = true;
 			else
 values[8] = IntervalPGetDatum(offset_to_interval(applyLag));


signature.asc
Description: PGP signature

Re: Extracting only the columns needed for a query

2019-07-16 Thread Melanie Plageman

We implemented Approach B in the attached patch set (patch 0001) and
then implemented Approach A (patch 0002) to sanity check the pruned
list of columns to scan we were getting at plan-time.
We emit a notice in SeqNext() if the two column sets differ.
Currently, for all of the queries in the regression suite, no
differences are found.
We did notice that none of the current regression tests for triggers
are showing a difference between columns that can be extracted at plan
time and those that can be extracted at execution time--though we
haven't dug into this at all.

In our implementation of Approach B, we extract the columns to scan in
make_one_rel() after set_base_rel_sizes() and before
set_base_rel_pathlists() so that the scan cols can be used in costing.
We do it after calling set_base_rel_sizes() because it eventually
calls set_append_rel_size() which adds PathTarget exprs for the
partitions with Vars having the correct varattno and varno.

We added the scanCols to RangeTblEntries because it seemed like the
easiest way to make sure the information was available at scan time
(as suggested by David).

A quirk in the patch for Approach B is that, in inheritance_planner(),
we copy over the scanCols which we populated in each subroot's rtable
to the final rtable.
The rtables for all of the subroots seem to be basically unused after
finishing with inheritance_planner(). That is, many other members of
the modified child PlannerInfos are copied over to the root
PlannerInfo, but the rtables seem to be an exception.
If we want to get at them later, we would have had to go rooting
around in the pathlist of the RelOptInfo, which seemed very complex.

One note: we did not add any logic to make the extraction of scan
columns conditional on the AM. We have code to do it conditionally in
the zedstore patch, but we made it generic here.

While we were implementing this, we found that in the columns
extracted at plan-time, there were excess columns for
UPDATE/DELETE...RETURNING on partition tables.
Vars for these columns are being added to the targetlist in
preprocess_targetlist(). Eventually targetlist items are added to
RelOptInfo->reltarget exprs.
However, when result_relation is 0, this means all of the vars from
the returningList will be added to the targetlist, which seems
incorrect. We included a patch (0003) to only add the vars if
result_relation is not 0.

Adding the scanCols in RangeTblEntry, we noticed that a few other
members of RangeTblEntry are consistently initialized to NULL whenever
a new RangeTblEntry is being made.
This is confusing because makeNode() for a RangeTblEntry does
palloc0() anyway, so, why is this necessary?
If it is necessary, why not create some kind of generic initialization
function to do this?

Thanks,
Ashwin & Melanie
From 3b8515aaea164174d18ef2450b19478de8e64298 Mon Sep 17 00:00:00 2001
From: csteam 
Date: Tue, 16 Jul 2019 16:19:49 -0700
Subject: [PATCH v1 2/3] Execution-time scan col extraction and comparison with
 plan-time cols

Extract the columns needed to scan directly before scanning from the
ScanState and compare this set of columns with those extracted at
plan-time.

In regress, there are two main queries where there is a difference in
the columns extracted at plan time vs execution time. They are both due
to the fact that UPDATE/DELETE on partition tables adds the contents of
the returningList to the PathTargets in the RelOptInfos. This manifests
as a difference in the column sets.
---
 src/backend/executor/execScan.c| 63 ++
 src/backend/executor/nodeSeqscan.c | 24 
 src/include/executor/executor.h|  4 ++
 3 files changed, 91 insertions(+)

diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index c0e4a5376c..89146693a3 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -20,6 +20,7 @@
 
 #include "executor/executor.h"
 #include "miscadmin.h"
+#include "nodes/nodeFuncs.h"
 #include "utils/memutils.h"
 
 
@@ -300,3 +301,65 @@ ExecScanReScan(ScanState *node)
 		}
 	}
 }
+
+typedef struct neededColumnContext
+{
+	Bitmapset **mask;
+	int n;
+} neededColumnContext;
+
+static bool
+neededColumnContextWalker(Node *node, neededColumnContext *c)
+{
+	if (node == NULL)
+		return false;
+
+	if (IsA(node, Var))
+	{
+		Var *var = (Var *)node;
+
+		if (var->varattno >= 0)
+		{
+			Assert(var->varattno <= c->n);
+			*(c->mask) = bms_add_member(*(c->mask), var->varattno);
+		}
+
+		return false;
+	}
+	return expression_tree_walker(node, neededColumnContextWalker, (void * )c);
+}
+
+/*
+ * n specifies the number of allowed entries in mask: we use
+ * it for bounds-checking in the walker above.
+ */
+void
+PopulateNeededColumnsForNode(Node *expr, int n, Bitmapset **scanCols)
+{
+	neededColumnContext c;
+
+	c.mask = scanCols;
+	c.n = n;
+
+	neededColumnContextWalker(expr, );
+}
+
+Bitmapset *
+PopulateNeededColumnsForScan(ScanState *scanstate, int ncol)
+{
+	Bitmapset *result =

Re: pg_receivewal documentation

2019-07-16 Thread Michael Paquier

On Tue, Jul 16, 2019 at 01:03:12PM -0400, Jesper Pedersen wrote:
> Here is the patch for that.

+   
+Note that while WAL will be flushed with this setting,
+pg_receivewal never applies it, so
+ must not be set to
+remote_apply if 
pg_receivewal
+is the only synchronous standby. Similarly, if
+pg_receivewal is part of a quorum-based
+set of synchronous standbys, it won't count towards the quorum if
+ is set to
+remote_apply.
+   

I think we should really document the caveat with priority-based sets
of standbys as much as quorum-based sets.  For example if a user sets
synchronous_commit = remote_apply in postgresql.conf, and then sets
s_s_names to '2(pg_receivewal, my_connected_standby)' to get a
priority-based set, then you have the same problem, and pg_receivewal
is not the only synchronous standby in this configuration.  The patch
does not cover that case properly.
--
Michael


signature.asc
Description: PGP signature

Re: Parallel Append subplan order instability on aye-aye

2019-07-16 Thread David Rowley

On Wed, 17 Jul 2019 at 07:23, Andres Freund  wrote:
>
> Hi,
>
> On 2019-07-15 21:12:32 -0400, Tom Lane wrote:
> > But I bet that these tables forming
> > an inheritance hierarchy (with multiple inheritance even) does
> > have something to do with it somehow, because if this were a
> > generic VACUUM bug surely we'd be seeing it elsewhere.
>
> It's possible that it's hidden in other cases, because of
>
> void
> table_block_relation_estimate_size(Relation rel, int32 *attr_widths,
>
> BlockNumber *pages, double *tuples,
>double 
> *allvisfrac,
>Size 
> overhead_bytes_per_tuple,
>Size 
> usable_bytes_per_page)
> ...
>  * If the table has inheritance children, we don't apply this 
> heuristic.
>  * Totally empty parent tables are quite common, so we should be 
> willing
>  * to believe that they are empty.
>  */
> if (curpages < 10 &&
> relpages == 0 &&
> !rel->rd_rel->relhassubclass)
> curpages = 10;
>
> which'd not make us actually take a relpages=0 into account for tables
> without inheritance.  A lot of these tables never get 10+ pages long, so
> the heuristic would always apply...

Surely it can't be that since that just sets what *pages gets set to.
Tom mentioned that following was returning 0 pages and tuples:

-- Temporary hack to investigate whether extra vacuum/analyze is happening
select relname, relpages, reltuples
from pg_class
where relname like '__star' order by relname;
 relname | relpages | reltuples
-+--+---
 a_star  |1 | 3

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: SegFault on 9.6.14

2019-07-16 Thread Thomas Munro

On Wed, Jul 17, 2019 at 12:44 PM Thomas Munro  wrote:
> > #11 0x55666e0359df in ExecShutdownNode (node=node@entry=0x55667033a6c8)
> > at 
> > /build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execProcnode.c:830
> > #12 0x55666e04d0ff in ExecLimit (node=node@entry=0x55667033a428)
> > at 
> > /build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeLimit.c:139
>
> https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/backend/executor/nodeLimit.c#L139
>
> Limit thinks it's OK to "shut down" the subtree, but if you shut down a
> Gather node you can't rescan it later because it destroys its shared
> memory.  Oops.  Not sure what to do about that yet.

CCing Amit and Robert, authors of commits 19df1702 and 69de1718.

-- 
Thomas Munro
https://enterprisedb.com

Re: SegFault on 9.6.14

2019-07-16 Thread Jerry Sievers

Thomas Munro  writes:

> On Wed, Jul 17, 2019 at 12:26 PM Jerry Sievers  wrote:
>
>> Is this the right sequencing?
>>
>> 1. Start client and get backend pid
>> 2. GDB;  handle SIGUSR1, break, cont
>> 3. Run query
>> 4. bt
>
> Perfect, thanks.  I think I just spotted something:

Dig that!  Great big thanks to you and Tomas, et al for jumping on this.

Please let know if there's anything else I can submit that would be
helpful.


>
>> #11 0x55666e0359df in ExecShutdownNode (node=node@entry=0x55667033a6c8)
>> at 
>> /build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execProcnode.c:830
>> #12 0x55666e04d0ff in ExecLimit (node=node@entry=0x55667033a428)
>> at 
>> /build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeLimit.c:139
>
> https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/backend/executor/nodeLimit.c#L139
>
> Limit thinks it's OK to "shut down" the subtree, but if you shut down a
> Gather node you can't rescan it later because it destroys its shared
> memory.  Oops.  Not sure what to do about that yet.
>
>
> --
> Thomas Munro
> https://enterprisedb.com
>
>
>

-- 
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consult...@comcast.net

Re: SegFault on 9.6.14

2019-07-16 Thread Thomas Munro

On Wed, Jul 17, 2019 at 12:26 PM Jerry Sievers  wrote:
> Is this the right sequencing?
>
> 1. Start client and get backend pid
> 2. GDB;  handle SIGUSR1, break, cont
> 3. Run query
> 4. bt

Perfect, thanks.  I think I just spotted something:

> #11 0x55666e0359df in ExecShutdownNode (node=node@entry=0x55667033a6c8)
> at 
> /build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execProcnode.c:830
> #12 0x55666e04d0ff in ExecLimit (node=node@entry=0x55667033a428)
> at 
> /build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeLimit.c:139

https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/backend/executor/nodeLimit.c#L139

Limit thinks it's OK to "shut down" the subtree, but if you shut down a
Gather node you can't rescan it later because it destroys its shared
memory.  Oops.  Not sure what to do about that yet.


--
Thomas Munro
https://enterprisedb.com

Re: SegFault on 9.6.14

2019-07-16 Thread Jerry Sievers

Thomas Munro  writes:

> On Wed, Jul 17, 2019 at 12:05 PM Jerry Sievers  wrote:
>
>> Program received signal SIGUSR1, User defined signal 1.
>
> Oh, we need to ignore those pesky signals with "handle SIGUSR1 noprint 
> nostop".

Is this the right sequencing?

1. Start client and get backend pid
2. GDB;  handle SIGUSR1, break, cont
3. Run query
4. bt

Thanks

Don't think I am doing this correctly.  Please advise.

handle SIGUSR1 noprint nostop
SignalStop  Print   Pass to program Description
SIGUSR1   NoNo  Yes User defined signal 1
(gdb) break munmap
Breakpoint 1 at 0x7f3d09331740: file ../sysdeps/unix/syscall-template.S, line 
84.
(gdb) cont
Continuing.

Breakpoint 1, munmap () at ../sysdeps/unix/syscall-template.S:84
84  ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  munmap () at ../sysdeps/unix/syscall-template.S:84
#1  0x55666e12d7f4 in dsm_impl_posix (impl_private=0x22, elevel=19, 
mapped_size=0x556670205890, mapped_address=0x556670205888, request_size=0, 
handle=, op=DSM_OP_DETACH)
at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/storage/ipc/dsm_impl.c:259
#2  dsm_impl_op (op=op@entry=DSM_OP_DETACH, handle=, 
request_size=request_size@entry=0, 
impl_private=impl_private@entry=0x556670205880, 
mapped_address=mapped_address@entry=0x556670205888, 
mapped_size=mapped_size@entry=0x556670205890, elevel=19)
at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/storage/ipc/dsm_impl.c:176
#3  0x55666e12efb1 in dsm_detach (seg=0x556670205860)
at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/storage/ipc/dsm.c:738
#4  0x55666df31369 in DestroyParallelContext (pcxt=0x556670219b68)
at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/access/transam/parallel.c:750
#5  0x55666e0357bb in ExecParallelCleanup (pei=0x7f3d012218b0)
at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execParallel.c:575
#6  0x55666e047ca2 in ExecShutdownGather (node=node@entry=0x55667033bed0)
at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeGather.c:443
#7  0x55666e0359f5 in ExecShutdownNode (node=0x55667033bed0)
at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execProcnode.c:820
#8  0x55666e0777e1 in planstate_tree_walker (planstate=0x55667033b2b0, 
walker=0x55666e0359a0 , context=0x0)
at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/nodes/nodeFuncs.c:3636
#9  0x55666e0777e1 in planstate_tree_walker (planstate=0x55667033b040, 
walker=0x55666e0359a0 , context=0x0)
at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/nodes/nodeFuncs.c:3636
#10 0x55666e0777e1 in planstate_tree_walker 
(planstate=planstate@entry=0x55667033a6c8, 
walker=walker@entry=0x55666e0359a0 , 
context=context@entry=0x0)
at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/nodes/nodeFuncs.c:3636
#11 0x55666e0359df in ExecShutdownNode (node=node@entry=0x55667033a6c8)
at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execProcnode.c:830
#12 0x55666e04d0ff in ExecLimit (node=node@entry=0x55667033a428)
at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeLimit.c:139
#13 0x55666e035d28 in ExecProcNode (node=node@entry=0x55667033a428)
at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execProcnode.c:531
#14 0x55666e051f69 in ExecNestLoop (node=node@entry=0x55667031c660)
---Type  to continue, or q  to quit---at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeNestloop.c:174
#15 0x55666e035e28 in ExecProcNode (node=node@entry=0x55667031c660)
at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execProcnode.c:476
#16 0x55666e054989 in ExecSort (node=node@entry=0x55667031c3f0)
at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeSort.c:103
#17 0x55666e035de8 in ExecProcNode (node=0x55667031c3f0)
at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execProcnode.c:495
#18 0x55666e041fe9 in fetch_input_tuple 
(aggstate=aggstate@entry=0x55667031ba18)
at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeAgg.c:598
#19 0x55666e043bb3 in agg_retrieve_direct (aggstate=0x55667031ba18)
at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeAgg.c:2078
#20 ExecAgg (node=node@entry=0x55667031ba18)
at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeAgg.c:1903
#21 0x55666e035dc8 in ExecProcNode

Re: SegFault on 9.6.14

2019-07-16 Thread Thomas Munro

On Wed, Jul 17, 2019 at 12:05 PM Jerry Sievers  wrote:
> Program received signal SIGUSR1, User defined signal 1.

Oh, we need to ignore those pesky signals with "handle SIGUSR1 noprint nostop".

-- 
Thomas Munro
https://enterprisedb.com

Re: SegFault on 9.6.14

2019-07-16 Thread Jerry Sievers

Thomas Munro  writes:

> On Wed, Jul 17, 2019 at 11:33 AM Jerry Sievers  wrote:
>
>>  ->  Nested Loop Left Join  (cost=251621.81..12300177.37 rows=48 
>> width=44)
>>->  Gather  (cost=1001.55..270403.27 rows=48 width=40)
>
>>->  Limit  (cost=250620.25..250620.27 rows=1 width=20)
>
>>->  Gather  (cost=1000.00..250452.00 
>> rows=3706 width=4)
>
> One observation is that it's a rescan a bit like the one in the
> unsuccessful repro attempt I posted, but it has *two* Gather nodes in
> it (and thus two parallel query DSM segments), and only one of them
> should be rescanned, and from the backtrace we see that it is indeed
> the expected one, the one under the Limit operator.  Neither of them
> should be getting unmapped in the leader though and AFAIK nothing
> happening in the workers could cause this effect, the leader would
> have to explicitly unmap the thing AFAIK.
>
> On Wed, Jul 17, 2019 at 11:42 AM Jerry Sievers  wrote:
>> mmap(NULL, 287624, PROT_READ|PROT_WRITE, MAP_SHARED, 124, 0) = 0x7f3d011f2000
>> mmap(NULL, 262504, PROT_READ|PROT_WRITE, MAP_SHARED, 124, 0) = 0x7f3d011b1000
>> munmap(0x7f3d011b1000, 262504)  = 0
>
> Ok, there go our two parallel query DSM segments, and there it is
> being unmapped.  Hmm.  Any chance you could attach a debugger, and
> "break munmap", "cont", and then show us the backtrace "bt" when that
> is reached?

gdb /usr/lib/postgresql/9.6/bin/postgres 21640
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/postgresql/9.6/bin/postgres...Reading symbols 
from 
/usr/lib/debug/.build-id/04/6f55a5ce6ce05064edfc8feee61c6cb039d296.debug...done.
done.
Attaching to program: /usr/lib/postgresql/9.6/bin/postgres, process 21640
Reading symbols from /usr/lib/x86_64-linux-gnu/libxml2.so.2...Reading symbols 
from 
/usr/lib/debug/.build-id/d3/57ce1dba1fab803eddf48922123ffd0a303676.debug...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libpam.so.0...(no debugging symbols 
found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libssl.so.1.0.0...Reading symbols 
from 
/usr/lib/debug/.build-id/ff/69ea60ebe05f2dd689d2b26fc85a73e5fbc3a0.debug...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0...Reading symbols 
from 
/usr/lib/debug/.build-id/15/ffeb43278726b025f020862bf51302822a40ec.debug...done.
done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libgssapi_krb5.so.2...(no 
debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/librt.so.1...Reading symbols from 
/usr/lib/debug//lib/x86_64-linux-gnu/librt-2.23.so...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...Reading symbols from 
/usr/lib/debug//lib/x86_64-linux-gnu/libdl-2.23.so...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...Reading symbols from 
/usr/lib/debug//lib/x86_64-linux-gnu/libm-2.23.so...done.
done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libldap_r-2.4.so.2...Reading 
symbols from 
/usr/lib/debug/.build-id/38/90d33727391e4a85dc0f819ab0aa29bb5dfc86.debug...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libsystemd.so.0...(no debugging 
symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...Reading symbols from 
/usr/lib/debug//lib/x86_64-linux-gnu/libc-2.23.so...done.
done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libicuuc.so.55...Reading symbols 
from 
/usr/lib/debug/.build-id/46/3d8b610702d64ae0803c7dfcaa02cfb4c6477b.debug...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libz.so.1...Reading symbols from 
/usr/lib/debug/.build-id/8d/9bd4ce26e45ef16075c67d5f5eeafd8b562832.debug...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/liblzma.so.5...(no debugging symbols 
found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libaudit.so.1...(no debugging 
symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libkrb5.so.3...(no debugging 
symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libk5crypto.so.3...(no debugging 
symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libcom_err.so.2...Reading symbols 
from /usr/lib/debug//lib/x86_64-linux-gnu/libcom_err.so.2.1...done.
done.
Reading symbols from

Re: SegFault on 9.6.14

2019-07-16 Thread Thomas Munro

On Wed, Jul 17, 2019 at 11:33 AM Jerry Sievers  wrote:
>  ->  Nested Loop Left Join  (cost=251621.81..12300177.37 rows=48 
> width=44)
>->  Gather  (cost=1001.55..270403.27 rows=48 width=40)

>->  Limit  (cost=250620.25..250620.27 rows=1 width=20)

>->  Gather  (cost=1000.00..250452.00 
> rows=3706 width=4)

One observation is that it's a rescan a bit like the one in the
unsuccessful repro attempt I posted, but it has *two* Gather nodes in
it (and thus two parallel query DSM segments), and only one of them
should be rescanned, and from the backtrace we see that it is indeed
the expected one, the one under the Limit operator.  Neither of them
should be getting unmapped in the leader though and AFAIK nothing
happening in the workers could cause this effect, the leader would
have to explicitly unmap the thing AFAIK.

On Wed, Jul 17, 2019 at 11:42 AM Jerry Sievers  wrote:
> mmap(NULL, 287624, PROT_READ|PROT_WRITE, MAP_SHARED, 124, 0) = 0x7f3d011f2000
> mmap(NULL, 262504, PROT_READ|PROT_WRITE, MAP_SHARED, 124, 0) = 0x7f3d011b1000
> munmap(0x7f3d011b1000, 262504)  = 0

Ok, there go our two parallel query DSM segments, and there it is
being unmapped.  Hmm.  Any chance you could attach a debugger, and
"break munmap", "cont", and then show us the backtrace "bt" when that
is reached?

--
Thomas Munro
https://enterprisedb.com

Re: SegFault on 9.6.14

2019-07-16 Thread Jerry Sievers

Thomas Munro  writes:

> On Wed, Jul 17, 2019 at 11:06 AM Jerry Sievers  wrote:
>
>> (gdb) p *scan->rs_parallel
>> Cannot access memory at address 0x7fa673a54108
>
> So I guess one question is: was it a valid address that's been
> unexpectedly unmapped, or is the pointer corrupted?  Any chance you
> can strace the backend and pull out the map, unmap calls?

There were about 60k lines from strace including these few...


mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7f3d0127a000
mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7f3d01239000
mmap(NULL, 287624, PROT_READ|PROT_WRITE, MAP_SHARED, 124, 0) = 0x7f3d011f2000
mmap(NULL, 262504, PROT_READ|PROT_WRITE, MAP_SHARED, 124, 0) = 0x7f3d011b1000
munmap(0x7f3d011b1000, 262504)  = 0

Thx





-- 
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consult...@comcast.net

Re: SegFault on 9.6.14

2019-07-16 Thread Jerry Sievers

Thomas Munro  writes:

> On Wed, Jul 17, 2019 at 11:06 AM Jerry Sievers  wrote:
>
>> (gdb) p *scan->rs_parallel
>> Cannot access memory at address 0x7fa673a54108
>
> So I guess one question is: was it a valid address that's been
> unexpectedly unmapped, or is the pointer corrupted?  Any chance you
> can strace the backend and pull out the map, unmap calls?

I'll dig further.

Here is a sanitized look at the query and explain plan...

The segfault happens $immediately upon issuance of the query.





begin;

-- This setting makes the segfault go away
--set local max_parallel_workers_per_gather to 0;

explain
select v.account_id, COUNT(cnt.clicks), te.description,
l.product_id
from thing3.thing10 te
join thing3.thing9 pv on pv.page_view_id = te.page_view_id
join thing3.thing11 v on v.visit_id = pv.visit_id
left join thing6.thing12 l on v.account_id=l.account_id
  left join lateral (
select MAX(v.visit_id)
 ,COUNT(*) as clicks
 from thing3.thing10 te
 join thing3.thing9 pv on pv.page_view_id =
te.page_view_id
 join thing3.thing11 v on v.visit_id = pv.visit_id
 where te.description in ('thing7',
'thing8')
   and v.account_id=l.account_id
 GROUP BY v.account_id, v.visit_id
 order by v.account_id, v.visit_id desc
 limit 1
)cnt on true
where (te.description in ('thing4',
'thing5')
 or te.description like'%auto%')
  and te.created_at > '2019-06-24 00:00:00'
--and l.loan_status_id in (5,6)
group by v.account_id, te.description,
l.product_id;

abort;
BEGIN

QUERY PLAN  

   
---
 GroupAggregate  (cost=12300178.71..12300179.79 rows=48 width=44)
   Group Key: v.account_id, te.description, l.product_id
   ->  Sort  (cost=12300178.71..12300178.83 rows=48 width=44)
 Sort Key: v.account_id, te.description, l.product_id
 ->  Nested Loop Left Join  (cost=251621.81..12300177.37 rows=48 
width=44)
   ->  Gather  (cost=1001.55..270403.27 rows=48 width=40)
 Workers Planned: 3
 ->  Nested Loop Left Join  (cost=1.56..269398.47 rows=15 
width=40)
   ->  Nested Loop  (cost=1.13..269391.71 rows=14 
width=32)
 ->  Nested Loop  (cost=0.57..269368.66 rows=39 
width=32)
   ->  Parallel Seq Scan on thing10 te  
(cost=0.00..269228.36 rows=39 width=32)
 Filter: ((created_at > '2019-06-24 
00:00:00'::timestamp without time zone) AND (((description)::text = ANY 
('{thing4,thing5}'::text[])) OR ((description)::text ~~ '%auto%'::text)))
   ->  Index Scan using page_views_pkey on 
thing9 pv  (cost=0.57..3.59 rows=1 width=8)
 Index Cond: (page_view_id = 
te.page_view_id)
 ->  Index Scan using visits_pkey on thing11 v  
(cost=0.56..0.58 rows=1 width=8)
   Index Cond: (visit_id = pv.visit_id)
   ->  Index Scan using index_loans_on_account_id on 
thing12 l  (cost=0.42..0.46 rows=2 width=8)
 Index Cond: (v.account_id = account_id)
   ->  Limit  (cost=250620.25..250620.27 rows=1 width=20)
 ->  GroupAggregate  (cost=250620.25..250620.27 rows=1 
width=20)
   Group Key: v_1.visit_id
   ->  Sort  (cost=250620.25..250620.26 rows=1 width=8)
 Sort Key: v_1.visit_id DESC
 ->  Hash Join  (cost=1154.34..250620.24 rows=1 
width=8)
   Hash Cond: (te_1.page_view_id = 
pv_1.page_view_id)
   ->  Gather  (cost=1000.00..250452.00 
rows=3706 width=4)
 Workers Planned: 3
 ->  Parallel Seq Scan on thing10 
te_1  (cost=0.00..249081.40 rows=1195 width=4)
   Filter: ((description)::text 
= ANY ('{thing7,thing8}'::text[]))
   ->  Hash  (cost=152.85..152.85 rows=119 
width=12)
 ->  Nested Loop  
(cost=1.01..152.85 rows=119 width=12)
   ->  Index Scan using 
index_visits_on_account_id on thing11 v_1  (cost=0.43..15.63 rows=18

Re: A little report on informal commit tag usage

2019-07-16 Thread Andres Freund

Hi,

On 2019-07-16 19:26:59 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > They don't preclude each other though. E.g. it'd be sensible to have both
> 
> >> Per gripe from Ken Tanzer.  Back-patch to 9.6.  The issue exists
> >> further back, but before 9.6 the code looks very different and it
> >> doesn't actually know whether the "var" name matches anything,
> >> so I desisted from trying to fix it.
> 
> > and "Backpatch: 9.6-" or such.
> 
> I've wondered for some time what you think the "-" means in this.

Up to master. Occasionally there's bugs that only need to be fixed in
some back branches etc.

Greetings,

Andres Freund

Re: A little report on informal commit tag usage

2019-07-16 Thread Tom Lane

Andres Freund  writes:
> They don't preclude each other though. E.g. it'd be sensible to have both

>> Per gripe from Ken Tanzer.  Back-patch to 9.6.  The issue exists
>> further back, but before 9.6 the code looks very different and it
>> doesn't actually know whether the "var" name matches anything,
>> so I desisted from trying to fix it.

> and "Backpatch: 9.6-" or such.

I've wondered for some time what you think the "-" means in this.

regards, tom lane

Re: A little report on informal commit tag usage

2019-07-16 Thread Andres Freund

Hi,

On 2019-07-16 10:33:06 -0400, Tom Lane wrote:
> Michael Paquier  writes:
> > As mentioned on different threads, "Discussion" is the only one we had
> > a strong agreement with.  Could it be possible to consider things like
> > Author, Reported-by, Reviewed-by or Backpatch-through for example and
> > extend to that?  The first three ones are useful for parsing the
> > commit logs.  The fourth one is handy so as there is no need to look
> > at a full log tree with git log --graph or such, which is something I
> > do from time to time to guess down to where a fix has been applied (I
> > tend to avoid git_changelog).
> 
> FWIW, I'm one of the people who prefer prose for this.  The backpatching
> bit is a good example of why, because my log messages typically don't
> just say "backpatch to 9.6" but something about why that was the cutoff.

They don't preclude each other though. E.g. it'd be sensible to have both

> Per gripe from Ken Tanzer.  Back-patch to 9.6.  The issue exists
> further back, but before 9.6 the code looks very different and it
> doesn't actually know whether the "var" name matches anything,
> so I desisted from trying to fix it.

and "Backpatch: 9.6-" or such.


> I am in favor of trying to consistently mention that a patch is being
> back-patched, rather than expecting people to rely on git metadata
> to find that out.  But I don't see that a rigid "Backpatch" tag format
> makes anything easier there.  If you need to know that mechanically,
> git_changelog is way more reliable.

I find it useful to have a quick place to scan in a commit message. It's
a lot quicker to focus on the last few lines with tags, and see a
'Backpatch: 9.6-' than to parse a potentially long commit message. If
I'm then still interested in the commit, I'll then read the commit.

Greetings,

Andres Freund

Re: SegFault on 9.6.14

2019-07-16 Thread Thomas Munro

On Wed, Jul 17, 2019 at 11:11 AM Thomas Munro  wrote:
> map, unmap

mmap, munmap

-- 
Thomas Munro
https://enterprisedb.com

Re: SegFault on 9.6.14

2019-07-16 Thread Thomas Munro

On Wed, Jul 17, 2019 at 11:06 AM Jerry Sievers  wrote:
> (gdb) p *scan->rs_parallel
> Cannot access memory at address 0x7fa673a54108

So I guess one question is: was it a valid address that's been
unexpectedly unmapped, or is the pointer corrupted?  Any chance you
can strace the backend and pull out the map, unmap calls?

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: converting Lists into arrays

2019-07-16 Thread Tom Lane

I wrote:
> * Look at places using lcons/list_delete_first to maintain FIFO lists.
> The patch makes these O(N^2) for long lists.  If we can reverse the list
> order and use lappend/list_truncate instead, it'd be better.  Possibly in
> some places the list ordering is critical enough to make this impractical,
> but I suspect it's an easy win in most.

Attached are two patches that touch all the places where it seemed like
an easy win to stop using lcons and/or list_delete_first.

0001 adds list_delete_last() as a mirror image to list_delete_first(),
and changes all the places where it seemed 100% safe to do so (ie,
there's no behavioral change because the list order is demonstrably
immaterial).

0002 changes some additional places where it's maybe a bit less safe,
ie there's a potential for user-visible behavioral change because
processing will occur in a different order.  In particular, the proposed
change in execExpr.c causes aggregates and window functions that are in
the same plan node to be executed in a different order than before ---
but it seems to me that this order is saner.  (Note the change in the
expected regression results, in a test that's intentionally sensitive to
the execution order.)  And anyway when did we guarantee anything about
that?

I refrained from changing lcons to lappend in get_relation_info, because
that demonstrably causes the planner to change its choices when two
indexes look equally attractive, and probably people would complain
about that.  I think that the other changes proposed in 0002 are pretty
harmless --- for example, in get_tables_to_cluster the order depends
initially on the results of a seqscan of pg_index, so anybody who's
expecting stability is in for rude surprises anyhow.  Also, the proposed
changes in plancat.c, parse_agg.c, selfuncs.c almost certainly have no
user-visible effect, but maybe there could be changes at the
roundoff-error level due to processing estimates in a different order?

There are a bunch of places that are using list_delete_first to remove
the next-to-process entry from a List used as a queue.  In principle,
we could invert the order of those queues and then use list_delete_last,
but I thought this would probably be too confusing: it's natural to
think of the front of the list as being the head of the queue.  I doubt
that any of those queues get long enough for it to be a serious
performance problem to leave them as-is.

(Actually, I doubt that any of these changes will really move the
performance needle in the real world.  It's more a case of wanting
the code to present good examples not bad ones.)

Thoughts?  Anybody want to object to any of the changes in 0002?

regards, tom lane

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index dfb51f6..169bf6f 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -1323,8 +1323,6 @@ static void
 gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
 GISTSTATE *giststate, List *splitinfo, bool unlockbuf)
 {
-	ListCell   *lc;
-	List	   *reversed;
 	GISTPageSplitInfo *right;
 	GISTPageSplitInfo *left;
 	IndexTuple	tuples[2];
@@ -1339,14 +1337,6 @@ gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
 	 * left. Finally insert the downlink for the last new page and update the
 	 * downlink for the original page as one operation.
 	 */
-
-	/* for convenience, create a copy of the list in reverse order */
-	reversed = NIL;
-	foreach(lc, splitinfo)
-	{
-		reversed = lcons(lfirst(lc), reversed);
-	}
-
 	LockBuffer(stack->parent->buffer, GIST_EXCLUSIVE);
 	gistFindCorrectParent(state->r, stack);

@@ -1354,10 +1344,10 @@ gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
 	 * insert downlinks for the siblings from right to left, until there are
 	 * only two siblings left.
 	 */
-	while (list_length(reversed) > 2)
+	for (int pos = list_length(splitinfo) - 1; pos > 1; pos--)
 	{
-		right = (GISTPageSplitInfo *) linitial(reversed);
-		left = (GISTPageSplitInfo *) lsecond(reversed);
+		right = (GISTPageSplitInfo *) list_nth(splitinfo, pos);
+		left = (GISTPageSplitInfo *) list_nth(splitinfo, pos - 1);

 		if (gistinserttuples(state, stack->parent, giststate,
 			 >downlink, 1,
@@ -1371,11 +1361,10 @@ gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
 			gistFindCorrectParent(state->r, stack);
 		}
 		/* gistinserttuples() released the lock on right->buf. */
-		reversed = list_delete_first(reversed);
 	}

-	right = (GISTPageSplitInfo *) linitial(reversed);
-	left = (GISTPageSplitInfo *) lsecond(reversed);
+	right = (GISTPageSplitInfo *) lsecond(splitinfo);
+	left = (GISTPageSplitInfo *) linitial(splitinfo);

 	/*
 	 * Finally insert downlink for the remaining right page and update the
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 6c3ff76..032fab9 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -633,7 +633,7 @@

Re: SegFault on 9.6.14

2019-07-16 Thread Jerry Sievers

Tomas Vondra  writes:

> On Mon, Jul 15, 2019 at 08:20:00PM -0500, Jerry Sievers wrote:
>
>>Tomas Vondra  writes:
>>
>>> On Mon, Jul 15, 2019 at 07:22:55PM -0500, Jerry Sievers wrote:
>>>
Tomas Vondra  writes:

> On Mon, Jul 15, 2019 at 06:48:05PM -0500, Jerry Sievers wrote:
>
>>Greetings Hackers.
>>
>>We have a reproduceable case of $subject that issues a backtrace such as
>>seen below.
>>
>>The query that I'd prefer to sanitize before sending is <30 lines of at
>>a glance, not terribly complex logic.
>>
>>It nonetheless dies hard after a few seconds of running and as expected,
>>results in an automatic all-backend restart.
>>
>>Please advise on how to proceed.  Thanks!
>>
>>bt
>>#0  initscan (scan=scan@entry=0x55d7a7daa0b0, key=0x0, 
>>keep_startblock=keep_startblock@entry=1 '\001')
>>at 
>> /build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/access/heap/heapam.c:233
>>#1  0x55d7a72fa8d0 in heap_rescan (scan=0x55d7a7daa0b0, 
>>key=key@entry=0x0) at 
>>/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/access/heap/heapam.c:1529
>>#2  0x55d7a7451fef in ExecReScanSeqScan 
>>(node=node@entry=0x55d7a7d85100) at 
>>/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeSeqscan.c:280
>>#3  0x55d7a742d36e in ExecReScan (node=0x55d7a7d85100) at 
>>/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execAmi.c:158
>>#4  0x55d7a7445d38 in ExecReScanGather 
>>(node=node@entry=0x55d7a7d84d30) at 
>>/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeGather.c:475
>>#5  0x55d7a742d255 in ExecReScan (node=0x55d7a7d84d30) at 
>>/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execAmi.c:166
>>#6  0x55d7a7448673 in ExecReScanHashJoin 
>>(node=node@entry=0x55d7a7d84110) at 
>>/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeHashjoin.c:1019
>>#7  0x55d7a742d29e in ExecReScan (node=node@entry=0x55d7a7d84110) at 
>>/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execAmi.c:226
>>
>>
>
> Hmmm, that means it's crashing here:
>
>if (scan->rs_parallel != NULL)
>scan->rs_nblocks = scan->rs_parallel->phs_nblocks; <--- here
>else
>scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
>
> But clearly, scan is valid (otherwise it'd crash on the if condition),
> and scan->rs_parallel must me non-NULL. Which probably means the pointer
> is (no longer) valid.
>
> Could it be that the rs_parallel DSM disappears on rescan, or something
> like that?

No clue but something I just tried was to disable parallelism by setting
max_parallel_workers_per_gather to 0 and however the query has not
finished after a few minutes, there is no crash.

>>>
>>> That might be a hint my rough analysis was somewhat correct. The
>>> question is whether the non-parallel plan does the same thing. Maybe it
>>> picks a plan that does not require rescans, or something like that.
>>>
Please advise.

>>>
>>> It would be useful to see (a) exacution plan of the query, (b) full
>>> backtrace and (c) a bit of context for the place where it crashed.
>>>
>>> Something like (in gdb):
>>>
>>>bt full
>>>list
>>>p *scan
>>
>>The p *scan did nothing unless I ran it first however my gdb $foo isn't
>>strong presently.
>
> Hmm, the rs_parallel pointer looks sane (it's not obvious garbage). Can
> you try this?
>
>   p *scan->rs_parallel


$ gdb /usr/lib/postgresql/9.6/bin/postgres core
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/postgresql/9.6/bin/postgres...Reading symbols 
from 
/usr/lib/debug/.build-id/04/6f55a5ce6ce05064edfc8feee61c6cb039d296.debug...done.
done.
[New LWP 31654]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: foo_eis_segfault: jsievers staging 
10.220.22.26(57948) SELECT   '.
Program terminated

Re: Allow simplehash to use already-calculated hash values

2019-07-16 Thread Andres Freund

Hi,

On 2019-07-16 15:20:33 -0700, Jeff Davis wrote:
> The attached small patch adds new entry points to simplehash.h that
> allow the caller to pass in the already-calculated hash value, so that
> simplehash doesn't need to recalculate it.
> 
> This is helpful for Memory-Bounded Hash Aggregation[1], which uses the
> hash value for multiple purposes. For instance, if the hash table is
> full and the group is not already present in the hash table, it needs
> to spill the tuple to disk. In that case, it would use the hash value
> for the initial lookup, then to select the right spill partition.
> Later, when it processes the batch, it will again need the same hash
> value to perform a lookup. By separating the hash value calculation
> from where it's used, we can avoid needlessly recalculating it for each
> of these steps.

Makes sense to me.

> In theory, this could add overhead for "SH_SCOPE extern" for callers
> not specifying their own hash value, because it adds an extra external
> function call. I looked at the generated LLVM and it's a simple tail
> call, and I looked at the generated assembly and it's just an extra
> jmp.

How does it look for gcc? And was that with LTO enabled or not?

Is that still true when the hashtable is defined in a shared library, or
when you compile postgres as a PIE executable? I'm not sure that
compilers can optimize the external function call at least in the former
case, because the typical function resolution rules IIRC mean that
references to extern functions could be resolved to definitions in other
translation units, *even if* there's a definition in the same TU.

ISTM that it'd be best to just have a static inline helper function
employed both the hash-passing and the "traditional" insertion routines?
Then that problem ought to not exist anymore.

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

2019-07-16 Thread Andres Freund

Hi,

On 2019-07-15 12:26:21 -0400, Robert Haas wrote:
> Yeah. I didn't understand that explanation.  It seems to me that one
> of the fundamental design questions for this system is whether we
> should allow there to be an unbounded number of transactions that are
> pending undo application, or whether it's OK to enforce a hard limit.
> Either way, there should certainly be pressure applied to try to keep
> the number low, like forcing undo application into the foreground when
> a backlog is accumulating, but the question is what to do when that's
> insufficient.  My original idea was that we should not have a hard
> limit, in which case the shared memory data on what is pending might
> be incomplete, in which case we would need the discard workers to
> discover transactions needing undo and add them to the shared memory
> data structures, and if those structures are full, then we'd just skip
> adding those details and rediscover those transactions again at some
> future point.
>
> But, my understanding of the current design being implemented is that
> there is a hard limit on the number of transactions that can be
> pending undo and the in-memory data structures are sized accordingly.

My understanding is that that's really just an outcome of needing to
maintain oldestXidHavingUndo accurately, right? I know I asked this
before, but I didn't feel like the answer was that clear (probably due
to my own haziness). To me it seems very important to understand whether
/ how much we can separate the queuing/worker logic from the question of
how to maintain oldestXidHavingUndo.


> In such a system, we cannot rely on the discard worker(s) to
> (re)discover transactions that need undo, because if there can be
> transactions that need undo that we don't know about, then we can't
> enforce a hard limit correctly.  The exception, I suppose, is that
> after a crash, we'll need to scan all the undo logs and figure out
> which transactions are pending, but that doesn't preclude using a
> single queue entry covering both the logged and the unlogged portion
> of a transaction that has written undo of both kinds.  We've got to
> scan all of the undo logs before we allow any new undo-using
> transactions to start, and so we can create one fully-up-to-date entry
> that reflects the data for both persistence levels before any
> concurrent activity happens.

Yea, that seems like a question independent of the "completeness"
requirement. If desirable, it seems trivial to either have
RollbackHashEntry have per-persistence level status (for one entry per
xid), or not (for per-persistence entries).

Greetings,

Andres Freund

Allow simplehash to use already-calculated hash values

2019-07-16 Thread Jeff Davis

The attached small patch adds new entry points to simplehash.h that
allow the caller to pass in the already-calculated hash value, so that
simplehash doesn't need to recalculate it.

This is helpful for Memory-Bounded Hash Aggregation[1], which uses the
hash value for multiple purposes. For instance, if the hash table is
full and the group is not already present in the hash table, it needs
to spill the tuple to disk. In that case, it would use the hash value
for the initial lookup, then to select the right spill partition.
Later, when it processes the batch, it will again need the same hash
value to perform a lookup. By separating the hash value calculation
from where it's used, we can avoid needlessly recalculating it for each
of these steps.

There is already an option for simplehash to cache the calculated hash
value and return it with the entry, but that doesn't quite fit the
need. The hash value is needed in cases where the lookup fails, because
that is when the tuple must be spilled; but if the lookup fails, it
returns NULL, discarding the calculated hash value.

I am including this patch separately from Hash Aggregation because it
is a small and independently-reviewable change.

In theory, this could add overhead for "SH_SCOPE extern" for callers
not specifying their own hash value, because it adds an extra external
function call. I looked at the generated LLVM and it's a simple tail
call, and I looked at the generated assembly and it's just an extra
jmp. I tested by doing a hash aggregation of 30M zeroes, which should
exercise that path a lot, and I didn't see any difference. Also, once
we actually use this for hash aggregation, there will be no "SH_SCOPE
extern" callers that don't specify the hash value anyway.

Regards,
Jeff Davis

[1] 
https://postgr.es/m/507ac540ec7c20136364b5272acbcd4574aa76ef.camel%40j-davis.com
From ac4877d1e7b64da6caead189b85813fb5fed81a0 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Tue, 16 Jul 2019 12:07:37 -0700
Subject: [PATCH] Allow simplehash to use already-calculated hash values.

Add _lookup_hash and _insert_hash functions for callers that have
already calculated the hash value of the key.

This is intended for use with hash algorithms that write to disk in
partitions. The hash value can be calculated once, used to perform a
lookup, used to select the partition, then written to the partition
along with the tuple. When the tuple is read back, the hash value does
not need to be recalculated.
---
 src/include/lib/simplehash.h | 36 ++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index 5c6bd93bc7..065062ee96 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -74,8 +74,10 @@
 #define SH_DESTROY SH_MAKE_NAME(destroy)
 #define SH_RESET SH_MAKE_NAME(reset)
 #define SH_INSERT SH_MAKE_NAME(insert)
+#define SH_INSERT_HASH SH_MAKE_NAME(insert_hash)
 #define SH_DELETE SH_MAKE_NAME(delete)
 #define SH_LOOKUP SH_MAKE_NAME(lookup)
+#define SH_LOOKUP_HASH SH_MAKE_NAME(lookup_hash)
 #define SH_GROW SH_MAKE_NAME(grow)
 #define SH_START_ITERATE SH_MAKE_NAME(start_iterate)
 #define SH_START_ITERATE_AT SH_MAKE_NAME(start_iterate_at)
@@ -144,7 +146,11 @@ SH_SCOPE void SH_DESTROY(SH_TYPE * tb);
 SH_SCOPE void SH_RESET(SH_TYPE * tb);
 SH_SCOPE void SH_GROW(SH_TYPE * tb, uint32 newsize);
 SH_SCOPE	SH_ELEMENT_TYPE *SH_INSERT(SH_TYPE * tb, SH_KEY_TYPE key, bool *found);
+SH_SCOPE	SH_ELEMENT_TYPE *SH_INSERT_HASH(SH_TYPE * tb, SH_KEY_TYPE key,
+			uint32 hash, bool *found);
 SH_SCOPE	SH_ELEMENT_TYPE *SH_LOOKUP(SH_TYPE * tb, SH_KEY_TYPE key);
+SH_SCOPE	SH_ELEMENT_TYPE *SH_LOOKUP_HASH(SH_TYPE * tb, SH_KEY_TYPE key,
+			uint32 hash);
 SH_SCOPE bool SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key);
 SH_SCOPE void SH_START_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter);
 SH_SCOPE void SH_START_ITERATE_AT(SH_TYPE * tb, SH_ITERATOR * iter, uint32 at);
@@ -499,7 +505,19 @@ SH_GROW(SH_TYPE * tb, uint32 newsize)
 SH_SCOPE	SH_ELEMENT_TYPE *
 SH_INSERT(SH_TYPE * tb, SH_KEY_TYPE key, bool *found)
 {
-	uint32		hash = SH_HASH_KEY(tb, key);
+	uint32 hash = SH_HASH_KEY(tb, key);
+
+	return SH_INSERT_HASH(tb, key, hash, found);
+}
+
+/*
+ * Insert the key key into the hash-table using an already-calculated
+ * hash. Set *found to true if the key already exists, false
+ * otherwise. Returns the hash-table entry in either case.
+ */
+SH_SCOPE	SH_ELEMENT_TYPE *
+SH_INSERT_HASH(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash, bool *found)
+{
 	uint32		startelem;
 	uint32		curelem;
 	SH_ELEMENT_TYPE *data;
@@ -669,7 +687,19 @@ restart:
 SH_SCOPE	SH_ELEMENT_TYPE *
 SH_LOOKUP(SH_TYPE * tb, SH_KEY_TYPE key)
 {
-	uint32		hash = SH_HASH_KEY(tb, key);
+	uint32 hash = SH_HASH_KEY(tb, key);
+
+	return SH_LOOKUP_HASH(tb, key, hash);
+}
+
+/*
+ * Lookup up entry in hash table using an already-calculated hash.
+ *
+ * Returns NULL if key not present.
+ */
+SH_SCOPE	SH_ELEMENT_TYPE

Re: POC: Cleaning up orphaned files using undo logs

2019-07-16 Thread Thomas Munro

On Tue, Jul 16, 2019 at 11:33 PM Dilip Kumar  wrote:
> On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro  wrote:
> /* If we discarded everything, the slot can be given up. */
> + if (entirely_discarded)
> + free_undo_log_slot(slot);
>
> I have noticed that when the undo log was detached and it was full
> then if we discard complete log we release its slot.  But, what is
> bothering me is should we add that log to the free list?  Or I am
> missing something?

Stepping back a bit:  The free lists are for undo logs that someone
might want to attach to and insert into.  If it's full, we probably
can't insert anything into it again (well, technically someone else
who wants to insert something a bit smaller might be able to, but
that's not an interesting case to worry about).  So it doesn't need to
go back on a free list, but it still needs to exist (= occupy a slot)
as long as there is undiscarded data in it, because that data is
needed and we need to be able to test URPs against its discard
pointer.  But once its data is entirely discarded, it ceases to exist
-- there is no reason to waste a slot on it, and any URP in this undo
log will be considered to be discarded (because we can't find a slot,
and we also cache that fact in recent_discard so lookups are fast and
lock-free), and therefore it'll not be checkpointed or reloaded at
next startup; then we couldn't put it on a free list even if we wanted
to, because there is nothing left of it ("logs" don't really exist in
memory, only "slots", currently holding the meta-data for a log, which
is why I renamed UndoLog to UndoLogSlot to reduce confusion on that
point).  One of the goals here is to make a system that doesn't
require an increasing amount of memory as time goes on -- hence desire
to completely remove state relating to entirely discarded undo logs
(you might point out that the recent_discard cache would get
arbitrarily large after we chew through millions of undo logs, but
there is another defence against that in the form of low_logno which
isn't used in that test yet but could be used to miminise that
effect).  Does this make sense, and do you see a problem?

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-07-16 Thread Andres Freund

Hi,

On 2019-07-13 15:55:51 +0530, Amit Kapila wrote:
> On Fri, Jul 12, 2019 at 7:08 PM Robert Haas  wrote:
> > > I think even if we currently go with a binary heap, it will be
> > > possible to change it to rbtree later, but I am fine either way.
> >
> > Well, I don't see much point in revising all of this logic twice. We
> > should pick the way we want it to work and make it work that way.
> >
>
> Yeah, I agree.  So, I am assuming here that as you have discussed this
> idea with Andres offlist, he is on board with changing it as he has
> originally suggested using binary_heap.  Andres, do let us know if you
> think differently here.  It would be good if anyone else following the
> thread can also weigh in.

Yes, I think using an rbtree makes sense.

I'm not yet sure whether we'd want the rbtree nodes being pointed to
directly by the hashtable, or whether we'd want one indirection.

e.g. either something like:

typedef struct UndoWorkerQueue
{
/* priority ordered tree */
RBTree *tree;

}

typedef struct UndoWorkerQueueEntry
{
 RBTNode tree_node;

 /*
  * Reference hashtable via key, not pointers, entries might be
  * moved.
  */
 RollbackHashKey rollback_key
 ...
} UndoWorkerQueueEntry;

typedef struct RollbackHashEntry
{
 ...
 UndoWorkerQueueEntry *queue_memb_size;
 UndoWorkerQueueEntry *queue_memb_age;
 UndoWorkerQueueEntry *queue_memb_error;
}

and call rbt_delete() for any non-NULL queue_memb_* whenever an entry is
dequeued via one of the queues (after setting the one already dequeued
from to NULL, of course).  Which requires - as Robert mentioned - that
rbtree pointers remain stable after insertions.

Alternatively we can have a more complicated arrangement without the
"stable pointer" requirement (which'd also similarly work for a binary
heap):

typedef struct UndoWorkerQueue
{
/* information about work needed, not meaningfully ordered */
UndoWorkerQueueEntry *entries;

/*
 * Priority ordered references into 0entries, UndoWorkerQueueEntry members */
slist_head freelist;

/*
 * Number of entries in ->entries and tree that can be pruned by
 * doing a scan of both.
 */
int num_prunable_entries;
}

typedef struct UndoWorkerQueueEntry
{
 /*
  * Reference hashtable via key, not pointers, entries might be
  * moved.
  */
 RollbackHashKey rollback_key

 /*
  * As members of UndoWorkerQueue->tree can be moved in memory,
  * RollbackHashEntry cannot directly point to them. Instead
  */
 bool already_processed;
 ...

 slist_node freelist_node;
} UndoWorkerQueueEntry;

typedef struct UndoWorkerQueueTreeEntry
{
RBTree tree;

/* offset into UndoWorkerQueue->entries */
int off;
} UndoWorkerQueueEntry;

and again

typedef struct RollbackHashEntry
{
 RBTNode tree_node;
 ...
 UndoWorkerQueueEntry *queue_memb_size;
 UndoWorkerQueueEntry *queue_memb_age;
 UndoWorkerQueueEntry *queue_memb_error;
}

Because the tree entries are not members of the tree itself, pointers to
them would be stable, regardless of rbtree (or binary heap) moving them
around.   The cost of that would be more complicated datastructures, and
insertion/deletion/dequeing operations:

insertion:
if (slist_is_empty(>freelist))
   prune();
if (slist_is_empty(>freelist))
   elog(ERROR, "full")

UndoWorkerQueueEntry *entry = slist_pop_head_node(>freelist)
UndoWorkerQueueTreeEntry tree_entry;

entry->already_processed = false;
entry->... = ...;

tree_entry.off = entry - queue->entries; // calculate offset
rbt_insert(queue->tree, _entry, NULL);

prune:
if (queue->num_prunable_entries > 0)
RBTreeIterator iter;
slist_node *pending_freelist;
rbt_begin_iterate(queue->tree, , LeftRightWalk);
while ((tnode = rbt_iterate()) != 0)
node = (UndoWorkerQueueTreeEntry *) tnode;
if (queue->entries[node->off]->already_processed)
rbt_delete(tnode);
/* XXX: Have to stop here, the iterator is invalid -
 * probably should add a rbt_delete_current(iterator);
 */
break;

dequeue:
while (node = rbt_leftmost(queue->tree))
node = (UndoWorkerQueueTreeEntry *) tnode;
entry = >entries[node->off];

rbt_delete(tnode);

/* check if the entry has already been processed via another queue */
if (entry->already_processed)
slist_push(>freelist, >freelist_node);
else
/* found it */
return entry;
return NULL;

delete (i.e. processed in another queue):

/*
 * Queue entry will only be reusable when the corresponding tree
 * entry has been removed. That'll happen either when new entries
 * are needed (cf prune), or when the entry is dequeued (cf dequeue).
 */
entry->already_processed = true;

I think the first approach is

Re: Detailed questions about pg_xact_commit_timestamp

2019-07-16 Thread Morris de Oryx

Adrien, thanks a lot for taking the time to try and explain all of these
details to me. I'm looking at incremental rollups, and thinking through
various alternative designs. It sounds like pg_xact_commit_timestamp just
isn't the right tool for my purposes, so I'll go in another direction.

All the same, I've learned a _lot_ of important points about Postgres from
trying to sort all of this out. Your messages have been a real help.


On Tue, Jul 16, 2019 at 7:03 PM Adrien Nayrat 
wrote:

> On 7/12/19 2:50 PM, Morris de Oryx wrote:
> > Adrien, thanks very much for answering my question. Just a couple of
> follow-up
> > points, if you don't mind.
> >
> > In our answer, you offer an example of pg_xact_commit_timestamp showing
> > out-of-sequence commit times:
> >
> > Session xid  pg_xact_commit_timestamp
> > A   34386826 2019-07-11 09:32:38.994440+00  Started earlier,
> > committed later
> > B   34386827 2019-07-11 09:32:29.806183+00
> >
> > I may not have asked my question clearly, or I may not understand the
> answer
> > properly. Or both ;-) If I understand it correctly, an xid is assigned
> when a
> > transaction starts.
>
> It is a little bit more complicated :) When a transaction start, a
> *virtual* xid
> is assigned. It is when the transaction change the state of the database,
> an xid
> is assigned:
> > Throughout running a transaction, a server process holds an exclusive
> lock on the transaction's virtual transaction ID. If a permanent ID is
> assigned to the transaction (which normally happens only if the transaction
> changes the state of the database), it also holds an exclusive lock on the
> transaction's permanent transaction ID until it ends.
>
> https://www.postgresql.org/docs/current/view-pg-locks.html
>
> (It shouldn't change anything for you)
>
>
> > One transaction might take a second, another might take ten
> > minutes. So, the xid sequence doesn't imply anything at all about commit
> > sequence. What I'm trying to figure out is if it is possible for the
> commit
> > timestamps to somehow be out of order.
>
> I am sorry but I don't understand what you mean by "commit timestamps to
> somehow
> be out of order"?
>
> > What I'm looking for is a way of finding
> > changes committed since a specific moment. When the transaction started
> doesn't
> > matter in my case.
>
>
> Yes, the commit timestamp is the time when the transaction is committed :
> postgres=# begin;
> BEGIN
> postgres=# select now();
>  now
> --
>  2019-07-16 08:46:59.64712+00
> (1 row)
>
> postgres=# select txid_current();
>  txid_current
> --
>  34386830
> (1 row)
>
> postgres=# commit;
> COMMIT
> postgres=# select pg_xact_commit_timestamp('34386830'::xid);
>pg_xact_commit_timestamp
> ---
>  2019-07-16 08:47:30.238746+00
> (1 row)
>
>
> >
> > Is pg_xact_commit_timestamp suitable for this? I'm getting the
> impression that
> > it isn't. But I don't understand quite how. And if it isn't suited to
> this
> > purpose, does anyone know what pg_xact_commit_timestamp is for? What I'm
> after
> > is something like a "xcommitserial" that increases reliably, and
> monotonically
> > on transaction commit. That's how I'm hoping that
> pg_xact_commit_timestamp
> > functions.
>
> I don't think so. pg_xact_commit_timestamp returns the timestamp. If you
> want
> some kind of ordering you have to fetch all commit timestamps (with their
> respective xid) and order them.
>
> You also can implement this tracking by yourself with triggers which
> insert a
> row containing xid and timestamp in a tracking table. You can add an index
> on
> timestamp column. With this approach you don't have to worry about vacuum
> freeze
> which remove old timestamps. As you add more write, it could be more
> expensive
> than track_commit_timestamp.
>
> >
> > Thanks also for making me understand that pg_xact_commit_timestamp
> applies to a
> > *transaction*, not to each row. That makes it a lot lighter in the
> database. I
> > was thinking 12 bytes+ per row, which is completely off for my case. (I
> tend to
> > insert thousands of rows in a transaction.)
> >
> >> Yes timestamp are stored in pg_commit_ts directory. Old timestamp are
> removed
> > after freeze has explained in
> >> https://www.postgresql.org/docs/current/routine-vacuuming.html
> >
> > Thanks for the answer, and for kindly pointing me to the right section
> of the
> > documentation. It's easy to get impatient with new(er) users. I'm _not_
> lazy
> > about reading manuls and researching but, well, the Postgres
> documentation is
> > over 3,000 pages long when you download it. So, I may have missed a
> detail or
> > two If I read that correctly, the ~4 billion number range is made
> into an
> > endless circle by keeping ~2 billions numbers in the past, and 2 billion
> in the
> > future. If that's right, I'm never going to be so out of data that the ~2
> > billion number

Re: refactoring - share str2*int64 functions

2019-07-16 Thread Andres Freund

Hi,

On 2019-07-16 16:11:44 +0900, Michael Paquier wrote:

> numutils.c also has pg_strtoint16 and pg_strtoint32, so the locations
> become rather inconsistent with inconsistent APIs for the manipulation
> of int2 and int4 fields, and scanint8 is just a derivative of the same
> logic.  We have two categories of routines here:

> - The wrappers on top of strtol and strtoul & co, which are named
> respectively strtoint and pg_strtouint64 with the patch.  The naming
> part is inconsistent, and we only handle uint64 and int32.  We don't
> need to worry about int64 and uint32 because they are not used?
> That's fine by me, but at least let's have a consistent naming.

Yea, consistent naming seems like a strong requirement
here. Additionally I think we should just provide a consistent set
rather than what's needed just now. That'll just lead to people
inventing their own again down the line.

> Prefixing the functions with pg_* is a better practice in my opinion
> as we will unlikely run into conflicts this way.

+1

> - The str->integer conversion routines, which actually have very
> similar characteristics to the strtol families as they remove trailing
> whitespaces first, check for a sign, etc, except that they work only
> on base 10.

There's afaict neither a caller that needs the base argument at the
moment, nor one in the tree previously. I'd argue for just making
pg_strtouint64's API consistent.

I'd probably also just use the implementation we have for signed
integers (minus the relevant negation and overflow checks, obviously) -
it's a lot faster, and I think there's value in keeping the
implementations in sync.

Greetings,

Andres Freund

Re: refactoring - share str2*int64 functions

2019-07-16 Thread Andres Freund

Hi,

On 2019-07-15 07:08:42 +0200, Fabien COELHO wrote:
> I do not think that changing the error handling capability is appropriate,
> it is really a feature of the function. The function could try to use an
> internal pg_strtoint64 which would look like the other unsigned version, but
> then it would not differentiate the various error conditions (out of range
> vs syntax error).

> The compromise I can offer is to change the name of the first one, say to
> "pg_scanint8" to reflect its former backend name. Attached a v4 which does a
> renaming so as to avoid the name similarity but signature difference. I also
> made both error messages identical.

I think the interface of that function is not that good, and the "scan"
in the name isn't great for discoverability (for one it's a different
naming than pg_strtoint16 etc), and the *8 meaning 64bit is confusing
enough in the backend, we definitely shouldn't extend that to frontend
code.

Referencing "bigint" and "input syntax" from frontend code imo doesn't
make a lot of sense. And int8in is the only caller that uses
errorOK=False anyway, so there's currently no need for the frontend
error strings afaict.

ISTM that something like

typedef enum StrToIntConversion
{
STRTOINT_OK = 0,
STRTOINT_SYNTAX_ERROR = 1,
STRTOINT_OUT_OF_RANGE = 2
} StrToIntConversion;
StrToIntConversion pg_strtoint64(const char *str, int64 *result);

would make more sense.

There is the issue that there already is pg_strtoint16 and
pg_strtoint32, which do not have the option to not raise an error.  I'd
probably name the non-error throwing ones something like pg_strtointNN_e
(for extended, or error handling), and have pg_strtointNN wrappers that
just handle the errors, or reverse the naming (which'd cause a bit of
churn, but not that much).

That'd also make the code for pg_strtointNN a bit nicer, because we'd
not need the gotos anymore, they're just there to avoid redundant error
messages - which'd not be an issue if the error handling were just a
switch in a separate function. E.g.

int32
pg_strtoint32(const char *s)
{
int32 result;

switch (pg_strtoint32_e(s, ))
{
case STRTOINT_OK:
 return result;

case STRTOINT_SYNTAX_ERROR:
ereport(ERROR,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
 errmsg("value \"%s\" is out of range for type %s",
s, "integer")));

case STRTOINT_OUT_OF_RANGE:
ereport(ERROR,
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
 errmsg("invalid input syntax for type %s: \"%s\"",
"integer", s)));

}

return 0;   /* keep compiler quiet */
}

which does seem nicer than what we have right now.

Greetings,

Andres Freund

Re: Parallel Append subplan order instability on aye-aye

2019-07-16 Thread Andres Freund

Hi,

On 2019-07-15 21:12:32 -0400, Tom Lane wrote:
> But I bet that these tables forming
> an inheritance hierarchy (with multiple inheritance even) does
> have something to do with it somehow, because if this were a
> generic VACUUM bug surely we'd be seeing it elsewhere.

It's possible that it's hidden in other cases, because of

void
table_block_relation_estimate_size(Relation rel, int32 *attr_widths,
   BlockNumber 
*pages, double *tuples,
   double 
*allvisfrac,
   Size 
overhead_bytes_per_tuple,
   Size 
usable_bytes_per_page)
...
 * If the table has inheritance children, we don't apply this heuristic.
 * Totally empty parent tables are quite common, so we should be willing
 * to believe that they are empty.
 */
if (curpages < 10 &&
relpages == 0 &&
!rel->rd_rel->relhassubclass)
curpages = 10;

which'd not make us actually take a relpages=0 into account for tables
without inheritance.  A lot of these tables never get 10+ pages long, so
the heuristic would always apply...

Greetings,

Andres Freund

Re: rebased background worker reimplementation prototype

2019-07-16 Thread Tomas Vondra


On Tue, Jul 16, 2019 at 10:53:46AM -0700, Andres Freund wrote:

Hi,

On 2019-07-12 15:47:02 +0200, Tomas Vondra wrote:

I've done a bit of benchmarking / testing on this, so let me report some
basic results. I haven't done any significant code review, I've simply
ran a bunch of pgbench runs on different systems with different scales.


Thanks!



System #1
-
* CPU: Intel i5
* RAM: 8GB
* storage: 6 x SATA SSD RAID0 (Intel S3700)
* autovacuum_analyze_scale_factor = 0.1
* autovacuum_vacuum_cost_delay = 2
* autovacuum_vacuum_cost_limit = 1000
* autovacuum_vacuum_scale_factor = 0.01
* bgwriter_delay = 100
* bgwriter_lru_maxpages = 1
* checkpoint_timeout = 30min
* max_wal_size = 64GB
* shared_buffers = 1GB


What's the controller situation here? Can the full SATA3 bandwidth on
all of those drives be employed concurrently?



There's just an on-board SATA controller, so it might be a bottleneck.

A single drive can do ~440 MB/s reads sequentially, and the whole RAID0
array (Linux sw raid) does ~1.6GB/s, so not exactly 6x that. But I don't
think we're generating that many writes during the test.




System #2
-
* CPU: 2x Xeon E5-2620v5
* RAM: 64GB
* storage: 3 x 7.2k SATA RAID0, 1x NVMe
* autovacuum_analyze_scale_factor = 0.1
* autovacuum_vacuum_cost_delay = 2
* autovacuum_vacuum_cost_limit = 1000
* autovacuum_vacuum_scale_factor = 0.01
* bgwriter_delay = 100
* bgwriter_lru_maxpages = 1
* checkpoint_completion_target = 0.9
* checkpoint_timeout = 15min
* max_wal_size = 32GB
* shared_buffers = 8GB


What type of NVMe disk is this? I'm mostly wondering whether it's fast
enough that there's no conceivable way that IO scheduling is going to
make a meaningful difference, given other bottlenecks in postgres.

In some preliminary benchmark runs I've seen fairly significant gains on
SATA and SAS SSDs, as well as spinning rust, but I've not yet
benchmarked on a decent NVMe SSD.



Intel Optane 900P 280MB (model SSDPED1D280GA) [1].

[1] https://ssd.userbenchmark.com/SpeedTest/31/INTEL-SSDPED1D280GA

I think one of the main improvements in this generation of drives is
good performance with low queue depth.  See for example [2].

[2] 
https://www.anandtech.com/show/12136/the-intel-optane-ssd-900p-480gb-review/5

Not sure if that plays role here, but I've seen this to afffect prefetch
and similar things.




For each config I've done tests with three scales - small (fits into
shared buffers), medium (fits into RAM) and large (at least 2x the RAM).
Aside from the basic metrics (throughput etc.) I've also sampled data
about 5% of transactions, to be able to look at latency stats.

The tests were done on master and patched code (both in the 'legacy' and
new mode).





I haven't done any temporal analysis yet (i.e. I'm only looking at global
summaries, not tps over time etc).


FWIW, I'm working on a tool that generates correlated graphs of OS, PG,
pgbench stats. Especially being able to correlate the kernel's
'Writeback' stats (grep Writeback: /proc/meminfo) and latency is very
valuable. Sampling wait events over time also is worthwhile.



Good to know, although I don't think it's difficult to fetch the data
from sar and plot it. I might even already have ugly bash scripts doing
that, somewhere.




When running on the 7.2k SATA RAID, the throughput improves with the
medium scale - from ~340tps to ~439tps, which is a pretty significant
jump. But on the large scale this disappears (in fact, it seems to be a
bit lower than master/legacy cases). Of course, all this is just from a
single run (although 4h, so noise should even out).


Any chance there's an order-of-test factor here? In my tests I found two
related issues very important: 1) the first few tests are slower,
because WAL segments don't yet exist. 2) Some poor bugger of a later
test will get hit with anti-wraparound vacuums, even if otherwise not
necessary.



Not sure - I'll check, but I find it unlikely. I need to repeat the
tests to have multiple runs.


The fact that the master and "legacy" numbers differ significantly
e.g. in the "xeon sata scale 1000" latency CDF does make me wonder
whether there's an effect like that. While there might be some small
performance difference due to different stats message sizes, and a few
additional branches, I don't see how it could be that noticable.



That's about the one case where things like anti-wraparound are pretty
much impossible, because the SATA storage is so slow ...




I've also computed latency CDF (from the 5% sample) - I've attached this
for the two interesting cases mentioned in the previous paragraph. This
shows that with the medium scale the latencies move down (with the patch,
both in the legacy and "new" modes), while on large scale the "new" mode
moves a bit to the right to higher values).


Hm. I can't yet explain that.



And finally, I've looked at buffer stats, i.e. number of buffers written
in various ways (checkpoing, bgwriter, backends) etc. Interestingly
enough, these

Re: POC: converting Lists into arrays

2019-07-16 Thread Tom Lane

I wrote:
> * Rationalize places that are using combinations of list_copy and
> list_concat, probably by inventing an additional list-concatenation
> primitive that modifies neither input.

I poked around to see what we have in this department.  There seem to
be several identifiable use-cases:

* Concat two Lists that are freshly built, or at least not otherwise
referenced.  In the old code, list_concat serves fine, leaking the
second List's header but not any of its cons cells.  As of 1cff1b95a,
the second List's storage is leaked completely.  We could imagine
inventing a list_concat variant that list_free's its second input,
but I'm unconvinced that that's worth the trouble.  Few if any
callers can't stand to leak any storage, and if there are any where
it seems worth the trouble, adding an explicit list_free seems about
as good as calling a variant of list_concat.  (If we do want such a
variant, we need a name for it.  list_join, maybe, by analogy to
bms_join?)

* Concat two lists where there exist other pointers to the second list,
but it's okay if the lists share cons cells afterwards.  As of the
new code, they don't actually share any storage, which seems strictly
better.  I don't think we need to do anything here, except go around
and adjust the comments that explain that that's what's happening.

* Concat two lists where there exist other pointers to the second list,
and it's not okay to share storage.  This is currently implemented by
list_copy'ing the second argument, but we can just drop that (and
adjust comments where needed).

* Concat two lists where we mustn't modify either input list.
This is currently implemented by list_copy'ing both arguments.
I'm inclined to replace this pattern with a function like
"list_concat_copy(const List *, const List *)", although settling
on a suitable name might be difficult.

* There's a small number of places that list_copy the first argument
but not the second.  I believe that all of these are either of the form
"x = list_concat(list_copy(y), x)", ie replacing the only reference to
the second argument, or are relying on the "it's okay to share storage"
assumption to not copy a second argument that has other references.
I think we can just replace these with list_concat_copy.  We'll leak
the second argument's storage in the cases where another list is being
prepended onto a working list, but I doubt it's worth fussing over.
(But, if this is repeated a lot of times, maybe it is worth fussing
over?  Conceivably you could leak O(N^2) storage while building a
long working list, if you prepend many shorter lists onto it.)

* Note that some places are applying copyObject() not list_copy().
In these places the idea is to make duplicates of pointed-to structures
not just the list proper.  These should be left alone, I think.
When the copyObject is applied to the second argument, we're leaking
the top-level List in the copy result, but again it's not worth
fussing over.

Comments?

regards, tom lane

Re: A little report on informal commit tag usage

2019-07-16 Thread Daniel Gustafsson

> On 16 Jul 2019, at 16:33, Tom Lane  wrote:
> 
> Michael Paquier  writes:
>> As mentioned on different threads, "Discussion" is the only one we had
>> a strong agreement with.  Could it be possible to consider things like
>> Author, Reported-by, Reviewed-by or Backpatch-through for example and
>> extend to that?  The first three ones are useful for parsing the
>> commit logs.  The fourth one is handy so as there is no need to look
>> at a full log tree with git log --graph or such, which is something I
>> do from time to time to guess down to where a fix has been applied (I
>> tend to avoid git_changelog).
> 
> FWIW, I'm one of the people who prefer prose for this.  The backpatching
> bit is a good example of why, because my log messages typically don't
> just say "backpatch to 9.6" but something about why that was the cutoff.

Wearing my $work-hat where I regularly perform interesting merges of postgres
releases as an upstream, these detailed commit messages are very valuable and
much appreciated.  The wealth of (human readable) information stored in the
commit logs makes tracking postgres as an upstream quite a lot easier.

> I'm also skeptical of the argument that machine-parseable Reported-by
> and so forth are useful to anybody.  Who'd use them, and for what?

The green gamification dot on people’s Github profiles might light up if the
machine readable format with email address was used (and the user has that
specific email connected to their Github account unless it’s a primary email).
Looking at commit 1c9bb02d8ec1d5b1b319e4fed70439a403c245b1 I can see that for
August 2018 Amit’s Github profile lists “Created 1 commit in 1 repository
postgres/postgres 1 commit”, which is likely from this commit message being
parsed in the mirror.

cheers ./daniel

Re: SQL/JSON path issues/questions

2019-07-16 Thread Thom Brown

On Thu, 11 Jul 2019 at 16:23, Alexander Korotkov
 wrote:
>
> On Thu, Jul 11, 2019 at 5:10 PM Thom Brown  wrote:
> > On Wed, 10 Jul 2019 at 05:58, Alexander Korotkov
> >  wrote:
> > >
> > > On Mon, Jul 8, 2019 at 12:30 AM Alexander Korotkov
> > >  wrote:
> > > > On Thu, Jul 4, 2019 at 4:38 PM Liudmila Mantrova
> > > >  wrote:
> > > > > Thank  you!
> > > > >
> > > > > I think we can make this sentence even shorter, the fix is attached:
> > > > >
> > > > > "To refer to a JSON element stored at a lower nesting level, add one 
> > > > > or
> > > > > more accessor operators after @."
> > > >
> > > > Thanks, looks good to me.  Attached revision of patch contains commit
> > > > message.  I'm going to commit this on no objections.
> > >
> > > So, pushed!
> >
> > I've just noticed the >= operator shows up as just > in the "jsonpath
> > Filter Expression Elements" table, and the <= example only shows <.
>
> Thank you for catching this!  Fix just pushed.

Thanks.

Now I'm looking at the @? and @@ operators, and getting a bit
confused.  This following query returns true, but I can't determine
why:

# SELECT '{"a":[1,2,3,4,5]}'::jsonb @? '$.b == "hello"'::jsonpath;
 ?column?
--
 t
(1 row)

"b" is not a valid item, so there should be no match.  Perhaps it's my
misunderstanding of how these operators are supposed to work, but the
documentation is quite terse on the behaviour.

Thom

Re: rebased background worker reimplementation prototype

2019-07-16 Thread Andres Freund

Hi,

On 2019-07-12 15:47:02 +0200, Tomas Vondra wrote:
> I've done a bit of benchmarking / testing on this, so let me report some
> basic results. I haven't done any significant code review, I've simply
> ran a bunch of pgbench runs on different systems with different scales.

Thanks!

> System #1
> -
> * CPU: Intel i5
> * RAM: 8GB
> * storage: 6 x SATA SSD RAID0 (Intel S3700)
> * autovacuum_analyze_scale_factor = 0.1
> * autovacuum_vacuum_cost_delay = 2
> * autovacuum_vacuum_cost_limit = 1000
> * autovacuum_vacuum_scale_factor = 0.01
> * bgwriter_delay = 100
> * bgwriter_lru_maxpages = 1
> * checkpoint_timeout = 30min
> * max_wal_size = 64GB
> * shared_buffers = 1GB

What's the controller situation here? Can the full SATA3 bandwidth on
all of those drives be employed concurrently?

> System #2
> -
> * CPU: 2x Xeon E5-2620v5
> * RAM: 64GB
> * storage: 3 x 7.2k SATA RAID0, 1x NVMe
> * autovacuum_analyze_scale_factor = 0.1
> * autovacuum_vacuum_cost_delay = 2
> * autovacuum_vacuum_cost_limit = 1000
> * autovacuum_vacuum_scale_factor = 0.01
> * bgwriter_delay = 100
> * bgwriter_lru_maxpages = 1
> * checkpoint_completion_target = 0.9
> * checkpoint_timeout = 15min
> * max_wal_size = 32GB
> * shared_buffers = 8GB

What type of NVMe disk is this? I'm mostly wondering whether it's fast
enough that there's no conceivable way that IO scheduling is going to
make a meaningful difference, given other bottlenecks in postgres.

In some preliminary benchmark runs I've seen fairly significant gains on
SATA and SAS SSDs, as well as spinning rust, but I've not yet
benchmarked on a decent NVMe SSD.

> For each config I've done tests with three scales - small (fits into
> shared buffers), medium (fits into RAM) and large (at least 2x the RAM).
> Aside from the basic metrics (throughput etc.) I've also sampled data
> about 5% of transactions, to be able to look at latency stats.
> 
> The tests were done on master and patched code (both in the 'legacy' and
> new mode).

> I haven't done any temporal analysis yet (i.e. I'm only looking at global
> summaries, not tps over time etc).

FWIW, I'm working on a tool that generates correlated graphs of OS, PG,
pgbench stats. Especially being able to correlate the kernel's
'Writeback' stats (grep Writeback: /proc/meminfo) and latency is very
valuable. Sampling wait events over time also is worthwhile.

> When running on the 7.2k SATA RAID, the throughput improves with the
> medium scale - from ~340tps to ~439tps, which is a pretty significant
> jump. But on the large scale this disappears (in fact, it seems to be a
> bit lower than master/legacy cases). Of course, all this is just from a
> single run (although 4h, so noise should even out).

Any chance there's an order-of-test factor here? In my tests I found two
related issues very important: 1) the first few tests are slower,
because WAL segments don't yet exist. 2) Some poor bugger of a later
test will get hit with anti-wraparound vacuums, even if otherwise not
necessary.

The fact that the master and "legacy" numbers differ significantly
e.g. in the "xeon sata scale 1000" latency CDF does make me wonder
whether there's an effect like that. While there might be some small
performance difference due to different stats message sizes, and a few
additional branches, I don't see how it could be that noticable.

> I've also computed latency CDF (from the 5% sample) - I've attached this
> for the two interesting cases mentioned in the previous paragraph. This
> shows that with the medium scale the latencies move down (with the patch,
> both in the legacy and "new" modes), while on large scale the "new" mode
> moves a bit to the right to higher values).

Hm. I can't yet explain that.

> And finally, I've looked at buffer stats, i.e. number of buffers written
> in various ways (checkpoing, bgwriter, backends) etc. Interestingly
> enough, these numbers did not change very much - especially on the flash
> storage. Maybe that's expected, though.

Some of that is expected, e.g. because file extensions count as backend
writes, and are going to be roughly correlate with throughput, and not
much else. But they're more similar than I'd actually expect.

I do see a pretty big difference in the number of bgwriter written
backends in the "new" case for scale 1, on the nvme?

For the SATA SSD case, I wonder if the throughput bottleneck is WAL
writes. I see much more noticable differences if I enable
wal_compression or disable full_page_writes, because otherwise the bulk
of the volume is WAL data.  But even in that case, I see a latency
stddev reduction with the new bgwriter around checkpoints.

> The one case where it did change is the "medium" scale on SATA storage,
> where the throughput improved with the patch. But the change is kinda
> strange, because the number of buffers evicted by the bgwriter decreased
> (and instead it got evicted by the checkpointer). Which might explain the
> higher throughput, because

Re: Minimal logical decoding on standbys

2019-07-16 Thread Andres Freund

Hi,

On 2019-07-12 14:53:21 +0530, tushar wrote:
> On 07/10/2019 05:12 PM, Amit Khandekar wrote:
> > All right. Will do that in the next patch set. For now, I have quickly
> > done the below changes in a single patch again (attached), in order to
> > get early comments if any.
> Thanks Amit for your patch. i am able to see 1 issues  on Standby server -
> (where  logical replication slot created ) ,
> a)size of  pg_wal folder  is NOT decreasing even after firing get_changes
> function

Even after calling pg_logical_slot_get_changes() multiple times? What
does
SELECT * FROM pg_replication_slots; before and after multiple calls return?

Does manually forcing a checkpoint with CHECKPOINT; first on the primary
and then the standby "fix" the issue?

> b)pg_wal files are not recycling  and every time it is creating new files
> after firing get_changes function

I'm not sure what you mean by this. Are you saying that
pg_logical_slot_get_changes() causes WAL to be written?

Greetings,

Andres Freund

Re: heapam_index_build_range_scan's anyvisible

2019-07-16 Thread Andres Freund

Hi,

On 2019-07-11 17:27:46 -0700, Ashwin Agrawal wrote:
> Please find attached the patch to remove IndexBuildCallback's dependency on
> HeapTuple, as discussed. Changed to have the argument as ItemPointer
> instead of HeapTuple. Other larger refactoring if feasible for
> index_build_range_scan API itself can be performed as follow-up changes.

> From f73b0061795f0c320f96ecfed0c0602ae318d73e Mon Sep 17 00:00:00 2001
> From: Ashwin Agrawal 
> Date: Thu, 11 Jul 2019 16:58:50 -0700
> Subject: [PATCH v1] Remove IndexBuildCallback's dependency on HeapTuple.
>
> With IndexBuildCallback taking input as HeapTuple, all table AMs
> irrespective of storing the tuples in HeapTuple form or not, are
> forced to construct HeapTuple, to insert the tuple in Index. Since,
> only thing required by the index callbacks is TID and not really the
> full tuple, modify callback to only take ItemPointer.

Looks good to me. Planning to apply this unless somebody wants to argue
against it soon?

- Andres

Re: pg_receivewal documentation

2019-07-16 Thread Jesper Pedersen


Hi,

On 7/16/19 12:28 PM, Laurenz Albe wrote:

This is not true in all cases as since 9.6 it is possible to specify
multiple synchronous standbys.  So if for example pg_receivewal and
another synchronous standby are set in s_s_names and that the number
of a FIRST (priority-based) or ANY (quorum set) is two, then the same
issue exists, but this documentation is incorrect.  I think that we
should have a more extensive wording  here, like "if pg_receivewal is
part of a quorum-based or priority-based set of synchronous standbys."


I think this would be overly complicated.
The wording above seems to cover the priority-based base sufficiently
in my opinion.
Maybe a second sentence with more detail would be better:

   ... must not be set to remote_apply if
   pg_receivewal is the only synchronous standby.
   Similarly, if pg_receivewal is part of
   a quorum-based set of synchronous standbys, it won't count towards
   the quorum if  is set to
   remote_apply.



Here is the patch for that.

Best regards,
 Jesper


>From 4fb28d6fe08ddea5a9740c9a81e8e00b94283d78 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Tue, 9 Jul 2019 13:14:25 -0400
Subject: [PATCH] Highlight that pg_receivewal doesn't apply WAL, and as such
 synchronous-commit needs to be remote_write or lower.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Authors: Ãlvaro Herrera, Laurenz Albe and Jesper Pedersen
Review-by: Ãlvaro Herrera, Laurenz Albe and Jesper Pedersen
---
 doc/src/sgml/ref/pg_receivewal.sgml | 12 
 1 file changed, 12 insertions(+)

diff --git a/doc/src/sgml/ref/pg_receivewal.sgml b/doc/src/sgml/ref/pg_receivewal.sgml
index 0506120c00..03b20ecd38 100644
--- a/doc/src/sgml/ref/pg_receivewal.sgml
+++ b/doc/src/sgml/ref/pg_receivewal.sgml
@@ -207,6 +207,18 @@ PostgreSQL documentation
 server as a synchronous standby, to ensure that timely feedback is
 sent to the server.

+
+   
+Note that while WAL will be flushed with this setting,
+pg_receivewal never applies it, so
+ must not be set to
+remote_apply if pg_receivewal
+is the only synchronous standby. Similarly, if
+pg_receivewal is part of a quorum-based
+set of synchronous standbys, it won't count towards the quorum if
+ is set to
+remote_apply.
+   
   
  
 
-- 
2.21.0

Re: Index Skip Scan

2019-07-16 Thread Jesper Pedersen


Hi David,

On 7/11/19 7:38 AM, David Rowley wrote:

The UniqueKeys idea is quite separate from pathkeys.  Currently, a
Path can have a List of PathKeys which define the order that the
tuples will be read from the Plan node that's created from that Path.
The idea with UniqueKeys is that a Path can also have a non-empty List
of UniqueKeys to define that there will be no more than 1 row with the
same value for the Paths UniqueKey column/exprs.

As of now, if you look at standard_qp_callback() sets
root->query_pathkeys, the idea here would be that the callback would
also set a new List field named "query_uniquekeys" based on the
group_pathkeys when non-empty and !root->query->hasAggs, or by using
the distinct clause if it's non-empty. Then in build_index_paths()
around the call to match_pathkeys_to_index() we'll probably also want
to check if the index can support UniqueKeys that would suit the
query_uniquekeys that were set during standard_qp_callback().

As for setting query_uniquekeys in standard_qp_callback(), this should
be simply a matter of looping over either group_pathkeys or
distinct_pathkeys and grabbing the pk_eclass from each key and making
a canonical UniqueKey from that. To have these canonical you'll need
to have a new field in PlannerInfo named canon_uniquekeys which will
do for UniqueKeys what canon_pathkeys does for PathKeys. So you'll
need an equivalent of make_canonical_pathkey() in uniquekey.c

With this implementation, the code that the patch adds in
create_distinct_paths() can mostly disappear. You'd only need to look
for a path that uniquekeys_contained_in() matches the
root->query_uniquekeys and then just leave it to
set_cheapest(distinct_rel); to pick the cheapest path.

It would be wasted effort to create paths with UniqueKeys if there's
multiple non-dead base rels in the query as the final rel in
create_distinct_paths would be a join rel, so it might be worth
checking that before creating such index paths in build_index_paths().
However, down the line, we could carry the uniquekeys forward into the
join paths if the join does not duplicate rows from the other side of
the join... That's future stuff though, not for this patch, I don't
think.

I think a UniqueKey can just be a struct similar to PathKey, e.g be
located in pathnodes.h around where PathKey is defined.  Likely we'll
need a uniquekeys.c file that has the equivalent to
pathkeys_contained_in() ... uniquekeys_contained_in(), which return
true if arg1 is a superset of arg2 rather than check for one set being
a prefix of another. As you mention above: UniqueKeys { x, y } ==
UniqueKeys { y, x }.  That superset check could perhaps be optimized
by sorting UniqueKey lists in memory address order, that'll save
having a nested loop, but likely that's not going to be required for a
first cut version.  This would work since you'd want UniqueKeys to be
canonical the same as PathKeys are (Notice that compare_pathkeys()
only checks memory addresses of pathkeys and not equals()).

I think the UniqueKey struct would only need to contain an
EquivalenceClass field. I think all the other stuff that's in PathKey
is irrelevant to UniqueKey.



Here is a patch more in that direction.

Some questions:

1) Do we really need the UniqueKey struct ?  If it only contains the 
EquivalenceClass field then we could just have a list of those instead. 
That would make the patch simpler.


2) Do we need both canon_uniquekeys and query_uniquekeys ?  Currently 
the patch only uses canon_uniquekeys because the we attach the list 
directly on the Path node.


I'll clean the patch up based on your feedback, and then start to rebase 
v21 on it.


Thanks in advance !

Best regards,
 Jesper
From 174a6425036e2d4ca7d3d68c635cd55a58a9b9e6 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Tue, 9 Jul 2019 06:44:57 -0400
Subject: [PATCH] UniqueKey

---
 src/backend/nodes/print.c  |  39 +++
 src/backend/optimizer/path/Makefile|   2 +-
 src/backend/optimizer/path/allpaths.c  |   8 ++
 src/backend/optimizer/path/costsize.c  |   5 +
 src/backend/optimizer/path/indxpath.c  |  39 +++
 src/backend/optimizer/path/uniquekey.c | 149 +
 src/backend/optimizer/plan/planner.c   |  12 +-
 src/backend/optimizer/util/pathnode.c  |  12 ++
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/pathnodes.h  |  18 +++
 src/include/nodes/print.h  |   2 +-
 src/include/optimizer/pathnode.h   |   1 +
 src/include/optimizer/paths.h  |   8 ++
 13 files changed, 293 insertions(+), 3 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekey.c

diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c
index 4ecde6b421..ed5684bf19 100644
--- a/src/backend/nodes/print.c
+++ b/src/backend/nodes/print.c
@@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable)
printf(")\n");
 }
 
+/*
+ * print_unique_keys -
+ *   unique_key an UniqueKey
+ */
+void

Re: pg_receivewal documentation

2019-07-16 Thread Laurenz Albe

On Tue, 2019-07-16 at 14:05 +0900, Michael Paquier wrote:
> >> How about
> >>  Note that while WAL will be flushed with this setting,
> >>  pg_receivewal never applies it, so
> >>   must not be set to
> >>  remote_apply if 
> >> pg_receivewal
> >>  is the only synchronous standby.
> 
> This is not true in all cases as since 9.6 it is possible to specify
> multiple synchronous standbys.  So if for example pg_receivewal and
> another synchronous standby are set in s_s_names and that the number
> of a FIRST (priority-based) or ANY (quorum set) is two, then the same
> issue exists, but this documentation is incorrect.  I think that we
> should have a more extensive wording  here, like "if pg_receivewal is
> part of a quorum-based or priority-based set of synchronous standbys."

I think this would be overly complicated.
The wording above seems to cover the priority-based base sufficiently
in my opinion.
Maybe a second sentence with more detail would be better:

  ... must not be set to remote_apply if
  pg_receivewal is the only synchronous standby.
  Similarly, if pg_receivewal is part of
  a quorum-based set of synchronous standbys, it won't count towards
  the quorum if  is set to
  remote_apply.

Yours,
Laurenz Albe

Re: POC: Cleaning up orphaned files using undo logs

2019-07-16 Thread Robert Haas

On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro  wrote:
> Here's a new version.

Here's a relatively complete review of 0019 and 0020 and a remark or
two on the beginning of 0003.

Regarding 0020:

The documentation claims that undo data exists in a 64-bit address
space divided into 2^34 undo logs, each with a theoretical capacity of
1TB, but that would require 74 bits.

I am mildly suspicious that, on a busy system, the use of 1MB segment
files could result in slowdowns due to frequent filesystem operations.
We just recently made it more convenient to change the WAL segment
size, mostly so that people on very busy systems could crank it up
from 16MB to, say, 64MB or 256MB. It's true that the considerations
are a bit different here, because undo logs don't have to be archived,
and because we might be using many undo logs simultaneously rather
than only 1 for the whole system, but it's still true that if you've
got a bunch of backends blasting out undo at top speed, you're going
to have to recycle files *extremely* quickly.  How much performance
testing have you done to assess the effect of segment size? Do you
think there's an argument for making this 1MB size configurable at
initdb-time? Or even variable at runtime, so that we use larger files
if we're filling them up in < 100ms or whatever?

I don't think the last paragraph is entirely accurate.  The access
method gets to control what records are written, but the general
format of the records is fixed by the undo system. Perhaps the undo
log code isn't what cares about that, but whether it's the undo log
code or the undo access code or the undo processing code isn't likely
to seem relevant to developers.

Regarding 0019:

I think there's a substantial amount of duplication between 0019 and
0020, and I'm not sure that we ought to have both.  They both talk
about the purpose of undo, the way the adddress space is divided, etc.
I understand that it would be a little weird to include all of the
information from 0019 in the user-facing documentation, and I also
understand that it won't work to have no user-facing documentation at
all, but it still seems a little odd to me.  Possibly 0019 could refer
to the SGML documentation for preliminaries and then add only those
details that are not covered there.

How could we avoid the limit on the total size of an active
transaction mentioned here? And what would be the cost of such a
scheme? If we've filled an undo log and moved on to another one, why
can't we evict the one that's full and reuse the shared memory slot,
bringing it back in later when required?  I suspect the answer is that
there is a locking rule involved. I think this README would be a good
place to document things like locking rules, or a least to refer to
where they are documented. I also think we should mull over whether we
could relax the rule without too much pain.  I expect that at least
part of the problem is that somebody might have a pointer to an
UndoLogSlot which could become stale if we recycle a slot, but that
can already happen at least when the log is fully discarded, so maybe
allowing it to happen in other cases wouldn't be too bad.

I know you're laughing at me on the inside, worrying about a
transaction that touches so many TB of data that it manages to exhaust
all the undo log slots, but I don't think that's a completely crazy
scenario. There are PB-scale databases out there, and it would be nice
to think that PostgreSQL could capture more of those workloads.  They
will probably become more common over time.

Reading the section on persistence levels and tablespaces makes me
wonder what happens to address space that gets allocated to temporary
and unlogged undo logs. It seems pretty important to make sure that we
at least don't leak anything significant, and maybe that we actually
recycle the address space or share it across backends. That is, if
several backends are all writing temporary undo, there's no intrinsic
reason why they can't all be using the same temporary undo logs, as
long as the file naming works OK for that (e.g. if it follows the same
pattern we use for relation names). Any undo logs that get allocated
to unlogged undo can be recycled - either for unlogged undo or
otherwise - after a crash, and any that are partially filled can be
rewound. I don't know how much effort we're expending on any of that
right now, but it seems like it would be worth discussing in this
README, and possibly improving.

When the undo log contents section mentions that "client code is
responsible for stepping over the page headers and advancing to the
next page," that's again a somewhat middle-of-the-patch stack
perspective. I am not sure exactly how this should be phrased, but the
point is that the client code we're talking about is not the AM but
the next patch in the stack. I think developers will view the AM as
the client and our wording probably ought to reflect that.

"keepign" is not spelled correctly. A little later on, "checkpoin" is

Re: POC: Cleaning up orphaned files using undo logs

2019-07-16 Thread Robert Haas

On Tue, Jul 16, 2019 at 7:13 AM Amit Kapila  wrote:
> >  I also strongly suspect it is altogether wrong to do
> > this before CommitSubTransaction sets s->state to TRANS_COMMIT; what
> > if a subxact callback throws an error?
>
> Are you worried that it might lead to the execution of actions twice?

No, I'm worried that you are running code that is part of the commit
path before the transaction has actually committed.
CommitSubTransaction() is full of stuff which basically propagates
whatever the subtransaction did out to the parent transaction, and all
of that code runs after we've ruled out the possibility of an abort,
but this very-similar-looking code runs while it's still possible for
an abort to happen. That seems unlikely to be correct, and even if it
is, it seems needlessly inconsistent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

2019-07-16 Thread Robert Haas

On Tue, Jul 16, 2019 at 12:32 AM Amit Kapila  wrote:
> The idea is that the queues can get full, but not rollback hash table.
> In the case where the error queue gets full, we mark the entry as
> Invalid in the hash table and later when discard worker again
> encounters this request, it adds it to the queue if there is a space
> available and marks the entry in the hash table as valid.  This allows
> us to keep the information of all xacts having pending undo in shared
> memory.

I don't understand.  How is it OK to have entries in the hash table
but not the queues?  And why would that ever happen, anyway?  If you
make the queues as big as the hash table is, then they should never
fill up (or, if using binary heaps with lazy removal rather than
rbtrees, they might fill up, but if they do, you can always make space
by cleaning out the stale entries).

> I think this can regress the performance when there are many
> concurrent sessions unless there is a way to add/remove request
> without a lock.  As of now, we don't enter any request or block any
> space in shared memory related to pending undo till there is an error
> or user explicitly Rollback the transaction.  We can surely do some
> other way as well, but this way we won't have any overhead in the
> commit or successful transaction's path.

Well, we're already incurring some overhead to attach to an undo log,
and that probably involves some locking.  I don't see why this would
be any worse, and maybe it could piggyback on the existing work.
Anyway, if you don't like this solution, propose something else. It's
impossible to correctly implement a hard limit unless the number of
aborted-but-not-yet-undone transaction is bounded to (HARD_LIMIT -
ENTRIES_THAT_WOULD_BE_ADDED_AFTER_RECOVERY_IF_THE_SYSTEM_CRASHED_NOW).
If there are 100 transactions each bound to 2 undo logs, and you
crash, you will need to (as you have it designed now) add another 200
transactions to the hash table upon recovery, and that will make you
exceed the hard limit unless you were at least 200 transactions below
the limit before the crash.  Have you handled that somehow?  If so,
how?  It seems to me that you MUST - at a minimum - keep a count of
undo logs attached to in-progress transactions, if not the actual hash
table entries.

> Again coming to question of whether we need single or multiple entries
> for one-request-per-persistence level, the reason for the same we have
> discussed so far is that discard worker can register the requests for
> them while scanning undo logs at different times.

Yeah, but why do we need that in the first place?  I wrote something
about that in a previous email, but you haven't responded to it here.

> However, there are
> few more things like what if while applying the actions, the actions
> for logged are successful and unlogged fails, keeping them separate
> allows better processing.  If one fails, register its request in error
> queue and try to process the request for another persistence level.  I
> think the requests for the different persistence levels are kept in a
> separate log which makes their processing separately easier.

I don't find this convincing. It's not really an argument, just a
vague list of issues. If you want to convince me, you'll need to be
much more precise.

It seems to me that it is generally undesirable to undo the unlogged
part of a transaction separately from the logged part of the
transaction.  But even if we want to support that, having one entry
per XID rather than one entry per  doesn't
preclude that.  Even if you discover the entries at different times,
you can still handle that by updating the existing entry rather than
making a new one.

There might be a good reason to do it the way you are describing, but
I don't see that you've made the argument for it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: converting Lists into arrays

2019-07-16 Thread Tom Lane

Robert Haas  writes:
> On Tue, Jul 16, 2019 at 10:44 AM Tom Lane  wrote:
>> OK, I'm outvoted, will do it that way.

> I cast my vote in the other direction i.e. for sticking with qsort.

Didn't see this until after pushing a commit that uses "list_sort".

While composing that commit message another argument occurred to me,
which is that renaming makes it absolutely sure that any external
callers will notice they have an API change to deal with, no matter
how forgiving their compiler is.  Also, if somebody really really
doesn't want to cope with the change, they can now make their own
version of list_qsort (stealing it out of 1cff1b95a) and the core
code won't pose a conflict.

So I'm good with "list_sort" at this point.

regards, tom lane

Re: POC: converting Lists into arrays

2019-07-16 Thread Peter Geoghegan

On Tue, Jul 16, 2019 at 9:01 AM Robert Haas  wrote:
> I cast my vote in the other direction i.e. for sticking with qsort.

I do too.


-- 
Peter Geoghegan

Re: POC: Cleaning up orphaned files using undo logs

2019-07-16 Thread Robert Haas

On Tue, Jul 16, 2019 at 10:02 AM Amit Kapila  wrote:
> On Tue, Jul 16, 2019 at 4:43 PM Amit Kapila  wrote:
> > On Tue, Jul 16, 2019 at 2:09 AM Robert Haas  wrote:
> > > This patch has some problems with naming consistency.  There's a
> > > function called PushUndoRequest() which calls a function called
> > > RegisterRollbackReq() to do the heart of the work.  So, is it undo or
> > > rollback?  Are we pushing or registering?  Is it a request or a req?
> >
> > I think we can rename PushUndoRequest as RegisterUndoRequest and
> > RegisterRollbackReq as RegisterUndoRequestGuts.
>
> One thing I am not sure about the above suggestion is whether it is a
> good idea to expose a function which ends with 'Guts'.  I have checked
> and found that there are a few similar precedents like
> ExecuteTruncateGuts.  Another idea could be to rename
> RegisterRollbackReq as RegisterUndoRequestInternal.  We have few
> precedents for that as well.

I don't personally like Guts, not only because bringing human (or
animal) body parts into this seems unnecessary, but more importantly
because it's not at all descriptive. Internal is no better. The point
is that you need to give the functions names that make it clear how
what one function does is different from what another function does,
and neither Guts nor Internal is going to help with that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: converting Lists into arrays

2019-07-16 Thread Robert Haas

On Tue, Jul 16, 2019 at 10:44 AM Tom Lane  wrote:
> David Steele  writes:
> > On 7/15/19 11:07 PM, Tom Lane wrote:
> >> David Rowley  writes:
> >>> The only thoughts I have so far here are that it's a shame that the
> >>> function got called list_qsort() and not just list_sort().
>
> > I agree with David -- list_sort() is better.  I don't think "sort" is
> > such a common stem that searching is a big issue, especially with modern
> > code indexing tools.
>
> OK, I'm outvoted, will do it that way.

I cast my vote in the other direction i.e. for sticking with qsort.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: fix for BUG #3720: wrong results at using ltree

2019-07-16 Thread Nikita Glukhov



On 09.07.2019 17:57, Oleg Bartunov wrote:

On Mon, Jul 8, 2019 at 7:22 AM Thomas Munro  wrote:

On Sun, Apr 7, 2019 at 3:46 AM Tom Lane  wrote:

=?UTF-8?Q?Filip_Rembia=C5=82kowski?=  writes:

Here is my attempt to fix a 12-years old ltree bug (which is a todo item).
I see it's not backward-compatible, but in my understanding that's
what is documented. Previous behavior was inconsistent with
documentation (where single asterisk should match zero or more
labels).
http://archives.postgresql.org/pgsql-bugs/2007-11/msg00044.php

[...]


In short, I'm wondering if we should treat this as a documentation
bug not a code bug.  But to do that, we'd need a more accurate
description of what the code is supposed to do, because the statement
quoted above is certainly not a match to the actual behavior.

This patch doesn't apply.  More importantly, it seems like we don't
have a consensus on whether we want it.

Teodor, Oleg, would you like to offer an opinion here?  If I
understand correctly, the choices are doc change, code/comment change
or WONT_FIX.  This seems to be an entry that we can bring to a
conclusion in this CF with some input from the ltree experts.

We are currently very busy and will look at the problem (and dig into
our memory) later.  There is also another ltree patch
(https://commitfest.postgresql.org/23/1977/), it would be nice if
Filip try it.


I looked at "ltree syntax improvement" patch and found two more very
old bugs in ltree/lquery (fixes are attached):

1.  ltree/lquery level counter overflow is wrongly checked:

SELECT nlevel((repeat('a.', 65534) || 'a')::ltree);
 nlevel

  65535
(1 row)

-- expected 65536 or error
SELECT nlevel((repeat('a.', 65535) || 'a')::ltree);
 nlevel

  0
(1 row)

-- expected 65537 or error
SELECT nlevel((repeat('a.', 65536) || 'a')::ltree);
 nlevel

  1
(1 row)

-- expected 'a...' or error
SELECT (repeat('a.', 65535) || 'a')::ltree;
 ltree
---
 
(1 row)


-- expected 'a...' or error
SELECT (repeat('a.', 65536) || 'a')::ltree;
 ltree
---
 a
(1 row)


2.  '*{a}.*{b}.*{c}' is not equivalent to '*{a+b+c}' (as I expect):

SELECT ltree '1.2' ~ '*{2}';
 ?column?
--
 t
(1 row)

-- expected true
SELECT ltree '1.2' ~ '*{1}.*{1}';
 ?column?
--
 f
(1 row)


Maybe these two bugs need a separate thread?


--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

>From 5334c92406fd281409594a8861cbb40ca9a8c0a9 Mon Sep 17 00:00:00 2001
From: Nikita Glukhov 
Date: Thu, 7 Mar 2019 17:45:44 +0300
Subject: [PATCH 1/2] Fix max size checking for ltree and lquery

---
 contrib/ltree/expected/ltree.out | 16 
 contrib/ltree/ltree.h|  2 ++
 contrib/ltree/ltree_io.c | 12 ++--
 contrib/ltree/sql/ltree.sql  |  5 +
 4 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/contrib/ltree/expected/ltree.out b/contrib/ltree/expected/ltree.out
index 8226930..30b8247 100644
--- a/contrib/ltree/expected/ltree.out
+++ b/contrib/ltree/expected/ltree.out
@@ -457,6 +457,22 @@ SELECT nlevel('1.2.3.4');
   4
 (1 row)
 
+SELECT nlevel(('1' || repeat('.1', 65534))::ltree);
+ nlevel 
+
+  65535
+(1 row)
+
+SELECT nlevel(('1' || repeat('.1', 65535))::ltree);
+ERROR:  number of ltree levels (65536) exceeds the maximum allowed (65535)
+SELECT ('1' || repeat('.1', 65534))::lquery IS NULL;
+ ?column? 
+--
+ f
+(1 row)
+
+SELECT ('1' || repeat('.1', 65535))::lquery IS NULL;
+ERROR:  number of lquery levels (65536) exceeds the maximum allowed (65535)
 SELECT '1.2'::ltree  < '2.2'::ltree;
  ?column? 
 --
diff --git a/contrib/ltree/ltree.h b/contrib/ltree/ltree.h
index 366e580..0e843e3 100644
--- a/contrib/ltree/ltree.h
+++ b/contrib/ltree/ltree.h
@@ -25,6 +25,7 @@ typedef struct
 
 #define LTREE_HDRSIZE	MAXALIGN( offsetof(ltree, data) )
 #define LTREE_FIRST(x)	( (ltree_level*)( ((char*)(x))+LTREE_HDRSIZE ) )
+#define LTREE_MAX_LEVELS	Min(PG_UINT16_MAX, MaxAllocSize / sizeof(nodeitem))
 
 
 /* lquery */
@@ -77,6 +78,7 @@ typedef struct
 
 #define LQUERY_HDRSIZE	 MAXALIGN( offsetof(lquery, data) )
 #define LQUERY_FIRST(x)   ( (lquery_level*)( ((char*)(x))+LQUERY_HDRSIZE ) )
+#define LQUERY_MAX_LEVELS	Min(PG_UINT16_MAX, MaxAllocSize / ITEMSIZE)
 
 #define LQUERY_HASNOT		0x01
 
diff --git a/contrib/ltree/ltree_io.c b/contrib/ltree/ltree_io.c
index f54f037..460ed1a 100644
--- a/contrib/ltree/ltree_io.c
+++ b/contrib/ltree/ltree_io.c
@@ -58,11 +58,11 @@ ltree_in(PG_FUNCTION_ARGS)
 		ptr += charlen;
 	}
 
-	if (num + 1 > MaxAllocSize / sizeof(nodeitem))
+	if (num + 1 > LTREE_MAX_LEVELS)
 		ereport(ERROR,
 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
- errmsg("number of levels (%d) exceeds the maximum allowed (%d)",
-		num + 1, (int) (MaxAllocSize / sizeof(nodeitem);
+

Re: POC: converting Lists into arrays

2019-07-16 Thread Tom Lane

David Steele  writes:
> On 7/15/19 11:07 PM, Tom Lane wrote:
>> David Rowley  writes:
>>> The only thoughts I have so far here are that it's a shame that the
>>> function got called list_qsort() and not just list_sort().

> I agree with David -- list_sort() is better.  I don't think "sort" is 
> such a common stem that searching is a big issue, especially with modern 
> code indexing tools.

OK, I'm outvoted, will do it that way.

regards, tom lane

Re: A little report on informal commit tag usage

2019-07-16 Thread Tom Lane

Michael Paquier  writes:
> As mentioned on different threads, "Discussion" is the only one we had
> a strong agreement with.  Could it be possible to consider things like
> Author, Reported-by, Reviewed-by or Backpatch-through for example and
> extend to that?  The first three ones are useful for parsing the
> commit logs.  The fourth one is handy so as there is no need to look
> at a full log tree with git log --graph or such, which is something I
> do from time to time to guess down to where a fix has been applied (I
> tend to avoid git_changelog).

FWIW, I'm one of the people who prefer prose for this.  The backpatching
bit is a good example of why, because my log messages typically don't
just say "backpatch to 9.6" but something about why that was the cutoff.
For instance in 0ec3e13c6,

Per gripe from Ken Tanzer.  Back-patch to 9.6.  The issue exists
further back, but before 9.6 the code looks very different and it
doesn't actually know whether the "var" name matches anything,
so I desisted from trying to fix it.

I am in favor of trying to consistently mention that a patch is being
back-patched, rather than expecting people to rely on git metadata
to find that out.  But I don't see that a rigid "Backpatch" tag format
makes anything easier there.  If you need to know that mechanically,
git_changelog is way more reliable.

I'm also skeptical of the argument that machine-parseable Reported-by
and so forth are useful to anybody.  Who'd use them, and for what?
Also, it's not always clear how to apply such a format to a real
situation --- eg, what do you do if the reporter is also the patch
author, or a co-author?  I'm not excited about redundantly entering
somebody's name several times.

regards, tom lane

Re: [proposal] de-TOAST'ing using a iterator

2019-07-16 Thread Binguo Bao

Hi, John

First, I'd like to advocate for caution when using synthetic
> benchmarks involving compression. Consider this test:
> insert into detoast_c (a)
> select
> 'abc'||
> repeat(
> (SELECT string_agg(md5(chr(i)), '')
> FROM generate_series(1,127) i)
> , 1)
> ||'xyz'
> from generate_series(1,100);
> The results for the uncompressed case were not much different then
> your test. However, in the compressed case the iterator doesn't buy us
> much with beginning searches since full decompression is already fast:
>  master  patch
> comp. beg.869ms  837ms
> comp. end   14100ms16100ms
> uncomp. beg. 6360ms  800ms
> uncomp. end 21100ms21400ms
> and with compression it's 14% slower searching to the end. This is
> pretty contrived, but I include it for demonstration.


I've reproduced the test case with test scripts in the attachment on my
laptop:

 master  patch
comp. beg.2686.77 ms  1532.79 ms
comp. end 17971.8 ms  21206.3 ms
uncomp. beg.8358.79 ms  1556.93 ms
uncomp. end 23559.7 ms  22547.1 ms

In the compressed beginning case, the test result is different from yours
since the patch is ~1.75x faster
rather than no improvement. The interesting thing is that the patch if 4%
faster than master in the uncompressed end case.
I can't figure out reason now.

Reading the thread where you're working on optimizing partial
> decompression [1], it seems you have two separate solutions for the
> two problems. Maybe this is fine, but I'd like to bring up the
> possibility of using the same approach for both kinds of callers.



> I'm not an expert on TOAST, but maybe one way to solve both problems
> is to work at the level of whole TOAST chunks. In that case, the
> current patch would look like this:
> 1. The caller requests more of the attribute value from the de-TOAST
> iterator.
> 2. The iterator gets the next chunk and either copies or decompresses
> the whole chunk into the buffer. (If inline, just decompress the whole
> thing)


Thanks for your suggestion. It is indeed possible to implement
PG_DETOAST_DATUM_SLICE using the de-TOAST iterator.
IMO the iterator is more suitable for situations where the caller doesn't
know the slice size. If the caller knows the slice size,
it is reasonable to fetch enough chunks at once and then decompress it at
once.
 --
Best regards,
Binguo Bao


init-test.sh
Description: application/shellscript


iterator-test.sh
Description: application/shellscript

Re: make \d pg_toast.foo show its indices ; and, \d toast show its main table ; and \d relkind=I show its partitions

2019-07-16 Thread Justin Pryzby

I realized that the test added to show-childs patch was listing partitioned
tables not indices..fixed.
>From 237f0bb2a048aa71726eff2580d01404ae3a98b4 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 30 Apr 2019 19:05:53 -0500
Subject: [PATCH v5] print table associated with given TOAST table

---
 src/bin/psql/describe.c| 28 
 src/test/regress/expected/psql.out | 10 ++
 src/test/regress/sql/psql.sql  |  3 +++
 3 files changed, 41 insertions(+)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 3ee9c82..13ed2e1 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -2153,6 +2153,34 @@ describeOneTableDetails(const char *schemaname,
 		}
 	}
 
+	/* print table associated with given TOAST table */
+	if (tableinfo.relkind == RELKIND_TOASTVALUE)
+	{
+		PGresult   *result = NULL;
+		printfPQExpBuffer(,
+		  "SELECT c.relnamespace::pg_catalog.regnamespace, c.relname FROM pg_catalog.pg_class c WHERE reltoastrelid = '%s'",
+		  oid);
+		result = PSQLexec(buf.data);
+		if (!result)
+		{
+			goto error_return;
+		}
+		else if (PQntuples(result) != 1)
+		{
+			PQclear(result);
+			goto error_return;
+		}
+		else
+		{
+			char	   *schemaname = PQgetvalue(result, 0, 0);
+			char	   *relname = PQgetvalue(result, 0, 1);
+			appendPQExpBuffer(, _("For table: \"%s.%s\""),
+		  schemaname, relname);
+			printTableAddFooter(, tmpbuf.data);
+			PQclear(result);
+		}
+	}
+
 	if (tableinfo.relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		/* Get the partition key information  */
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 9021c80..5c8e439 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -4748,3 +4748,13 @@ drop schema testpart;
 set search_path to default;
 set role to default;
 drop role regress_partitioning_role;
+-- slash dee on toast table:
+\d pg_toast.pg_toast_2619
+TOAST table "pg_toast.pg_toast_2619"
+   Column   |  Type   
++-
+ chunk_id   | oid
+ chunk_seq  | integer
+ chunk_data | bytea
+For table: "pg_catalog.pg_statistic"
+
diff --git a/src/test/regress/sql/psql.sql b/src/test/regress/sql/psql.sql
index cefe41b..b4a232d 100644
--- a/src/test/regress/sql/psql.sql
+++ b/src/test/regress/sql/psql.sql
@@ -1131,3 +1131,6 @@ set search_path to default;
 
 set role to default;
 drop role regress_partitioning_role;
+
+-- slash dee on toast table:
+\d pg_toast.pg_toast_2619
-- 
2.7.4

>From 38928b346dc2cc264bb2a7581f1214f14b1bb89a Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 3 May 2019 09:24:51 -0500
Subject: [PATCH v5] make \d pg_toast.foo show its indices

---
 src/bin/psql/describe.c| 1 +
 src/test/regress/expected/psql.out | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 13ed2e1..86a7610 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -2303,6 +2303,7 @@ describeOneTableDetails(const char *schemaname,
 	else if (tableinfo.relkind == RELKIND_RELATION ||
 			 tableinfo.relkind == RELKIND_MATVIEW ||
 			 tableinfo.relkind == RELKIND_FOREIGN_TABLE ||
+			 tableinfo.relkind == RELKIND_TOASTVALUE ||
 			 tableinfo.relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		/* Footer information about a table */
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 5c8e439..d53dbb0 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -4757,4 +4757,6 @@ TOAST table "pg_toast.pg_toast_2619"
  chunk_seq  | integer
  chunk_data | bytea
 For table: "pg_catalog.pg_statistic"
+Indexes:
+"pg_toast_2619_index" PRIMARY KEY, btree (chunk_id, chunk_seq)
 
-- 
2.7.4

>From ae748eda1460e6da37a6c5a1e1168a5c7c18639b Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 19 Jun 2019 15:41:25 -0500
Subject: [PATCH v6] show childs of partitioned indices

---
 src/bin/psql/describe.c   | 57 ++-
 src/test/regress/input/tablespace.source  |  1 +
 src/test/regress/output/tablespace.source | 31 +
 3 files changed, 57 insertions(+), 32 deletions(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 86a7610..6d136ba 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -3069,6 +3069,7 @@ describeOneTableDetails(const char *schemaname,
 	if (tableinfo.relkind == RELKIND_RELATION ||
 		tableinfo.relkind == RELKIND_MATVIEW ||
 		tableinfo.relkind == RELKIND_FOREIGN_TABLE ||
+		tableinfo.relkind == RELKIND_PARTITIONED_INDEX ||
 		tableinfo.relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		PGresult   *result;
@@ -3120,6 +3121,7 @@ describeOneTableDetails(const char *schemaname,
 		  " FROM pg_catalog.pg_class c, pg_catalog.pg_inherits i"
 		  " WHERE c.oid=i.inhparent AND i.inhrelid = '%s'"
 		  " AND c.relkind != " CppAsString2(RELKIND_PARTITIONED_TABLE)
+		  " AND c.relkind

Re: POC: Cleaning up orphaned files using undo logs

2019-07-16 Thread Amit Kapila

On Tue, Jul 16, 2019 at 4:43 PM Amit Kapila  wrote:
>
> On Tue, Jul 16, 2019 at 2:09 AM Robert Haas  wrote:
> >
> > This patch has some problems with naming consistency.  There's a
> > function called PushUndoRequest() which calls a function called
> > RegisterRollbackReq() to do the heart of the work.  So, is it undo or
> > rollback?  Are we pushing or registering?  Is it a request or a req?
> >
>
> I think we can rename PushUndoRequest as RegisterUndoRequest and
> RegisterRollbackReq as RegisterUndoRequestGuts.
>

One thing I am not sure about the above suggestion is whether it is a
good idea to expose a function which ends with 'Guts'.  I have checked
and found that there are a few similar precedents like
ExecuteTruncateGuts.  Another idea could be to rename
RegisterRollbackReq as RegisterUndoRequestInternal.  We have few
precedents for that as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: getting ERROR "relation 16401 has no triggers" with partition foreign key alter

2019-07-16 Thread Tom Lane

Rajkumar Raghuwanshi  writes:
> I am getting ERROR:  relation 16401 has no triggers error while executing
> below query.

Yeah, I can reproduce that back to v11.  If you try the same scenario
with a non-partitioned table you get

ERROR:  55006: cannot ALTER TABLE "tbl2" because it has pending trigger events
LOCATION:  CheckTableNotInUse, tablecmds.c:3436

but that test evidently fails to detect pending events for a partition
child table.

regards, tom lane

Re: refactoring - share str2*int64 functions

2019-07-16 Thread Tom Lane

Robert Haas  writes:
> On Jul 16, 2019, at 3:30 AM, Fabien COELHO  wrote:
>>> Cool.  I'm not exactly sure when we should include 'pg_' in identifier
>>> names.

>> I added the pg_ prefix as a poor man's namespace because the function can be 
>> used by external tools (eg contribs), so as to avoid potential name 
>> conflicts.

> Yeah, I think if we are going to expose it to front end code there is a good 
> argument for some kind of prefix that makes it sound PostgreSQL-related.

Yeah, I'd tend to err in favor of including "pg_".  We might get away
without that as long as the name is never exposed to non-PG code, but
for stuff that's going into src/common/ or src/port/ I think that's
a risky assumption to make.

I'm also in agreement with Michael's comments in
<20190716071144.gf1...@paquier.xyz> that this would be a good time
to bring some consistency to the naming of related functions.

regards, tom lane

Re: [HACKERS] proposal: schema variables

2019-07-16 Thread Pavel Stehule

Hi

ne 30. 6. 2019 v 5:10 odesílatel Pavel Stehule 
napsal:

>
>
> pá 24. 5. 2019 v 19:12 odesílatel Pavel Stehule 
> napsal:
>
>> Hi
>>
>> čt 9. 5. 2019 v 6:34 odesílatel Pavel Stehule 
>> napsal:
>>
>>> Hi
>>>
>>> rebased patch
>>>
>>
>> rebase after pgindent
>>
>
> fresh rebase
>

just rebase again

Regards

Pavel



> Regards
>
> Pavel
>
>
>> Regards
>>
>> Pavel
>>
>>>
>>> Regards
>>>
>>> Pavel
>>>
>>>
>>>


schema-variables-20190716.patch.gz
Description: application/gzip

Re: refactoring - share str2*int64 functions

2019-07-16 Thread Robert Haas

On Jul 16, 2019, at 3:30 AM, Fabien COELHO  wrote:

>> Cool.  I'm not exactly sure when we should include 'pg_' in identifier
>> names.  It seems to be used for functions/macros that wrap or replace
>> something else with a similar name, like pg_pwrite(),
>> pg_attribute_noreturn(), ...  In this case it's just our own code that
>> we're moving, so I'm wondering if we should just call it scanint8().
> 
> I added the pg_ prefix as a poor man's namespace because the function can be 
> used by external tools (eg contribs), so as to avoid potential name conflicts.

Yeah, I think if we are going to expose it to front end code there is a good 
argument for some kind of prefix that makes it sound PostgreSQL-related.

...Robert

Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-16 Thread Julien Rouhaud

On Fri, Jul 12, 2019 at 11:47 AM Julien Rouhaud  wrote:
>
> I didn't change the behavior wrt. possible deadlock if user specify
> catalog objects using --index or --table and ask for multiple
> connection, as I'm afraid that it'll add too much code for a little
> benefit.  Please shout if you think otherwise.

Sorry I meant schemas, not indexes.

After more thinking about schema and multiple jobs, I think that
erroring out is quite user unfriendly, as it's entirely ok to ask for
multiple indexes and multiple object that do support parallelism in a
single call.  So I think it's better to remove the error, ignore the
given --jobs options for indexes and document this behavior.

Re: Resume vacuum and autovacuum from interruption and cancellation

2019-07-16 Thread Masahiko Sawada

On Wed, Jun 12, 2019 at 1:30 PM Masahiko Sawada  wrote:
>
> Hi all,
>
> Long-running vacuum could be sometimes cancelled by administrator. And
> autovacuums could be cancelled by concurrent processes. Even if it
> retries after cancellation, since it always restart from the first
> block of table it could vacuums blocks again that we vacuumed last
> time. We have visibility map to skip scanning all-visible blocks but
> in case where the table is large and often modified, we're more likely
> to reclaim more garbage from blocks other than we processed last time
> than scanning from the first block.
>
> So I'd like to propose to make vacuums save its progress and resume
> vacuuming based on it. The mechanism I'm thinking is simple; vacuums
> periodically report the current block number to the stats collector.
> If table has indexes, reports it after heap vacuum whereas reports it
> every certain amount of blocks (e.g. 1024 blocks = 8MB) if no indexes.
> We can see that value on new column vacuum_resume_block of
> pg_stat_all_tables. I'm going to add one vacuum command option RESUME
> and one new reloption vacuum_resume. If the option is true vacuums
> fetch the block number from stats collector before starting and start
> vacuuming from that block. I wonder if we could make it true by
> default for autovacuums but it must be false when aggressive vacuum.
>
> If we start to vacuum from not first block, we can update neither
> relfrozenxid nor relfrozenxmxid. And we might not be able to update
> even relation statistics.
>

Attached the first version of patch. And registered this item to the
next commit fest.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
From 74c035d887c8f76b43414ce80559794a399ebaf7 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Wed, 10 Jul 2019 19:02:56 +0900
Subject: [PATCH] Add RESUME option to VACUUM and autovacuum.

This commit adds a new reloption, vaucum_resume, which controls
whether vacuum attempt to resume vacuuming from the last vacuumed
block saved at vacuum_resume_block column of pg_stat_all_tables. It
also adds a new option to the VACUUM command, RESUME which can be used
to override the reloption.
---
 doc/src/sgml/monitoring.sgml   |  5 ++
 doc/src/sgml/ref/vacuum.sgml   | 18 +
 src/backend/access/common/reloptions.c | 13 +++-
 src/backend/access/heap/vacuumlazy.c   | 91 --
 src/backend/catalog/system_views.sql   |  1 +
 src/backend/commands/vacuum.c  | 13 
 src/backend/postmaster/pgstat.c| 42 
 src/backend/utils/adt/pgstatfuncs.c| 14 
 src/include/catalog/pg_proc.dat|  5 ++
 src/include/commands/vacuum.h  |  5 +-
 src/include/pgstat.h   | 14 
 src/include/utils/rel.h|  2 +
 src/test/regress/expected/rules.out|  3 +
 src/test/regress/expected/vacuum.out   | 20 ++
 src/test/regress/sql/vacuum.sql| 21 ++
 15 files changed, 258 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bf72d0c303..fe68113b02 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2784,6 +2784,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
  bigint
  Estimated number of rows modified since this table was last analyzed
 
+
+ vacuum_resume_block
+ bigint
+ Block number to resume vacuuming from
+
 
  last_vacuum
  timestamp with time zone
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index f9b0fb8794..0b8733d555 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -34,6 +34,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ boolean ]
 INDEX_CLEANUP [ boolean ]
 TRUNCATE [ boolean ]
+RESUME [ boolean ]
 
 and table_and_columns is:
 
@@ -223,6 +224,23 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ . This behavior is helpful
+  when to resume vacuuming from interruption and cancellation.The default
+  is false unless the vacuum_resume option has been
+  set to true. This option is ignored if either the FULL,
+  the FREEZE or DISABLE_PAGE_SKIPPING
+  option is used.
+ 
+
+   
+

 boolean
 
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 5773021499..6494c3bdfd 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"vacuum_resume",
+			"Enables vacuum to resume from the last vacuumed block",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		false
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1412,7 +1421,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"vacuum_index_cleanup",

Re: POC: Cleaning up orphaned files using undo logs

2019-07-16 Thread Dilip Kumar

On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro  wrote:
>
>
> 1.  Renamed UndoPersistence to UndoLogCategory everywhere, and add a
> fourth category UNDO_SHARED where transactions can write 'out of band'
> data that relates to more than one transaction.
>
> 2.  Introduced a new RMGR callback rm_undo_status.  It is used to
> decide when record sets in the UNDO_SHARED category should be
> discarded (instead of the usual single xid-based rules).  The possible
> answers are "discard me now!", "ask me again when a given XID is all
> visible", and "ask me again when a given XID is no longer running".
>
> 3.  Recognise UNDO_SHARED record set boundaries differently.  Whereas
> undolog.c recognises transaction boundaries automatically for the
> other categories (UNDO_PERMANENT, UNDO_UNLOGGED, UNDO_TEMP), for
> UNDO_SHARED the
>
> 4.  Add some quick-and-dirty throw-away test stuff to demonstrate
> that.  SELECT test_multixact([1234, 2345]) will create a new record
> set that will survive until the given array of transactions is no
> longer running, and then it'll be discarded.  You can see that with
> SELECT * FROM undoinspect('shared').  Or look at SELECT
> pg_stat_undo_logs.  This test simply writes all the xids into its
> payload, and then has an rm_undo_status function that returns the
> first xid it finds in the list that is still running, or if none are
> running returns UNDO_STATUS_DISCARD.
>
> Currently you can only return UNDO_STATUS_WAIT_XMIN so wait for an xid
> to be older than the oldest xmin; presumably it'd be useful to be able
> to discard as soon as an xid is no longer active, which could be a bit
> sooner.
>
> Another small change: several people commented that
> UndoLogIsDiscarded(ptr) ought to have some kind of fast path that
> doesn't acquire locks since it'll surely be hammered.  Here's an
> attempt at that that provides an inlined function that uses a
> per-backend recent_discard to avoid doing more work in the (hopefully)
> common case that you mostly encounter discarded undo pointers.  I hope
> this change will show up in profilers in some zheap workloads but this
> hasn't been tested yet.
>
> Another small change/review: the function UndoLogGetNextInsertPtr()
> previously took a transaction ID, but I'm not sure if that made sense,
> I need to think about it some more.
>
> I pulled the latest patches pulled in from the "undoprocessing" branch
> as of late last week, and most of the above is implemented as fixup
> commits on top of that.
>
> Next I'm working on DBA facilities for forcing undo records to be
> discarded (which consists mostly of sorting out the interlocking to
> make that work safely).  And also testing facilities for simulating
> undo log switching (when you fill up each log and move to another one,
> which are rare code paths run, so we need a good way to make them not
> rare).
>

In 0003-Add-undo-log-manager

/* If we discarded everything, the slot can be given up. */
+ if (entirely_discarded)
+ free_undo_log_slot(slot);

I have noticed that when the undo log was detached and it was full
then if we discard complete log we release its slot.  But, what is
bothering me is should we add that log to the free list?  Or I am
missing something?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: d25ea01275 and partitionwise join

2019-07-16 Thread Etsuro Fujita

On Tue, Jul 2, 2019 at 6:29 PM Amit Langote  wrote:
> 0001 - fix partitionwise join to work correctly with n-way joins of
> which some are full joins (+ cosmetic improvements around the code
> that was touched)

Here are my comments about the cosmetic improvements: they seem pretty
large to me, so I'd make a separate patch for that.  In addition, I'd
move have_partkey_equi_join() and match_expr_to_partition_keys() to
relnode.c, because these functions are only used in that file.

Best regards,
Etsuro Fujita

Re: Change ereport level for QueuePartitionConstraintValidation

2019-07-16 Thread Sergei Kornilov

Hello

Here is two patches with NOTICE ereport: one for partitions operations and one 
for "set not null" (for consistency)

regards, Sergeidiff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 2a2c161695..4ec61d4833 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -6329,7 +6329,7 @@ NotNullImpliedByRelConstraints(Relation rel, Form_pg_attribute attr)
 
 	if (ConstraintImpliedByRelConstraint(rel, list_make1(nnulltest), NIL))
 	{
-		ereport(DEBUG1,
+		ereport(NOTICE,
 (errmsg("existing constraints on column \"%s\".\"%s\" are sufficient to prove that it does not contain nulls",
 		RelationGetRelationName(rel), NameStr(attr->attname;
 		return true;
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index 95d5bee029..fb8b3583f1 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -1089,12 +1089,14 @@ alter table atacc1 drop constraint atacc1_constr_invalid;
 update atacc1 set test_a = 1;
 alter table atacc1 add constraint atacc1_constr_a_valid check(test_a is not null);
 alter table atacc1 alter test_a set not null;
+NOTICE:  existing constraints on column "atacc1"."test_a" are sufficient to prove that it does not contain nulls
 delete from atacc1;
 insert into atacc1 values (2, null);
 alter table atacc1 alter test_a drop not null;
 -- test multiple set not null at same time
 -- test_a checked by atacc1_constr_a_valid, test_b should fail by table scan
 alter table atacc1 alter test_a set not null, alter test_b set not null;
+NOTICE:  existing constraints on column "atacc1"."test_a" are sufficient to prove that it does not contain nulls
 ERROR:  column "test_b" contains null values
 -- commands order has no importance
 alter table atacc1 alter test_b set not null, alter test_a set not null;
@@ -1106,6 +1108,8 @@ alter table atacc1 alter test_a drop not null, alter test_b drop not null;
 -- both column has check constraints
 alter table atacc1 add constraint atacc1_constr_b_valid check(test_b is not null);
 alter table atacc1 alter test_b set not null, alter test_a set not null;
+NOTICE:  existing constraints on column "atacc1"."test_b" are sufficient to prove that it does not contain nulls
+NOTICE:  existing constraints on column "atacc1"."test_a" are sufficient to prove that it does not contain nulls
 drop table atacc1;
 -- test inheritance
 create table parent (a int);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 2e792edbcf..2a2c161695 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15336,11 +15336,11 @@ QueuePartitionConstraintValidation(List **wqueue, Relation scanrel,
 	if (PartConstraintImpliedByRelConstraint(scanrel, partConstraint))
 	{
 		if (!validate_default)
-			ereport(INFO,
+			ereport(NOTICE,
 	(errmsg("partition constraint for table \"%s\" is implied by existing constraints",
 			RelationGetRelationName(scanrel;
 		else
-			ereport(INFO,
+			ereport(NOTICE,
 	(errmsg("updated partition constraint for default partition \"%s\" is implied by existing constraints",
 			RelationGetRelationName(scanrel;
 		return;
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index 7d8907b2b4..34a325bce8 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -1253,7 +1253,7 @@ check_default_partition_contents(Relation parent, Relation default_rel,
 	 */
 	if (PartConstraintImpliedByRelConstraint(default_rel, def_part_constraints))
 	{
-		ereport(INFO,
+		ereport(NOTICE,
 (errmsg("updated partition constraint for default partition \"%s\" is implied by existing constraints",
 		RelationGetRelationName(default_rel;
 		return;
@@ -1304,7 +1304,7 @@ check_default_partition_contents(Relation parent, Relation default_rel,
 			if (PartConstraintImpliedByRelConstraint(part_rel,
 	 def_part_constraints))
 			{
-ereport(INFO,
+ereport(NOTICE,
 		(errmsg("updated partition constraint for default partition \"%s\" is implied by existing constraints",
 RelationGetRelationName(part_rel;
 
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index 3578b8009b..95d5bee029 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -3633,11 +3633,11 @@ ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4);
 ALTER TABLE list_parted2 DETACH PARTITION part_3_4;
 ALTER TABLE part_3_4 ALTER a SET NOT NULL;
 ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4);
-INFO:  partition constraint for table "part_3_4" is implied by existing constraints
+NOTICE:  partition constraint for table "part_3_4" is implied by existing constraints
 -- check if default partition scan skipped
 ALTER TABLE

Re: POC: Cleaning up orphaned files using undo logs

2019-07-16 Thread Amit Kapila

On Tue, Jul 16, 2019 at 2:09 AM Robert Haas  wrote:
>
> On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro  wrote:
>
> Reviewing Amit's 0016:
>
> performUndoActions appears to be badly-designed.  For starters, it's
> sometimes wrong: the only place it gets set to true is in
> UndoActionsRequired (which is badly named, because from the name you
> expect it to return a Boolean and to not have side effects, but
> instead it doesn't return anything and does have side effects).
> UndoActionsRequired() only gets called from selected places, like
> AbortCurrentTransaction(), so the rest of the time it just returns a
> wrong answer.  Now maybe it's never called at those times, but there's
> no guard to prevent a function like CanPerformUndoActions() (which is
> also badly named, because performUndoActions tells you whether you
> need to perform undo actions, not whether it's possible to perform
> undo actions) from being called before the flag is set.  I think that
> this flag should be either (1) maintained eagerly - so that wherever
> we set start_urec_ptr we also set the flag right away or (2) removed -
> so when we need to know, we just loop over all of the undo categories
> on the spot, which is not that expensive because there aren't that
> many of them.
>

I would prefer to go with (2).  So, I will change the function
CanPerformUndoActions() to loop over categories and return whether
there is a need to perform undo actions. Also, rename
CanPerformUndoActions as NeedToPerformUndoActions or
UndoActionsRequired, any other better suggestion?

> It seems pointless to make PrepareTransaction() take undo pointers as
> arguments, because those pointers are just extracted from the
> transaction state, to which PrepareTransaction() has a pointer.
>

Agreed, will remove.

> Thomas has already objected to another proposal to add functions that
> turn 32-bit XIDs into 64-bit XIDs.  Therefore, I feel confident in
> predicting that he will likewise object to GetEpochForXid. I think
> this needs to be changed somehow, maybe by doing what the XXX comment
> you added suggests.
>

We can do what the comment says, but there is one more similar usage
in undodiscard.c as well, so not sure if that is the right thing.  I
think Thomas is suggesting to open code its usage where it is safe to
do so and required.  I have responded to his email, let us see what he
has to say, based on that we can modify this patch.

> This patch has some problems with naming consistency.  There's a
> function called PushUndoRequest() which calls a function called
> RegisterRollbackReq() to do the heart of the work.  So, is it undo or
> rollback?  Are we pushing or registering?  Is it a request or a req?
>

I think we can rename PushUndoRequest as RegisterUndoRequest and
RegisterRollbackReq as RegisterUndoRequestGuts.

> For bonus points, the flag that the function sets is called
> undo_req_pushed, which is halfway in between the two competing
> terminologies.  Other gripes about PushUndoRequest: push is vague and
> doesn't really explain what's happening, "apllying" is a typo,
> per_level is a poor variable name and shouldn't be declared volatile.
> This function has problems with naming in other places, too; please go
> through all of the names carefully and make them consistent and
> adequately descriptive.
>

Okay, will change as per suggestion.

> I am not a fan of applying_subxact_undo.  I think we should look for a
> better design there.  A couple of things occur to me.  One is that we
> don't necessarily need to go to FATAL; we could just force the current
> transaction and all of its subtransactions fail all the way out to the
> top level, but then perhaps allow new transactions to be started
> afterwards.  I'm not sure that's worth it, but it would work, and I
> think it has precedent in SxactIsDoomed. Assuming we're going to stick
> with the current FATAL plan, I think we should do something like
> invent a new kind of critical section that forces ERROR to be promoted
> to FATAL and then use it here.  We could call it a semi-critical or
> locally-critical section, and the undo machinery could use it, but
> then also so could other things.  I've wanted that sort of concept
> before, so I think it's a good idea to try to have something general
> and independent of undo.  The same concept could be used in
> PerformUndoActions() instead of having to invent
> pg_rethrow_as_fatal(), so we'd have two uses for this mechanism right
> away.
>

Okay, I will investigate on the lines of the semi-critical section.

> FinishPreparedTransactions() tries to apply undo actions while
> interrupts are still held.  Is that necessary?
>

I don't think so.  I'll think some more and update back if I see any
problem, otherwise, will do RESUME_INTERRUPTS before performing
actions.

>  Can we avoid it?
>
> It seems highly likely that the logic added to the TBLOCK_SUBCOMMIT
> case inside CommitTransactionCommand and also into
> ReleaseCurrentSubTransaction should have been

Re: SegFault on 9.6.14

2019-07-16 Thread Thomas Munro

On Tue, Jul 16, 2019 at 8:22 PM Tomas Vondra
 wrote:
> On Mon, Jul 15, 2019 at 08:20:00PM -0500, Jerry Sievers wrote:
> >We have a reproduceable case of $subject that issues a backtrace such as
> >seen below.

> >#0  initscan (scan=scan@entry=0x55d7a7daa0b0, key=0x0, 
> >keep_startblock=keep_startblock@entry=1 '\001')
> >at 
> > /build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/access/heap/heapam.c:233
> >#1  0x55d7a72fa8d0 in heap_rescan (scan=0x55d7a7daa0b0, 
> >key=key@entry=0x0) at 
> >/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/access/heap/heapam.c:1529
> >#2  0x55d7a7451fef in ExecReScanSeqScan 
> >(node=node@entry=0x55d7a7d85100) at 
> >/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeSeqscan.c:280
> >#3  0x55d7a742d36e in ExecReScan (node=0x55d7a7d85100) at 
> >/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execAmi.c:158
> >#4  0x55d7a7445d38 in ExecReScanGather 
> >(node=node@entry=0x55d7a7d84d30) at 
> >/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeGather.c:475
> >#5  0x55d7a742d255 in ExecReScan (node=0x55d7a7d84d30) at 
> >/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execAmi.c:166

Here's a query that rescans a gather node repeatedly on 9.6 in case it
helps someone build a repro, but it works fine here.

-- 
Thomas Munro
https://enterprisedb.com


test.sql
Description: application/sql

Re: Custom table AMs need to include heapam.h because of BulkInsertState

2019-07-16 Thread Michael Paquier

Hi David,

On Wed, Jul 10, 2019 at 09:40:59PM +1200, David Rowley wrote:
> On Wed, 3 Jul 2019 at 19:35, Michael Paquier  wrote:
>> This has been reverted as of f5db56f, still it seems to me that this
>> was moving in the right direction.
> 
> I've pushed this again, this time with the cleanup code done in the
> right order. 

I have spent some time lately analyzing f7c830f as I was curious about
the logic behind it, and FWIW the result looks good.  Thanks!
--
Michael


signature.asc
Description: PGP signature

Re: Comment fix of config_default.pl

2019-07-16 Thread Kyotaro Horiguchi

At Sat, 13 Jul 2019 16:53:45 +0900, Michael Paquier  wrote 
in <20190713075345.gc2...@paquier.xyz>
> On Fri, Jul 12, 2019 at 05:01:41PM +0900, Michael Paquier wrote:
> > I would also patch GetFakeConfigure in Solution.pm (no need to send a
> > new patch), and I thought that you'd actually do the change.  What do
> > you think?
> 
> OK, applied as I have been able to look at it again, and after fixing
> the portion for GetFakeConfigure.  Thanks! 

Thanks for committing and your additional part.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Check-out mutable functions in check constraints

2019-07-16 Thread Kyotaro Horiguchi

Hello, Thanks all!

At Sat, 13 Jul 2019 11:17:32 -0400, Tom Lane  wrote in 
<18372.1563031...@sss.pgh.pa.us>
> Tomas Vondra  writes:
> > On Fri, Jul 12, 2019 at 07:59:13PM -0400, Tom Lane wrote:
> >> I'm pretty sure this change has been proposed before, and rejected before.
> >> Has anybody excavated in the archives for prior discussions?
> 
> > Yes, I've done some quick searches like "volatile constraint" and so on.
> > There are a couple of relevant discussions:
> > 2004: 
> > https://www.postgresql.org/message-id/flat/0C3A1AEC-6BE4-11D8-9224-000A95C88220%40myrealbox.com
> > 2010: 
> > https://www.postgresql.org/message-id/flat/12849.1277918175%40sss.pgh.pa.us#736c8ef9d7810c0bb85f495490fd40f5
> > But I don't think the conclusions are particularly clear.
> > In the first thread you seem to agree with requiring immutable functions
> > for check constraints (and triggers for one-time checks). The second
> > thread ended up discussing some new related stuff in SQL standard.
> 
> Well, I think that second thread is very relevant here, because
> it correctly points out that we are *required by spec* to allow
> check constraints of the form CHECK(datecol <= CURRENT_DATE) and
> related tests.  See the stuff about "retrospectively deterministic"
> predicates in SQL:2003 or later.
> 
> I suppose you could imagine writing some messy logic that allowed the
> specific cases called out by the spec but not any other non-immutable
> function calls.  But that just leaves us with an inconsistent
> restriction.  If the spec is allowing this because it can be seen
> to be safe, why should we not allow other cases that the user has
> taken the trouble to prove to themselves are safe?  (If their proof is
> wrong, well, it wouldn't be the first bug in anyone's SQL application.)
> 
>   regards, tom lane

If, we have a CURRENT_DATE() that always returns UTC timestamp
(or something like), then CURRENT_DATE()::text gives a local
representation. We may have constraints using CURRENT_DATE()
since it is truly immutable. I think the spec can be interpreted
as that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Detailed questions about pg_xact_commit_timestamp

2019-07-16 Thread Adrien Nayrat

On 7/12/19 2:50 PM, Morris de Oryx wrote:
> Adrien, thanks very much for answering my question. Just a couple of follow-up
> points, if you don't mind.
> 
> In our answer, you offer an example of pg_xact_commit_timestamp showing
> out-of-sequence commit times:
> 
> Session     xid          pg_xact_commit_timestamp
> A           34386826     2019-07-11 09:32:38.994440+00  Started earlier,
> committed later
> B           34386827     2019-07-11 09:32:29.806183+00
> 
> I may not have asked my question clearly, or I may not understand the answer
> properly. Or both ;-) If I understand it correctly, an xid is assigned when a
> transaction starts.

It is a little bit more complicated :) When a transaction start, a *virtual* xid
is assigned. It is when the transaction change the state of the database, an xid
is assigned:
> Throughout running a transaction, a server process holds an exclusive lock on 
> the transaction's virtual transaction ID. If a permanent ID is assigned to 
> the transaction (which normally happens only if the transaction changes the 
> state of the database), it also holds an exclusive lock on the transaction's 
> permanent transaction ID until it ends.

https://www.postgresql.org/docs/current/view-pg-locks.html

(It shouldn't change anything for you)


> One transaction might take a second, another might take ten
> minutes. So, the xid sequence doesn't imply anything at all about commit
> sequence. What I'm trying to figure out is if it is possible for the commit
> timestamps to somehow be out of order. 

I am sorry but I don't understand what you mean by "commit timestamps to somehow
be out of order"?

> What I'm looking for is a way of finding
> changes committed since a specific moment. When the transaction started 
> doesn't
> matter in my case. 


Yes, the commit timestamp is the time when the transaction is committed :
postgres=# begin;
BEGIN
postgres=# select now();
 now
--
 2019-07-16 08:46:59.64712+00
(1 row)

postgres=# select txid_current();
 txid_current
--
 34386830
(1 row)

postgres=# commit;
COMMIT
postgres=# select pg_xact_commit_timestamp('34386830'::xid);
   pg_xact_commit_timestamp
---
 2019-07-16 08:47:30.238746+00
(1 row)


> 
> Is pg_xact_commit_timestamp suitable for this? I'm getting the impression that
> it isn't. But I don't understand quite how. And if it isn't suited to this
> purpose, does anyone know what pg_xact_commit_timestamp is for? What I'm after
> is something like a "xcommitserial" that increases reliably, and monotonically
> on transaction commit. That's how I'm hoping that pg_xact_commit_timestamp
> functions. 

I don't think so. pg_xact_commit_timestamp returns the timestamp. If you want
some kind of ordering you have to fetch all commit timestamps (with their
respective xid) and order them.

You also can implement this tracking by yourself with triggers which insert a
row containing xid and timestamp in a tracking table. You can add an index on
timestamp column. With this approach you don't have to worry about vacuum freeze
which remove old timestamps. As you add more write, it could be more expensive
than track_commit_timestamp.

> 
> Thanks also for making me understand that pg_xact_commit_timestamp applies to 
> a
> *transaction*, not to each row. That makes it a lot lighter in the database. I
> was thinking 12 bytes+ per row, which is completely off for my case. (I tend 
> to
> insert thousands of rows in a transaction.)
> 
>> Yes timestamp are stored in pg_commit_ts directory. Old timestamp are removed
> after freeze has explained in
>> https://www.postgresql.org/docs/current/routine-vacuuming.html
> 
> Thanks for the answer, and for kindly pointing me to the right section of the
> documentation. It's easy to get impatient with new(er) users. I'm _not_ lazy
> about reading manuls and researching but, well, the Postgres documentation is
> over 3,000 pages long when you download it. So, I may have missed a detail or
> two If I read that correctly, the ~4 billion number range is made into an
> endless circle by keeping ~2 billions numbers in the past, and 2 billion in 
> the
> future. If that's right, I'm never going to be so out of data that the ~2
> billion number window is too small.
> 

Yes it is a circular counter because xid are stored on 32 bits. However you have
to keep in mind that vacuum freeze old visible rows (default is 200 millions
transactions) and you lose timestamp information.

Sawada-san made a good presentation on freezing:
https://www.slideshare.net/masahikosawada98/introduction-vauum-freezing-xid-wraparound

You can also look at this website:
http://www.interdb.jp/pg/pgsql05.html#_5.1.
http://www.interdb.jp/pg/pgsql06.html#_6.3.

Regards,

-- 
Adrien



signature.asc
Description: OpenPGP digital signature

Re: Check-out mutable functions in check constraints

2019-07-16 Thread Kyotaro Horiguchi

Mmm.

# I eventually found messages sent to me stashed in unexpcted
# place. I felt I was in a void space for these days.. That's
# silly!

Thank you for the comment.

# Putting aside the appliability(?) of this check..

At Fri, 12 Jul 2019 13:14:57 +0200, Tomas Vondra  
wrote in <20190712111457.ekkcgx5mpkxl2ooh@development>
> On Fri, Jul 12, 2019 at 03:44:58PM +0900, Kyotaro Horiguchi wrote:
> >Hello.
> >
> >As mentioned in the following message:
> >
> >https://www.postgresql.org/message-id/20190712.150527.145133646.horikyota.ntt%40gmail.com
> >
> >Mutable function are allowed in check constraint expressions but
> >it is not right. The attached is a proposed fix for it including
> >regression test.
> >
> 
> I think the comment in parse_expr.c is wrong:
> 
>/*
> * All SQL value functions are stable so we reject them in check
> * constraint expressions.
> */
>if (pstate->p_expr_kind == EXPR_KIND_CHECK_CONSTRAINT)
>ereport(ERROR,
>(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> errmsg("mutable functions are not allowed in check
> constraints")));
> 
> At first it claims SQL value functions are stable, but then rejects
> them
> with a message that they're mutable.

Isn't Stable mutable? By definition stable functions can return
different values with the same input. But the message may be
somewhat confusing for unaccostomed users.

> Also, the other places use "cannot ..." messages:
> 
>case EXPR_KIND_COLUMN_DEFAULT:
>err = _("cannot use column reference in DEFAULT expression");
>break;
> 
> so maybe these new checks should use the same style.

It is following to messages like the follows:

parse_func.c:2497
|case EXPR_KIND_CHECK_CONSTRAINT:
|case EXPR_KIND_DOMAIN_CHECK:
|  err = _("set-returning functions are not allowed in check constraints");

Should we unify them? "are not allowed in" is used in
parse_func.c and parse_agg.c, and "cannot use" is used in
parse_expr.c for the same instruction.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: SegFault on 9.6.14

2019-07-16 Thread Tomas Vondra


On Mon, Jul 15, 2019 at 08:20:00PM -0500, Jerry Sievers wrote:

Tomas Vondra  writes:


On Mon, Jul 15, 2019 at 07:22:55PM -0500, Jerry Sievers wrote:


Tomas Vondra  writes:


On Mon, Jul 15, 2019 at 06:48:05PM -0500, Jerry Sievers wrote:


Greetings Hackers.

We have a reproduceable case of $subject that issues a backtrace such as
seen below.

The query that I'd prefer to sanitize before sending is <30 lines of at
a glance, not terribly complex logic.

It nonetheless dies hard after a few seconds of running and as expected,
results in an automatic all-backend restart.

Please advise on how to proceed.  Thanks!

bt
#0  initscan (scan=scan@entry=0x55d7a7daa0b0, key=0x0, 
keep_startblock=keep_startblock@entry=1 '\001')
   at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/access/heap/heapam.c:233
#1  0x55d7a72fa8d0 in heap_rescan (scan=0x55d7a7daa0b0, key=key@entry=0x0) 
at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/access/heap/heapam.c:1529
#2  0x55d7a7451fef in ExecReScanSeqScan (node=node@entry=0x55d7a7d85100) at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeSeqscan.c:280
#3  0x55d7a742d36e in ExecReScan (node=0x55d7a7d85100) at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execAmi.c:158
#4  0x55d7a7445d38 in ExecReScanGather (node=node@entry=0x55d7a7d84d30) at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeGather.c:475
#5  0x55d7a742d255 in ExecReScan (node=0x55d7a7d84d30) at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execAmi.c:166
#6  0x55d7a7448673 in ExecReScanHashJoin (node=node@entry=0x55d7a7d84110) 
at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeHashjoin.c:1019
#7  0x55d7a742d29e in ExecReScan (node=node@entry=0x55d7a7d84110) at 
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execAmi.c:226




Hmmm, that means it's crashing here:

   if (scan->rs_parallel != NULL)
   scan->rs_nblocks = scan->rs_parallel->phs_nblocks; <--- here
   else
   scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);

But clearly, scan is valid (otherwise it'd crash on the if condition),
and scan->rs_parallel must me non-NULL. Which probably means the pointer
is (no longer) valid.

Could it be that the rs_parallel DSM disappears on rescan, or something
like that?


No clue but something I just tried was to disable parallelism by setting
max_parallel_workers_per_gather to 0 and however the query has not
finished after a few minutes, there is no crash.



That might be a hint my rough analysis was somewhat correct. The
question is whether the non-parallel plan does the same thing. Maybe it
picks a plan that does not require rescans, or something like that.


Please advise.



It would be useful to see (a) exacution plan of the query, (b) full
backtrace and (c) a bit of context for the place where it crashed.

Something like (in gdb):

   bt full
   list
   p *scan


The p *scan did nothing unless I ran it first however my gdb $foo isn't
strong presently.


Hmm, the rs_parallel pointer looks sane (it's not obvious garbage). Can
you try this?

  p *scan->rs_parallel

Another question - are you sure this is not an OOM issue? That might
sometimes look like SIGSEGV due to overcommit. What's the memory
consumption / is there anything in dmesg?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: GiST VACUUM

2019-07-16 Thread Amit Kapila

On Fri, Jun 28, 2019 at 3:32 AM Thomas Munro  wrote:
>
> On Thu, Jun 27, 2019 at 11:38 PM Heikki Linnakangas  wrote:
> > * I moved the logic to extend a 32-bit XID to 64-bits to a new helper
> > function in varsup.c.
>
> I'm a bit uneasy about making this into a generally available function
> for reuse.  I think we should instead come up with a very small number
> of sources of fxids that known to be free of races because of some
> well explained interlocking.
>

I have two more cases in undo patch series where the same function is
needed and it is safe to use it there.  The first place is twophase.c
for rolling back prepared transactions where we know that we don't
support aborted xacts that are older than 2 billion, so we can rely on
such a function.  We also need it in undodiscard.c to compute the
value of oldestFullXidHavingUnappliedUndo.  See the usage of
GetEpochForXid in unprocessing patches.  Now, we might find a way to
avoid using in one of these places by doing some more work, but not
sure we can avoid from all the three places (one proposed by this
patch and two by undo patchset).

> For example, I believe it is safe to convert an xid obtained from a
> WAL record during recovery, because (for now) recovery is a single
> thread of execution and the next fxid can't advance underneath you.
> So I think XLogRecGetFullXid(XLogReaderState *record)[1] as I'm about
> to propose in another thread (though I've forgotten who wrote it,
> maybe Andres, Amit or me, but if it wasn't me then it's exactly what I
> would have written) is a safe blessed source of fxids.  The rationale
> for writing this function instead of an earlier code that looked much
> like your proposed helper function, during EDB-internal review of
> zheap stuff, was this: if we provide a general purpose xid->fxid
> facility, it's virtually guaranteed that someone will eventually use
> it to shoot footwards, by acquiring an xid from somewhere, and then
> some arbitrary time later extending it to a fxid when no interlocking
> guarantees that the later conversion has the correct epoch.
>
> I'd like to figure out how to maintain full versions of the
> procarray.c-managed xid horizons, without widening the shared memory
> representation.  I was planning to think harder about for 13.
> Obviously that doesn't help you now.  So I'm wondering if you should
> just open-code this for now, and put in a comment about why it's safe
> and a note that there'll hopefully be a fxid horizon available in a
> later release?
>

Do you suggest to open code for all the three places for now?  I am
not against open coding the logic for now but not sure if we can
eliminate its need because even if we can do what you are saying with
procarray.c-managed xid horizons, I think we need to do something
about clog.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: getting ERROR "relation 16401 has no triggers" with partition foreign key alter

2019-07-16 Thread Michael Paquier

On Tue, Jul 16, 2019 at 01:07:45PM +0530, Rajkumar Raghuwanshi wrote:
> I am getting ERROR:  relation 16401 has no triggers error while executing
> below query.

Indeed, confirmed.  I can reproduce that down to v11, so that's not an
open item.  I have added an entry in the section for older issues
though.
--
Michael


signature.asc
Description: PGP signature

Re: pgbench - add option to show actual builtin script code

2019-07-16 Thread Fabien COELHO





Committed, after pgindent.  Thanks Fabien and Ibrar.


Thanks for the commit.

--
Fabien.

Re: pgbench - extend initialization phase control

2019-07-16 Thread Fabien COELHO



Hello Ibrar,


The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

Other than that, the patch looks good to me.

The new status of this patch is: Ready for Committer


Thanks for the review.

Attached v2 is a rebase after ce8f9467.

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 816f9cc4c7..bcdac60ade 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -167,7 +167,7 @@ pgbench  options  d
 init_steps specifies the
 initialization steps to be performed, using one character per step.
 Each step is invoked in the specified order.
-The default is dtgvp.
+The default is dt(g)vp.
 The available steps are:
 
 
@@ -193,12 +193,31 @@ pgbench  options  d
   
  
  
-  g (Generate data)
+  ( (begin transaction)
+  
+   
+Emit a BEGIN.
+   
+  
+ 
+ 
+  g or G (Generate data, client or server side)
   

 Generate data and load it into the standard tables,
 replacing any data already present.

+   
+When data is generated server side, there is no progress output.
+   
+  
+ 
+ 
+  ) (commit transaction)
+  
+   
+Emit a COMMIT.
+   
   
  
  
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 570cf3306a..a990eb6f21 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -130,7 +130,13 @@ static int	pthread_join(pthread_t th, void **thread_return);
 /
  * some configurable parameters */
 
-#define DEFAULT_INIT_STEPS "dtgvp"	/* default -I setting */
+/*
+ * we do all data generation in one transaction to enable the backend's
+ * data-loading optimizations
+ */
+#define DEFAULT_INIT_STEPS "dt(g)vp"	/* default -I setting */
+
+#define ALL_INIT_STEPS "dtgGvpf()"		/* all possible steps */
 
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
@@ -608,7 +614,7 @@ usage(void)
 		   "  %s [OPTION]... [DBNAME]\n"
 		   "\nInitialization options:\n"
 		   "  -i, --initialize invokes initialization mode\n"
-		   "  -I, --init-steps=[dtgvpf]+ (default \"dtgvp\")\n"
+		   "  -I, --init-steps=[" ALL_INIT_STEPS "]+ (default \"" DEFAULT_INIT_STEPS "\")\n"
 		   "   run selected initialization steps\n"
 		   "  -F, --fillfactor=NUM set fill factor\n"
 		   "  -n, --no-vacuum  do not run VACUUM during initialization\n"
@@ -3689,10 +3695,23 @@ initCreateTables(PGconn *con)
 }
 
 /*
- * Fill the standard tables with some data
+ * truncate away any old data, in one command in case there are foreign keys
  */
 static void
-initGenerateData(PGconn *con)
+initTruncateTables(PGconn *con)
+{
+	executeStatement(con, "truncate table "
+	 "pgbench_accounts, "
+	 "pgbench_branches, "
+	 "pgbench_history, "
+	 "pgbench_tellers");
+}
+
+/*
+ * Fill the standard tables with some data, from the client side
+ */
+static void
+initGenerateDataClientSide(PGconn *con)
 {
 	char		sql[256];
 	PGresult   *res;
@@ -3706,23 +3725,9 @@ initGenerateData(PGconn *con)
 remaining_sec;
 	int			log_interval = 1;
 
-	fprintf(stderr, "generating data...\n");
+	fprintf(stderr, "generating data client side...\n");
 
-	/*
-	 * we do all of this in one transaction to enable the backend's
-	 * data-loading optimizations
-	 */
-	executeStatement(con, "begin");
-
-	/*
-	 * truncate away any old data, in one command in case there are foreign
-	 * keys
-	 */
-	executeStatement(con, "truncate table "
-	 "pgbench_accounts, "
-	 "pgbench_branches, "
-	 "pgbench_history, "
-	 "pgbench_tellers");
+	initTruncateTables(con);
 
 	/*
 	 * fill branches, tellers, accounts in that order in case foreign keys
@@ -3822,8 +3827,37 @@ initGenerateData(PGconn *con)
 		fprintf(stderr, "PQendcopy failed\n");
 		exit(1);
 	}
+}
 
-	executeStatement(con, "commit");
+/*
+ * Fill the standard tables with some data, from the server side
+ */
+static void
+initGenerateDataServerSide(PGconn *con)
+{
+	char		sql[256];
+
+	fprintf(stderr, "generating data server side...\n");
+
+	initTruncateTables(con);
+
+	snprintf(sql, sizeof(sql),
+			 "insert into pgbench_branches(bid,bbalance,filler) "
+			 "select bid, 0, '' "
+			 "from generate_series(1, %d) as bid", scale);
+	executeStatement(con, sql);
+
+	snprintf(sql, sizeof(sql),
+			 "insert into pgbench_tellers(tid,bid,tbalance,filler) "
+			 "select tid, (tid - 1) / %d + 1, 0, '' "
+			 "from generate_series(1, %d) as tid",

Re: A little report on informal commit tag usage

2019-07-16 Thread Michael Paquier

On Mon, Jul 15, 2019 at 05:49:26PM +1200, Thomas Munro wrote:
> I would have tried to exclude the first line messages if I'd thought
> of that.  But anyway, the reason for the low Doc number is case
> sensitivity. I ran that on a Mac and its lame collation support failed
> me in the "sort" step (also -i didn't do what I wanted, but that
> wasn't the issue).  Trying again on FreeBSD box and explicitly setting
> LANG for the benefit of anyone else wanting to run this (see end), and
> then removing a few obvious false matches, I now get similar numbers
> in most fields but a higher "doc" number:
> 
>  767 Author
>9 Authors
>  144 Backpatch-through
>   55 Backpatch
>   14 Bug
>   14 Co-authored-by
>   27 Diagnosed-by
> 1599 Discussion
>  119 doc
>   36 docs
>  284 Reported-by
>5 Review
>8 Reviewed by
>  460 Reviewed-by
>7 Security
>9 Tested-by

Thanks for those numbers.  I am wondering if we could do a bit of
consolidation here and write a page about this stuff on the wiki.
Getting the "Discussion" field most of the time is really cool.

I think that we could get some improvements on the following things.
Here is a set of ideas:
- Avoid "Authors" and replace it with "Author" even if there are
multiple authors.
- Avoid having multiple entries for each one of them?  For example we
have a couple of commits listed listing one "Reviewed-by" field with
one single name.
- Most commit entries to not use the email address with the name of
the author, reviewer, tester or reporter.  Perhaps we should give up
on that?
- Keep "Backpatch-through", not "Backpatch".
- Keep "Reviewed-by", not "Reviewed by" nor "Review".

"Security" is a special case, we append it to all the CVE-related
commits.

That is mainly a matter of taste, but I tend to prefer the following
format, protecting usually the order:
- Diagnosed-by
- Author
- Reviewed-by
- Discussion
- Backpatch-through
- I tend to have only one "Reviewed-by" entry with a list of names,
same for "Author" and "Reported-by".
- Only names, no emails.

As mentioned on different threads, "Discussion" is the only one we had
a strong agreement with.  Could it be possible to consider things like
Author, Reported-by, Reviewed-by or Backpatch-through for example and
extend to that?  The first three ones are useful for parsing the
commit logs.  The fourth one is handy so as there is no need to look
at a full log tree with git log --graph or such, which is something I
do from time to time to guess down to where a fix has been applied (I
tend to avoid git_changelog).
--
Michael


signature.asc
Description: PGP signature

getting ERROR "relation 16401 has no triggers" with partition foreign key alter

2019-07-16 Thread Rajkumar Raghuwanshi

Hi,

I am getting ERROR:  relation 16401 has no triggers error while executing
below query.

postgres=# create table tbl1(f1 int primary key);
CREATE TABLE
postgres=# create table tbl2(f1 int references tbl1 deferrable initially
deferred) partition by range(f1);
CREATE TABLE
postgres=# create table tbl2_p1 partition of tbl2 for values from
(minvalue) to (maxvalue);
CREATE TABLE
postgres=# insert into tbl1 values(1);
INSERT 0 1
postgres=# begin;
BEGIN
postgres=# insert into tbl2 values(1);
INSERT 0 1
postgres=# alter table tbl2 drop constraint tbl2_f1_fkey;
ALTER TABLE
postgres=# commit;
ERROR:  relation 16395 has no triggers

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

1 2 >

1 - 100 of 107 matches

Mail list logo