Re: Have an encrypted pgpass file

2018-07-20 Thread Tom Lane
Isaac Morland  writes:
>>> It would also provide a *very* fertile source of shell-script-injection
>>> vulnerabilities.  (Whaddya mean, you tried to use a user name with a
>>> quote mark in it?)

> If I understand the proposal correctly, the pgpass program would run on the
> client, invoked by libpq when a password is needed for a connection. So the
> risk relates to strange things happening on the client when the client
> attempts to connect as a strangely-named user or to a strangely-named
> database or host, not to being able to break into the server.

Yeah.  The most obvious scenario for trouble is that somebody enters
a crafted user name on a website, and that results in bad things happening
on an application-server machine that tried to pass that user name to
a database server.  The DB server itself isn't compromised, but the app
server could be.

If we were putting this sort of feature into psql, it wouldn't be such
a risk, but if it's in libpq then I fear it is.  libpq underlies a lot
of client-side code.

regards, tom lane



Re: small development tip: Consider using the gold linker

2018-07-20 Thread Peter Geoghegan
On Mon, Jul 9, 2018 at 4:40 PM, Andres Freund  wrote:
> FWIW, I've lately noticed that I spend a fair time waiting for the
> linker during edit-compile-test cycles.  Due to an independent issue I
> just used the gold linker, and the speedup is quite noticable.
>
> Just relinking the backend, without rebuilding anything else, goes from
> 0m3.975s to 0m1.585s. Given ccache commonly prevents actually having to
> recompile files, that works out to a noticable benefit.
>
> For me just adding '-fuse-ld=gold' to CFLAGS works.

I tried this out today. It makes quite a noticeable difference for me.
Thanks for the tip.

> Unfortunately I get some spurious warnings, but I hope that's just a
> debian unstable issue: /usr/bin/ld.gold: warning: discarding version
> information for __cxa_finalize@GLIBC_2.2.5, defined in unused shared
> library /lib/x86_64-linux-gnu/libc.so.6 (linked with --as-needed)

I didn't have this problem.

-- 
Peter Geoghegan



Re: Have an encrypted pgpass file

2018-07-20 Thread Isaac Morland
On 20 July 2018 at 17:22, Tels  wrote:

> Moin,
>
> > It would also provide a *very* fertile source of shell-script-injection
> > vulnerabilities.  (Whaddya mean, you tried to use a user name with a
> > quote mark in it?)
>
> Little Bobby Tables, we call him. :)
>
> I'm also concerned that that would let anybody who could alter the
> environment then let arbitrary code be run as user postgres. Is this
> something that poses a risk in addition to the current situation?
>

If I understand the proposal correctly, the pgpass program would run on the
client, invoked by libpq when a password is needed for a connection. So the
risk relates to strange things happening on the client when the client
attempts to connect as a strangely-named user or to a strangely-named
database or host, not to being able to break into the server.


Re: patch to allow disable of WAL recycling

2018-07-20 Thread Jerry Jelinek
 Thomas,

Thanks for your offer to run some tests on different OSes and filesystems
that you have. Anything you can provide here would be much appreciated. I
don't have anything other than our native SmartOS/ZFS based configurations,
but I might be able to setup some VMs and get results that way. I should be
able to setup a VM running FreeBSD. If you have a chance to collect some
data, just let me know the exact benchmarks you ran and I'll run the same
things on the FreeBSD VM. Obviously you're under no obligation to do any of
this, so if you don't have time, just let me know and I'll see what I can
do on my own.

Thanks again,
Jerry


On Tue, Jul 17, 2018 at 2:47 PM, Tomas Vondra 
wrote:

> On 07/17/2018 09:12 PM, Peter Eisentraut wrote:
> > On 17.07.18 00:04, Jerry Jelinek wrote:
> >> There have been quite a few comments since last week, so at this point I
> >> am uncertain how to proceed with this change. I don't think I saw
> >> anything concrete in the recent emails that I can act upon.
> >
> > The outcome of this could be multiple orthogonal patches that affect the
> > WAL file allocation behavior somehow.  I think your original idea of
> > skipping recycling on a COW file system is sound.  But I would rather
> > frame the option as "preallocating files is obviously useless on a COW
> > file system" rather than "this will make things mysteriously faster with
> > uncertain trade-offs".
> >
>
> Makes sense, I guess. But I think many claims made in this thread are
> mostly just assumptions at this point, based on our beliefs how CoW or
> non-CoW filesystems work. The results from ZFS (showing positive impact)
> are an exception, but that's about it. I'm sure those claims are based
> on real-world experience and are likely true, but it'd be good to have
> data from a wider range of filesystems / configurations etc. so that we
> can give better recommendations to users, for example.
>
> That's something I can help with, assuming we agree on what tests we
> want to do. I'd say the usual batter of write-only pgbench tests with
> different scales (fits into s_b, fits into RAM, larger then RAM) on
> common Linux filesystems (ext4, xfs, btrfs) and zfsonlinux, and
> different types of storage would be enough. I don't have any freebsd box
> available, unfortunately.
>
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: BUG #15182: Canceling authentication due to timeout aka Denial of Service Attack

2018-07-20 Thread Marko Tiikkaja
On Fri, Jul 20, 2018 at 2:17 AM, Jeremy Schneider 
wrote:

> I'd like to bump this old bug that Lloyd filed for more discussion. It
> seems serious enough to me that we should at least talk about it.
>
> Anyone with simply the login privilege and the ability to run SQL can
> instantly block all new incoming connections to a DB including new
> superuser connections.
>

So..  don't VACUUM FULL pg_authid without lock_timeout?

I can come up with dozens of ways to achieve the same effect, all of them
silly.


.m


Re: patch to allow disable of WAL recycling

2018-07-20 Thread Jerry Jelinek
Peter,

Thanks for your feedback. I'm happy to change the name of the tunable or to
update the man page in any way.  I have already posted an updated patch
with changes to the man page which I think may address your concerns there,
but please let me know if that still needs more work. It looks like Kyotaro
already did some exploration, and tuning the min/max for the WAL size won't
solve this problem.  Just let me know if there is anything else here which
you think I should look into.

Thanks again,
Jerry


On Tue, Jul 17, 2018 at 1:12 PM, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 17.07.18 00:04, Jerry Jelinek wrote:
> > There have been quite a few comments since last week, so at this point I
> > am uncertain how to proceed with this change. I don't think I saw
> > anything concrete in the recent emails that I can act upon.
>
> The outcome of this could be multiple orthogonal patches that affect the
> WAL file allocation behavior somehow.  I think your original idea of
> skipping recycling on a COW file system is sound.  But I would rather
> frame the option as "preallocating files is obviously useless on a COW
> file system" rather than "this will make things mysteriously faster with
> uncertain trade-offs".
>
> The actual implementation could use another round of consideration.  I
> wonder how this should interact with min_wal_size.  Wouldn't
> min_wal_size = 0 already do what we need (if you could set it to 0,
> which is currently not possible)?  Should the new setting be something
> like min_wal_size = -1?  Or even if it's a new setting, it might be
> better to act on it in XLOGfileslop(), so these things are kept closer
> together.
>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2018-07-20 Thread Ashwin Agrawal
On Fri, Jul 20, 2018 at 1:54 AM Heikki Linnakangas  wrote:

> On 19/07/18 23:13, Andres Freund wrote:
> > On 2018-07-19 14:54:52 -0500, Nico Williams wrote:
> >> On Thu, Jul 19, 2018 at 12:20:53PM -0700, Andres Freund wrote:
> >>> On 2018-07-19 11:57:25 +0300, Heikki Linnakangas wrote:
>  Ugh. Yeah, in wal_quickdie, and other aux process *_quickdie()
> handlers, I
>  agree we should just _exit(2). All we want to do is to exit the
> process
>  immediately.
> >>>
> >>> Indeed.
> >>>
>  bgworker_quickdie() makes some effort to block SIGQUIT during the
> exit()
>  processing, but that doesn't solve the whole problem. The process
> could've
>  been in the middle of a malloc/free when the signal arrived, for
> example.
>  exit() is simply not safe to call from a signal handler.
> >>>
> >>> Yea. I really don't understand why we have't been able to agree on this
> >>> for *years* now.
> >>
> >> I mean, only calling async-signal-safe functions from signal handlers is
> >> a critical POSIX requirement, and exit(3) is NOT async-signal-safe.
> >
> > There's a few standard requirements that aren't particularly relevant in
> > practice (including some functions not being listed as signal
> > safe). Just arguing from that isn't really helpful. But there's plenty
> > hard evidence, including a few messages upthread!, of it being
> > practically problematic. Just looking at the reasoning at why exit (and
> > malloc) aren't signal safe however, makes it clear that this particular
> > restriction should be adhered to, however.
>
> ISTM that no-one has any great ideas on what to do about the ereport()
> in quickdie(). But I think we have consensus on replacing the exit(2)
> calls with _exit(2). If we do just that, it would be better than the
> status quo, even if it doesn't completely fix the problem. This would
> prevent the case that Asim reported, for starters.
>
> Any objections to the attached?
>
> With _exit(), I think we wouldn't really need the
> "PG_SETMASK();" calls in the aux process *_quickdie handlers
> that don't do anything else than call _exit(). But I didn't dare to
> remove them yet.
>

Seems there is opportunity to actuall consolidate all those quickdies as
well. As I see no difference between the auxiliary process quickdie
implementations. Only backend `quickdie()` has special code for ereport and
all.


Re: patch to allow disable of WAL recycling

2018-07-20 Thread Jerry Jelinek
Hi Robert,

I'm new to the Postgresql community, so I'm not familiar with how patches
are accepted here. Thanks for your detailed explanation. I do want to keep
pushing on this. I'll respond separately to Peter and to Tomas regarding
their emails.

Thanks again,
Jerry


On Wed, Jul 18, 2018 at 1:43 PM, Robert Haas  wrote:

> On Wed, Jul 18, 2018 at 3:22 PM, Jerry Jelinek 
> wrote:
> > I've gotten a wide variety of feedback on the proposed patch. The
> comments
> > range from rough approval through various discussion about alternative
> > solutions. At this point I am unsure if this patch is rejected or if it
> > would be accepted once I had the updated man page changes that were
> > discussed last week.
> >
> > I have attached an updated patch which does incorporate man page
> changes, in
> > case that is the blocker. However, if this patch is simply rejected, I'd
> > appreciate it if I could get a definitive statement to that effect.
>
> 1. There's no such thing as a definitive statement of the community's
> opinion, generally speaking, because as a rule the community consists
> of many different people who rarely all agree on anything but the most
> uncontroversial of topics.  We could probably all agree that the sun
> rises in the East, or at least has historically done so, and that,
> say, typos are bad.
>
> 2. You can't really expect somebody else to do the work of forging
> consensus on your behalf.  Sure, that may happen, if somebody else
> takes an interest in the problem.  But, really, since you started the
> thread, most likely you're the one most interested.  If you're not
> willing to take the time to discuss the issues with the individual
> people who have responded, promote your own views, investigate
> proposed alternatives, etc., it's unlikely anybody else is going to do
> it.
>
> 3. It's not unusual for a patch of this complexity to take months to
> get committed; it's only been a few weeks.  If it's important to you,
> don't give up now.
>
> It seems to me that there are several people in favor of this patch,
> some others with questions and concerns, and pretty much nobody
> adamantly opposed.  So I would guess that this has pretty good odds in
> the long run.  But you're not going to get anywhere by pushing for a
> commit-or-reject-right-now.  It's been less than 24 hours since Tomas
> proposed to do further benchmarking if we could agree on what to test
> (you haven't made any suggestions in response) and it's also been less
> than 24 hours since Peter and I both sent emails about whether it
> should be controlled by its own GUC or in some other way.  The
> discussion is very much actively continuing.  It's too soon to decide
> on the conclusion, but it would be a good idea for you to keep
> participating.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: [HACKERS] Two pass CheckDeadlock in contentent case

2018-07-20 Thread Yura Sokolov
11.07.2018 17:01, Ashutosh Bapat пишет:
> The patch still applies and it's part of this commitfest.
> 
> On Tue, Oct 3, 2017 at 8:36 PM, Sokolov Yura  wrote:
>> On 2017-10-03 17:30, Sokolov Yura wrote:
>>>
>>> Good day, hackers.
>>>
>>> During hard workload sometimes process reaches deadlock timeout
>>> even if no real deadlock occurred. It is easily reproducible with
>>> pg_xact_advisory_lock on same value + some time consuming
>>> operation (update) and many clients.
>>>
>>> When backend reaches deadlock timeout, it calls CheckDeadlock,
>>> which locks all partitions of lock hash in exclusive mode to
>>> walk through graph and search for deadlock.
>>>
>>> If hundreds of backends reaches this timeout trying to acquire
>>> advisory lock on a same value, it leads to hard-stuck for many
>>> seconds, cause they all traverse same huge lock graph under
>>> exclusive lock.
>>> During this stuck there is no possibility to do any meaningful
>>> operations (no new transaction can begin).
>>>
>>> Attached patch makes CheckDeadlock to do two passes:
>>> - first pass uses LW_SHARED on partitions of lock hash.
>>>   DeadLockCheck is called with flag "readonly", so it doesn't
>>>   modify anything.
>>> - If there is possibility of "soft" or "hard" deadlock detected,
>>>   ie if there is need to modify lock graph, then partitions
>>>   relocked with LW_EXCLUSIVE, and DeadLockCheck is called again.
>>>
>>> It fixes hard-stuck, cause backends walk lock graph under shared
>>> lock, and found that there is no real deadlock.
>>>
>>> Test on 4 socket xeon machine:
>>> pgbench_zipf -s 10 -c 800  -j 100 -M prepared -T 450 -f
>>> ./ycsb_read_zipf.sql@50 -f ./ycsb_update_lock2_zipf.sql@50 -P 5
>>>
>>> ycsb_read_zipf.sql:
>>> \set i random_zipfian(1, 10 * :scale, 1.01)
>>> SELECT abalance FROM pgbench_accounts WHERE aid = :i;
>>> ycsb_update_lock2_zipf.sql:
>>> \set i random_zipfian(1, 10 * :scale, 1.01)
>>> select lock_and_update( :i );
>>>
>>> CREATE OR REPLACE FUNCTION public.lock_and_update(i integer)
>>>  RETURNS void
>>>  LANGUAGE sql
>>> AS $function$
>>> SELECT pg_advisory_xact_lock( $1 );
>>> UPDATE pgbench_accounts SET abalance = 1 WHERE aid = $1;
>>> $function$
>>>
>>> Without attached patch:
>>>
>>> progress: 5.0 s, 45707.0 tps, lat 15.599 ms stddev 83.757
>>> progress: 10.0 s, 51124.4 tps, lat 15.681 ms stddev 78.353
>>> progress: 15.0 s, 52293.8 tps, lat 15.327 ms stddev 77.017
>>> progress: 20.0 s, 51280.4 tps, lat 15.603 ms stddev 78.199
>>> progress: 25.0 s, 47278.6 tps, lat 16.795 ms stddev 83.570
>>> progress: 30.0 s, 41792.9 tps, lat 18.535 ms stddev 93.697
>>> progress: 35.0 s, 12393.7 tps, lat 33.757 ms stddev 169.116
>>> progress: 40.0 s, 0.0 tps, lat -nan ms stddev -nan
>>> progress: 45.0 s, 0.0 tps, lat -nan ms stddev -nan
>>> progress: 50.0 s, 1.2 tps, lat 2497.734 ms stddev 5393.166
>>> progress: 55.0 s, 0.0 tps, lat -nan ms stddev -nan
>>> progress: 60.0 s, 27357.9 tps, lat 160.622 ms stddev 1866.625
>>> progress: 65.0 s, 38770.8 tps, lat 20.829 ms stddev 104.453
>>> progress: 70.0 s, 40553.2 tps, lat 19.809 ms stddev 99.741
>>>
>>> (autovacuum led to trigger deadlock timeout,
>>>  and therefore, to stuck)
>>>
>>> Patched:
>>>
>>> progress: 5.0 s, 56264.7 tps, lat 12.847 ms stddev 66.980
>>> progress: 10.0 s, 55389.3 tps, lat 14.329 ms stddev 71.997
>>> progress: 15.0 s, 50757.7 tps, lat 15.730 ms stddev 78.222
>>> progress: 20.0 s, 50797.3 tps, lat 15.736 ms stddev 79.296
>>> progress: 25.0 s, 48485.3 tps, lat 16.432 ms stddev 82.720
>>> progress: 30.0 s, 45202.1 tps, lat 17.733 ms stddev 88.554
>>> progress: 35.0 s, 40035.8 tps, lat 19.343 ms stddev 97.787
>>> progress: 40.0 s, 14240.1 tps, lat 47.625 ms stddev 265.465
>>> progress: 45.0 s, 30150.6 tps, lat 31.140 ms stddev 196.114
>>> progress: 50.0 s, 38975.0 tps, lat 20.543 ms stddev 104.281
>>> progress: 55.0 s, 40573.9 tps, lat 19.770 ms stddev 99.487
>>> progress: 60.0 s, 38361.1 tps, lat 20.693 ms stddev 103.141
>>> progress: 65.0 s, 39793.3 tps, lat 20.216 ms stddev 101.080
>>> progress: 70.0 s, 38387.9 tps, lat 20.886 ms stddev 104.482
> 
> I am confused about interpreting these numbers. Does it mean that
> without the patch at the end of 70s, we have completed 40553.2 * 70
> transactions and with the patch there are 70 * 38387.9 transactions
> completed in 70 seconds?

It is regular pgbench output, so there is no source for confusion.
tps numbers is number of transactions completed in that particular
5 second interval. That is why there are 0 tps and 1 tps intervals
without patch. Which way 0tps and 1tps could be if tps were average
since start?

It means that without patch there were
46k+51k+52k+47k+42k+12k+0k+0k+0k+0k+27k+39k+41k=357 thousands of
transactions completed.

And with patch there were
56k+55k+51k+51k+48k+45k+40k+14k+30k+39k+41k+38k+40k+38k=586 thousands of
transactions completed.

> I agree that with the patch there is smoother degradation when a lot
> of backends 

Re: Have an encrypted pgpass file

2018-07-20 Thread Tels
Moin,

On Wed, July 18, 2018 7:25 pm, Tom Lane wrote:
> Alvaro Herrera  writes:
>> Seems to me that passing %-specifiers to the command would make it more
>> useful (%u for "user", "host" etc) -- your command could refuse to give
>> you a password for the superuser account for instance but grant one for
>> a read-only user.
>
> It would also provide a *very* fertile source of shell-script-injection
> vulnerabilities.  (Whaddya mean, you tried to use a user name with a
> quote mark in it?)

Little Bobby Tables, we call him. :)

I'm also concerned that that would let anybody who could alter the
environment then let arbitrary code be run as user postgres. Is this
something that poses a risk in addition to the current situation?

Best regards,

Tels



Re: Non-portable shell code in pg_upgrade tap tests

2018-07-20 Thread Tels
Moin,

On Fri, July 20, 2018 10:55 am, Victor Wagner wrote:
> On Fri, 20 Jul 2018 10:25:47 -0400
> Tom Lane  wrote:
>
>> Victor Wagner  writes:
>> > I've discovered that in the branch REL_11_STABLE there is shell
>> > script src/bin/pg_upgrade/test.sh which doesn't work under Solaris
>> > 10. (it uses $(command) syntax with is not compatible with original
>> > Solaris /bin/sh)
>
>>
>> Please send a patch.  Most of us do not have access to old shells
>
> Here it goes. Previous letter was written before fixed tests were
> completed, because this old machine is slow.


+   *)  if [ `find ${PGDATA} -type f ! -perm 640 | wc -l` -ne 0 ]; then

Shouldn't ${PGDATA} in the above as argument to find be quoted, otherwise
the shell would get confused if it contains spaces or other special
characters?

Regards,

Tels




Re: pread() and pwrite()

2018-07-20 Thread Daniel Gustafsson
> On 20 Jul 2018, at 17:34, Tom Lane  wrote:
> 
> Heikki Linnakangas  writes:
>> No objections, if you want to make the effort. But IMHO the lseek+read 
>> fallback is good enough on Windows. Unless you were thinking that we 
>> could then remove the !HAVE_PREAD fallback altogether. Are there any 
>> other platforms out there that don't have pread/pwrite that we care about?
> 
> AFAICT, macOS has them as far back as we care about (prairiedog does).
> HPUX 10.20 (gaur/pademelon) does not, so personally I'd like to keep
> the lseek+read workaround.  Don't know about the oldest Solaris critters
> we have in the buildfarm.  FreeBSD has had 'em at least since 4.0 (1994);
> didn't check the other BSDen.

The OpenBSD box I have access to has pwrite/pread, and have had for some time
if I read the manpage right.

cheers ./daniel



Re: Possible performance regression in version 10.1 with pgbench read-write tests.

2018-07-20 Thread Andres Freund
Hi,

On 2018-07-20 16:43:33 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-07-20 15:35:39 -0400, Tom Lane wrote:
> >> In any case, I strongly resist making performance-based changes on
> >> the basis of one test on one kernel and one hardware platform.
> 
> > Sure, it'd be good to do more of that. But from a theoretical POV it's
> > quite logical that posix semas sharing cachelines is bad for
> > performance, if there's any contention. When backed by futexes -
> > i.e. all non ancient linux machines - the hot path just does a cmpxchg
> > of the *userspace* data (I've copied the relevant code below).
> 
> Here's the thing: the hot path is of little or no interest, because
> if we are in the sema code at all, we are expecting to block.

Note that we're also using semas for ProcArrayGroupClearXid(), which is
pretty commonly hot for pgbench style workloads, and where the expected
wait times are very short.


> It's possible that the bigger picture here is that the kernel boys
> optimized for the "uncontended" path to the point where they broke
> performance of the blocking path.  It's hard to see how they could
> have broke it to the point of being slower than the SysV sema API,
> though.

I don't see how this is a likely proposition, given that adding padding
to the *userspace* portion of futexes increased the performance quite
significantly.


> On my RHEL6 machine, with unmodified HEAD and 8 sessions (since I've
> only got 8 cores) but other parameters matching Mithun's example,
> I just got

It's *really* common to have more actual clients than cpus for oltp
workloads, so I don't think it's insane to test with more clients.

Greetings,

Andres Freund



Re: Possible performance regression in version 10.1 with pgbench read-write tests.

2018-07-20 Thread Tom Lane
Andres Freund  writes:
> On 2018-07-20 15:35:39 -0400, Tom Lane wrote:
>> In any case, I strongly resist making performance-based changes on
>> the basis of one test on one kernel and one hardware platform.

> Sure, it'd be good to do more of that. But from a theoretical POV it's
> quite logical that posix semas sharing cachelines is bad for
> performance, if there's any contention. When backed by futexes -
> i.e. all non ancient linux machines - the hot path just does a cmpxchg
> of the *userspace* data (I've copied the relevant code below).

Here's the thing: the hot path is of little or no interest, because
if we are in the sema code at all, we are expecting to block.  The
only case where we wouldn't block is if the lock manager decided the
current process needs to sleep, but some other process already released
us by the time we reach the futex/kernel call.  Certainly that will happen
some of the time, but it's not likely to be the way to bet.  So I'm very
dubious of any arguments based on the speed of the "uncontended" path.

It's possible that the bigger picture here is that the kernel boys
optimized for the "uncontended" path to the point where they broke
performance of the blocking path.  It's hard to see how they could
have broke it to the point of being slower than the SysV sema API,
though.

Anyway, I think we need to test first and patch second.  I'm working
on getting some numbers on my own machines now.

On my RHEL6 machine, with unmodified HEAD and 8 sessions (since I've
only got 8 cores) but other parameters matching Mithun's example,
I just got

transaction type: 
scaling factor: 300
query mode: prepared
number of clients: 8
number of threads: 8
duration: 1800 s
number of transactions actually processed: 29001016
latency average = 0.497 ms
tps = 16111.575661 (including connections establishing)
tps = 16111.623329 (excluding connections establishing)

which is interesting because vmstat was pretty consistently reporting
around 50 context swaps/second during the run, or circa 30
cs/transaction.  We'd have a minimum of 14 cs/transaction just between
client and server (due to seven SQL commands per transaction in TPC-B)
so that seems on the low side; not a lot of lock contention here it
seems.  I wonder what the corresponding ratio was in Mithun's runs.

regards, tom lane



Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-07-20 Thread Andres Freund
On 2018-07-20 16:15:14 -0400, Tom Lane wrote:
> We've seen several occurrences of $subject in the buildfarm in the past
> month or so.  Scraping the logs, I find
> 
>  coypu| 2018-06-14 21:17:49 | HEAD  | Check | 2018-06-14 
> 23:31:44.505 CEST [5b22deb8.30e1:124] ERROR:  could not read block 3 in file 
> "base/16384/2662": read only 0 of 8192 bytes
>  coypu| 2018-06-14 21:17:49 | HEAD  | Check | 2018-06-14 
> 23:31:44.653 CEST [5b22deb8.6d4d:132] ERROR:  could not read block 3 in file 
> "base/16384/2662": read only 0 of 8192 bytes
>  gull | 2018-06-21 04:02:11 | HEAD  | Check | 2018-06-20 
> 21:27:06.843 PDT [31138:3] FATAL:  could not read block 3 in file 
> "base/16384/2662": read only 0 of 8192 bytes
>  mandrill | 2018-06-22 16:14:16 | HEAD  | Check | 2018-06-22 
> 16:46:24.138 UTC [14353240:43] pg_regress/create_table_like ERROR:  could not 
> read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
>  mandrill | 2018-06-22 16:14:16 | HEAD  | Check | 2018-06-22 
> 16:46:24.137 UTC [20710034:1] ERROR:  could not read block 3 in file 
> "base/16384/2662": read only 0 of 8192 bytes
>  lapwing  | 2018-07-05 07:20:02 | HEAD  | Check | 2018-07-05 
> 07:21:45.585 UTC [24022:1] ERROR:  could not read block 3 in file 
> "base/16384/2662": read only 0 of 8192 bytes
>  lapwing  | 2018-07-05 07:20:02 | HEAD  | Check | 2018-07-05 
> 07:21:45.591 UTC [23941:39] pg_regress/roleattributes ERROR:  could not read 
> block 3 in file "base/16384/2662": read only 0 of 8192 bytes
>  mantid   | 2018-07-17 01:20:04 | REL_11_STABLE | Check | 2018-07-16 
> 21:21:32.557 EDT [26072:1] ERROR:  could not read block 3 in file 
> "base/16384/2662": read only 0 of 8192 bytes
>  mantid   | 2018-07-17 01:20:04 | REL_11_STABLE | Check | 2018-07-16 
> 21:21:32.557 EDT [25990:82] ERROR:  could not read block 3 in file 
> "base/16384/2662": read only 0 of 8192 bytes
>  coypu| 2018-07-17 01:27:39 | HEAD  | Check | 2018-07-17 
> 03:47:26.264 CEST [5b4d4aa4.3ac4:71] ERROR:  could not read block 3 in file 
> "base/16384/2662": read only 0 of 8192 bytes
>  coypu| 2018-07-20 11:18:02 | HEAD  | Check | 2018-07-20 
> 13:32:14.104 CEST [5b51c833.4884:131] ERROR:  could not read block 3 in file 
> "base/16384/2662": read only 0 of 8192 bytes
> 
> While I didn't go back indefinitely far, there are no other occurrences of
> "could not read block" failures in the last three months.  This suggests
> strongly that we broke something in early June.  Don't know what though.
> Ideas?
> 
> In case anyone's wondering, 2662 is pg_class_oid_index.

While I do not immediately see how, and the affected versions don't
really point that way, the timing does make me think of
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a54e1f1587793b5bf926630ec9ce142e4578d7a0

HEAD/REL_11_STABLE apparently solely being affected points elsewhere,
but I don't immediatley know where.

Greetings,

Andres Freund



Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-07-20 Thread Heikki Linnakangas

On 01/03/18 12:40, Heikki Linnakangas wrote:

On 16/01/18 15:00, Yoshimi Ichiyanagi wrote:

These patches enable to use Persistent Memory Development Kit(PMDK)[1]
for reading/writing WAL logs on persistent memory(PMEM).
PMEM is next generation storage and it has a number of nice features:
fast, byte-addressable and non-volatile.


Interesting. How does this compare with using good old mmap()? I think
just doing that would allow eliminating much of the complexity around
managing the shared_buffers. And if the OS is smart about persistent
memory (I don't know what the state of the art on that is), presumably
msync() and fsync() on an file that lives in persistent memory is
lightning fast.


I briefly looked at the docs at pmem.io. pmem_map_file() uses mmap() 
under the hood, but it does some extra checks to test if the files is on 
a persistent memory device, and makes a note of it.


I think the way forward with this patch would be to map WAL segments 
with plain old mmap(), and use msync(). If that's faster than the status 
quo, great. If not, it would still be a good stepping stone for actually 
using PMDK. If nothing else, it would provide a way to test most of the 
code paths, without actually having a persistent memory device, or 
libpmem. The examples at http://pmem.io/pmdk/libpmem/ actually sugest 
doing exactly that: use libpmem to map a file to memory, and check if it 
lives on persistent memory using libpmem's pmem_is_pmem() function. If 
it returns yes, use pmem_drain(), if it return false, fall back to using 
msync().


- Heikki



buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-07-20 Thread Tom Lane
We've seen several occurrences of $subject in the buildfarm in the past
month or so.  Scraping the logs, I find

 coypu| 2018-06-14 21:17:49 | HEAD  | Check | 2018-06-14 
23:31:44.505 CEST [5b22deb8.30e1:124] ERROR:  could not read block 3 in file 
"base/16384/2662": read only 0 of 8192 bytes
 coypu| 2018-06-14 21:17:49 | HEAD  | Check | 2018-06-14 
23:31:44.653 CEST [5b22deb8.6d4d:132] ERROR:  could not read block 3 in file 
"base/16384/2662": read only 0 of 8192 bytes
 gull | 2018-06-21 04:02:11 | HEAD  | Check | 2018-06-20 
21:27:06.843 PDT [31138:3] FATAL:  could not read block 3 in file 
"base/16384/2662": read only 0 of 8192 bytes
 mandrill | 2018-06-22 16:14:16 | HEAD  | Check | 2018-06-22 
16:46:24.138 UTC [14353240:43] pg_regress/create_table_like ERROR:  could not 
read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
 mandrill | 2018-06-22 16:14:16 | HEAD  | Check | 2018-06-22 
16:46:24.137 UTC [20710034:1] ERROR:  could not read block 3 in file 
"base/16384/2662": read only 0 of 8192 bytes
 lapwing  | 2018-07-05 07:20:02 | HEAD  | Check | 2018-07-05 
07:21:45.585 UTC [24022:1] ERROR:  could not read block 3 in file 
"base/16384/2662": read only 0 of 8192 bytes
 lapwing  | 2018-07-05 07:20:02 | HEAD  | Check | 2018-07-05 
07:21:45.591 UTC [23941:39] pg_regress/roleattributes ERROR:  could not read 
block 3 in file "base/16384/2662": read only 0 of 8192 bytes
 mantid   | 2018-07-17 01:20:04 | REL_11_STABLE | Check | 2018-07-16 
21:21:32.557 EDT [26072:1] ERROR:  could not read block 3 in file 
"base/16384/2662": read only 0 of 8192 bytes
 mantid   | 2018-07-17 01:20:04 | REL_11_STABLE | Check | 2018-07-16 
21:21:32.557 EDT [25990:82] ERROR:  could not read block 3 in file 
"base/16384/2662": read only 0 of 8192 bytes
 coypu| 2018-07-17 01:27:39 | HEAD  | Check | 2018-07-17 
03:47:26.264 CEST [5b4d4aa4.3ac4:71] ERROR:  could not read block 3 in file 
"base/16384/2662": read only 0 of 8192 bytes
 coypu| 2018-07-20 11:18:02 | HEAD  | Check | 2018-07-20 
13:32:14.104 CEST [5b51c833.4884:131] ERROR:  could not read block 3 in file 
"base/16384/2662": read only 0 of 8192 bytes

While I didn't go back indefinitely far, there are no other occurrences of
"could not read block" failures in the last three months.  This suggests
strongly that we broke something in early June.  Don't know what though.
Ideas?

In case anyone's wondering, 2662 is pg_class_oid_index.

regards, tom lane



Re: Possible performance regression in version 10.1 with pgbench read-write tests.

2018-07-20 Thread Andres Freund
On 2018-07-20 15:35:39 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-07-21 00:53:28 +0530, Mithun Cy wrote:
> >> I did a quick test applying the patch with same settings as initial mail I
> >> have reported  (On postgresql 10 latest code)
> >> 72 clients
> >> 
> >> CASE 1:
> >> Without Patch : TPS 29269.823540
> >> 
> >> With Patch : TPS 36005.544960.--- 23% jump
> >> 
> >> Just Disabling using unnamed POSIX semaphores: TPS 34481.207959
> 
> >> So it seems that is the issue as the test is being run on 8 node numa
> >> machine.
> 
> > Cool. I think we should just backpatch that then.  Does anybody want to
> > argue against?
> 
> Not entirely clear to me what change is being proposed here?

Adding padding to struct PGSemaphoreData, so multiple semas don't share
a cacheline.


> In any case, I strongly resist making performance-based changes on
> the basis of one test on one kernel and one hardware platform.
> We should reproduce the results on a few different machines before
> we even think of committing anything.  I'm happy to test on what
> I have, although I'd be the first to agree that what I'm checking
> is relatively low-end cases.  (Too bad hydra's gone.)

Sure, it'd be good to do more of that. But from a theoretical POV it's
quite logical that posix semas sharing cachelines is bad for
performance, if there's any contention. When backed by futexes -
i.e. all non ancient linux machines - the hot path just does a cmpxchg
of the *userspace* data (I've copied the relevant code below).  Given
that we don't have a large number of semas these days, that there's
reasons to make the change even without measuring it, that we have
benchmark results, and that it's hard to see how it'd cause regressions,
I don't think going for a fix quickly is unreasonable.

You could argue it'd be better to make semaphores be embeddable in
bigger structures like PGPROC, rather than allocated in an array. While
I suspect you'd get a bit of a performance benefit from that, it seems
like a far bigger change we'd want to do in a minor release.

int
__new_sem_wait (sem_t *sem)
{
  /* We need to check whether we need to act upon a cancellation request here
 because POSIX specifies that cancellation points "shall occur" in
 sem_wait and sem_timedwait, which also means that they need to check
 this regardless whether they block or not (unlike "may occur"
 functions).  See the POSIX Rationale for this requirement: Section
 "Thread Cancellation Overview" [1] and austin group issue #1076 [2]
 for thoughs on why this may be a suboptimal design.

 [1] http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap02.html
 [2] http://austingroupbugs.net/view.php?id=1076 for thoughts on why this
   */
  __pthread_testcancel ();

  if (__new_sem_wait_fast ((struct new_sem *) sem, 0) == 0)
return 0;
  else
return __new_sem_wait_slow((struct new_sem *) sem, NULL);
}

/* Fast path: Try to grab a token without blocking.  */
static int
__new_sem_wait_fast (struct new_sem *sem, int definitive_result)
{
  /* We need acquire MO if we actually grab a token, so that this
 synchronizes with all token providers (i.e., the RMW operation we read
 from or all those before it in modification order; also see sem_post).
 We do not need to guarantee any ordering if we observed that there is
 no token (POSIX leaves it unspecified whether functions that fail
 synchronize memory); thus, relaxed MO is sufficient for the initial load
 and the failure path of the CAS.  If the weak CAS fails and we need a
 definitive result, retry.  */
#if __HAVE_64B_ATOMICS
  uint64_t d = atomic_load_relaxed (>data);
  do
{
  if ((d & SEM_VALUE_MASK) == 0)
break;
  if (atomic_compare_exchange_weak_acquire (>data, , d - 1))
return 0;
}
  while (definitive_result);
  return -1;
#else
  unsigned int v = atomic_load_relaxed (>value);
  do
{
  if ((v >> SEM_VALUE_SHIFT) == 0)
break;
  if (atomic_compare_exchange_weak_acquire (>value,
  , v - (1 << SEM_VALUE_SHIFT)))
return 0;
}
  while (definitive_result);
  return -1;
#endif
}

/* See sem_wait for an explanation of the algorithm.  */
int
__new_sem_post (sem_t *sem)
{
  struct new_sem *isem = (struct new_sem *) sem;
  int private = isem->private;

#if __HAVE_64B_ATOMICS
  /* Add a token to the semaphore.  We use release MO to make sure that a
 thread acquiring this token synchronizes with us and other threads that
 added tokens before (the release sequence includes atomic RMW operations
 by other threads).  */
  /* TODO Use atomic_fetch_add to make it scale better than a CAS loop?  */
  uint64_t d = atomic_load_relaxed (>data);
  do
{
  if ((d & SEM_VALUE_MASK) == SEM_VALUE_MAX)
{
  __set_errno (EOVERFLOW);
  return -1;
}
}
  while (!atomic_compare_exchange_weak_release (>data, , d + 1));

  /* If there is any potentially 

Re: Possible performance regression in version 10.1 with pgbench read-write tests.

2018-07-20 Thread Tom Lane
Andres Freund  writes:
> On 2018-07-21 00:53:28 +0530, Mithun Cy wrote:
>> I did a quick test applying the patch with same settings as initial mail I
>> have reported  (On postgresql 10 latest code)
>> 72 clients
>> 
>> CASE 1:
>> Without Patch : TPS 29269.823540
>> 
>> With Patch : TPS 36005.544960.--- 23% jump
>> 
>> Just Disabling using unnamed POSIX semaphores: TPS 34481.207959

>> So it seems that is the issue as the test is being run on 8 node numa
>> machine.

> Cool. I think we should just backpatch that then.  Does anybody want to
> argue against?

Not entirely clear to me what change is being proposed here?

In any case, I strongly resist making performance-based changes on
the basis of one test on one kernel and one hardware platform.
We should reproduce the results on a few different machines before
we even think of committing anything.  I'm happy to test on what
I have, although I'd be the first to agree that what I'm checking
is relatively low-end cases.  (Too bad hydra's gone.)

regards, tom lane



Re: Possible performance regression in version 10.1 with pgbench read-write tests.

2018-07-20 Thread Mithun Cy
On Fri, Jul 20, 2018 at 10:52 AM, Thomas Munro <
thomas.mu...@enterprisedb.com> wrote:

> On Fri, Jul 20, 2018 at 7:56 AM, Tom Lane  wrote:
> >
> > It's not *that* noticeable, as I failed to demonstrate any performance
> > difference before committing the patch.  I think some more investigation
> > is warranted to find out why some other people are getting different
> > results
> Maybe false sharing is a factor, since sizeof(sem_t) is 32 bytes on
> Linux/amd64 and we're probably hitting elements clustered at one end
> of the array?  Let's see... I tried sticking padding into
> PGSemaphoreData and I got ~8% more TPS (72 client on multi socket
> box, pgbench scale 100, only running for a minute but otherwise the
> same settings that Mithun showed).
>
> --- a/src/backend/port/posix_sema.c
> +++ b/src/backend/port/posix_sema.c
> @@ -45,6 +45,7 @@
>  typedef struct PGSemaphoreData
>  {
> sem_t   pgsem;
> +   charpadding[PG_CACHE_LINE_SIZE - sizeof(sem_t)];
>  } PGSemaphoreData;
>
> That's probably not the right idiom and my tests probably weren't long
> enough, but there seems to be some effect here.
>

I did a quick test applying the patch with same settings as initial mail I
have reported  (On postgresql 10 latest code)
72 clients

CASE 1:
Without Patch : TPS 29269.823540

With Patch : TPS 36005.544960.--- 23% jump

Just Disabling using unnamed POSIX semaphores: TPS 34481.207959

So it seems that is the issue as the test is being run on 8 node numa
machine.
I also came across a presentation [1] : slide 20 which says one of those
futex architecture is bad for NUMA machine. I am not sure the new fix for
same is included as part of Linux version 3.10.0-693.5.2.el7.x86_64 which
is on my test machine.


[1] https://www.slideshare.net/davidlohr/futex-scaling-for-multicore-systems


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


Re: Segfault logical replication PG 10.4

2018-07-20 Thread Alvaro Herrera
On 2018-Jul-19, Quan TRAN wrote:

> Hello,
> 
> In worker.c:
> 
>     - in apply_handle_insert, slot_store_cstrings is called before
> PushActiveSnapshot
>     - in apply_handle_update & apply_handle_delete, slot_store_cstrings is
> called after PushActiveSnapshot
> 
>     /* Process and store remote tuple in the slot */
>     oldctx = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
>     slot_store_cstrings(remoteslot, rel, newtup.values);
>     slot_fill_defaults(rel, estate, remoteslot);
>     MemoryContextSwitchTo(oldctx);
> 
>     PushActiveSnapshot(GetTransactionSnapshot());
> 
> Should this be the cause?

Yeah, moving the PushActiveSnapshot call to just before
slot_store_cstrings sounds right.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Add constraint in a Materialized View

2018-07-20 Thread Nico Williams
On Wed, Jul 18, 2018 at 09:28:19AM +0200, Kaye Ann Ignacio wrote:
> I'm trying to add a foreign constraint in my local table to reference a
> column in a materialized view, is it possible to alter this materialized
> view by adding a primary key constraint?

It's not, but I'm working on a patch for that and much more.  Basically,
I want to integrate something like
https://github.com/twosigma/postgresql-contrib/blob/master/mat_views.sql

I currently use that code instead of PG native materialized views.

Nico
-- 



Re: Making "COPY partitioned_table FROM" faster

2018-07-20 Thread Karen Huddleston
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

This patch applied cleanly and worked as expected.

Patch description:
This patch implements the use of heap_multi_insert() for partition tables when 
using the COPY FROM command, benefiting the performance of COPY FROM in cases 
in which multiple tuples in the file are destined for the same partition.

Tests:
All tests passed make check-world and TAP tests

Functionality:
Specifically, we tried the cases in the bottom section "Exercising the code" 
and found all behavior as described and expected.
Note that we did conduct performance testing for our test cases enumerated in 
this section using COPY FROM PROGRAM comparing the patched and unpatched code 
using one example with tuples all destined for the same partition and one 
example with tuples destined for alternating partitions. We saw a similar 
performance improvement to the one recorded by David in his email, including a 
decrease in performance for copying data with tuples alternating between 
destination partitions as compared to the unpatched code. 
However, in our cases below, we focus on exercising the code added as opposed 
to on performance.

Feature feedback from a user perspective:
There will be two differences in performance that may be perceptible to the 
user after the inclusion of this patch:
1) the relatively small decrease in performance (as compared to master) for 
copying data from a file or program in which the destination partition changes 
frequently
2) the stark contrast between the performance of copying data destined for the 
same partition and data destined for alternating partitions
Based on the numbers in David's email the performance difference 
27721.054 ms patched for copying data destined for the same partition vs 
46151.844 ms patched for copying data destined for two different partitions 
alternating each row
Because both differences could impact users in data loading, it is worth 
considering making this transparent to the user in some way. Because this will 
be noticeable to the user, and, furthermore, because it is potentially within 
the power of the user to make changes to their data copying strategy given this 
information, it might be prudent to either create a GUC allowing the user to 
disable heap_multi_insert for COPY FROM (if the user knows the data he/she is 
copying over is very varied) or to add a comment to the docs that when using 
COPY FROM for a partition table, it is advisable to sort the data in some 
manner which would optimize the number of consecutive tuples destined for the 
same partition.

Code Review:
Naming of newly introduced enum:
The CopyInsertMethod enum value CIM_MULTI_PARTITION is potentially confusing, 
since, for a partition table with BEFORE or INSTEAD INSERT triggers on child 
partitions only heap_insert is valid, requiring the additional variable 
part_can_multi_insert to determine if heap_multi_insert can be used or not.
It might be more clear to name it something that indicates it is a candidate 
for multi-insertion, pending further review. This decouples the state of 
potentially applicable multi-insertion from the table being a partition table. 
In the future, perhaps, some other feature might make a table conditionally 
eligible for multi-insertion of tuples, so, it may be desirable to use this 
intermediate state for other kinds of tables as well.
Alternatively, it might make it more clear if the variable 
part_can_multi_insert was clearly for a leaf partition table, since this can 
change from leaf to leaf, maybe named leaf_part_can_multi_insert

Code comment potential typo corrections:
In the following comment, in the patched code line 2550
     /* ... 
     * Note: It does not matter if any partitions have any volatile default
     * expression as we use the defaults from the target of the COPY command.
     */
It seems like "volatile default expression" should be "volatile default 
expressions"

In the following comment, in the patched code starting on line 2582

        /* ...
         * the same partition as the previous tuple, and since we flush the
         * previous partitions buffer once the new tuple has already been
         * built, we're unable to reset the estate since we'd free the memory
         * which the new tuple is stored.  To work around this we maintain a
         * secondary expression context and alternate between these when the
         ... */
"previous partitions" should probably be "previous partition's" and 
"since we'd free the memory which the new tuple is stored" should be "since 
we'd free the memory in which the new tuple is stored"

In the following comment, in the patched code starting on line 2769

                /*
                 * We'd better make the bulk insert mechanism gets a new
                 * 

Add constraint in a Materialized View

2018-07-20 Thread Kaye Ann Ignacio
Hi,

I'm trying to add a foreign constraint in my local table to reference a
column in a materialized view, is it possible to alter this materialized
view by adding a primary key constraint?

Thank you.

-- 
Kaye Ann Ignacio, Programmer
proceedit "the BPaaS company"
kaye.igna...@proceedit.com +34 679188011 (mobile)


Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-07-20 Thread Peter Geoghegan
On Thu, Jul 19, 2018 at 4:29 AM, Masahiko Sawada  wrote:
>> One area that might be worth investigating is retail index tuple
>> deletion performed within the executor in the event of non-HOT
>> updates. Maybe LP_REDIRECT could be repurposed to mean "ghost record",
>> at least in unique index tuples with no NULL values. The idea is that
>> MVCC index scans can skip over those if they've already found a
>> visible tuple with the same value.
>
> I think that's a good idea. The overhead of marking it as ghost seems
> small and it would speed up index scans. If MVCC index scans have
> already found a visible tuples with the same value they can not only
> skip scanning but also kill them? If can, we can kill index tuples
> without checking the heap.

I think you're talking about making LP_REDIRECT marking in nbtree
represent a "recently dead" hint: the deleting transaction has
committed, and so we are 100% sure that the tuple is about to become
garbage, but it cannot be LP_DEAD just yet because it needs to stay
around for the benefit of at least one existing snapshot/long running
transaction.

That's a different idea to what I talked about, since it could perhaps
work in a way that's much closer to LP_DEAD/kill_prior_tuple (no extra
executor stuff is required). I'm not sure if your idea is better or
worse than what I suggested, though. It would certainly be easier to
implement.

-- 
Peter Geoghegan



Re: Negotiating the SCRAM channel binding type

2018-07-20 Thread Heikki Linnakangas

On 12/07/18 16:08, Michael Paquier wrote:

On Thu, Jul 12, 2018 at 12:34:51PM +0300, Heikki Linnakangas wrote:

Hmm. That is actually in a section called "Default Channel Binding". And the
first paragraph says "A default channel binding type agreement process for
all SASL application protocols that do not provide their own channel binding
type agreement is provided as follows". Given that, it's not entirely clear
to me if the requirement to support tls-unique is for all implementations of
SCRAM, or only those applications that don't provide their own channel
binding type agreement.


I am not sure, but I get that as tls-unique must be the default if
available, so if it is technically possible to have it we should have it
in priority.  If not, then a channel binding type which is supported by
both the server and the client can be chosen.


Another interesting piece of legalese is in RFC 5929 Channel Bindings 
for TLS:



   For many applications, there may be two or more potentially
   applicable TLS channel binding types.  Existing security frameworks
   (such as the GSS-API [RFC2743] or the SASL [RFC4422] GS2 framework
   [RFC5801]) and security mechanisms generally do not support
   negotiation of channel binding types.  Therefore, application peers
   need to agree a priori as to what channel binding type to use (or
   agree to rules for deciding what channel binding type to use).

   The specifics of whether and how to negotiate channel binding types
   are beyond the scope of this document.  However, it is RECOMMENDED
   that application protocols making use of TLS channel bindings, use
   'tls-unique' exclusively, except, perhaps, where server-side proxies
   are common in deployments of an application protocol.  In the latter
   case an application protocol MAY specify that 'tls-server-end-point'
   channel bindings must be used when available, with 'tls-unique' being
   used when 'tls-server-end-point' channel bindings are not available.
   Alternatively, the application may negotiate which channel binding
   type to use, or may make the choice of channel binding type
   configurable.


In any case, I'm quite convinced now that we should just remove 
tls-unique, and stick to tls-server-end-point. Patch attached, please 
take a look.


- Heikki
>From 4e2f010692aaef97d4e16822359e86073a53e96b Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Fri, 20 Jul 2018 20:02:17 +0300
Subject: [PATCH 1/2] Remove support for tls-unique channel binding.

There are some problems with the tls-unique channel binding type. It's not
supported by all SSL libraries, and strictly speaking it's not defined for
TLS 1.3 at all, even though at least in OpenSSL, the functions used for it
still seem to work with TLS 1.3 connections. And since we had no
mechanism to negotiate what channel binding type to use, there would be
awkward interoperability issues if a server only supported some channel
binding types. tls-server-end-point seems feasible to support with any SSL
library, so let's just stick to that.

This removes the scram_channel_binding libpq option altogether, since there
is now only one supported channel binding type.

This also removes all the channel binding tests from the SSL test suite.
It would be good to actually have some tests for channel binding, but these
tests were really just testing the scram_channel_binding option, which is
now gone.

I also removed the SCRAM_CHANNEL_BINDING_TLS_END_POINT constant. This is a
matter of taste, but IMO it's more readable to just use the
"tls-server-end-point" string.

Refactor the checks on whether the SSL library supports the functions
needed for tls-server-end-point channel binding. Now the server won't
advertise, and the client won't choose, the SCRAM-SHA-256-PLUS variant, if
compiled with an OpenSSL version too old to support it.

In the passing, add some sanity checks to check that the chosen SASL
mechanism, SCRAM-SHA-256 or SCRAM-SHA-256-PLUS, matches whether the SCRAM
exchange used channel binding or not. For example, if the client selects
the non-channel-binding variant SCRAM-SHA-256, but in the SCRAM message
uses channel binding anyway. It's harmless from a security point of view,
I believe, and I'm not sure if there are some other conditions that would
cause the connection to fail, but it seems better to be strict about these
things and check explicitly.

Discussion: https://www.postgresql.org/message-id/ec787074-2305-c6f4-86aa-6902f98485a4%40iki.fi
---
 doc/src/sgml/libpq.sgml  |  28 
 doc/src/sgml/protocol.sgml   |  28 ++--
 doc/src/sgml/release-11.sgml |   5 +-
 src/backend/libpq/auth-scram.c   | 218 +--
 src/backend/libpq/auth.c |  61 +++--
 src/backend/libpq/be-secure-openssl.c|  27 +---
 src/include/common/scram-common.h|   4 -
 src/include/libpq/libpq-be.h |  12 +-
 src/include/libpq/scram.h|   4 +-
 

Re: Faster str to int conversion (was Table with large number of int columns, very slow COPY FROM)

2018-07-20 Thread Andres Freund
Hi,

On 2018-07-20 08:27:34 -0400, Robert Haas wrote:
> On Thu, Jul 19, 2018 at 4:32 PM, Andres Freund  wrote:
> >> 1. Why the error message changes?  If there's a good reason, it should
> >> be done as a separate commit, or at least well-documented in the
> >> commit message.
> >
> > Because there's a lot of "invalid input syntax for type %s: \"%s\"",
> > error messages, and we shouldn't force translators to have separate
> > version that inlines the first %s.  But you're right, it'd be worthwhile
> > to point that out in the commit message.
> 
> It just seems weird that they're bundled together in one commit like this.

I'll push it separately.

> Nothing else from me...

Thanks for looking!

Greetings,

Andres Freund



Re: [bug fix] Produce a crash dump before main() on Windows

2018-07-20 Thread Robert Haas
On Wed, Jul 18, 2018 at 4:38 AM, Tsunakawa, Takayuki
 wrote:
> IIRC, the pop up doesn't appear under Windows service.  If you start the 
> database server with pg_ctl start on the command prompt, the pop up will 
> appear, which I think is not bad.

Hmm.  I think that might be bad.  What makes you think it isn't?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-07-20 Thread Robert Haas
On Fri, Jul 20, 2018 at 8:38 AM, Robert Haas  wrote:
> This looks like a terrible design to me.  If somebody in future
> invents a new plan type that is not projection-capable, then this
> could fail an assertion here and there won't be any simple fix.  And
> if you reach here and the child targetlists somehow haven't been
> adjusted, then you won't even fail an assertion, you'll just crash (or
> maybe return wrong answers).

To elaborate on this thought a bit, it is currently the case that all
scan and join types are projection-capable, and most likely we would
make any such node types added in the future projection-capable as
well, but it still doesn't seem like a good idea to me to have
tangentially-related parts of the system depend on that in subtle
ways.  Also, even today, this would fail if the subplan happened to be
a Sort, and it's not very obvious why that couldn't happen.  If it
can't happen today, it's not obvious why a future code change couldn't
introduce cases where it can happen.

More broadly, the central idea of this patch is that we can set the
target list for a child rel to something other than the target list
that we expect it will eventually produce, and then at plan creation
time fix it.  I think that's a bad idea.  The target list affects lots
of things, like costing.  If we don't insert a ConvertRowTypeExpr into
the child's target list, the costing will be wrong to the extent that
ConvertRowTypeExpr has any cost, which it does.  Any other current or
future property that is computed based on the target list --
parallel-safety, width, security-level, whatever -- could also
potentially end up with a wrong value if the target list as written
does not match the target list as will actually be produced.  It's
true that, in all of these areas, the ConvertRowTypeExpr isn't likely
to have a lot of impact; it probably won't have a big effect on
costing and may not actually cause a problem for the other things that
I mentioned.  On the other hand, if it does cause a serious problem,
what options would we have for fixing it other than to back out your
entire patch and solve the problem some other way?  The idea that
derived properties won't be correct unless they are based on the real
target list is pretty fundamental, and I think deviating from the idea
that we get the correct target list first and then compute the derived
properties afterward is likely to cause real headaches for future
developers.

In short, plan creation time is just the wrong time to change the
plan.  It's just a time to translate the plan from the form needed by
the planner to the form needed by the executor.  It would be fine if
the change you were making were only cosmetic, but it's not.

After looking over both patches, I think Ashutosh Bapat has basically
the right idea.  What he is proposing is a localized fix that doesn't
seem to make any changes to the way things work overall.  I do think
his patches need to be fixed up a bit to avoid conflating
ConvertRowtypeExpr with "converted whole-row reference."  The two are
certainly not equivalent; the latter is a narrow special case.  Some
of the comments could use some more wordsmithing, too, I think.  Apart
from those gripes, though, I think it's solving the problem the
correct way: don't build the wrong plan and try to fix it, just build
the right plan from the beginning.

There are definitely some things not to like about this approach.  In
particular, I definitely agree that treating a converted whole-row
reference specially is not very nice.  It feels like it might be
substantially cleaner to be able to represent this idea as a single
node rather than a combination of ConvertRowtypeExpr and var with
varattno = 0.  Perhaps in the future we ought to consider either (a)
trying to use a RowExpr for this rather than ConvertRowtypeExpr or (b)
inventing a new WholeRowExpr node that stores two RTIs, one for the
relation that's generating the whole-row reference and the other for
the relation that is controlling the column ordering of the result or
(c) allowing a Var to represent a whole-row expression that has to
produce its outputs according to the ordering of some other RTE.  But
I don't think it's wise or necessary to whack that around just to fix
this bug; it is refactoring or improvement work best left to a future
release.

Also, it is definitely a real shame that we have to build attr_needed
data for each child separately.  Right now, we've got to build a whole
lot of things for each child individually, and that's slow and
inefficient.  We're not going to scale nicely to large partitioning
hierarchies unless we can figure out some way of minimizing the work
we do for each partition.  So, the fact that Fujii-san's approach
avoids needing to compute attr_needed in some cases is very appealing.
However, in my opinion, it's nowhere near appealing enough to justify
trying to do surgery on the target list at plan-creation time.  I
think we need to leave optimizing this to 

Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-07-20 Thread David Rowley
On 20 July 2018 at 21:44, Amit Langote  wrote:
> But I don't think the result of make_partition_pruneinfo itself has to be
> List of PartitionedRelPruneInfo nested under PartitionPruneInfo.  I gather
> that each PartitionPruneInfo corresponds to each root partitioned table
> and a PartitionedRelPruneInfo contains the actual pruning information,
> which is created for every partitioned table (non-leaf tables), including
> the root tables.  I don't think such nesting is necessary.  I think that
> just like flattened partitioned_rels list, we should put flattened list of
> PartitionedRelPruneInfo into the Append or MergeAppend plan.  No need for
> nesting PartitionedRelPruneInfo under PartitionPruneInfo.

To do that properly you'd need to mark the target partitioned table of
each hierarchy. Your test of pg_class.relispartition is not going to
work as you're assuming the query is always going to the root. The
user might query some intermediate partitioned table (which will have
relispartition = true). Your patch will fall flat in that case.

You could work around that by having some array that points to the
target partitioned table of each hierarchy, but I don't see why that's
better than having the additional struct. There's also some code
inside ExecFindInitialMatchingSubPlans() which does a backward scan
over the partitions. This must process children before their parents.
Unsure how well that's going to work if we start mixing the
hierarchies. I'm sure it can be made to work providing each hierarchy
is stored together consecutively in the array, but it just seems
pretty fragile to me. That code is already pretty hard to follow.

What's the reason you don't like the additional level to represent
multiple hierarchies?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Non-portable shell code in pg_upgrade tap tests

2018-07-20 Thread Tom Lane
Victor Wagner  writes:
> Tom Lane  wrote:
>> Please send a patch.  Most of us do not have access to old shells

> Here it goes. Previous letter was written before fixed tests were
> completed, because this old machine is slow.

Thanks.  Will check on my own dinosaurs, and push if I don't find
a problem.

regards, tom lane



Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2018-07-20 Thread Andres Freund
On 2018-07-20 11:53:32 +0300, Heikki Linnakangas wrote:
> ISTM that no-one has any great ideas on what to do about the ereport() in
> quickdie().

I think we actually have some decent ideas how to make that less
problematic in a number of cases. Avoid translation (thereby avoiding
direct malloc()) and rely on the fact that the error message evaluation
happens in a memory context with pre-allocated memory.  That still
leaves openssl, but is better than nothing.  If we wanted, we
additionally could force ssl allocations to happen through another
context with preallocated memory.


> But I think we have consensus on replacing the exit(2) calls
> with _exit(2). If we do just that, it would be better than the status quo,
> even if it doesn't completely fix the problem. This would prevent the case
> that Asim reported, for starters.

Right. You're planning to backpatch?


> With _exit(), I think we wouldn't really need the "PG_SETMASK();"
> calls in the aux process *_quickdie handlers that don't do anything else
> than call _exit(). But I didn't dare to remove them yet.

I think we should remove it at least in master.


> diff --git a/src/backend/postmaster/bgwriter.c 
> b/src/backend/postmaster/bgwriter.c
> index 960d3de204..010d53a6ce 100644
> --- a/src/backend/postmaster/bgwriter.c
> +++ b/src/backend/postmaster/bgwriter.c
> @@ -404,22 +404,21 @@ bg_quickdie(SIGNAL_ARGS)
>   /*
>* We DO NOT want to run proc_exit() callbacks -- we're here because
>* shared memory may be corrupted, so we don't want to try to clean up 
> our
> -  * transaction.  Just nail the windows shut and get out of town.  Now 
> that
> -  * there's an atexit callback to prevent third-party code from breaking
> -  * things by calling exit() directly, we have to reset the callbacks
> -  * explicitly to make this work as intended.
> -  */
> - on_exit_reset();
> -
> - /*
> -  * Note we do exit(2) not exit(0).  This is to force the postmaster 
> into a
> -  * system reset cycle if some idiot DBA sends a manual SIGQUIT to a 
> random
> -  * backend.  This is necessary precisely because we don't clean up our
> -  * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
> -  * should ensure the postmaster sees this as a crash, too, but no harm 
> in
> -  * being doubly sure.)
> +  * transaction.  Just nail the windows shut and get out of town.
> +  *
> +  * There's an atexit callback to prevent third-party code from breaking
> +  * things by calling exit() directly.  We don't want to trigger that, so
> +  * we use _exit(), which doesn't run atexit callbacks, rather than 
> exit().
> +  * And exit() wouldn't be safe to run from a signal handler, anyway.

It's much less the exit() that's unsafe, than the callbacks themselves,
right?  Perhaps just restate that we wouldn't want to trigger atexit
processing due to signal safety?

> +  * Note we use _exit(2) not _exit(0).  This is to force the postmaster
> +  * into a system reset cycle if some idiot DBA sends a manual SIGQUIT 
> to a
> +  * random backend.  This is necessary precisely because we don't clean 
> up
> +  * our shared memory state.  (The "dead man switch" mechanism in
> +  * pmsignal.c should ensure the postmaster sees this as a crash, too, 
> but
> +  * no harm in being doubly sure.)
>*/
> - exit(2);
> + _exit(2);
>  }

Greetings,

Andres Freund



Re: [HACKERS] logical decoding of two-phase transactions

2018-07-20 Thread Andres Freund
On 2018-07-20 12:13:19 +0530, Nikhil Sontakke wrote:
> Hi Andres,
>
>
> > So what if we, at the begin / end of cache miss handling, re-check if
> > the to-be-decoded transaction is still in-progress (or has
> > committed). And we throw an error if that happened. That error is then
> > caught in reorderbuffer, the in-progress-xact aborted callback is
> > called, and processing continues (there's a couple nontrivial details
> > here, but it should be doable).
> >
> > The biggest issue is what constitutes a "cache miss". It's fairly
> > trivial to do this for syscache / relcache, but that's not sufficient:
> > there's plenty cases where catalogs are accessed without going through
> > either. But as far as I can tell if we declared that all historic
> > accesses have to go through systable_beginscan* - which'd imo not be a
> > crazy restriction - we could put the checks at that layer.
> >
>
> Documenting that historic accesses go through systable_* APIs does
> seem reasonable. In our earlier discussions, we felt asking plugin
> writers to do anything along these lines was too onerous and
> cumbersome to expect.

But they don't really need to do that - in just about all cases access
"automatically" goes through systable_* or layers above. If you call
output functions, do syscache lookups, etc you're good.


> > That'd require that an index lookup can't crash if the corresponding
> > heap entry doesn't exist (etc), but that's something we need to handle
> > anyway.  The issue that multiple separate catalog lookups need to be
> > coherent (say Robert's pg_class exists, but pg_attribute doesn't
> > example) is solved by virtue of the the pg_attribute lookups failing if
> > the transaction aborted.
> >
> > Am I missing something here?
> >
>
> Are you suggesting we have a:
>
> PG_TRY()
> {
> Catalog_Access();
> }
> PG_CATCH()
> {
> Abort_Handling();
> }
>
> here?

Not quite, no. Basically, in a simplified manner, the logical decoding
loop is like:

while (true)
record = readRecord()
logical = decodeRecord()

PG_TRY():
StartTransactionCommand();

switch (TypeOf(logical))
case INSERT:
insert_callback(logical);
break;
...

CommitTransactionCommand();

PG_CATCH():
AbortCurrentTransaction();
PG_RE_THROW();

what I'm proposing is that that various catalog access functions throw a
new class of error, something like "decoding aborted transactions". The
PG_CATCH() above would then not unconditionally re-throw, but set a flag
and continue iff that class of error was detected.

while (true)
if (in_progress_xact_abort_pending)
StartTransactionCommand();
in_progress_xact_abort_callback(made_up_record);
in_progress_xact_abort_pending = false;
CommitTransactionCommand();

record = readRecord()
logical = decodeRecord()

PG_TRY():
StartTransactionCommand();

switch (TypeOf(logical))
case INSERT:
insert_callback(logical);
break;
...

CommitTransactionCommand();

PG_CATCH():
AbortCurrentTransaction();
if (errclass == DECODING_ABORTED_XACT)
   in_progress_xact_abort_pending = true;
   continue;
else
   PG_RE_THROW();

Now obviously that's just pseudo code with lotsa things missing, but I
think the basic idea should come through?

Greetings,

Andres Freund



Re: Making "COPY partitioned_table FROM" faster

2018-07-20 Thread David Rowley
 (This email seemed to only come to me and somehow missed the mailing
list. Including the message here in its entirety)

On 20 July 2018 at 13:26, Karen Huddleston  wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   not tested
> Documentation:not tested
>
> This patch applied cleanly and worked as expected.
>
> Patch description:
> This patch implements the use of heap_multi_insert() for partition tables 
> when using the COPY FROM command, benefiting the performance of COPY FROM in 
> cases in which multiple tuples in the file are destined for the same 
> partition.
>
> Tests:
> All tests passed make check-world and TAP tests
>
> Functionality:
> Specifically, we tried the cases in the bottom section "Exercising the code" 
> and found all behavior as described and expected.
> Note that we did conduct performance testing for our test cases enumerated in 
> this section using COPY FROM PROGRAM comparing the patched and unpatched code 
> using one example with tuples all destined for the same partition and one 
> example with tuples destined for alternating partitions. We saw a similar 
> performance improvement to the one recorded by David in his email, including 
> a decrease in performance for copying data with tuples alternating between 
> destination partitions as compared to the unpatched code.
> However, in our cases below, we focus on exercising the code added as opposed 
> to on performance.
>
> Feature feedback from a user perspective:
> There will be two differences in performance that may be perceptible to the 
> user after the inclusion of this patch:
> 1) the relatively small decrease in performance (as compared to master) for 
> copying data from a file or program in which the destination partition 
> changes frequently
> 2) the stark contrast between the performance of copying data destined for 
> the same partition and data destined for alternating partitions
> Based on the numbers in David's email the performance difference
> 27721.054 ms patched for copying data destined for the same partition vs 
> 46151.844 ms patched for copying data destined for two different partitions 
> alternating each row
> Because both differences could impact users in data loading, it is worth 
> considering making this transparent to the user in some way. Because this 
> will be noticeable to the user, and, furthermore, because it is potentially 
> within the power of the user to make changes to their data copying strategy 
> given this information, it might be prudent to either create a GUC allowing 
> the user to disable heap_multi_insert for COPY FROM (if the user knows the 
> data he/she is copying over is very varied) or to add a comment to the docs 
> that when using COPY FROM for a partition table, it is advisable to sort the 
> data in some manner which would optimize the number of consecutive tuples 
> destined for the same partition.

I'd rather steer clear of any new GUCs. Instead, in the patch I've
attached, I've included some adaptive code which chooses to use or not
use multi-inserts based on the average number of tuples that have been
in the batches. I also did some tests and it does appear that the
benefits of multi-inserts come pretty early. So I ended up coding it
to enabling multi-inserts when the average tuples per batch is at
least 1.3.  I originally guessed 1.5, but one of my tests was faster
with multi-inserts and the average batch size there was 1.33.

> Code Review:
> Naming of newly introduced enum:
> The CopyInsertMethod enum value CIM_MULTI_PARTITION is potentially confusing, 
> since, for a partition table with BEFORE or INSTEAD INSERT triggers on child 
> partitions only heap_insert is valid, requiring the additional variable 
> part_can_multi_insert to determine if heap_multi_insert can be used or not.
> It might be more clear to name it something that indicates it is a candidate 
> for multi-insertion, pending further review. This decouples the state of 
> potentially applicable multi-insertion from the table being a partition 
> table. In the future, perhaps, some other feature might make a table 
> conditionally eligible for multi-insertion of tuples, so, it may be desirable 
> to use this intermediate state for other kinds of tables as well.

I've renamed this to CIM_MULTI_CONDITIONAL.

> Alternatively, it might make it more clear if the variable 
> part_can_multi_insert was clearly for a leaf partition table, since this can 
> change from leaf to leaf, maybe named leaf_part_can_multi_insert

Renamed to leafpart_use_multi_insert. I changed from "can" to "use"
due to the adaptive code might just choose not to, even if it actually
"can" use multi-inserts.

> Code comment potential typo corrections:
> In the following comment, in the patched code line 2550
>  /* ...
>  * Note: It does not matter if any partitions 

Re: Non-portable shell code in pg_upgrade tap tests

2018-07-20 Thread Victor Wagner
On Fri, 20 Jul 2018 10:25:47 -0400
Tom Lane  wrote:

> Victor Wagner  writes:
> > I've discovered that in the branch REL_11_STABLE there is shell
> > script src/bin/pg_upgrade/test.sh which doesn't work under Solaris
> > 10. (it uses $(command) syntax with is not compatible with original
> > Solaris /bin/sh)  

> 
> Please send a patch.  Most of us do not have access to old shells

Here it goes. Previous letter was written before fixed tests were
completed, because this old machine is slow.



> we could test this on.  (The oldest machine I have, and it's old
> enough to vote, does run that script ... I doubt very many other
> developers have anything older.)
> 
>   regards, tom lane

commit efb0bbcefeccd826ba951ac31057689e752b79d0
Author: Victor Wagner 
Date:   Fri Jul 20 16:06:50 2018 +0300

Fixed incompatibility of src/bin/pg_upgrade/test.sh with solaris /bin/sh

diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index 45ccd8fa66..775dd5729d 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -234,7 +234,7 @@ pg_upgrade $PG_UPGRADE_OPTS -d "${PGDATA}.old" -D "${PGDATA}" -b "$oldbindir" -B
 # Windows hosts don't support Unix-y permissions.
 case $testhost in
 	MINGW*) ;;
-	*)	if [ $(find ${PGDATA} -type f ! -perm 640 | wc -l) -ne 0 ]; then
+	*)	if [ `find ${PGDATA} -type f ! -perm 640 | wc -l` -ne 0 ]; then
 			echo "files in PGDATA with permission != 640";
 			exit 1;
 		fi ;;
@@ -242,7 +242,7 @@ esac
 
 case $testhost in
 	MINGW*) ;;
-	*)	if [ $(find ${PGDATA} -type d ! -perm 750 | wc -l) -ne 0 ]; then
+	*)	if [ `find ${PGDATA} -type d ! -perm 750 | wc -l` -ne 0 ]; then
 			echo "directories in PGDATA with permission != 750";
 			exit 1;
 		fi ;;


Re: Fw: Windows 10 got stuck with PostgreSQL at starting up. Adding delay lets it avoid.

2018-07-20 Thread Tom Lane
Yugo Nagata  writes:
> Recently, one of our clients reported a problem that Windows 10 sometime 
> (approximately once in 300 tries) hung up at OS starting up while PostgreSQL
> 9.3.x service is starting up. My co-worker analyzed this and found that
> PostgreSQL's auxiliary process and Windows' logon processes are in a dead-lock
> situation.

Really?  What would they deadlock on?  Why is there any connection
whatsoever?  Why has nobody else run into this?

> He reported this problem to pgsql-general list as below. Also, he created a 
> patch
> to add a build-time option for adding 0.5 or 3.0 seconds delay after each sub 
> process starts.

This seems like an ugly hack that probably doesn't reliably resolve
whatever the problem is, but does manage to kill postmaster
responsiveness :-(.  It'd be especially awful to insert such a delay
after forking parallel worker processes, which would be a problem in
anything much newer than 9.3.

I think we need more investigation; and to start with, reproducing
the problem in a branch that's not within hailing distance of its EOL
would be a good idea.  (Not that I have reason to think PG's behavior
has changed much here ... but 9.3 is just not a good basis for asking
us to do anything now.)

regards, tom lane



Re: Non-portable shell code in pg_upgrade tap tests

2018-07-20 Thread Tom Lane
Victor Wagner  writes:
> I've discovered that in the branch REL_11_STABLE there is shell script
> src/bin/pg_upgrade/test.sh which doesn't work under Solaris 10.
> (it uses $(command) syntax with is not compatible with original
> Solaris /bin/sh)

OK ...

> It is quite easy to replace $() syntax with backticks. Backticks are
> not nestable and considered unsafe by modern shell scripting style
> guides, but they do work with older shells.

Please send a patch.  Most of us do not have access to old shells
we could test this on.  (The oldest machine I have, and it's old
enough to vote, does run that script ... I doubt very many other
developers have anything older.)

regards, tom lane



Non-portable shell code in pg_upgrade tap tests

2018-07-20 Thread Victor Wagner
Collegues,

I've discovered that in the branch REL_11_STABLE there is shell script
src/bin/pg_upgrade/test.sh which doesn't work under Solaris 10.
(it uses $(command) syntax with is not compatible with original
Solaris /bin/sh)

I was quite surprised that this problem goes unnoticed on big buildfarm,
but after some investigation found out that both Solaris machines in
that buildfarm don't configure postgres with --enable-tap-tests.

Offending code is:

# make sure all directories and files have group permissions, on Unix hosts
# Windows hosts don't support Unix-y permissions.
case $testhost in
MINGW*) ;;
*)  if [ $(find ${PGDATA} -type f ! -perm 640 | wc -l) -ne 0 ]; then
echo "files in PGDATA with permission != 640";
exit 1;
fi ;;
esac

case $testhost in
MINGW*) ;;
*)  if [ $(find ${PGDATA} -type d ! -perm 750 | wc -l) -ne 0 ]; then
echo "directories in PGDATA with permission != 750";
exit 1;
fi ;;
esac


It is quite easy to replace $() syntax with backticks. Backticks are
not nestable and considered unsafe by modern shell scripting style
guides, but they do work with older shells.

--






Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-07-20 Thread Robert Haas
On Wed, Jul 18, 2018 at 8:00 AM, Etsuro Fujita
 wrote:
> [ new patch ]

+ /*
+ * If the child plan is an Append or MergeAppend, the targetlists of its
+ * subplans would have already been adjusted before we get here, so need
+ * to work here.
+ */
+ if (IsA(subplan, Append) || IsA(subplan, MergeAppend))
+ return;
+
+ /* The child plan should be a scan or join */
+ Assert(is_projection_capable_plan(subplan));

This looks like a terrible design to me.  If somebody in future
invents a new plan type that is not projection-capable, then this
could fail an assertion here and there won't be any simple fix.  And
if you reach here and the child targetlists somehow haven't been
adjusted, then you won't even fail an assertion, you'll just crash (or
maybe return wrong answers).

I'm going to study this some more now, but I really think this is
going in the wrong direction.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Faster str to int conversion (was Table with large number of int columns, very slow COPY FROM)

2018-07-20 Thread Robert Haas
On Thu, Jul 19, 2018 at 4:32 PM, Andres Freund  wrote:
>> 1. Why the error message changes?  If there's a good reason, it should
>> be done as a separate commit, or at least well-documented in the
>> commit message.
>
> Because there's a lot of "invalid input syntax for type %s: \"%s\"",
> error messages, and we shouldn't force translators to have separate
> version that inlines the first %s.  But you're right, it'd be worthwhile
> to point that out in the commit message.

It just seems weird that they're bundled together in one commit like this.

>> 2. Does the likely/unlikely stuff make a noticeable difference?
>
> Yes. It's also largely a copy from existing code (scanint8), so I don't
> really want to differ here.

OK.

>> 3. If this is a drop-in replacement for pg_atoi, why not just recode
>> pg_atoi this way -- or have it call this -- and leave the callers
>> unchanged?
>
> Because pg_atoi supports a variable 'terminator'.

OK.

>> 4. Are we sure this is faster on all platforms, or could it work out
>> the other way on, say, BSD?
>
> I'd be *VERY* surprised if any would be faster. It's not easy to write a
> faster implmentation, than what I've proposed, and especially not so if
> you use strtol() as the API (variable bases, a bit of locale support).

OK.

Nothing else from me...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: partition tree inspection functions

2018-07-20 Thread Jesper Pedersen

Hi Amit,

On 07/19/2018 10:27 PM, Amit Langote wrote:

On 2018/07/19 23:18, Jesper Pedersen wrote:

I'm thinking about how to best use these functions to generate a graph
that represents the partition hierarchy.

What about renaming pg_partition_tree_tables() to pg_partition_children(),
and have it work like

select * from pg_partition_children('p', true);
-
  p
  p0
  p1
  p00
  p01
  p10
  p11
(7 rows)

select * from pg_partition_children('p', false);
-
  p0
  p1
(2 rows)

e.g. if 'bool include_all' is true all nodes under the node, including
itself, are fetched. With false only nodes directly under the node,
excluding itself, are returned. If there are no children NULL is returned.


That's a big change to make to what this function does, but if that's
what's useful we could make it.  As an alternative, wouldn't it help to
implement the idea that Dilip mentioned upthread of providing a function
to report the level of a given table in the partition hierarchy -- 0 for
root, 1 for its partitions and so on?



Yes, Dilip's idea could work. I just don't think that 
pg_partition_tree_tables() as is would have a benefit over time.



Basically, as also discussed before, users can already use SQL to get the
information they want out of the relevant catalogs (pg_inherits, etc.).
But, such user queries might not be very future-proof as we might want to
change the catalog organization in the future, so we'd like to provide
users a proper interface to begin with.  Keeping that in mind, it'd be
better to think carefully about what we ought to be doing here.  Input
like yours is greatly helpful for that.



We could have the patch include pg_partition_root_parent and 
pg_partition_parent, and leave the rest for a future CommitFest such 
that more people could provide feedback on what they would like to see 
in this space.



Maybe a function like pg_partition_number_of_partitions() could be of
benefit to count the number of actual partitions in a tree. Especially
useful in complex scenarios,

   select pg_partition_number_of_partitions('p') as number;

     number
   -
    4
   (1 row)


Okay, adding one more function at this point may not be asking for too
much.  Although, select count(*) from pg_partition_tree_tables('p') would
give you the count, a special function seems nice.


Yeah, but I was thinking that the function would only return the number of
actual tables that contains data, e.g. not include 'p', 'p0' and 'p1' in
the count; otherwise you could use 'select count(*) from
pg_partition_children('p', true)' like you said.


Maybe call it pg_partition_tree_leaf_count() or some such then?



That could work.

Best regards,
 Jesper



Re: Flexible configuration for full-text search

2018-07-20 Thread Alexander Korotkov
Hi, Aleksandr!

On Mon, Jul 9, 2018 at 10:26 AM Aleksandr Parfenov <
a.parfe...@postgrespro.ru> wrote:

> A new version of the patch in the attachment. There are no changes since
> the last version except refreshing it to current HEAD.
>

I took a look at this patch.  It applied cleanly, but didn't pass
regression tests.

***
/Users/smagen/projects/postgresql/env/master/src/src/test/regress/expected/misc_sanity.out
2018-07-20
13:44:54.0 +0300
---
/Users/smagen/projects/postgresql/env/master/src/src/test/regress/results/misc_sanity.out
2018-07-20
13:47:00.0 +0300
***
*** 105,109 
   pg_index| indpred   | pg_node_tree
   pg_largeobject  | data  | bytea
   pg_largeobject_metadata | lomacl| aclitem[]
! (11 rows)

--- 105,110 
   pg_index| indpred   | pg_node_tree
   pg_largeobject  | data  | bytea
   pg_largeobject_metadata | lomacl| aclitem[]
!  pg_ts_config_map| mapdicts  | jsonb
! (12 rows)

It seems to be related to recent patches which adds toast tables to
majority of system tables with varlena column.  Regression diff indicates
that mapdicts field of pg_ts_config_map can't be toasted.  I think we
should add toast table to pg_ts_config_map table.

Also, I see that you add extra grammar rules for ALTER TEXT SEARCH
CONFIGURATION to the documentation, which allows specifying config instead
of dictionary_name.

+ALTER TEXT SEARCH CONFIGURATION name
+ADD MAPPING FOR token_type [, ... ] WITH config
 ALTER TEXT SEARCH CONFIGURATION name
 ADD MAPPING FOR token_type [, ... ] WITH dictionary_name [, ... ]
+ALTER TEXT SEARCH CONFIGURATION name
+ALTER MAPPING FOR token_type [, ... ] WITH config
 ALTER TEXT SEARCH CONFIGURATION name
 ALTER MAPPING FOR token_type [, ... ] WITH dictionary_name [, ... ]
 ALTER TEXT SEARCH CONFIGURATION name

In the same time you declare config as following.

+   
+Formally config is one of:
+   
+   
+* dictionary_name
+
+* config { UNION | INTERSECT | EXCEPT | MAP } config
+
+* CASE config
+WHEN [ NO ] MATCH THEN { KEEP | config }
+[ ELSE config ]
+  END
+   

That is config itself could be a dictionary_name.  I think this makes
grammar rules for  ALTER TEXT SEARCH CONFIGURATION redundant.  We can
specify those rules to always expect config, assuming that it can be
actually a dictionary nay.

+ if (fout->remoteVersion >= 11)

PostgreSQL 11 already passed feature freeze.  Thus, we should be aimed to
PostgreSQL 12.

That's all for now, but I'm going to do more detailed code review.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Fw: Windows 10 got stuck with PostgreSQL at starting up. Adding delay lets it avoid.

2018-07-20 Thread Michael Paquier
On Fri, Jul 20, 2018 at 05:58:13PM +0900, Yugo Nagata wrote:
> He reported this problem to pgsql-general list as below. Also, he created a 
> patch
> to add a build-time option for adding 0.5 or 3.0 seconds delay after each sub 
> process starts.  The attached is the same one.  Our client confirmed that 
> this 
> patch resolves the dead-lock problem. Is it acceptable to add this option to 
> PostgreSQL?  Any comment would be appreciated.

If the OS startup gets slower, then an arbitrary delay is not going to
solve things and you would finish by facing the same problem.  It seems
to me that we need to understand what are the low-level locks which get
stuck, if it is possible to make their acquirement conditional, and then
loop conditionally with multiple retries.
--
Michael


signature.asc
Description: PGP signature


Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-07-20 Thread Amit Langote
On 2018/07/19 22:03, David Rowley wrote:
> v3-0001-Fix-run-time-partition-pruning-for-UNION-ALL-pare.patch

Thanks for updating the patch.

I studied this patch today and concluded that it's going a bit too far by
carrying the nested partition pruning info structures from the planner all
the way into the executor.

I understood the root cause of this issue as that make_partition_pruneinfo
trips when UNION ALL's parent subquery, instead of the actual individual
partitioned root tables, is treated as the root parent rel when converting
prunequals using appenrel_adjust_*.  That happens because of a flattened
partitioned_rels list whose members are all assumed by
make_partition_pruneinfo to have the same root parent and that it is an
actual partitioned table.  That assumption fails in this case because the
parent is really the UNION ALL subquery rel.

I think the fix implemented in the patch by modifying allpaths.c is
correct, whereby the partition hierarchies are preserved by having nested
Lists of partitioned rels.  So, the partitioned_rels List corresponding to
UNION ALL subquery parent itself contains Lists corresponding to
partitioned tables appearing under it.  With that,
make_partition_pruneinfo (actually, make_partitionedrel_pruneinfo in the
patch) can correctly perform its work for every sub-List, because for each
sub-List, it can identify the correct root partitioned table parent to use.

But I don't think the result of make_partition_pruneinfo itself has to be
List of PartitionedRelPruneInfo nested under PartitionPruneInfo.  I gather
that each PartitionPruneInfo corresponds to each root partitioned table
and a PartitionedRelPruneInfo contains the actual pruning information,
which is created for every partitioned table (non-leaf tables), including
the root tables.  I don't think such nesting is necessary.  I think that
just like flattened partitioned_rels list, we should put flattened list of
PartitionedRelPruneInfo into the Append or MergeAppend plan.  No need for
nesting PartitionedRelPruneInfo under PartitionPruneInfo.

We create a relid_subplan_map from the flattened list of sub-plans, where
sub-plans of leaf partitions of different partitioned tables appear in the
same list.  Similarly, I think, we should create relid_subpart_map from
the flattened list of partitioned_rels, where partitioned rel RT indexes
coming from different partitioned tables appear in the same list.
Currently relid_subpart_map seems to be constructed separately for each
sub-List of nested partitioned_rels list, so while subplan_map of each
PartitionedRelPruneInfo contains indexes into a global array of leaf
partition sub-plans, subpart_map contains indexes into local array of
PartitionedRelPruneInfo for that partitioned table.  But, I think there is
not a big hurdle in making even the latter contain indexes into global
array of PartitionedRelPruneInfos of *all* partitioned tables.

On the executor side, instead of having PartitionedRelPruningData be
nested under PartitionPruningData, which in turn are stored in the
top-level PartitionPruneState, store them directly in PartitionPruneState,
since we're making planner put global indexes into subpart_map.  Slight
adjustment seems to be needed to make ExecInitFindMatchingSubPlans and
ExecFindMatchingSubPlans skip the PartitionedRelPruningData of non-root
tables, because find_matching_subplans_recurse takes care of recursing to
non-root ones.  Actually, not skipping them seems to cause wrong result.

To verify that such an approach would actually work, I modified the
relevant parts of your patch and confirmed that it does.  See attached a
delta patch.

Thanks,
Amit

PS: Other than the main patch, I think it would be nice to add a RT index
field to PartitionedRelPruneInfo in addition to the existing Oid field.
It would help to identify and fetch the Relation from a hypothetical
executor-local array of Relation pointers which is addressable by RT indexes.
diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index dac789d414..f81ff672ca 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -48,7 +48,7 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation 
rel,
 bool 
*isnull,
 int 
maxfieldlen);
 static List *adjust_partition_tlist(List *tlist, TupleConversionMap *map);
-static void find_matching_subplans_recurse(PartitionPruningData *pprune,
+static void find_matching_subplans_recurse(PartitionPruneState *prunestate,
   
PartitionedRelPruningData *prelprune,
   bool initial_prune,
   Bitmapset 
**validsubplans);
@@ -1396,14 +1396,10 @@ adjust_partition_tlist(List *tlist, TupleConversionMap 

Re: project updates

2018-07-20 Thread Aleksander Alekseeev
Hello Charles,

> Here is my current working status.
> 1. Complete the thrift_binary_in and thrift_binary_out functions, so
> that users can express their thrift struct using json. These two
> functions support both simple data struct and complex data structure
> like struct and map. 2. added test cases and document to cover
> thrift_binary_in and thrift_binary_out. 3. make the code compile-able
> from 9.4 to 11.0.

I see multiple warnings during compilation with 11.0, e.g:

```
pg_thrift.c:1120:72: warning: implicit declaration of function
‘JsonbGetDatum’; did you mean ‘JsonbPGetDatum’?
[-Wimplicit-function-declaration]

pg_thrift.c:1206:13: warning: the address of ‘is_big_endian’ will
always evaluate as ‘true’ [-Waddress]

pg_thrift.c:1227:18: warning: implicit declaration of function
‘PG_GETARG_JSONB’; did you mean ‘PG_GETARG_JSONB_P’?
[-Wimplicit-function-declaration]

pg_thrift.c:1227:18: warning: initialization of ‘Jsonb *’ {aka ‘struct
 *’} from ‘int’ makes pointer from integer without a cast
[-Wint-conversion]
```

Also tests (make clean && make && make install && make installcheck)
don't pass.

> One question though, for custom types, it seems rich operations are
> also very important besides storing and expressing using thrift type.
> What are the most important operators should I support? Any
> suggestions? Here is the updated repo
> https://github.com/charles-cui/pg_thrift

I believe get field / set field are most common ones. Naturally there
are also enumerations, projections, you name it. You can check out list
of operation supported by JSONB (or a list of operations on dict's in
Python) to get a full picture.

-- 
Best regards,
Aleksander Alekseev



Adding TCP_USER_TIMEOUT support for libpq/psqlodbc

2018-07-20 Thread AYahorau
Hello PostgreSQL Community!

Not long ago I faced the situation concerning ODBC/libpq client hanging in 
case of some network problems.
I had a discussion regarding this issue within  pgsql-o...@postgresql.org 
and got some suggestions.
Here is this discussion:
https://www.postgresql.org/message-id/OF33DF00A3.D6444835-ON432582C3.003EA7C5-432582C3.0045562B%40iba.by

In a few words the suggestion was to use pqopt keepalive options for ODBC 
configuration for example as follows:
pqopt = keepalives=1 keepalives_idle=5 keepalives_count=1 
keepalives_interval=1

But under some circumstances it can be not reliable and the connection can 
loose its keepalives properties and it can remain hanging. 

Here is a quote from our discussion:
Hmm it seems keepalive stops while waiting for ack.
Therefore it's a matter of retransmission control
You can use TCP_USER_TIMEOUT on linux but the option is not used in libpq.

In my opinion it make sense to add the support of TCP_USER_TIMEOUT socket 
option  to libpq/psqlodbc connection.
The attachment contains a patch with the source code changes regarding 
this issue.
In my implementation it can be configured by new keepalives_user_timeout  
option within pqopt parameter.



Best regards, 
Andrei Yahorau

0001_TCP_USER_TIMEOUT_libpq-int.patch
Description: Binary data


0001_TCP_USER_TIMEOUT_fe-connect.patch
Description: Binary data


Fw: Windows 10 got stuck with PostgreSQL at starting up. Adding delay lets it avoid.

2018-07-20 Thread Yugo Nagata
Hi,

Recently, one of our clients reported a problem that Windows 10 sometime 
(approximately once in 300 tries) hung up at OS starting up while PostgreSQL
9.3.x service is starting up. My co-worker analyzed this and found that
PostgreSQL's auxiliary process and Windows' logon processes are in a dead-lock
situation.

Although this problem have been found only with PostgreSQL 9.3.x and Windows 10
in our client's environment for now, maybe the same problem occurs with other 
versions of PostgreSQL.

He reported this problem to pgsql-general list as below. Also, he created a 
patch
to add a build-time option for adding 0.5 or 3.0 seconds delay after each sub 
process starts.  The attached is the same one.  Our client confirmed that this 
patch resolves the dead-lock problem. Is it acceptable to add this option to 
PostgreSQL?  Any comment would be appreciated.

Regards,




Begin forwarded message:

Date: Fri, 29 Jun 2018 15:03:10 +0900
From: TAKATSUKA Haruka 
To: pgsql-gene...@postgresql.org
Subject: Windows 10 got stuck with PostgreSQL at starting up. Adding delay lets 
it avoid.


I got a trouble in PostgreSQL 9.3.x on Windows 10.
I would like to add new delay code as an official build option.

Windows 10 sometime (approximately once in 300 tries) hung up 
at OS starting up. The logs say it happened while the PostgreSQL 
service was starting. When OS stopped, some postgres auxiliary 
process were started and some were not started yet. 

The Windows dump say some threads of the postgres auxiliary process
are waiting OS level locks and the logon processes’thread are
also waiting a lock. MS help desk said that PostgreSQL’s OS level 
deadlock caused OS freeze. I think it is strange story. But, 
in fact, it not happened in repeated tests when I got rid of 
PostgreSQL from the initial auto-starting services.

I tweaked PostgreSQL 9.3.x (the newest from the repository) to add 
0.5 or 3.0 seconds delay after each sub process starts. 
And then the hung up was gone. This test patch is attached. 
It is only implemented for Windows. Also, I did not use existing 
pg_usleep because it contains locking codes (e.g. WaitForSingleObject
and Enter/LeaveCriticalSection).

Although Windows OS may have some problems, I think we should have
a means to avoid it. Can PostgreSQL be accepted such delay codes
as build-time options by preprocessor variables?


Thanks,
Takatsuka Haruka


-- 
Yugo Nagata 
diff --git a/src/backend/postmaster/postmaster.c 
b/src/backend/postmaster/postmaster.c
index d6fc2ed..ff03ebd 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -398,6 +398,30 @@ extern int optreset;   /* might not be 
declared by system headers */
 static DNSServiceRef bonjour_sdref = NULL;
 #endif
 
+#define USE_AFTER_AUX_FORK_SLEEP 3000
+
+#ifdef USE_AFTER_AUX_FORK_SLEEP
+#ifndef WIN32
+#define AFTER_AUX_FORK_SLEEP()
+#else
+#define AFTER_AUX_FORK_SLEEP() do { SleepEx(USE_AFTER_AUX_FORK_SLEEP, FALSE); 
} while(0)
+#endif
+#else
+#define AFTER_AUX_FORK_SLEEP()
+#endif
+
+#define USE_AFTER_BACKEND_FORK_SLEEP 500
+
+#ifdef USE_AFTER_BACKEND_FORK_SLEEP
+#ifndef WIN32
+#define AFTER_BACKEND_FORK_SLEEP()
+#else
+#define AFTER_BACKEND_FORK_SLEEP() do { SleepEx(USE_AFTER_BACKEND_FORK_SLEEP, 
FALSE); } while(0)
+#endif
+#else
+#define AFTER_BACKEND_FORK_SLEEP()
+#endif
+
 /*
  * postmaster.c - function prototypes
  */
@@ -1709,6 +1733,7 @@ ServerLoop(void)
 */
StreamClose(port->sock);
ConnFree(port);
+   AFTER_BACKEND_FORK_SLEEP();
}
}
}
@@ -2801,11 +2826,20 @@ reaper(SIGNAL_ARGS)
 * situation, some of them may be alive already.
 */
if (!IsBinaryUpgrade && AutoVacuumingActive() && 
AutoVacPID == 0)
+   {
AutoVacPID = StartAutoVacLauncher();
+   AFTER_AUX_FORK_SLEEP(); 
+   }
if (XLogArchivingActive() && PgArchPID == 0)
+   {
PgArchPID = pgarch_start();
+   AFTER_AUX_FORK_SLEEP();
+   }
if (PgStatPID == 0)
+   {
PgStatPID = pgstat_start();
+   AFTER_AUX_FORK_SLEEP();
+   }
 
/* some workers may be scheduled to start now */
maybe_start_bgworker();
@@ -5259,6 +5293,7 @@ StartChildProcess(AuxProcType type)
/*
 * in parent, successful fork
 */
+   AFTER_AUX_FORK_SLEEP();
return pid;
 }
 


Re: de-deduplicate code in DML execution hooks in postgres_fdw

2018-07-20 Thread Etsuro Fujita

(2018/07/20 13:49), Michael Paquier wrote:

On Thu, Jul 19, 2018 at 05:35:11PM +0900, Etsuro Fujita wrote:

+1 for the general idea.  (Actually, I also thought the same thing before.)
But since this is definitely a matter of PG12, ISTM that it's wise to work
on this after addressing the issue in [1].  My concern is: if we do this
refactoring now, we might need two patches for fixing the issue in case of
backpatching as the fix might need to change those executor functions.


FWIW, I would think that if some cleanup of the code is obvious, we
should make it without waiting for the other issues to settle down
because there is no way to know when those are done,


I would agree to that if we were late in the development cycle for PG12.


and this patch
could be forgotten.


I won't forget this patch.  :)


Looking at the proposed patch, moving the new routine closer to
execute_dml_stmt and renaming it execute_dml_single_row would be nicer.


Or execute_parameterized_dml_stmt?

Best regards,
Etsuro Fujita



Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2018-07-20 Thread Heikki Linnakangas

On 19/07/18 23:13, Andres Freund wrote:

On 2018-07-19 14:54:52 -0500, Nico Williams wrote:

On Thu, Jul 19, 2018 at 12:20:53PM -0700, Andres Freund wrote:

On 2018-07-19 11:57:25 +0300, Heikki Linnakangas wrote:

Ugh. Yeah, in wal_quickdie, and other aux process *_quickdie() handlers, I
agree we should just _exit(2). All we want to do is to exit the process
immediately.


Indeed.


bgworker_quickdie() makes some effort to block SIGQUIT during the exit()
processing, but that doesn't solve the whole problem. The process could've
been in the middle of a malloc/free when the signal arrived, for example.
exit() is simply not safe to call from a signal handler.


Yea. I really don't understand why we have't been able to agree on this
for *years* now.


I mean, only calling async-signal-safe functions from signal handlers is
a critical POSIX requirement, and exit(3) is NOT async-signal-safe.


There's a few standard requirements that aren't particularly relevant in
practice (including some functions not being listed as signal
safe). Just arguing from that isn't really helpful. But there's plenty
hard evidence, including a few messages upthread!, of it being
practically problematic. Just looking at the reasoning at why exit (and
malloc) aren't signal safe however, makes it clear that this particular
restriction should be adhered to, however.


ISTM that no-one has any great ideas on what to do about the ereport() 
in quickdie(). But I think we have consensus on replacing the exit(2) 
calls with _exit(2). If we do just that, it would be better than the 
status quo, even if it doesn't completely fix the problem. This would 
prevent the case that Asim reported, for starters.


Any objections to the attached?

With _exit(), I think we wouldn't really need the 
"PG_SETMASK();" calls in the aux process *_quickdie handlers 
that don't do anything else than call _exit(). But I didn't dare to 
remove them yet.


- Heikki
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index f651bb49b1..ac8693fd61 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -650,22 +650,21 @@ bgworker_quickdie(SIGNAL_ARGS)
 	/*
 	 * We DO NOT want to run proc_exit() callbacks -- we're here because
 	 * shared memory may be corrupted, so we don't want to try to clean up our
-	 * transaction.  Just nail the windows shut and get out of town.  Now that
-	 * there's an atexit callback to prevent third-party code from breaking
-	 * things by calling exit() directly, we have to reset the callbacks
-	 * explicitly to make this work as intended.
-	 */
-	on_exit_reset();
-
-	/*
-	 * Note we do exit(2) not exit(0).  This is to force the postmaster into a
-	 * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
-	 * backend.  This is necessary precisely because we don't clean up our
-	 * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
-	 * should ensure the postmaster sees this as a crash, too, but no harm in
-	 * being doubly sure.)
+	 * transaction.  Just nail the windows shut and get out of town.
+	 *
+	 * There's an atexit callback to prevent third-party code from breaking
+	 * things by calling exit() directly.  We don't want to trigger that, so
+	 * we use _exit(), which doesn't run atexit callbacks, rather than exit().
+	 * And exit() wouldn't be safe to run from a signal handler, anyway.
+	 *
+	 * Note we use _exit(2) not _exit(0).  This is to force the postmaster
+	 * into a system reset cycle if some idiot DBA sends a manual SIGQUIT to a
+	 * random backend.  This is necessary precisely because we don't clean up
+	 * our shared memory state.  (The "dead man switch" mechanism in
+	 * pmsignal.c should ensure the postmaster sees this as a crash, too, but
+	 * no harm in being doubly sure.)
 	 */
-	exit(2);
+	_exit(2);
 }
 
 /*
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 960d3de204..010d53a6ce 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -404,22 +404,21 @@ bg_quickdie(SIGNAL_ARGS)
 	/*
 	 * We DO NOT want to run proc_exit() callbacks -- we're here because
 	 * shared memory may be corrupted, so we don't want to try to clean up our
-	 * transaction.  Just nail the windows shut and get out of town.  Now that
-	 * there's an atexit callback to prevent third-party code from breaking
-	 * things by calling exit() directly, we have to reset the callbacks
-	 * explicitly to make this work as intended.
-	 */
-	on_exit_reset();
-
-	/*
-	 * Note we do exit(2) not exit(0).  This is to force the postmaster into a
-	 * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
-	 * backend.  This is necessary precisely because we don't clean up our
-	 * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
-	 * should ensure the postmaster sees this as a crash, too, but no harm in
-	 * being doubly sure.)
+	 * transaction.  Just nail 

Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2018-07-20 Thread Heikki Linnakangas

On 19/07/18 23:04, Nico Williams wrote:

On Thu, Jun 22, 2017 at 03:10:31PM -0400, Tom Lane wrote:

Andres Freund  writes:

Or, probably more robust: Simply _exit(2) without further ado, and rely
on postmaster to output an appropriate error message. Arguably it's not
actually useful to see hundreds of "WARNING: terminating connection because of
crash of another server process" messages in the log anyway.


At that point you might as well skip the entire mechanism and go straight
to SIGKILL.  IMO the only reason quickdie() exists at all is to try to
send a helpful message to the client.  And it is helpful --- your attempt
to claim that it isn't sounds very much like wishful thinking.


I dunno if it is or isn't helpful.  But I do know that this must be done
in an async-signal-safe way.

Besides making ereport() async-signal-safe, which is tricky, you could
write(2) the arguments to a pipe that another thread in the same process
is reading from and which will then call ereport() and exit(3).


I don't see how that helps. It still wouldn't be safe for the other 
thread to call ereport(), because the main thread might be in the middle 
of calling ereport() itself.


- Heikki



Re: pread() and pwrite()

2018-07-20 Thread Oskari Saarenmaa
On Thu, Jul 12, 2018 at 01:55:31PM +1200, Thomas Munro wrote:
> A couple of years ago, Oskari Saarenmaa proposed a patch[1] to adopt
> $SUBJECT.  Last year, Tobias Oberstein argued again that we should do
> that[2].  At the end of that thread there was a +1 from multiple
> committers in support of getting it done for PostgreSQL 12.  Since
> Oskari hasn't returned, I decided to dust off his patch and start a
> new thread.

Thanks for picking this up - I was meaning to get back to this, but have
unfortunately been too busy with other projects.

> 1.  FileRead() and FileWrite() are replaced with FileReadAt() and
> FileWriteAt(), taking a new offset argument.  Now we can do one
> syscall instead of two for random reads and writes.

> [...]  I'm not even sure I'd bother adding "At" to the
> function names.  If there are any extensions that want the old
> interface they will fail to compile either way.  Note that the BufFile
> interface remains the same, so hopefully that covers many use cases.

IIRC I used the "At" suffixes in my first version of the patch before
completely removing the functions which didn't take an offset argument
Now that they're gone I agree that we could just drop the "At" suffix;
"at" suffix is also used by various POSIX functions to operate in a
specific directory which may just add to confusion.

> I guess the only remaining reason to use FileSeek() is to get the file
> size?  So I wonder why SEEK_SET remains valid in the patch... if my
> suspicion is correct that only SEEK_END still has a reason to exist,
> perhaps we should just kill FileSeek() and add FileSize() or something
> instead?

I see you did this in your updated patch :+1:

Happy to see this patch move forward.

/ Oskari



Re: pread() and pwrite()

2018-07-20 Thread Heikki Linnakangas

On 20/07/18 01:50, Thomas Munro wrote:

An idea for how to handle Windows, in a follow-up patch: add a file
src/backend/port/win32/file.c that defines pgwin32_pread() and
pgwin32_pwrite(), wrapping WriteFile()/ReadFile() and passing in an
"OVERLAPPED" struct with the offset and sets errno on error, then set
up the macros so that Windows can use them as pread(), pwrite().  It
might also be necessary to open all files with FILE_FLAG_OVERLAPPED.
Does any Windows hacker have a bettter idea, and/or want to try to
write that patch?  Otherwise I'll eventually try to do some long
distance hacking on AppVeyor.


No objections, if you want to make the effort. But IMHO the lseek+read 
fallback is good enough on Windows. Unless you were thinking that we 
could then remove the !HAVE_PREAD fallback altogether. Are there any 
other platforms out there that don't have pread/pwrite that we care about?


- Heikki



Re: [HACKERS] logical decoding of two-phase transactions

2018-07-20 Thread Nikhil Sontakke
Hi Andres,


> So what if we, at the begin / end of cache miss handling, re-check if
> the to-be-decoded transaction is still in-progress (or has
> committed). And we throw an error if that happened. That error is then
> caught in reorderbuffer, the in-progress-xact aborted callback is
> called, and processing continues (there's a couple nontrivial details
> here, but it should be doable).
>
> The biggest issue is what constitutes a "cache miss". It's fairly
> trivial to do this for syscache / relcache, but that's not sufficient:
> there's plenty cases where catalogs are accessed without going through
> either. But as far as I can tell if we declared that all historic
> accesses have to go through systable_beginscan* - which'd imo not be a
> crazy restriction - we could put the checks at that layer.
>

Documenting that historic accesses go through systable_* APIs does
seem reasonable. In our earlier discussions, we felt asking plugin
writers to do anything along these lines was too onerous and
cumbersome to expect.

> That'd require that an index lookup can't crash if the corresponding
> heap entry doesn't exist (etc), but that's something we need to handle
> anyway.  The issue that multiple separate catalog lookups need to be
> coherent (say Robert's pg_class exists, but pg_attribute doesn't
> example) is solved by virtue of the the pg_attribute lookups failing if
> the transaction aborted.
>
> Am I missing something here?
>

Are you suggesting we have a:

PG_TRY()
{
Catalog_Access();
}
PG_CATCH()
{
Abort_Handling();
}

here?

Regards,
Nikhils



Re: More consistency for some file-related error message

2018-07-20 Thread Michael Paquier
On Thu, Jul 19, 2018 at 12:33:30PM +0900, Michael Paquier wrote:
> On Wed, Jul 18, 2018 at 11:24:05PM -0400, Tom Lane wrote:
>> read() is required by spec to set errno when returning a negative result.
>> I think the previous coding paid attention to errno regardless of the sign
>> of the result, which would justify pre-zeroing it ... but the new coding
>> definitely doesn't.
> 
> Yes, my point is a bit different though..  Do you think that we need to
> bother about the case where errno is not 0 before calling read(), in the
> case where it returns a positive result?  This would mean that errno
> would still have a previous errno set, still it returned a number of
> bytes read.  For the code paths discussed here that visibly does not
> matter so you are right, we could remove them, still patterns get easily
> copy-pasted around...

Okay, I just looked again at this point, and among the new messages only
what's in XLogFileCopy has been bothering setting errno to 0 (see
811b6e3), so let's remove it in this case.

Thoughts about the previous patch set?
--
Michael


signature.asc
Description: PGP signature