date:20140610

Andres Freund  writes:
> On 2014-06-10 11:14:43 -0400, Tom Lane wrote:
>> Because it would convert the intended behavior (postmaster and only
>> postmaster is exempt from OOM kill) into a situation where possibly
>> all of the database processes are exempt from OOM kill, at the whim
>> of somebody who should not have the privilege to decide that.

> Meh^3. By that argument we need to forbid superusers to create any form
> of untrusted functions. Forbid anything that does malloc(), system(),
> fork(), whatever from a user's influence.

That's utter and complete nonsense.  We're discussing an operation that is
root-privileged (ie, lowering a process's OOM score), not random stuff
that unprivileged processes can do.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

On 2014-06-10 11:35:23 -0400, Robert Haas wrote:
> > Because it would convert the intended behavior (postmaster and only
> > postmaster is exempt from OOM kill) into a situation where possibly
> > all of the database processes are exempt from OOM kill, at the whim
> > of somebody who should not have the privilege to decide that.
> 
> Gurjeet already refused that argument.

Only that he was wrong:
~# echo -500 > /proc/$BASHPID/oom_score_adj
~# su andres
~$ echo -1000 > /proc/$BASHPID/oom_score_adj
bash: echo: write error: Permission denied # so I can't decrease the likelihood
~$ echo -400 > /proc/$BASHPID/oom_score_adj # but I can increase it
~$ echo -500 > /proc/$BASHPID/oom_score_adj # but also *reset* it

Since we have to do the adjustment after the fork - otherwise the
postmaster would be vulnerable for a time - normal backends can reset
their score.

> Apparently, the child
> processes can only increase their chance of being OOM-killed, not
> decrease it, so you have this exactly backwards: right now, an
> individual system administrator can exempt all of the database
> processes from OOM kill, but cannot exempt the postmaster alone.

Well, if you compile postgres with the #define it'll reset the
likelihood after the fork? That's the reset? Or do you mean without
compiling postgres with the option?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

On Tue, Jun 10, 2014 at 11:14 AM, Tom Lane  wrote:
> In my view, the root-owned startup script grants OOM exemption to
> the postmaster because it *knows* that the postmaster's children
> will drop the exemption.  If that trust can be violated because some
> clueless DBA decided to frob a GUC setting, I'd be a lot more hesitant
> about giving the postmaster the exemption in the first place.

Even if the clueless DBA tries to set the oom_score_adj below that of
Postmaster, Linux kernel prevents that from being done. I demonstrate
that in the below screen share. I used GUC as well as plain command
line to try and set a child's badness (oom_score_adj) to be lower than
that of Postmaster's, and Linux disallows doing that, unless I use
root's powers.

So the argument that this GUC is a security concern, can be ignored.
Root user (one with control of start script) still controls the lowest
badness setting of all Postgres processes. If done at fork_process
time, the child process simply inherits parent's badness setting.

Best regards,

# Set postmaster badness: -1000
$ sudo echo -1000 > /proc/$$/oom_score_adj ; cat /proc/self/oom_score_adj
-1000

# Set backend badness the same as postmaster: -1000
$ perl -pi -e 's/linux_oom_score_adj = .*/linux_oom_score_adj =
-1000/' $PGDATA/postgresql.conf ; grep oom $PGDATA/postgresql.conf
linux_oom_score_adj = -1000

$ pgstop; pgstart
pg_ctl: server is running (PID: 65355)
/home/gurjeet/dev/pgdbuilds/oom_guc/db/bin/postgres "-D"
"/home/gurjeet/dev/pgdbuilds/oom_guc/db/data"
waiting for server to shut down done
server stopped
pg_ctl: no server running
waiting for server to start done
server started
Database cluster state:   in production

# Backends and the postmaster have the same badness; lower than default 0
$ for p in $(echo $(pgserverPIDList)| cut -d , --output-delimiter=' '
-f 1-); do echo $p $(cat /proc/$p/oom_score_adj) $(cat
/proc/$p/cmdline); done
537 -1000 
/home/gurjeet/dev/pgdbuilds/oom_guc/db/bin/postgres-D/home/gurjeet/dev/pgdbuilds/oom_guc/db/data
539 -1000 postgres: checkpointer process
540 -1000 postgres: writer process
541 -1000 postgres: wal writer process
542 -1000 postgres: autovacuum launcher process
543 -1000 postgres: stats collector process

# Set postmaster badness: -100
$ sudo echo -100 > /proc/$$/oom_score_adj ; cat /proc/self/oom_score_adj
-100

# Set backend badness the lower than postmaster: -500 vs. -100
$ perl -pi -e 's/linux_oom_score_adj = .*/linux_oom_score_adj = -500...
linux_oom_score_adj = -500

$ pgstop; pgstart
...

# Backends are capped to parent's badness.
$ for p in $(echo $(pgserverPIDList)| cut -d , ...
716 -100 
/home/gurjeet/dev/pgdbuilds/oom_guc/db/bin/postgres-D/home/gurjeet/dev/pgdbuilds/oom_guc/db/data
718 -100 postgres: checkpointer process
719 -100 postgres: writer process
720 -100 postgres: wal writer process
721 -100 postgres: autovacuum launcher process
722 -100 postgres: stats collector process

# Set postmaster badness: -100
$ sudo echo -100 > /proc/$$/oom_score_adj ; cat /proc/self/oom_score_adj
-100

# Set backend badness the higher than postmaster: +500 vs. -100
$ perl -pi -e 's/linux_oom_score_adj = .*/linux_oom_score_adj = 500/'
$PGDATA/postgresql.conf ; grep oom $PGDATA/postgresql.conf
linux_oom_score_adj = 500

$ pgstop; pgstart
...

# Backends have higher badness, hence higher likelyhood to be killed
than the Postmaster.
$ for p in $(echo $(pgserverPIDList)| cut -d , ...
1602 -100 
/home/gurjeet/dev/pgdbuilds/oom_guc/db/bin/postgres-D/home/gurjeet/dev/pgdbuilds/oom_guc/db/data
1604 500 postgres: checkpointer process
1605 500 postgres: writer process
1606 500 postgres: wal writer process
1607 500 postgres: autovacuum launcher process
1608 500 postgres: stats collector process

# Set postmaster badness: -100
$ sudo echo -100 > /proc/$$/oom_score_adj ; cat /proc/self/oom_score_adj
-100

# Reset backend badness to default: 0
$ perl -pi -e 's/linux_oom_score_adj = .*/linux_oom_score_adj = 0/'
$PGDATA/postgresql.conf ; grep oom $PGDATA/postgresql.conf
linux_oom_score_adj = 0

$ pgstop; pgstart
...

# Backends have higher badness, hence higher likelyhood to be killed
than the Postmaster.
$ for p in $(echo $(pgserverPIDList)| cut -d , ...
1835 -100 
/home/gurjeet/dev/pgdbuilds/oom_guc/db/bin/postgres-D/home/gurjeet/dev/pgdbuilds/oom_guc/db/data
1837 0 postgres: checkpointer process
1838 0 postgres: writer process
1839 0 postgres: wal writer process
1840 0 postgres: autovacuum launcher process
1841 0 postgres: stats collector process

# Lower checkpointer's badness, but keep it higher than postmaster's
$ echo -1 > /proc/1837/oom_score_adj

$ for p in $(echo $(pgserverPIDList)| cut -d , ...
1835 -100 
/home/gurjeet/dev/pgdbuilds/oom_guc/db/bin/postgres-D/home/gurjeet/dev/pgdbuilds/oom_guc/db/data
1837 -1 postgres: checkpointer process
...

# Lower checkpointer's badness, but keep it same as postmaster's
$ echo -100 > /proc/1837/oom_score_adj

$ for p in $(echo $(pgserverPIDList)| cut -d , ...
1835 -

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

Robert Haas  writes:
> Well, clearly, somebody hasn't got it right, or there wouldn't be this
> complaint.  I'll grant you that "somebody" may be EnterpriseDB's own
> packaging in this instance, but I wouldn't like to bet that no one
> else has ever got this wrong nor ever will.  Peter asked upthread why
> this wasn't a GUC with the comment that "Why is this feature not a
> run-time configuration variable or at least a configure option?  It's
> awfully well hidden now.  I doubt a lot of people are using this even
> though they might wish to."  I think that's quite right, and note that
> Peter is in no way affiliated with EnterpriseDB and made that comment
> (rather presciently) long before Gurjeet's recent report.

I'd be okay with a configure option, if you think that would make this
issue more visible to packagers.  It's delegating the responsibility to
the DBA level that I'm unhappy about.

>> Because it would convert the intended behavior (postmaster and only
>> postmaster is exempt from OOM kill) into a situation where possibly
>> all of the database processes are exempt from OOM kill, at the whim
>> of somebody who should not have the privilege to decide that.

> Gurjeet already refused that argument.

He can refuse it all he likes, but that doesn't make his opinion correct.

> How about using an environment variable?  It seems to me that would
> address the concern about DBAs without shell access.  They might be
> able to frob GUCs, but presumably not the postmaster's starting
> environment.

Hmmm ... yeah, that might work.  It would solve the problem I'm worried
about, which is making sure that the startup script has control of what
happens.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

On Sun, Jun 8, 2014 at 9:51 AM, Kevin Grittner  wrote:
> Amit Kapila  wrote:
>> I have improved the patch  by making following changes:
>> a. Improved the bgwriter logic to log for xl_running_xacts info and
>>removed the hibernate logic as bgwriter will now work only when
>>there is scarcity of buffer's in free list. Basic idea is when the
>>number of buffers on freelist drops below the low threshold, the
>>allocating backend sets the latch and bgwriter wakesup and begin
>>adding buffer's to freelist until it reaches high threshold and then
>>again goes back to sleep.
>
> The numbers from your benchmarks are very exciting, but the above
> concerns me.  My tuning of the bgwriter in production has generally
> *not* been aimed at keeping pages on the freelist,

Just to be clear, prior to this patch, the bgwriter has never been in
the business of putting pages on the freelist in the first place, so
it wouldn't have been possible for you to tune for that.

> Essentially I was able to tune the bgwriter so that a dirty page
> was always push out to the OS cache within three seconds, which led
> to a healthy balance of writes between the checkpoint process and
> the bgwriter. Backend processes related to user connections still
> performed about 30% of the writes, and this work shows promise
> toward bringing that down, which would be great; but please don't
> eliminate the ability to prevent write stalls in the process.

I think, as Amit says downthread, that the crucial design question
here is whether we need two processes, one to populate the freelist so
that regular backends don't need to run the clock sweep, and a second
to flush dirty buffers, or whether a single process can serve both
needs.  In favor of a single process, many people have commented that
the background writer doesn't seem to do much right now.  If the
process is mostly sitting around idle, then giving it more
responsibilities might be OK.  In favor of having a second process,
I'm a little concerned that if the background writer gets busy writing
a page, it might then be unavailable to populate the freelist until it
finishes, which might be a very long time relative to the buffer
allocation needs of other backends.  I'm not sure what the right
answer is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

On 2014-06-10 11:40:25 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2014-06-10 11:14:43 -0400, Tom Lane wrote:
> >> Because it would convert the intended behavior (postmaster and only
> >> postmaster is exempt from OOM kill) into a situation where possibly
> >> all of the database processes are exempt from OOM kill, at the whim
> >> of somebody who should not have the privilege to decide that.
> 
> > Meh^3. By that argument we need to forbid superusers to create any form
> > of untrusted functions. Forbid anything that does malloc(), system(),
> > fork(), whatever from a user's influence.
> 
> That's utter and complete nonsense.  We're discussing an operation that is
> root-privileged (ie, lowering a process's OOM score), not random stuff
> that unprivileged processes can do.

Oh, comeon. Tom. You a) conveniently left of the part where I said that
the user can execute code from the postmaster. b) fork() can be used to
escape the oom killer. c) Lots of much worse things can be done to the
system with arbitrary system calls than adjusting oom_score_adj.

The postmaster can currently change oom_score_adj. Users can run code as
a postmaster. Simple as that.

Besides, as demonstrated in
http://www.postgresql.org/message-id/20140610154536.gn8...@alap3.anarazel.de
postmaster children can already reset their score.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

On Tue, Jun 10, 2014 at 11:46 AM, Tom Lane  wrote:
> I'd be okay with a configure option, if you think that would make this
> issue more visible to packagers.  It's delegating the responsibility to
> the DBA level that I'm unhappy about.
>
[...]
>
>> How about using an environment variable?  It seems to me that would
>> address the concern about DBAs without shell access.  They might be
>> able to frob GUCs, but presumably not the postmaster's starting
>> environment.
>
> Hmmm ... yeah, that might work.  It would solve the problem I'm worried
> about, which is making sure that the startup script has control of what
> happens.

A configure option wouldn't address the issue from my perspective; you
still have to recompile if you don't like what your packager did, and
your packager might still be stupid.  However, an environment variable
seems like it would be just fine.  It might even be more convenient
for packagers that having to compile the desired value into the
binary.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

and...@anarazel.de (Andres Freund) writes:
> On 2014-06-10 11:20:28 -0400, Tom Lane wrote:
>> Maybe I'm mistaken, but I thought once the fork_process code has reset our
>> process's setting to zero it's not possible to lower it again (without
>> privileges we'd not have).

> No, doesn't look that way. It's possible to reset it to the value set at
> process start. So unless we introduce double forks for every backend
> start it can be reset by ordinary processes.

That's kind of annoying --- I wonder why they went to the trouble of doing
that?  But anyway, it's probably not worth the cost of double-forking to
prevent it.  I still say though that this is not a reason to make it as
easy as change-a-GUC to break the intended behavior.

Robert's idea of having the start script set an environment variable to
control the OOM adjustment reset seems like it would satisfy my concern.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

On Tue, Jun 10, 2014 at 11:24 AM, Andres Freund  wrote:
> On 2014-06-10 11:14:43 -0400, Tom Lane wrote:
>> Sure, but what's that have to do with this?  Any Red Hat or PGDG RPM will
>> come with this code already enabled in the build, so there is no need for
>> anyone to have a GUC to play around with the behavior.
>
> That's imo a fair point. Unless I misunderstand things Gurjeet picked
> the topic up again because he wants to increase the priority of the
> children. Is that correct Gurjeet?

Yes. A DBA would like to prevent the postmaster from being killed by
OOM killer, but let the child processes be still subject to OOM
killer's whim.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

Robert Haas  writes:
> On Tue, Jun 10, 2014 at 11:29 AM, Tom Lane  wrote:
>> If you think your users might want to give the postmaster OOM-exemption,
>> why don't you just activate the existing code when you build?  Resetting
>> the OOM setting to zero is safe whether or not the startup script did
>> anything to the postmaster's setting.

> The whole scenario here is that the user *doesn't want to recompile*.

Yeah, I understood that.  The question is why EDB isn't satisfied to just
add "-DLINUX_OOM_ADJ=0" to their build options, but instead would like to
dump a bunch of uncertainty on other packagers who might not like the
implications of a GUC.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

On 2014-06-10 11:52:17 -0400, Tom Lane wrote:
> and...@anarazel.de (Andres Freund) writes:
> > On 2014-06-10 11:20:28 -0400, Tom Lane wrote:
> >> Maybe I'm mistaken, but I thought once the fork_process code has reset our
> >> process's setting to zero it's not possible to lower it again (without
> >> privileges we'd not have).
> 
> > No, doesn't look that way. It's possible to reset it to the value set at
> > process start. So unless we introduce double forks for every backend
> > start it can be reset by ordinary processes.
> 
> That's kind of annoying --- I wonder why they went to the trouble of doing
> that?

My guess is that they want to allow a process to only temporarily reduce
the likelihood of getting killed while it's doing something
important.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

On Tue, Jun 10, 2014 at 11:52 AM, Tom Lane  wrote:
> and...@anarazel.de (Andres Freund) writes:
>> On 2014-06-10 11:20:28 -0400, Tom Lane wrote:
>>> Maybe I'm mistaken, but I thought once the fork_process code has reset our
>>> process's setting to zero it's not possible to lower it again (without
>>> privileges we'd not have).
>
>> No, doesn't look that way. It's possible to reset it to the value set at
>> process start. So unless we introduce double forks for every backend
>> start it can be reset by ordinary processes.
>
> That's kind of annoying --- I wonder why they went to the trouble of doing
> that?  But anyway, it's probably not worth the cost of double-forking to
> prevent it.  I still say though that this is not a reason to make it as
> easy as change-a-GUC to break the intended behavior.
>
> Robert's idea of having the start script set an environment variable to
> control the OOM adjustment reset seems like it would satisfy my concern.

I'm fine with this solution. Should this be a constant 0, or be
configurable based on env. variable's value?

-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] why postgresql define NTUP_PER_BUCKET as 10, not other numbers smaller

2014-06-10 Thread Stephen Frost

* Robert Haas (robertmh...@gmail.com) wrote:
> If we could allow NTUP_PER_BUCKET to drop when the hashtable is
> expected to fit in memory either way, perhaps with some safety margin
> (e.g. we expect to use less than 75% of work_mem), I bet that would
> make the people who have been complaining about this issue happy.  And
> probably a few people who haven't been complaining, too: I don't
> recall the precise performance numbers that were posted before, but
> ISTR the difference between 10 and 1 was pretty dramatic.

This is more-or-less what I had been hoping to find time to do.  As Tom
points out, more testing is needed, and there's still the trade-off
between "fits-in-L2" and "doesn't" which can make a pretty big
difference while building the hash table.  Of course, once it's built,
it's faster to use it when it's larger as there are fewer collisions.

I do plan to come back to this and will hopefully find time over the
next few months.

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Proposing pg_hibernate

On Thu, Jun 5, 2014 at 8:32 AM, Gurjeet Singh  wrote:
> On Wed, Jun 4, 2014 at 2:50 PM, Robert Haas  wrote:
>> The thing I was concerned about is that the system might have been in
>> recovery for months.  What was hot at the time the base backup was
>> taken seems like a poor guide to what will be hot at the time of
>> promotion. Consider a history table, for example: the pages at the
>> end, which have just been written, are much more likely to be useful
>> than anything earlier.
>
> I think you are specifically talking about a warm-standby that runs
> recovery for months before being brought online. As described in my
> response to Amit, if the base backup used to create that standby was
> taken after the BlockReaders had restored the buffers (which should
> complete within few minutes of startup, even for large databases),
> then there's no concern since the base backup wouldn't contain the
> save-files.
>
> If it's a hot-standby, the restore process would start as soon as the
> database starts accepting connections, finish soon after, and get
> completely out of the way of the normal recovery process. At which
> point the buffers populated by the recovery would compete only with
> the buffers being requested by backends, which is the normal
> behaviour.

I guess I don't see what warm-standby vs. hot-standby has to do with
it.  If recovery has been running for a long time, then restoring
buffers from some save file created before that is probably a bad
idea, regardless of whether the buffers already loaded were read in by
recovery itself or by queries running on the system.  But if you're
saying that doesn't happen, then there's no problem there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

Gurjeet Singh  writes:
> On Tue, Jun 10, 2014 at 11:52 AM, Tom Lane  wrote:
>> Robert's idea of having the start script set an environment variable to
>> control the OOM adjustment reset seems like it would satisfy my concern.

> I'm fine with this solution. Should this be a constant 0, or be
> configurable based on env. variable's value?

If we're going to do this, I would say that we should also take the
opportunity to get out from under the question of which kernel API
we're dealing with.  So perhaps a design like this:

1. If the environment variable PG_OOM_ADJUST_FILE is defined, it's
the name of a file to write something into after forking.  The typical
value would be "/proc/self/oom_score_adj" or "/proc/self/oom_adj".
If not set, nothing happens.

2. If the environment variable PG_OOM_ADJUST_VALUE is defined, that's
the string we write, otherwise we write "0".

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

On Tue, Jun 10, 2014 at 11:56 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Jun 10, 2014 at 11:29 AM, Tom Lane  wrote:
>>> If you think your users might want to give the postmaster OOM-exemption,
>>> why don't you just activate the existing code when you build?  Resetting
>>> the OOM setting to zero is safe whether or not the startup script did
>>> anything to the postmaster's setting.
>
>> The whole scenario here is that the user *doesn't want to recompile*.
>
> Yeah, I understood that.  The question is why EDB isn't satisfied to just
> add "-DLINUX_OOM_ADJ=0" to their build options, but instead would like to
> dump a bunch of uncertainty on other packagers who might not like the
> implications of a GUC.

Well, I think we should do that, too, and I'm going to propose it to
Dave & team.  If we're not already doing the right thing here (and I
don't know off-hand whether we are), then that is definitely our issue
to fix and there's no reason to change PostgreSQL on that account.

But I think the point is that this is a pretty subtle and
well-concealed thing which a PostgreSQL packager might fail to do
correctly in every instance - or where an individual user might want
to install a different policy than whatever the packager's default is.
Making that easy to do without recompiling seems to me to be a general
good.  Magnus once commented to me that every GUC which is
PGC_POSTMASTER is a bug - "perhaps not an easy bug to fix, but a bug
all the same" (his words), because requiring people to restart the
postmaster to change settings is often problematic.  I think this is
even more true of recompiling.  IMO, it should be an explicit goal of
the project to make reconfiguration of an already-installed system -
perhaps even already in production - as easy as possible. We will
doubtless fail to achieve perfection but that doesn't mean we
shouldn't try.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] branch for 9.5?

As 9.5CF1 is due to start in about 5 days, it seems we'd better branch
pretty soon.  Anyone planning to take care of that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] branch for 9.5?

Robert Haas  writes:
> As 9.5CF1 is due to start in about 5 days, it seems we'd better branch
> pretty soon.  Anyone planning to take care of that?

I think I've done it the last couple times, and am happy to do so again.

Anyone want to wait longer?  As Robert says, we need to do it no later
than the CF start anyway.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

2014-06-10 Thread David G Johnston

In short:

I can accept the idea that picking reasonable specific values is impossible
so let's just ensure that children are always killed before the parent
(basically the default behavior). If you then say that any system that is
capable of implementing that rule should have it set then leaving it
unexposed makes sense. That line of thought does not require the abstract
cost of a GUC to factor into the equation.

However, in considering system configuration and concurrently running
processes

Tom Lane-2 wrote
> Extra GUCs do not have zero cost, especially not ones that are as
> complicated-to-explain as this would be.

The explain factor does little for me since if given a reasonable default
the GUC can be ignored until a problem manifests. At that point not having
a GUC makes resolving the problem require someone to stop using packaging
and instead compile their own source. This results in a poor user
experience.

So, if someone where to provide a complete patch that introduced a GUC to
enable/disable as well as set priorities - to include documentation and
reasonable postgresql.conf defaults - what specific ongoing GUC costs would
prevent applying said patch?

You mention a security hazard but that is not a cost of GUCs generally but
this one specifically; and one that seems to have been deemed no riskier
than other DBA capabilities.

Help me understand the cost side of the equation since the benefit side
seems clear-cut enough to me. The OOM problem is real and making PostgreSQL
fit within the overall memory management architecture of a given server is
something that is the responsibility of the system admin in conjunction with
the DBA - not us or the packager. I'll buy that because this is a linux
issue that the implementation could be packager selectable but given the
dynamics of Linux centralizing a reasonable default implementation into core
makes sense.

David J.

--
View this message in context:
http://postgresql.1045698.n5.nabble.com/proc-self-oom-adj-is-deprecated-in-newer-Linux-kernels-tp4810971p5806697.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

2014-06-10 Thread Joshua D. Drake



On 06/10/2014 07:02 AM, Tom Lane wrote:

Gurjeet Singh  writes:

Startup scripts are not solely in the domain of packagers. End users
can also be expected to develop/edit their own startup scripts.



Providing it as GUC would have given end users both the peices, but
with a compile-time option they have only one half of the solution;
except if they go compile their own binaries, which forces them into
being packagers.


I don't find that this argument holds any water at all.  Anyone who's
developing their own start script can be expected to manage recompiling
Postgres.


This is certainly not true in any fashion. Production systems require a 
degree of flexibility and configuration that does not require the 
maintaining or compiling of source code.


Sincerely,

Joshua D. Drake

--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc
"If we send our children to Caesar for their education, we should
 not be surprised when they come back as Romans."


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] NUMA packaging and patch

2014-06-10 Thread Josh Berkus

On 06/08/2014 03:45 PM, Kevin Grittner wrote:
> By default, the OS cache and buffers are allocated in the memory
> node with the shortest "distance" from the CPU a process is running
> on. 

Note that this will stop being the default in future Linux kernels.
However, we'll have to deal with the old ones for some time to come.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

2014-06-10 Thread David G Johnston

Gurjeet Singh-4 wrote
> Even if the clueless DBA tries to set the oom_score_adj below that of
> Postmaster, Linux kernel prevents that from being done. I demonstrate
> that in the below screen share. I used GUC as well as plain command
> line to try and set a child's badness (oom_score_adj) to be lower than
> that of Postmaster's, and Linux disallows doing that, unless I use
> root's powers.
> 
> So the argument that this GUC is a security concern, can be ignored.
> Root user (one with control of start script) still controls the lowest
> badness setting of all Postgres processes. If done at fork_process
> time, the child process simply inherits parent's badness setting.

The counter point here is that the postmaster can be set to "no kill" and
the >= condition allows for children to achieve the same while it is our
explicit intent that the children be strictly > parent.

To that end, should the adjustment value be provided as an offset to the
postmasters instead of an absolute value - and disallow <= zero offset
values in the process?

I get and generally agree with the environment variable proposal and it's
stated goal to restrict whom can makes changes. But how much less cost does
an environment variable have than a GUC if one GUC argument is still its
maintenance overhead?

David J.





--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/proc-self-oom-adj-is-deprecated-in-newer-Linux-kernels-tp4810971p5806700.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

2014-06-10 Thread Magnus Hagander

On Jun 10, 2014 7:05 PM, "Joshua D. Drake"  wrote:
>
>
> On 06/10/2014 07:02 AM, Tom Lane wrote:
>>
>> Gurjeet Singh  writes:
>>>
>>> Startup scripts are not solely in the domain of packagers. End users
>>> can also be expected to develop/edit their own startup scripts.
>>
>>
>>> Providing it as GUC would have given end users both the peices, but
>>> with a compile-time option they have only one half of the solution;
>>> except if they go compile their own binaries, which forces them into
>>> being packagers.
>>
>>
>> I don't find that this argument holds any water at all.  Anyone who's
>> developing their own start script can be expected to manage recompiling
>> Postgres.
>
>
> This is certainly not true in any fashion. Production systems require a
degree of flexibility and configuration that does not require the
maintaining or compiling of source code.
>

As JD surely understands, it pains me a lot but I have to agree with him
here.

There are also a non-trivial number of organizations that just don't allow
compilers on production boxes, adding another hurdle for those. We may
think that's ridiculous, and it may be, but it's reality.

That said, I agree that a guc is probably a bad idea. I like the idea of an
environment variable if we need it to be changeable. Or perhaps what we
need is just better documentation of what's there already. Perhaps the
documentation should have a section specifically aimed at packagers?

/Magnus

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

On Tue, Jun 10, 2014 at 1:13 PM, David G Johnston
 wrote:
> Gurjeet Singh-4 wrote
>> So the argument that this GUC is a security concern, can be ignored.
>> Root user (one with control of start script) still controls the lowest
>> badness setting of all Postgres processes. If done at fork_process
>> time, the child process simply inherits parent's badness setting.
>
> The counter point here is that the postmaster can be set to "no kill" and

Only the root user can do that, since he/she has control over the
start script. All that the DBA could do with a GUC is to set backends'
badness worse than postmaster's, but bot better.

> the >= condition allows for children to achieve the same while it is our
> explicit intent that the children be strictly > parent.

I don't think anyone argued for that behaviour.

> To that end, should the adjustment value be provided as an offset to the
> postmasters instead of an absolute value - and disallow <= zero offset
> values in the process?

Seems unnecessary, given current knowledge.

> I get and generally agree with the environment variable proposal and it's
> stated goal to restrict whom can makes changes. But how much less cost does
> an environment variable have than a GUC if one GUC argument is still its
> maintenance overhead?

Having it as a GUC would have meant that two entities are required to
get the configuration right: one who controls start scripts, and the
other who controls GUC settings.

With the environment variable approach, a root user alone can control
the behaviour like so in start script:

echo -200 > /proc/self/oom_score_adj
export PG_OOM_ADJUST_FILE=oom_score_adj
export PG_OOM_ADJUST_VALUE=-100

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] NUMA packaging and patch

2014-06-10 Thread Kevin Grittner

Josh Berkus  wrote:
> On 06/08/2014 03:45 PM, Kevin Grittner wrote:
>> By default, the OS cache and buffers are allocated in the memory
>> node with the shortest "distance" from the CPU a process is
>> running on.
>
> Note that this will stop being the default in future Linux kernels.
> However, we'll have to deal with the old ones for some time to come.

I was not aware of that.  Thanks.  Do you have a URL handy?

In any event, that is the part of the problem which I think falls
into the realm of packagers and/or sysadmins; a patch for that
doesn't seem sensible, given how cpusets are implemented.  I did
figure we would want to add some documentation around it, though. 
Do you agree that is worthwhile?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Getting “make: Entering an unknown directory” error while building postgresql on MIPS platform

2014-06-10 Thread shreesha21

make[4]: Leaving directory
`/home/shreesha/platform/utils/postgresql-9.3.4/src/port'
/home/shreesha/platform/tools/bin/make -C timezone all
make[4]: Entering directory
`/home/shreesha/platform/utils/postgresql-9.3.4/src/timezone'
I am trying to install postgresql on MIPS platform and I am getting the
following error while building it.

/home/shreesha/platform/tools/bin/make -C common all

**make: Entering an unknown directory**

make: *** common: No such file or directory.  Stop.

make: Leaving an unknown directory

make[4]: *** [common-recursive] Error 2

make[4]: Leaving directory
`/home/shreesha/platform/utils/postgresql-9.3.4/src/timezone'
make[3]: *** [all-timezone-recurse] Error 2
make[3]: Leaving directory
`/home/shreesha/platform/utils/postgresql-9.3.4/src'
make[2]: *** [all-src-recurse] Error 2

make[2]: Leaving directory `/home/shreesha/platform/utils/postgresql-9.3.4'

Do we need to set any specific make variables/environmental variables in
this case? Any help on debugging this is much appreciated!



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Getting-make-Entering-an-unknown-directory-error-while-building-postgresql-on-MIPS-platform-tp5806705.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: why postgresql define NTUP_PER_BUCKET as 10, not other numbers smaller

2014-06-10 Thread Jeff Janes

On Tue, Jun 10, 2014 at 5:17 AM, Robert Haas  wrote:

> The problem case is when you have 1 batch and the increased memory
> consumption causes you to switch to 2 batches.  That's expensive.  It
> seems clear based on previous testing that *on the average*
> NTUP_PER_BUCKET = 1 will be better, but in the case where it causes an
> increase in the number of batches it will be much worse - particularly
> because the only way we ever increase the number of batches is to
> double it, which is almost always going to be a huge loss.
>

Is there a reason we don't do hybrid hashing, where if 80% fits in memory
than we write out only the 20% that doesn't? And then when probing the
table with the other input, the 80% that land in in-memory buckets get
handled immediately, and only the 20 that land in the on-disk buckets get
written for the next step?

Obviously no one implemented it yet, but is there a fundamental reason for
that or just a round tuit problem?

Cheers,

Jeff

Re: [HACKERS] API change advice: Passing plan invalidation info from the rewriter into the planner?

2014-06-10 Thread Adam Brightwell

Hi all,

This is my first post to the mailing list and I am looking forward to
working with everyone in the community.

With that said...

I'll take a look at changing the cache key to include user ID and
> ripping out the plan invalidation logic from the current patch tomorrow
> but I seriously doubt I'll be able to get all of that done in the next
> day or two.  If anyone else is able to help out, it'd certainly be
> appreciated; I really think that's the main hurdle to address at this
> point with this patch- without the plan invalidation complexity, the
> the patch is really just building out the catalog, the SQL-level
> statements for managing it, and the bit of code required to add the
> conditional to statements involving RLS-enabled tables.
>

I have been collaborating with Stephen on addressing this particular item
with RLS.

As a basis, I have been working with Craig's 'rls-9.4-upd-sb-views' branch
rebased against master around 9.4beta1.

Through this effort, we have concluded that for RLS the case of
invalidating a plan is only necessary when switching between a superuser
and a non-superuser.  Obviously, re-planning on every role change would be
too costly, but this approach should help minimize that cost.  As well,
there were not any cases outside of this one that were immediately apparent
with respect to RLS that would require re-planning on a per userid basis.

I have tested this approach with the following patch.

https://github.com/abrightwell/postgres/commit/4c959e63f7a89b24ebbd46575a31a629d24efa75

Does this sound like a sane approach?  Thoughts?  Recommendations?

Thanks,
Adam

Re: [HACKERS] Getting “make: Entering an unknown directory” error while building postgresql on MIPS platform

shreesha21  writes:
> make[4]: Leaving directory
> `/home/shreesha/platform/utils/postgresql-9.3.4/src/port'
> /home/shreesha/platform/tools/bin/make -C timezone all
> make[4]: Entering directory
> `/home/shreesha/platform/utils/postgresql-9.3.4/src/timezone'
> I am trying to install postgresql on MIPS platform and I am getting the
> following error while building it.

> /home/shreesha/platform/tools/bin/make -C common all

> **make: Entering an unknown directory**

> make: *** common: No such file or directory.  Stop.

> make: Leaving an unknown directory

> make[4]: *** [common-recursive] Error 2

What seems like the most likely bet here is that make has inherited the
setting for SUBDIRS in src/Makefile into the child run in src/timezone/,
and so it's looking for a subdirectory "common" which of course ain't
there.  Why this would not have happened within the previous two
subdirectories isn't real clear, but in any case that would be a bug
in "make".

What make version are you using?  Can you try another one?  If you're
trying a parallel build, maybe don't do that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] make check For Extensions

2014-06-10 Thread David E. Wheeler

Hackers,

Andres said during the unconference last month that there was a way to get 
`make check` to work with PGXS. The idea is that it would initialize a 
temporary cluster, start it on an open port, install an extension, and run the 
extension's test suite. I think the pg_regress --temp-install, maybe? I poked 
through the PGXS makefiles, and although it looks like there *might* be 
something like this for in-core contrib extensions, but not for 
externally-distributed extensions.

Is there something I could add to my extension Makefiles so that `make check` 
or `make test` will do a pre-install test on a temporary cluster?

Thanks,

David



signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [HACKERS] Why is it "JSQuery"?

2014-06-10 Thread David E. Wheeler

On Jun 6, 2014, at 3:50 PM, Josh Berkus  wrote:

> Maybe we should call it "jsonesque"  ;-)

I propose JOQL: JSON Object Query Language.

Best,

David

PS: JAQL sounds better, but [already exists](http://code.google.com/p/jaql/).


signature.asc
Description: Message signed with OpenPGP using GPGMail

[HACKERS] Re: Getting “make: Entering an unknown directory” error while building postgresql on MIPS platform

2014-06-10 Thread shreesha21

bash-4.1$ make --version
GNU Make 3.81
Copyright (C) 2006  Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.

This program built for x86_64-redhat-linux-gnu

Also, the build was successful in x86 platform when I tried to install in
CentOS.



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Getting-make-Entering-an-unknown-directory-error-while-building-postgresql-on-MIPS-platform-tp5806705p5806716.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: Getting “make: Entering an unknown directory” error while building postgresql on MIPS platform

shreesha21  writes:
> bash-4.1$ make --version
> GNU Make 3.81
> Copyright (C) 2006  Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.
> There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
> PARTICULAR PURPOSE.

> This program built for x86_64-redhat-linux-gnu

Thought you said this was MIPS.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] API change advice: Passing plan invalidation info from the rewriter into the planner?

Adam Brightwell  writes:
> Through this effort, we have concluded that for RLS the case of
> invalidating a plan is only necessary when switching between a superuser
> and a non-superuser.  Obviously, re-planning on every role change would be
> too costly, but this approach should help minimize that cost.  As well,
> there were not any cases outside of this one that were immediately apparent
> with respect to RLS that would require re-planning on a per userid basis.

Hm ... I'm not following why we'd need a special case for superusers and
not anyone else?  Seems like any useful RLS scheme is going to require
more privilege levels than just superuser and not-superuser.

Could we put the "if superuser then ok" test into the RLS condition test
and thereby not need more than one plan at all?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: Getting “make: Entering an unknown directory” error while building postgresql on MIPS platform

2014-06-10 Thread shreesha21

Sorry for the confusion. Let me clarify what I am trying to do here.

I am trying to configure and build postgresql for MIPS platform and I am
doing this on a linux system - (Linux dev-shreesharao.arubanetworks.com
2.6.32-431.11.2.el6.x86_64 GNU/Linux) so that the libraries generated from
this build will be ported to MIPS platform for further use.

The way I am trying to accomplish this is by executing the configure file
with some mips specific configure options and export variables.
i.e.,
configure_args: '--target=mips-linux-gnu' '--host=mips-linux-gnu'
'--without-readline' '--without-zlib'
'--prefix=/home/shreesha/platform/tools/porfidio_obj'
'host_alias=mips-linux-gnu' 'target_alias=mips-linux-gnu'
'CC=/home/shreesha/platform/tools/mipscross-xlp/linux/bin/mips64-nlm-linux-gcc
-B/home/shreesha/platform/tools/mipscross-xlp/linux
-B/home/shreesha/platform/tools/mipscross-xlp/linux/lib ' 'LDFLAGS= -mabi=32
-march=xlp'
'CPP=/home/shreesha/platform/tools/mipscross-xlp/linux/bin/mips64-nlm-linux-cpp'




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Getting-make-Entering-an-unknown-directory-error-while-building-postgresql-on-MIPS-platform-tp5806705p5806720.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: Getting “make: Entering an unknown directory” error while building postgresql on MIPS platform

Perhaps you've got SUBDIRS defined in your environment?

The following example seems to match the behavior you're reporting:

$ cat test1/Makefile
SUBDIRS = foo bar

all:
make -C test2 all

$ cat test1/test2/Makefile
all:
echo "SUBDIRS = $(SUBDIRS)"

$ make -C test1 all
make: Entering directory `/home/tgl/test1'
make -C test2 all
make[1]: Entering directory `/home/tgl/test1/test2'
echo "SUBDIRS = "
SUBDIRS = 
make[1]: Leaving directory `/home/tgl/test1/test2'
make: Leaving directory `/home/tgl/test1'

$ SUBDIRS=env make -C test1 all
make: Entering directory `/home/tgl/test1'
make -C test2 all
make[1]: Entering directory `/home/tgl/test1/test2'
echo "SUBDIRS = foo bar"
SUBDIRS = foo bar
make[1]: Leaving directory `/home/tgl/test1/test2'
make: Leaving directory `/home/tgl/test1'

I find this behavior a tad surprising: if the environment variable is
effective at all, it seems like it shouldn't be overridden by a simple
assignment in the parent makefile.  But I dunno if the gmake boys would
think it's a bug.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why is it "JSQuery"?

2014-06-10 Thread Oleg Bartunov

People,

we have many other tasks than guessing the language name.
jsquery is just an extension, which we invent to test our indexing
stuff.  Eventually, it grew out.  I think we'll think on better name
if developers agree to have it in core. For now, jsquery is good
enough to us.

jsquery name doesn't need to be used at all, by the way.

Oleg

On Tue, Jun 10, 2014 at 10:04 PM, David E. Wheeler
 wrote:
> On Jun 6, 2014, at 3:50 PM, Josh Berkus  wrote:
>
>> Maybe we should call it "jsonesque"  ;-)
>
> I propose JOQL: JSON Object Query Language.
>
> Best,
>
> David
>
> PS: JAQL sounds better, but [already exists](http://code.google.com/p/jaql/).

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why is it "JSQuery"?

2014-06-10 Thread Oleg Bartunov

The closest problem we have is jsonb statistics (lack of, actually) ,
which prevents use of all the power of jsquery. I hope Jan Urbański
could work on this.

Oleg

On Tue, Jun 10, 2014 at 11:06 PM, Oleg Bartunov  wrote:
> People,
>
> we have many other tasks than guessing the language name.
> jsquery is just an extension, which we invent to test our indexing
> stuff.  Eventually, it grew out.  I think we'll think on better name
> if developers agree to have it in core. For now, jsquery is good
> enough to us.
>
> jsquery name doesn't need to be used at all, by the way.
>
> Oleg
>
> On Tue, Jun 10, 2014 at 10:04 PM, David E. Wheeler
>  wrote:
>> On Jun 6, 2014, at 3:50 PM, Josh Berkus  wrote:
>>
>>> Maybe we should call it "jsonesque"  ;-)
>>
>> I propose JOQL: JSON Object Query Language.
>>
>> Best,
>>
>> David
>>
>> PS: JAQL sounds better, but [already exists](http://code.google.com/p/jaql/).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] API change advice: Passing plan invalidation info from the rewriter into the planner?

2014-06-10 Thread Brightwell, Adam

Hey Tom,


> Hm ... I'm not following why we'd need a special case for superusers and
> not anyone else?  Seems like any useful RLS scheme is going to require
> more privilege levels than just superuser and not-superuser.
>

As it stands right now, superuser is the only case where RLS policies
should not be applied/completely ignored.  I suppose it is possible to
create RLS policies that are related to other privilege levels, but those
would still need to be applied despite user id, excepting superuser.  I'll
defer to Stephen or Craig on the usefulness of this scheme.

Could we put the "if superuser then ok" test into the RLS condition test
> and thereby not need more than one plan at all?
>

As I understand it, the application of RLS policies occurs in the rewriter.
 Therefore, when switching back and forth between superuser and
not-superuser the query must be rewritten, which would ultimately result in
the need for a new plan correct?  If that is the case, then I am not sure
how one plan is possible.  However, again, I'll have to defer to Stephen or
Craig on this one.

Thanks,
Adam

Re: [HACKERS] Why is it "JSQuery"?

2014-06-10 Thread David E. Wheeler

On Jun 10, 2014, at 12:06 PM, Oleg Bartunov  wrote:

> we have many other tasks than guessing the language name.
> jsquery is just an extension, which we invent to test our indexing
> stuff.  Eventually, it grew out.  I think we'll think on better name
> if developers agree to have it in core. For now, jsquery is good
> enough to us.
> 
> jsquery name doesn't need to be used at all, by the way.

Yeah, I was more on about syntax than the name. We can change that any time 
before you release it.

David



signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [HACKERS] "RETURNING PRIMARY KEY" syntax extension

2014-06-10 Thread Vik Fearing

On 06/09/2014 07:13 PM, Andres Freund wrote:
> On 2014-06-09 13:42:22 +0200, Vik Fearing wrote:
>> On 06/09/2014 09:06 AM, Gavin Flower wrote:
>>> From memory all unique keys can be considered 'candidate primary keys',
>>> but only one can be designated 'the PRIMARY KEY'.
>>
>> Almost.  Candidate keys are also NOT NULL.
> 
> The list actually is a bit longer. They also cannot be partial.

What?  AFAIK, that only applies to an index.  How can the data itself be
partial?

> There's generally also the restriction that for some contexts - like
> e.g. foreign keys - primary/candidate keys may not be deferrable..

Again, what is deferrable data?
-- 
Vik


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] API change advice: Passing plan invalidation info from the rewriter into the planner?

2014-06-10 Thread Craig Ringer

On 06/11/2014 02:19 AM, Tom Lane wrote:
> Hm ... I'm not following why we'd need a special case for superusers and
> not anyone else?  Seems like any useful RLS scheme is going to require
> more privilege levels than just superuser and not-superuser.

What it really needs is to invalidate plans when switching between
RLS-enabled and RLS-exempt users, yes. I'm sure we'll want an "RLS
exempt" right or mode sooner rather than later, so I'm against tying
this explicitly to superuser as such.

I wouldn't be surprised to see

SET ROW SECURITY ON|OFF

down the track, with a right controlling whether you can or not. Or at
least, a right that directly exempts a user from row security.

> Could we put the "if superuser then ok" test into the RLS condition test
> and thereby not need more than one plan at all?

Only if we put it in another level of security barrier subquery, because
otherwise the planner might execute the other quals (including possible
user defined functions) before the superuser test. Which was the whole
reason for the superuser test in the first place.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [GENERAL] Question about partial functional indexes and the query planner

Brian Dunavant  writes:
> I am using a partial functional index on a table where F(a) = a.  Querying
> whre F(a) = a hits the index as expected.  However the reverse statement a
> = F(a) does not.  I have verified this in 9.3.4.
> Is this a deficiency with the query planner, or are these not actually
> equivalent?  I've been stumped on it.

Interesting.  The reason this doesn't work is that predicate_implied_by()
fails to prove "a = b" given "b = a", at least in the general case.
It will figure that out if either a or b is a constant, but not if
neither one is.  The fact that it works with constants might explain
the lack of previous complaints, but this is still pretty surprising
given the amount of knowledge about equality that the system evinces
otherwise.  (I'm also wondering whether the EquivalenceClass logic
might not sometimes rewrite "a = b" into "b = a", causing a failure
of this sort even when the user *had* phrased his query consistently.)

It would be fairly straightforward to add a proof rule along the lines of
"if both expressions are binary opclauses, and the left input expression
of each one is equal() to the right input of the other, and the operators
are each other's commutators, then the implication holds".

An objection might be made that this would add noticeably to the cost of
failing proofs, but I think it wouldn't be that bad, especially if we did
the syscache lookup for the commutator check last.  In most scenarios the
equal() checks would fail pretty quickly when the proof rule doesn't
apply.  Also, I believe that in the case where a or b is a constant,
even though we can make the proof already, this approach would succeed
noticeably more quickly than btree_predicate_proof() can.  (Worth noting
also is that predicate_implied_by() isn't even used unless you have
things like partial indexes involved, so that its cost is not especially
relevant to "simple" queries.)

I'd be inclined to add some similar proof rules to predicate_refuted_by,
along the lines of "a op1 b refutes a op2 b if op1 is op2's negator" and
"a op1 b refutes b op2 a if op1 is the negator of op2's commutator".
Again, the code is currently unable to make such deductions unless a or b
is a constant.

Given the lack of previous complaints, I'm not sure this amounts to
a back-patchable bug, but it does seem like something worth fixing
going forward.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] API change advice: Passing plan invalidation info from the rewriter into the planner?

Craig Ringer  writes:
> On 06/11/2014 02:19 AM, Tom Lane wrote:
>> Could we put the "if superuser then ok" test into the RLS condition test
>> and thereby not need more than one plan at all?

> Only if we put it in another level of security barrier subquery, because
> otherwise the planner might execute the other quals (including possible
> user defined functions) before the superuser test. Which was the whole
> reason for the superuser test in the first place.

Is the point of that that the table owner might have put trojan-horse
functions into the RLS qual?  If so, why are we only concerned about
defending the superuser and not other users?  Seems like the right fix
would be to insist that functions in the RLS qual run as the table owner.
Granted, that might be painful to do.  But it still seems like "we only
need to do this for superusers" is designing with blinkers on.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] API change advice: Passing plan invalidation info from the rewriter into the planner?

2014-06-10 Thread Craig Ringer

On 06/11/2014 07:24 AM, Tom Lane wrote:
> Is the point of that that the table owner might have put trojan-horse
> functions into the RLS qual?  If so, why are we only concerned about
> defending the superuser and not other users?  Seems like the right fix
> would be to insist that functions in the RLS qual run as the table owner.
> Granted, that might be painful to do.  But it still seems like "we only
> need to do this for superusers" is designing with blinkers on.

I agree, and now that the urgency of trying to deliver this for 9.4 is
over it's worth seeing if we can just run as table owner.

Failing that, we could take the approach a certain other RDBMS does and
make the ability to define row security quals a GRANTable right
initially held only by the superuser.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] API change advice: Passing plan invalidation info from the rewriter into the planner?

Craig Ringer  writes:
> On 06/11/2014 07:24 AM, Tom Lane wrote:
>> Is the point of that that the table owner might have put trojan-horse
>> functions into the RLS qual?  If so, why are we only concerned about
>> defending the superuser and not other users?  Seems like the right fix
>> would be to insist that functions in the RLS qual run as the table owner.
>> Granted, that might be painful to do.  But it still seems like "we only
>> need to do this for superusers" is designing with blinkers on.

> I agree, and now that the urgency of trying to deliver this for 9.4 is
> over it's worth seeing if we can just run as table owner.

> Failing that, we could take the approach a certain other RDBMS does and
> make the ability to define row security quals a GRANTable right
> initially held only by the superuser.

Hmm ... that might be a workable compromise.  I think the main issue here
is whether we expect that RLS quals will be something that the planner
could optimize to any meaningful extent.  If they're always (in effect)
wrapped in SECURITY DEFINER functions, I think that largely blocks any
optimizations; but maybe that wouldn't matter in practice.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: why postgresql define NTUP_PER_BUCKET as 10, not other numbers smaller

On Tue, Jun 10, 2014 at 1:43 PM, Jeff Janes  wrote:
> On Tue, Jun 10, 2014 at 5:17 AM, Robert Haas  wrote:
>> The problem case is when you have 1 batch and the increased memory
>> consumption causes you to switch to 2 batches.  That's expensive.  It
>> seems clear based on previous testing that *on the average*
>> NTUP_PER_BUCKET = 1 will be better, but in the case where it causes an
>> increase in the number of batches it will be much worse - particularly
>> because the only way we ever increase the number of batches is to
>> double it, which is almost always going to be a huge loss.
>
> Is there a reason we don't do hybrid hashing, where if 80% fits in memory
> than we write out only the 20% that doesn't? And then when probing the table
> with the other input, the 80% that land in in-memory buckets get handled
> immediately, and only the 20 that land in the on-disk buckets get written
> for the next step?

We have an optimization that is a little bit like that.  The "skew"
hash join stuff tries to (essentially) ensure that the MCVs are in the
first batch.

But more could probably be done.  For example, suppose we have 256
buckets.  If the hash table overflows work_mem, we could write the
contents of *one bucket* out to disk, rather than (as we currently do)
half of the table.  If we overflow again, we write another bucket.
When the number of buckets written reaches half the total, we split
all of the remaining buckets so that all 256 slots are once again
active.  Repeat as needed.

If something like that worked out, it would drastically reduce the
penalty for slightly overrunning work_mem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "RETURNING PRIMARY KEY" syntax extension

On 2014-06-11 00:21:58 +0200, Vik Fearing wrote:
> On 06/09/2014 07:13 PM, Andres Freund wrote:
> > On 2014-06-09 13:42:22 +0200, Vik Fearing wrote:
> >> On 06/09/2014 09:06 AM, Gavin Flower wrote:
> >>> From memory all unique keys can be considered 'candidate primary keys',
> >>> but only one can be designated 'the PRIMARY KEY'.
> >>
> >> Almost.  Candidate keys are also NOT NULL.
> > 
> > The list actually is a bit longer. They also cannot be partial.
> 
> What?  AFAIK, that only applies to an index.  How can the data itself be
> partial?

I don't follow? Gavin above talked about unique keys - which in postgres
you can create using CREATE UNIQUE INDEX. And if you make those partial
they can't be used for this purpose.

> > There's generally also the restriction that for some contexts - like
> > e.g. foreign keys - primary/candidate keys may not be deferrable..
> 
> Again, what is deferrable data?

You can define primary/unique constraints to be deferrable. c.f. CREATE
TABLE docs.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "RETURNING PRIMARY KEY" syntax extension

Andres Freund  writes:
> On 2014-06-11 00:21:58 +0200, Vik Fearing wrote:
>> What?  AFAIK, that only applies to an index.  How can the data itself be
>> partial?

> I don't follow? Gavin above talked about unique keys - which in postgres
> you can create using CREATE UNIQUE INDEX. And if you make those partial
> they can't be used for this purpose.

I have a feeling this conversation is going in the wrong direction.
ISTM that to be useful at all, the set of columns that would be returned
by a clause like this has to be *extremely* predictable; otherwise the
application won't know what to do with the results.  If the app has to
examine the table's metadata to identify what it's getting, what's the
point of the feature at all as opposed to just listing the columns you
want explicitly?  So I doubt that the use-case for anything more
complicated than returning the primary key, full stop, is really there.

I'm not even 100% sold that automatically returning the primary key
is going to save any application logic.  Could somebody point out
*exactly* where an app is going to save effort with this type of
syntax, compared to requesting the columns it wants by name?
Is it going to save enough to justify depending on a syntax that won't
be universal for a long time to come?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2014-06-10 Thread Michael Paquier

On Tue, Jun 10, 2014 at 11:49 PM, Rahila Syed  wrote:
> Hello ,
>
>
> In order to facilitate changing of compression algorithms  and to be able to
> recover using WAL records compressed with different compression algorithms,
> information about compression algorithm can be stored in WAL record.
>
> XLOG record header has 2 to 4 padding bytes in order to align the WAL
> record. This space can be used for  a new flag in order to store information
> about the compression algorithm used. Like the xl_info field of XlogRecord
> struct,  8 bits flag  can be constructed with the lower 4 bits of the flag
> used to indicate which backup block is compressed out of 0,1,2,3. Higher
> four bits can be used to indicate state of compression i.e
> off,lz4,snappy,pglz.
>
> The flag can be extended to incorporate more compression algorithms added in
> future if any.
>
> What is your opinion on this?
-1 for any additional bytes in WAL record to control such things,
having one single compression that we know performs well and relying
on it makes the life of user and developer easier.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] branch for 9.5?

I wrote:
> Robert Haas  writes:
>> As 9.5CF1 is due to start in about 5 days, it seems we'd better branch
>> pretty soon.  Anyone planning to take care of that?

> I think I've done it the last couple times, and am happy to do so again.

> Anyone want to wait longer?  As Robert says, we need to do it no later
> than the CF start anyway.

Hearing no objections, done.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "RETURNING PRIMARY KEY" syntax extension

2014-06-10 Thread Tom Dunstan

On 11 June 2014 10:09, Tom Lane  wrote:

> I'm not even 100% sold that automatically returning the primary key
> is going to save any application logic.  Could somebody point out
> *exactly* where an app is going to save effort with this type of
> syntax, compared to requesting the columns it wants by name?
> Is it going to save enough to justify depending on a syntax that won't
> be universal for a long time to come?
>

Well, in e.g. Hibernate there's piece of code which calls
getGeneratedKeys() to fetch the inserted primary key (it only supports
fetching a single generated key column in this way) if the underlying
database supports that. The postgresql dialect specifies that it does
support that code path, so at the moment any hibernate users who aren't
explicitly specifying the "sequence" type for their id generation will be
calling that, and the JDBC driver will be appending "RETURNING *" under the
hood for all inserts.

Looking at the oracle hibernate dialect is instructive as to the state of
support for the explicit-column-list variation:

// Oracle driver reports to support getGeneratedKeys(), but they only

// support the version taking an array of the names of the columns to

// be returned (via its RETURNING clause).  No other driver seems to

// support this overloaded version.

And so hibernate doesn't support the explicit-column-list version at all
since apparently no-one else supports it, and just marks that code path as
unsupported for oracle. I presume that the situation is similar in other
java-based ORMs.

Looking at some other drivers that I would expect to support
getGeneratedKeys() in a sane way given their identity/auto-increment
semantics reveals:

 - JTDS driver for MSSQL/Sybase piggybacks a second query to do "SELECT
SCOPE_IDENTITY() AS blah" / "SELECT @@IDENTITY AS blah" to fetch the key if
that was requested. It looks like this driver does support specifying the
column name, but it only allows a single column to be given, and it
promptly ignores the passed in value and calls the non-specified version.

 - MySQL driver internally returns a single ID with the query result, and
the driver then appears to add an auto-increment amount to calculate the
rest of the values. I guess MySQL must allocate the ids in
guaranteed-sequential chunks. MySQL only supports a single auto-increment
key. If called with the explicit column version, the passed-in column names
are ignored.

So looks like other JDBC driver/server combos only support this for
single-column primary keys. But for those cases things pretty much work as
expected. It would be nice to be able to support at least primary keys with
this feature.

We could try to teach every ORM out there to call the explicit column-list
version, but given other lack of support for it I doubt they'll be
interested, especially if the reason is because we don't want to add enough
support to make getGeneratedKeys() work efficiently.

FWIW I reckon for most users of ORMs at least it will be enough to support
this for direct inserts to tables - views is a nice-to-have but I'd take
tables-only over not at all.

Cheers

Tom

Re: [HACKERS] Proposing pg_hibernate

On Tue, Jun 10, 2014 at 12:02 PM, Robert Haas  wrote:
> If recovery has been running for a long time, then restoring
> buffers from some save file created before that is probably a bad
> idea, regardless of whether the buffers already loaded were read in by
> recovery itself or by queries running on the system.  But if you're
> saying that doesn't happen, then there's no problem there.

Normally, it won't happen. There's one case I can think of, which has
to coincide with a small window of time for such a thing to happen.

Consider this:
.) A database is shutdown, which creates the save-files in
$PGDATA/pg_hibernator/.
.) The database is restarted.
.) BlockReaders begin to read and restore the disk blocks into buffers.
.) Before the BlockReaders could finish*, a copy of the database is
taken (rsync/cp/FS-snapshot/etc.)
This causes the the save-files to be present in the copy, because
the BlockReaders haven't deleted them, yet.
* (The BlockReaders ideally finish their task in first few minutes
after first of them is started.)
.) The copy of the database is used to restore and erect a warm-standby.
.) The warm-standby starts replaying logs from WAL archive/stream.
.) Some time (hours/weeks/months) later, the warm-standby is promoted
to be a master.
.) It starts the Postgres Hibernator, which sees save-files in
$PGDATA/pg_hibernator/ and launches BlockReaders.

At this point, the BlockReaders will restore the blocks that were
present in original DB's shared-buffers at the time of shutdown. So,
this would fetch blocks into shared-buffers that may be completely
unrelated to the blocks recently operated on by the recovery process.

And it's probably accepted by now that such a bahviour is not
catastrophic, merely inconvenient.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "RETURNING PRIMARY KEY" syntax extension

2014-06-10 Thread Tom Dunstan

>
> Is it going to save enough to justify depending on a syntax that won't
> be universal for a long time to come?
>

Oh, and on the won't-be-universal-for-a-while point - the status quo works
fine, it's just less efficient than it should be. Once someone upgrades to
9.5 or whatever, and upgrades to the matching JDBC driver version, they'll
get the newer efficient call for free.

In the python world PEP249 has a lastrowid property that drivers can
implement, but I don't know how much it's used or our support for it. Any
python devs out there want to chime in? I don't know about other drivers.

Obviously anyone hand-crafting their queries won't be able to do that until
they know it's safe. But that's always the case with new syntax.

Cheers

Tom

Re: [HACKERS] Proposing pg_hibernate

On Sun, Jun 8, 2014 at 3:24 AM, Amit Kapila  wrote:
> On Fri, Jun 6, 2014 at 5:31 PM, Gurjeet Singh  wrote:
>> On Thu, Jun 5, 2014 at 11:32 PM, Amit Kapila 
>> wrote:
>
>> > Buffer saver process itself can crash while saving or restoring
>> > buffers.
>>
>> True. That may lead to partial list of buffers being saved. And the
>> code in Reader process tries hard to read only valid data, and punts
>> at the first sight of data that doesn't make sense or on ERROR raised
>> from Postgres API call.
>
> Inspite of Reader Process trying hard, I think we should ensure by
> some other means that file saved by buffer saver is valid (may be
> first write in tmp file and then rename it or something else).

I see no harm in current approach, since even if the file is partially
written on shutdown, or if it is corrupted due to hardware corruption,
the worst that can happen is the BlockReaders will try to restore, and
possibly succeed, a wrong block to shared-buffers.

I am okay with your approach of first writing to a temp file, if
others see an advantage of doing this and insist on it.

>> > IIUC on shutdown request, postmaster will send signal to BG Saver
>> > and BG Saver will save the buffers and then postmaster will send
>> > signal to checkpointer to shutdown.  So before writing Checkpoint
>> > record, BG Saver can crash (it might have saved half the buffers)
>>
>> Case handled as described above.
>>
>> > or may BG saver saves buffers, but checkpointer crashes (due to
>> > power outage or any such thing).
>>
>> Checkpointer process' crash seems to be irrelevant to Postgres
>> Hibernator's  workings.
>
> Yeap, but if it crashes before writing checkpoint record, it will lead to
> recovery which is what we are considering.

Good point.

In case of such recovery, the recovery process will read in the blocks
that were recently modified, and were possibly still in shared-buffers
when Checkpointer crashed. So after recovery finishes, the
BlockReaders will be invoked (because save-files were successfully
written before the crash), and they would request the same blocks to
be restored. Most likely, those blocks would already be in
shared-buffers, hence no cause of concern regarding BlockReaders
evicting buffers populated by recovery.

>> I think you are trying to argue the wording in my claim "save-files
>> are created only on clean shutdowons; not on a crash or immediate
>> shutdown", by implying that a crash may occur at any time during and
>> after the BufferSaver processing. I agree the wording can be improved.
>
> Not only wording, but in your above mail Case 2 and 1b would need to
> load buffer's and perform recovery as well, so we need to decide which
> one to give preference.

In the cases you mention, 1b and 2, ideally there will be no
save-files because the server either (1b) was still running, or (2)
crashed.

If there were any save-files present during the previous startup (the
one that happened before (1b) hot-backup or (2) crash) of the server,
they would have been removed by the BlockReaders soon after the
startup.

> So If you agree that we should have consideration for recovery data
> along with saved files data, then I think we have below options to
> consider:

I don't think any of the options you mention need any consideration
because recovery and buffer-restore process don't seem to be at
conflict with each other; not enough to be a concern, IMHO.

Thanks and best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposing pg_hibernate

2014-06-10 Thread Amit Kapila

On Wed, Jun 11, 2014 at 7:59 AM, Gurjeet Singh  wrote:
> On Sun, Jun 8, 2014 at 3:24 AM, Amit Kapila 
wrote:
> >> > IIUC on shutdown request, postmaster will send signal to BG Saver
> >> > and BG Saver will save the buffers and then postmaster will send
> >> > signal to checkpointer to shutdown.  So before writing Checkpoint
> >> > record, BG Saver can crash (it might have saved half the buffers)
> >>
> >> Case handled as described above.
> >>
> >> > or may BG saver saves buffers, but checkpointer crashes (due to
> >> > power outage or any such thing).
> >>
> >> Checkpointer process' crash seems to be irrelevant to Postgres
> >> Hibernator's  workings.
> >
> > Yeap, but if it crashes before writing checkpoint record, it will lead
to
> > recovery which is what we are considering.
>
> Good point.
>
> In case of such recovery, the recovery process will read in the blocks
> that were recently modified, and were possibly still in shared-buffers
> when Checkpointer crashed. So after recovery finishes, the
> BlockReaders will be invoked (because save-files were successfully
> written before the crash), and they would request the same blocks to
> be restored. Most likely, those blocks would already be in
> shared-buffers, hence no cause of concern regarding BlockReaders
> evicting buffers populated by recovery.

Not necessarily because after crash, recovery has to start from
previous checkpoint, so it might not perform operations on same
pages as are saved by buffer saver.  Also as the file saved by
buffer saver can be a file which contains only partial list of
buffers which were in shared buffer's, it becomes more likely that
in such cases it can override the buffers populated by recovery.
Now as pg_hibernator doesn't give any preference to usage_count while
saving buffer's, it can also evict the buffers populated by recovery
with some lower used pages of previous run.

> >> I think you are trying to argue the wording in my claim "save-files
> >> are created only on clean shutdowons; not on a crash or immediate
> >> shutdown", by implying that a crash may occur at any time during and
> >> after the BufferSaver processing. I agree the wording can be improved.
> >
> > Not only wording, but in your above mail Case 2 and 1b would need to
> > load buffer's and perform recovery as well, so we need to decide which
> > one to give preference.
>
> In the cases you mention, 1b and 2, ideally there will be no
> save-files because the server either (1b) was still running, or (2)
> crashed.
>
> If there were any save-files present during the previous startup (the
> one that happened before (1b) hot-backup or (2) crash) of the server,
> they would have been removed by the BlockReaders soon after the
> startup.

I think Block Readers will remove file only after reading and populating
buffers from it and that's the reason I mentioned that it can lead to doing
both recovery as well as load buffers based on file saved by buffer
saver.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] 9.4 release notes

2014-06-10 Thread Noah Misch

I propose the attached, miscellaneous edits.

The limit on tuplesort.c internal sort size is a limit on the number of
tuples, not on memory usage.  Specifically, the cap increased from 44739242
tuples to 2147483647 tuples.  I didn't include those numbers, though.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com
diff --git a/doc/src/sgml/release-9.4.sgml b/doc/src/sgml/release-9.4.sgml
index 3f30c63..d9f9828 100644
--- a/doc/src/sgml/release-9.4.sgml
+++ b/doc/src/sgml/release-9.4.sgml
@@ -573,8 +573,9 @@
 
   

-Allow sorting and B-tree index
-builds to use over four gigabytes of memory (Noah Misch)
+Raise hard limit on the number of tuples held in memory during sorting
+and B-tree index builds (Noah
+Misch)

   
 
@@ -615,7 +616,7 @@
   

 Expose the estimation of number of changed tuples since last analyze (Mark Kirkwood)
+linkend="vacuum-for-statistics">ANALYZE (Mark Kirkwood)

 

@@ -838,14 +839,14 @@
 
   

-Have Windows ASCII-encoded databases and server process
-(e.g.  postmaster) emit messages
-in the LC_CTYPE-defined language (Alexander Law,
-Noah Misch)
+Have Windows SQL_ASCII-encoded databases and server
+process (e.g.  postmaster) emit
+messages in the character encoding of the server's Windows user locale
+(Alexander Law, Noah Misch)

 

-Previously these messages were output using the Windows
+Previously these messages were output in the Windows
 ANSI code page.

   
@@ -985,7 +986,7 @@
 
  Allow pg_recvlogical
- to receive data logical decoding data (Andres Freund)
+ to receive logical decoding data (Andres Freund)
 

 
@@ -1445,9 +1446,8 @@
 
   

-Add SQL functions to allow large object reads/writes at
-arbitrary offsets (Pavel Stehule)
+Add SQL functions to allow large
+object reads/writes at arbitrary offsets (Pavel Stehule)

   
 
@@ -2078,7 +2078,7 @@
 
   

-Remove SnapshotNow() and
+Remove SnapshotNow and
 HeapTupleSatisfiesNow() (Robert Haas)

 
@@ -2090,8 +2090,8 @@
 
   

-Add API for memory allocations over four gigabytes
-(Noah Misch)
+Add API for memory allocations over one gigabyte (Noah
+Misch)

   
 
@@ -2229,7 +2229,7 @@
 
   

-Improve valgrind error reporting (Noah Misch)
+Improve Valgrind error reporting (Noah Misch)

   
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [bug fix] Memory leak in dblink

2014-06-10 Thread Amit Kapila

On Tue, Jun 10, 2014 at 7:30 PM, Robert Haas  wrote:
>
> On Tue, Jun 10, 2014 at 12:27 AM, Amit Kapila 
wrote:
> > Is there a need to free memory context in PG_CATCH()?
> > In error path the memory due to this temporary context will get
> > freed as it will delete the transaction context and this
> > temporary context will definitely be in the hierarchy of it, so
> > it should also get deleted.  I think such handling can be
> > useful incase we use PG_CATCH() to suppress the error.
>
> Using PG_CATCH() to suppress an error is pretty much categorically unsafe.

In some cases like for handling exceptions in plpgsql, PG_CATCH()
is used to handle the exception and then take an appropriate action
based on exception, so in some such cases it might be right to free
the context memory depending on situation.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] [bug fix] Memory leak in dblink

Amit Kapila  writes:
> On Tue, Jun 10, 2014 at 7:30 PM, Robert Haas  wrote:
>> 
>> On Tue, Jun 10, 2014 at 12:27 AM, Amit Kapila 
> wrote:
>>> Is there a need to free memory context in PG_CATCH()?
>>> In error path the memory due to this temporary context will get
>>> freed as it will delete the transaction context and this
>>> temporary context will definitely be in the hierarchy of it, so
>>> it should also get deleted.  I think such handling can be
>>> useful incase we use PG_CATCH() to suppress the error.

>> Using PG_CATCH() to suppress an error is pretty much categorically unsafe.

> In some cases like for handling exceptions in plpgsql, PG_CATCH()
> is used to handle the exception and then take an appropriate action
> based on exception, so in some such cases it might be right to free
> the context memory depending on situation.

Robert's point is that the only safe way to suppress an error is to
do a (sub)transaction rollback.  That will take care of cleaning up
appropriate memory contexts, along with much else.  I don't see the
value of adding any single-purpose cleanups when they'd just be
subsumed by the transaction rollback anyhow.

The reason there's a PG_CATCH block here at all is to clean up libpq
PGresults that the transaction rollback logic won't be aware of.
We could instead have built infrastructure to allow rollback to clean
those up, but it'd be a lot more code, for not a lot of benefit given
the current complexity of dblink.c.  Maybe sometime in the future
it'll be worth doing it that way.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [bug fix] Memory leak in dblink

2014-06-10 Thread Amit Kapila

On Wed, Jun 11, 2014 at 10:46 AM, Tom Lane  wrote:
> Amit Kapila  writes:
> > In some cases like for handling exceptions in plpgsql, PG_CATCH()
> > is used to handle the exception and then take an appropriate action
> > based on exception, so in some such cases it might be right to free
> > the context memory depending on situation.
>
> Robert's point is that the only safe way to suppress an error is to
> do a (sub)transaction rollback.  That will take care of cleaning up
> appropriate memory contexts, along with much else.  I don't see the
> value of adding any single-purpose cleanups when they'd just be
> subsumed by the transaction rollback anyhow.

Agreed in general there won't be need of any such special-purpose
cleanups and that's the main reason I have mentioned upthread to
remove context cleanup from PG_CATCH(), however for certain
special case where some situation want to allocate memory in
context higher level than transaction to retain data across
transaction boundary, it might be needed.  This is just a hypothetical
scenario came to my mind and I am not sure if there will be any
actual need for such a thing.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] pg_receivexlog add synchronous mode

2014-06-10 Thread furuyao

> No. IIUC walreceiver does flush *less* frequently than what you
> implemented on pg_receivexlog. Your version of pg_receivexlog tries to
> do flush every time when it receives one WAL chunk. OTOH, walreceiver
> does flush only when there is no extra WAL chunk in receive buffer. IOW,
> after writing WAL chunk, if there is another WAL chunk that walreceiver
> can receive immediately, it postpones flush later.
> 
> > However, it seems difficult to apply as same way.
> 
> Why? ISTM that's not so difficult.

I was not able to understand movement of walreceiver well.
While walreceiver writes data, do PQconsumeInput() by omitting the select().
Do flush if the PQgetCopyData has been to return the zero continuously.
Fixed to the same process using the flag.

Regards,

-- 
Furuya Osamu



pg_receivexlog-add-synchronous-mode-v3.patch
Description: pg_receivexlog-add-synchronous-mode-v3.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "RETURNING PRIMARY KEY" syntax extension

2014-06-10 Thread Hannu Krosing

On 06/10/2014 03:19 AM, Tom Dunstan wrote:
> A definite +1 on this feature. A while ago I got partway through
> hacking the hibernate postgres dialect to make it issue a RETURNING
> clause to spit out the primary key before I realised that the driver
> was already doing a RETURNING * internally.
>
> On 10 June 2014 05:53, Jim Nasby  > wrote:
>
> > I was wondering that myself. I think it's certainly reasonable to expect
> > someone would wan RETURNING SEQUENCE VALUES, which would return the
> value of
> > every column that owned a sequence (ie: ALTER SEQUENCE ... OWNED
> BY). ISTM
> > that would certainly handle the performance aspect of this, and it
> sounds
> > more in line with what I'd expect getGeneratedKeys() to do.
>
> Keep in mind that not all generated keys come from sequences. Many
> people have custom key generator functions, including UUIDs and other
> exotic things like Instagram's setup [1].
>
> RETURNING GENERATED KEYS perhaps, but then how do we determine that?
What about RETURNING CHANGED FIELDS ?

Might be quite complicated technically, but this is what is probably wanted.
> Any column that was filled with a default value? But that's
> potentially returning far more values than the user will want - I bet
> 99% of users just want their generated primary key.

Probably not true - you would want your ORM model to be in sync with
what is database after you save it if you plan to do any further
processing using it.

At least I would :)
>
> The spec is a bit vague [2]: 
>
> Retrieves any auto-generated keys created as a result of executing
> this Statement object. If this Statement object did not generate any
> keys, an empty ResultSet object is returned.
>
> Note:If the columns which represent the auto-generated keys were
> not specified, the JDBC driver implementation will determine the
> columns which best represent the auto-generated keys.
>
> The second paragraph refers to [3] and [4] where the application can
> specify which columns it's after. Given that there's a mechanism to
> specify which keys the application wants returned in the driver, and
> the driver in that case can just issue a RETURNING clause with a
> column list, my gut feel would be to just support returning primary
> keys as that will handle most cases of e.g. middleware like ORMs
> fetching that without needing to know the specific column names.
Why not then just leave the whole thing as it is on server side, and let
the ORM specify which "generated keys" it wants ?
>
> Cheers
>
> Tom
>
>
> [1]
> http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram
> [2]
> http://docs.oracle.com/javase/7/docs/api/java/sql/Statement.html#getGeneratedKeys()
> 
> [3] 
> http://docs.oracle.com/javase/7/docs/api/java/sql/Statement.html#execute(java.lang.String,%20int[])
> 
> [4] 
> http://docs.oracle.com/javase/7/docs/api/java/sql/Statement.html#execute(java.lang.String,%20java.lang.String[])
> 


-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ

Re: [HACKERS] "RETURNING PRIMARY KEY" syntax extension

2014-06-10 Thread Tom Dunstan

On 10 June 2014 17:49, Hannu Krosing  wrote:

> RETURNING GENERATED KEYS perhaps, but then how do we determine that?
>
>  What about RETURNING CHANGED FIELDS ?
>
> Might be quite complicated technically, but this is what is probably
> wanted.
>

Seems to be getting further away from something that describes the main use
case - changed fields sounds like something that would apply to an update
statement.

>  Any column that was filled with a default value? But that's potentially
> returning far more values than the user will want - I bet 99% of users just
> want their generated primary key.
>
>
> Probably not true - you would want your ORM model to be in sync with what
> is database after you save it if you plan to do any further processing
> using it.
>

Well, yes, but since RETURNING is non-standard most ORMs are unlikely to
support fetching other generated values that way anyway. The ones that I've
dealt with will do an insert, then a select to get the extra fields. I
don't know if other JDBC drivers allow applications to just specify any old
list of non-key columns to the execute method, but I suspect not, given
that the way they fetch those columns is rather less general-purpose than
our RETURNING syntax.

>
> The second paragraph refers to [3] and [4] where the application can
> specify which columns it's after. Given that there's a mechanism to specify
> which keys the application wants returned in the driver, and the driver in
> that case can just issue a RETURNING clause with a column list, my gut feel
> would be to just support returning primary keys as that will handle most
> cases of e.g. middleware like ORMs fetching that without needing to know
> the specific column names.
>
> Why not then just leave the whole thing as it is on server side, and let
> the ORM specify which "generated keys" it wants ?
>

Because java-based ORMs (at least) mostly don't have to - other
server/driver combos manage to implement getGeneratedKeys() without being
explicitly given a column list, they just do the sane thing and return the
appropriate identity column or whatever for the inserted row.

I agree that in hand-crafted JDBC there's no particular problem in making a
user specify a column list, (although I don't think I've EVER seen anyone
actually do that in the wild), but most middleware will expect
getGeneratedKeys() to just work and we should try to do something about
making that case work a bit more efficiently than it does now.

Cheers

Tom

Re: [HACKERS] pg_xlogdump --stats

2014-06-10 Thread furuyao

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Abhijit
> Menon-Sen
> Sent: Wednesday, June 04, 2014 7:47 PM
> To: pgsql-hackers@postgresql.org
> Subject: [HACKERS] pg_xlogdump --stats
> 
> Hi.
> 
> Here's a patch to make pg_xlogdump print summary statistics instead of
> individual records.
> 
The function works fine. It is a good to the learning of PostgreSQL.
But By applying the patch,warning comes out then make.
Although I think that's no particular problem, However I think better to fix is 
good.


$ make
...
pg_xlogdump.c: In function ‘XLogDumpStatsRow’:
pg_xlogdump.c:851: warning: format ‘%20llu’ expects type ‘long long unsigned 
int’, but argument 3 has type ‘uint64’
pg_xlogdump.c:851: warning: format ‘%20llu’ expects type ‘long long unsigned 
int’, but argument 5 has type ‘uint64’
pg_xlogdump.c:851: warning: format ‘%20llu’ expects type ‘long long unsigned 
int’, but argument 7 has type ‘uint64’
pg_xlogdump.c:851: warning: format ‘%20llu’ expects type ‘long long unsigned 
int’, but argument 9 has type ‘uint64’
pg_xlogdump.c: In function ‘XLogDumpDisplayStats’:
pg_xlogdump.c:948: warning: format ‘%20llu’ expects type ‘long long unsigned 
int’, but argument 3 has type ‘uint64’
pg_xlogdump.c:948: warning: format ‘%20llu’ expects type ‘long long unsigned 
int’, but argument 5 has type ‘uint64’
pg_xlogdump.c:948: warning: format ‘%20llu’ expects type ‘long long unsigned 
int’, but argument 7 has type ‘uint64’
pg_xlogdump.c:948: warning: format ‘%20llu’ expects type ‘long long unsigned 
int’, but argument 9 has type ‘uint64'
...

My environment:
RHEL 6.4
gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC)

Regards,

-- 
Furuya Osamu


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [bug fix] Memory leak in dblink

2014-06-10 Thread MauMau


From: "Amit Kapila" 

Is there a need to free memory context in PG_CATCH()?
In error path the memory due to this temporary context will get
freed as it will delete the transaction context and this
temporary context will definitely be in the hierarchy of it, so
it should also get deleted.  I think such handling can be
useful incase we use PG_CATCH() to suppress the error.


I thought the same, but I also felt that I should make an effort to release 
resources as soon as possible, considering the memory context auto deletion 
as a last resort.  However, looking at other places where PG_CATCH() is 
used, memory context is not deleted.  So, I removed the modification from 
PG_CATCH() block.  Thanks.


Regards
MauMau



dblink_memleak_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "RETURNING PRIMARY KEY" syntax extension

2014-06-10 Thread Hannu Krosing

On 06/10/2014 11:02 AM, Tom Dunstan wrote:
> On 10 June 2014 17:49, Hannu Krosing  > wrote:
>
> RETURNING GENERATED KEYS perhaps, but then how do we determine that?
> What about RETURNING CHANGED FIELDS ?
>
> Might be quite complicated technically, but this is what is
> probably wanted.
>
>
> Seems to be getting further away from something that describes the
> main use
> case - changed fields sounds like something that would apply to an
> update statement.
Not really - it applies to both INSERT and UPDATE if there are any
triggers and/or default values

The use-case is an extended version of getting the key, with the main
aim of making sure
that your ORM model is the same as what is saved in database.
>
>> Any column that was filled with a default value? But that's
>> potentially returning far more values than the user will want - I
>> bet 99% of users just want their generated primary key.
>
> Probably not true - you would want your ORM model to be in sync
> with what is database after you save it if you plan to do any
> further processing using it.
>
>
> Well, yes, but since RETURNING is non-standard most ORMs are unlikely
> to support fetching other generated values that way anyway. The ones
> that I've dealt with will do an insert, then a select to get the extra
> fields. I don't know if other JDBC drivers allow applications to just
> specify any old list of non-key columns to the execute method, but I
> suspect not, given that the way they fetch those columns is rather
> less general-purpose than our RETURNING syntax.
>  
>
>
>> The second paragraph refers to [3] and [4] where the application
>> can specify which columns it's after. Given that there's a
>> mechanism to specify which keys the application wants returned in
>> the driver, and the driver in that case can just issue a
>> RETURNING clause with a column list, my gut feel would be to just
>> support returning primary keys as that will handle most cases of
>> e.g. middleware like ORMs fetching that without needing to know
>> the specific column names.
> Why not then just leave the whole thing as it is on server side,
> and let the ORM specify which "generated keys" it wants ?
>
>
> Because java-based ORMs (at least) mostly don't have to - other
> server/driver combos manage to implement getGeneratedKeys() without
> being explicitly given a column list, they just do the sane thing and
> return the appropriate identity column or whatever for the inserted row.
>  
> I agree that in hand-crafted JDBC there's no particular problem in
> making a user specify a column list, (although I don't think I've EVER
> seen anyone actually do that in the wild), but most middleware will
> expect getGeneratedKeys() to just work and we should try to do
> something about making that case work a bit more efficiently than it
> does now.
But does the ORM already not "know" the names of auto-generated keys and
thus could easily replace them for * in RETURNING ?
>
> Cheers
>
> Tom


-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

On Mon, Sep 19, 2011 at 10:18 AM, Tom Lane  wrote:
> Peter Eisentraut  writes:
>> On sön, 2011-09-18 at 12:21 -0400, Tom Lane wrote:
>>> But having said that, it wouldn't be very hard to arrange things so that
>>> if you did have both symbols defined, the code would only attempt to
>>> write oom_adj if it had failed to write oom_score_adj; which is about as
>>> close as you're likely to get to a kernel version test for this.
>
>> Why is this feature not a run-time configuration variable or at least a
>> configure option?  It's awfully well hidden now.  I doubt a lot of
>> people are using this even though they might wish to.
>
> See the thread in which the feature was designed originally:
> http://archives.postgresql.org/pgsql-hackers/2010-01/msg00170.php
>
> The key point is that to get useful behavior, you need cooperation
> between both a root-privileged startup script and the PG executable.
> That tends to throw the problem into the domain of packagers, more
> than end users, and definitely puts a big crimp in the idea that
> run-time configuration of just half of the behavior could be helpful.
> So far, no Linux packagers have complained that the design is inadequate
> (a position that I also hold when wearing my red fedora) so I do not
> feel a need to complicate it further.

Startup scripts are not solely in the domain of packagers. End users
can also be expected to develop/edit their own startup scripts.

Providing it as GUC would have given end users both the peices, but
with a compile-time option they have only one half of the solution;
except if they go compile their own binaries, which forces them into
being packagers.

I am not alone in feeling that if Postgres wishes to provide a control
over child backend's oom_score_adj, it should be a GUC parameter
rather than a compile-time option. Yesterday a customer wanted to
leverage this and couldn't because they refuse to maintain their own
fork of Postgres code.

Please find attached a patch to turn it into a GUC, #ifdef'd by __linux__ macro.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com
diff --git a/src/backend/postmaster/fork_process.c b/src/backend/postmaster/fork_process.c
index f6df2de..7b9bc38 100644
--- a/src/backend/postmaster/fork_process.c
+++ b/src/backend/postmaster/fork_process.c
@@ -22,6 +22,11 @@
 #endif
 
 #ifndef WIN32
+
+#ifdef __linux__
+int linux_oom_score_adj = 0;
+#endif
+
 /*
  * Wrapper for fork(). Return values are the same as those for fork():
  * -1 if the fork failed, 0 in the child process, and the PID of the
@@ -78,7 +83,7 @@ fork_process(void)
 		 * LINUX_OOM_SCORE_ADJ #defined to 0, or to some other value that you
 		 * want child processes to adopt here.
 		 */
-#ifdef LINUX_OOM_SCORE_ADJ
+#ifdef __linux__
 		{
 			/*
 			 * Use open() not stdio, to ensure we control the open flags. Some
@@ -92,13 +97,12 @@ fork_process(void)
 char		buf[16];
 int			rc;
 
-snprintf(buf, sizeof(buf), "%d\n", LINUX_OOM_SCORE_ADJ);
+snprintf(buf, sizeof(buf), "%d\n", linux_oom_score_adj);
 rc = write(fd, buf, strlen(buf));
 (void) rc;
 close(fd);
 			}
 		}
-#endif   /* LINUX_OOM_SCORE_ADJ */
 
 		/*
 		 * Older Linux kernels have oom_adj not oom_score_adj.  This works
@@ -106,7 +110,6 @@ fork_process(void)
 		 * it's necessary to build Postgres to work with either API, you can
 		 * define both LINUX_OOM_SCORE_ADJ and LINUX_OOM_ADJ.
 		 */
-#ifdef LINUX_OOM_ADJ
 		{
 			/*
 			 * Use open() not stdio, to ensure we control the open flags. Some
@@ -120,13 +123,13 @@ fork_process(void)
 char		buf[16];
 int			rc;
 
-snprintf(buf, sizeof(buf), "%d\n", LINUX_OOM_ADJ);
+snprintf(buf, sizeof(buf), "%d\n", linux_oom_score_adj);
 rc = write(fd, buf, strlen(buf));
 (void) rc;
 close(fd);
 			}
 		}
-#endif   /* LINUX_OOM_ADJ */
+#endif   /* __linux__ */
 
 		/*
 		 * Make sure processes do not share OpenSSL randomness state.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1d094f0..8713abb 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -128,6 +128,7 @@ extern bool synchronize_seqscans;
 extern char *SSLCipherSuites;
 extern char *SSLECDHCurve;
 extern bool SSLPreferServerCiphers;
+extern int	linux_oom_score_adj;
 
 #ifdef TRACE_SORT
 extern bool trace_sort;
@@ -2554,6 +2555,18 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+#ifdef __linux__
+	{
+		{"linux_oom_score_adj", PGC_POSTMASTER, RESOURCES_KERNEL,
+			gettext_noop("Sets the oom_score_adj parameter of postmaster's child processes."),
+			NULL,
+		},
+		&linux_oom_score_adj,
+		0, -1000, 1000,
+		NULL, NULL, NULL
+	},
+#endif /* __linux__ */
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: why postgresql define NTUP_PER_BUCKET as 10, not other numbers smaller

On Tue, Jun 10, 2014 at 1:13 AM, b8flowerfire  wrote:
> Thanks for the explanation. But i don't think it is very convincible.
> Simply reduce the value of NTUP_PER_BUCKET will enlarge the pointer array
> and reduce the tuples in one batch. But is that effect significant to the
> performance?
> The utilization of the work_mem, i think, is determined by the ratio of size
> of the pointer and the size of the tuple.
> Let's assume the size of tuple is 28 bytes, which is very reasonable because
> it's the sum of the size of HJTUPLE_OVERHEAD(at least 8 bytes), the size of
> MinimalTupleData(at least 10 bytes) and the content of a tuple(assume 10
> bytes). And the size of pointer is 4 bytes.

The size of a pointer is 8 bytes on most platforms these days.  On the
flip side, we shouldn't forget that each tuple has a 2-pointer, thus
16-byte, overhead due to the way AllocSetAlloc works, and that before
adding that we will round up to the nearest power of two when
allocating.  So in fact, in your example, each tuple will require 48
bytes on a 64-bit platform, and each pointer will require 8.  So if
I'm calculation correctly, the memory allocation for the pointers
would be about 1.6% of the the total with NTUP_PER_BUCKET = 10 and
about 14.3% of the total with NTUP_PER_BUCKET = 1.

> As a result, changing the value of NTUP_PER_BUCKET to 1 may increase the
> batches number by only about 10%. So it that enough to effect the
> performance? Or maybe i can not do the calculation simply in this way.

The problem case is when you have 1 batch and the increased memory
consumption causes you to switch to 2 batches.  That's expensive.  It
seems clear based on previous testing that *on the average*
NTUP_PER_BUCKET = 1 will be better, but in the case where it causes an
increase in the number of batches it will be much worse - particularly
because the only way we ever increase the number of batches is to
double it, which is almost always going to be a huge loss.

> Besides, we have larger main-memory now. If we set the work_mem larger, the
> more batches effect introduced by the smaller NTUP_PER_BUCKET value may be
> reduced, couldn't it?

If work_mem is large enough that we're going to do a single batch
either way, or the same number of batches either way, then we can
reduce NTUP_PER_BUCKET and it should be a clear win.

> I have read about discussion about the NTUP_PER_BUCKET before. It seems that
> if we change NTUP_PER_BUCKET to 50 or even larger, the performance wouldn't
> be much worse. Because every tuple in the chain of a bucket has a hash
> value. Having more tuples in a bucket simply increase some comparisons of
> two integers. So is it the same if we change it smaller, that we could not
> get much better? Is it one of the reasons that we define it as 10?

I'm not sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] updated emacs configuration

2014-06-10 Thread Noah Misch

On Mon, Jun 09, 2014 at 09:04:02PM -0400, Peter Eisentraut wrote:
> On Sun, 2014-06-08 at 21:55 -0400, Noah Misch wrote:
> > After upgrading to GNU Emacs 23.4.1 from a version predating directory-local
> > variables, I saw switch/case indentation go on the fritz.  My hooks were
> > issuing (c-set-style "postgresql"), but ".dir-locals.el" set it back to 
> > plain
> > "bsd" style.
> 
> I'd consider just getting rid of the
> 
> (c-file-style . "bsd")
> 
> setting in .dir-locals.el and put the actual interesting settings from
> the style in there.
> 
> Another alternative is to change emacs.samples to use
> hack-local-variables-hook instead, as described here:
> http://www.emacswiki.org/emacs/LocalVariables

Those are reasonable alternatives.  The ultimate effect looks similar for all
three options, to the extent that I don't see a basis for forming a strong
preference.  Do you have a recommendation?

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] why postgresql define NTUP_PER_BUCKET as 10, not other numbers smaller

On Mon, Jun 9, 2014 at 11:09 AM, Tom Lane  wrote:
> Keep in mind that that standard advice is meant for all-in-memory cases,
> not for cases where the alternative to running with longer hash chains
> is dumping tuples out to disk and reading them back.

Sure, but that doesn't help someone who sets work_mem to some very
large value precisely to ensure that the hash join will be done in
memory.  They still don't get the benefit of a smaller NTUP_PER_BUCKET
setting.

> I'm quite prepared to believe that we should change NTUP_PER_BUCKET ...
> but appealing to standard advice isn't a good basis for arguing that.
> Actual performance measurements (in both batched and unbatched cases)
> would be a suitable basis for proposing a change.

Well, it's all in what scenario you test, right?  If you test the case
where something overflows work_mem as a result of the increased size
of the bucket array, it's always going to suck.  And if you test the
case where that doesn't happen, it's likely to win.  I think Stephen
Frost has already done quite a bit of testing in this area, on
previous threads.  But there's no one-size-fits-all solution.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [bug fix] Memory leak in dblink

On Tue, Jun 10, 2014 at 12:27 AM, Amit Kapila  wrote:
> Is there a need to free memory context in PG_CATCH()?
> In error path the memory due to this temporary context will get
> freed as it will delete the transaction context and this
> temporary context will definitely be in the hierarchy of it, so
> it should also get deleted.  I think such handling can be
> useful incase we use PG_CATCH() to suppress the error.

Using PG_CATCH() to suppress an error is pretty much categorically unsafe.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

Gurjeet Singh  writes:
> Startup scripts are not solely in the domain of packagers. End users
> can also be expected to develop/edit their own startup scripts.

> Providing it as GUC would have given end users both the peices, but
> with a compile-time option they have only one half of the solution;
> except if they go compile their own binaries, which forces them into
> being packagers.

I don't find that this argument holds any water at all.  Anyone who's
developing their own start script can be expected to manage recompiling
Postgres.

Extra GUCs do not have zero cost, especially not ones that are as
complicated-to-explain as this would be.

I would also argue that there's a security issue with making it a GUC.
A non-root DBA should not have the authority to decide whether or not
postmaster child processes run with nonzero OOM adjustment; that decision
properly belongs to whoever has authorized the root-owned startup script
to change the adjustment in the first place.  So seeing this as two
independent pieces is not only wrong but dangerous.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "cancelling statement due to user request error" occurs but the transaction has committed.

On Sun, Jun 8, 2014 at 2:52 AM, Amit Kapila  wrote:
>> I think this cancellation request must not interrupt the internal commited
>> transaction.
>>
>> This is because clients may misunderstand "the transaction has
>> rollbacked".
>
> There can be similar observation if the server goes off (power
> outage or anything like) after committing transaction, client will
> receive connection broken, so he can misunderstand that as well.
> I think for such corner cases, client needs to reconfirm his action
> results with database before concluding anything.

I don't agree with this analysis.  If the connection is closed after
the client sends a COMMIT and before it gets a response, then the
client must indeed be smart enough to figure out whether or not the
commit happened.  But if the server sends a response, the client
should be able to rely on that response being correct.  In this case,
an ERROR is getting sent but the transaction is getting committed;
yuck.  I'm not sure whether the fix is right, but this definitely
seems like a bug.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "cancelling statement due to user request error" occurs but the transaction has committed.

Robert Haas  writes:
> I don't agree with this analysis.  If the connection is closed after
> the client sends a COMMIT and before it gets a response, then the
> client must indeed be smart enough to figure out whether or not the
> commit happened.  But if the server sends a response, the client
> should be able to rely on that response being correct.  In this case,
> an ERROR is getting sent but the transaction is getting committed;
> yuck.  I'm not sure whether the fix is right, but this definitely
> seems like a bug.

In general, the only way to avoid that sort of behavior for a post-commit
error would be to PANIC ... and even then, the transaction got committed,
which might not be the expectation of a client that got an error message,
even if it said PANIC.  So this whole area is a minefield, and the only
attractive thing we can do is to try to reduce the number of errors that
can get thrown post-commit.  We already, for example, do not treat
post-commit file unlink failures as ERROR, though we surely would prefer
to do that.

So from this standpoint, redefining SIGINT as not throwing an error when
we're in post-commit seems like a good idea.  I'm not endorsing any
details of the patch here, but the 2-foot view seems generally sound.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "cancelling statement due to user request error" occurs but the transaction has committed.

2014-06-10 Thread Kevin Grittner

Robert Haas  wrote:

> If the connection is closed after the client sends a COMMIT and
> before it gets a response, then the client must indeed be smart
> enough to figure out whether or not the commit happened.  But if
> the server sends a response, the client should be able to rely on
> that response being correct.  In this case, an ERROR is getting
> sent but the transaction is getting committed; yuck.  I'm not
> sure whether the fix is right, but this definitely seems like a
> bug.

+1

It is one thing to send a request and experience a crash or loss of
connection before a response is delivered.  You have to consider
that the state of the transaction is indeterminate and needs to be
checked.  But if the client receives a response saying that the
commit was successful, or that the transaction was rolled back,
that had better reflect reality; otherwise it is a clear bug.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

On Tue, Jun 10, 2014 at 10:02 AM, Tom Lane  wrote:
> Gurjeet Singh  writes:
>> Startup scripts are not solely in the domain of packagers. End users
>> can also be expected to develop/edit their own startup scripts.
>
>> Providing it as GUC would have given end users both the peices, but
>> with a compile-time option they have only one half of the solution;
>> except if they go compile their own binaries, which forces them into
>> being packagers.
>
> I don't find that this argument holds any water at all.  Anyone who's
> developing their own start script can be expected to manage recompiling
> Postgres.

I respectfully disagree. Writing and managing init/start scripts
requires much different set of expertise than compiling and managing
builds of a critical software like a database product.

I would consider adding `echo -1000 > /proc/self/oom_score_adj` to a
start script as a deployment specific tweak, and not 'developing own
start script'.

> Extra GUCs do not have zero cost, especially not ones that are as
> complicated-to-explain as this would be.
>
> I would also argue that there's a security issue with making it a GUC.
> A non-root DBA should not have the authority to decide whether or not
> postmaster child processes run with nonzero OOM adjustment; that decision
> properly belongs to whoever has authorized the root-owned startup script
> to change the adjustment in the first place.  So seeing this as two
> independent pieces is not only wrong but dangerous.

>From experiments last night, I see that child process' oom_score_adj
is capped by the parent process' setting that is forking. So I don't
think it's a security risk from that perspective.

I'll retest and provide the exact findings.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

On Tue, Jun 10, 2014 at 10:02 AM, Tom Lane  wrote:
> Gurjeet Singh  writes:
>> Startup scripts are not solely in the domain of packagers. End users
>> can also be expected to develop/edit their own startup scripts.
>
>> Providing it as GUC would have given end users both the peices, but
>> with a compile-time option they have only one half of the solution;
>> except if they go compile their own binaries, which forces them into
>> being packagers.
>
> I don't find that this argument holds any water at all.  Anyone who's
> developing their own start script can be expected to manage recompiling
> Postgres.

Huh?  Lots of people install PostgreSQL via, say, RPMs, but may still
want to change their startup script locally.  I don't think those
users should have to give up the benefits of RPM packaging because
they want a different oom_adj value.  Then they have to be responsible
for updating the packages every time there's a new minor release,
instead of just typing 'yum update'.  That's a LOT of extra work.

> Extra GUCs do not have zero cost, especially not ones that are as
> complicated-to-explain as this would be.

NOT having them isn't free either.

> I would also argue that there's a security issue with making it a GUC.
> A non-root DBA should not have the authority to decide whether or not
> postmaster child processes run with nonzero OOM adjustment; that decision
> properly belongs to whoever has authorized the root-owned startup script
> to change the adjustment in the first place.  So seeing this as two
> independent pieces is not only wrong but dangerous.

I think the only possible issue is if the DBA doesn't even have shell
access.  If he doesn't have root but does have shell access, he could
have recompiled anyway - it's just more work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "cancelling statement due to user request error" occurs but the transaction has committed.

On Tue, Jun 10, 2014 at 10:18 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> I don't agree with this analysis.  If the connection is closed after
>> the client sends a COMMIT and before it gets a response, then the
>> client must indeed be smart enough to figure out whether or not the
>> commit happened.  But if the server sends a response, the client
>> should be able to rely on that response being correct.  In this case,
>> an ERROR is getting sent but the transaction is getting committed;
>> yuck.  I'm not sure whether the fix is right, but this definitely
>> seems like a bug.
>
> In general, the only way to avoid that sort of behavior for a post-commit
> error would be to PANIC ... and even then, the transaction got committed,
> which might not be the expectation of a client that got an error message,
> even if it said PANIC.  So this whole area is a minefield, and the only
> attractive thing we can do is to try to reduce the number of errors that
> can get thrown post-commit.  We already, for example, do not treat
> post-commit file unlink failures as ERROR, though we surely would prefer
> to do that.

We could treated it as a lost-communication scenario.  The appropriate
recovery actions from the client's point of view are identical.

> So from this standpoint, redefining SIGINT as not throwing an error when
> we're in post-commit seems like a good idea.  I'm not endorsing any
> details of the patch here, but the 2-foot view seems generally sound.

Cool, that makes sense to me also.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

On 2014-06-10 07:56:01 -0400, Gurjeet Singh wrote:
> Providing it as GUC would have given end users both the peices, but
> with a compile-time option they have only one half of the solution;
> except if they go compile their own binaries, which forces them into
> being packagers.
> 
> I am not alone in feeling that if Postgres wishes to provide a control
> over child backend's oom_score_adj, it should be a GUC parameter
> rather than a compile-time option. Yesterday a customer wanted to
> leverage this and couldn't because they refuse to maintain their own
> fork of Postgres code.

Independent of the rest of the discussion, I think there's one more
point: Trying to keep your system stable by *increasing* the priority of
normal backends is a bad idea. If you system gets into OOM land you need
to fix that, not whack who gets killed around.
The reason it makes sense to increase the priority of the postmaster is
that that *does* increase the stability by cleaning up resources and
restarting everything.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] updated emacs configuration

Noah Misch  writes:
> On Mon, Jun 09, 2014 at 09:04:02PM -0400, Peter Eisentraut wrote:
>> On Sun, 2014-06-08 at 21:55 -0400, Noah Misch wrote:
>>> After upgrading to GNU Emacs 23.4.1 from a version predating directory-local
>>> variables, I saw switch/case indentation go on the fritz.  My hooks were
>>> issuing (c-set-style "postgresql"), but ".dir-locals.el" set it back to 
>>> plain
>>> "bsd" style.

>> I'd consider just getting rid of the
>> 
>> (c-file-style . "bsd")
>> 
>> setting in .dir-locals.el and put the actual interesting settings from
>> the style in there.
>> 
>> Another alternative is to change emacs.samples to use
>> hack-local-variables-hook instead, as described here:
>> http://www.emacswiki.org/emacs/LocalVariables

> Those are reasonable alternatives.  The ultimate effect looks similar for all
> three options, to the extent that I don't see a basis for forming a strong
> preference.  Do you have a recommendation?

The more Ludd^H^H^Hcautious among us run with
(setq enable-local-variables nil)
and would rather not see the configuration recommendations overcomplicated
due to an attempt to play nice with directory-local settings we're ignoring
anyway.  So I'd vote for Peter's first option, ie, kluge up .dir-locals.el
not the configuration code.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] why postgresql define NTUP_PER_BUCKET as 10, not other numbers smaller

Robert Haas  writes:
> On Mon, Jun 9, 2014 at 11:09 AM, Tom Lane  wrote:
>> I'm quite prepared to believe that we should change NTUP_PER_BUCKET ...
>> but appealing to standard advice isn't a good basis for arguing that.
>> Actual performance measurements (in both batched and unbatched cases)
>> would be a suitable basis for proposing a change.

> Well, it's all in what scenario you test, right?  If you test the case
> where something overflows work_mem as a result of the increased size
> of the bucket array, it's always going to suck.  And if you test the
> case where that doesn't happen, it's likely to win.  I think Stephen
> Frost has already done quite a bit of testing in this area, on
> previous threads.  But there's no one-size-fits-all solution.

I don't really recall any hard numbers being provided.  I think if we
looked at some results that said "here's the average gain, and here's
the worst-case loss, and here's an estimate of how often you'd hit
the worst case", then we could make a decision.

However, I notice that it's already the case that we make a
to-batch-or-not-to-batch decision on the strength of some very crude
numbers during ExecChooseHashTableSize, and we explicitly don't consider
palloc overhead there.  It would certainly be easy enough to use two
different NTUP_PER_BUCKET target load factors depending on which path
is being taken in ExecChooseHashTableSize.  So maybe part of the answer is
to not require those numbers to be the same.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] why postgresql define NTUP_PER_BUCKET as 10, not other numbers smaller

On Tue, Jun 10, 2014 at 10:27 AM, Tom Lane  wrote:
>> Well, it's all in what scenario you test, right?  If you test the case
>> where something overflows work_mem as a result of the increased size
>> of the bucket array, it's always going to suck.  And if you test the
>> case where that doesn't happen, it's likely to win.  I think Stephen
>> Frost has already done quite a bit of testing in this area, on
>> previous threads.  But there's no one-size-fits-all solution.
>
> I don't really recall any hard numbers being provided.  I think if we
> looked at some results that said "here's the average gain, and here's
> the worst-case loss, and here's an estimate of how often you'd hit
> the worst case", then we could make a decision.

The worst case loss is that you have to rescan the entire inner
relation, so it's pretty darned bad.  I'm not sure how to construct an
optimal worst case fot that being monumentally expensive, but making
the inner relation gigantic is probably a good start.

> However, I notice that it's already the case that we make a
> to-batch-or-not-to-batch decision on the strength of some very crude
> numbers during ExecChooseHashTableSize, and we explicitly don't consider
> palloc overhead there.  It would certainly be easy enough to use two
> different NTUP_PER_BUCKET target load factors depending on which path
> is being taken in ExecChooseHashTableSize.  So maybe part of the answer is
> to not require those numbers to be the same.

If we could allow NTUP_PER_BUCKET to drop when the hashtable is
expected to fit in memory either way, perhaps with some safety margin
(e.g. we expect to use less than 75% of work_mem), I bet that would
make the people who have been complaining about this issue happy.  And
probably a few people who haven't been complaining, too: I don't
recall the precise performance numbers that were posted before, but
ISTR the difference between 10 and 1 was pretty dramatic.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "cancelling statement due to user request error" occurs but the transaction has committed.

Robert Haas  writes:
> On Tue, Jun 10, 2014 at 10:18 AM, Tom Lane  wrote:
>> ...  So this whole area is a minefield, and the only
>> attractive thing we can do is to try to reduce the number of errors that
>> can get thrown post-commit.  We already, for example, do not treat
>> post-commit file unlink failures as ERROR, though we surely would prefer
>> to do that.

> We could treated it as a lost-communication scenario.  The appropriate
> recovery actions from the client's point of view are identical.

I'd hardly rate that as an attractive option.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

On Tue, Jun 10, 2014 at 10:32 AM, Andres Freund  wrote:
> On 2014-06-10 07:56:01 -0400, Gurjeet Singh wrote:
>> Providing it as GUC would have given end users both the peices, but
>> with a compile-time option they have only one half of the solution;
>> except if they go compile their own binaries, which forces them into
>> being packagers.
>>
>> I am not alone in feeling that if Postgres wishes to provide a control
>> over child backend's oom_score_adj, it should be a GUC parameter
>> rather than a compile-time option. Yesterday a customer wanted to
>> leverage this and couldn't because they refuse to maintain their own
>> fork of Postgres code.
>
> Independent of the rest of the discussion, I think there's one more
> point: Trying to keep your system stable by *increasing* the priority of
> normal backends is a bad idea. If you system gets into OOM land you need
> to fix that, not whack who gets killed around.
> The reason it makes sense to increase the priority of the postmaster is
> that that *does* increase the stability by cleaning up resources and
> restarting everything.

But the priority is inherited by child processes, so to decrease the
priority of JUST the postmaster you need to decrease its priority and
then set the priority of each child back to the original value.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "cancelling statement due to user request error" occurs but the transaction has committed.

On Tue, Jun 10, 2014 at 10:42 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Jun 10, 2014 at 10:18 AM, Tom Lane  wrote:
>>> ...  So this whole area is a minefield, and the only
>>> attractive thing we can do is to try to reduce the number of errors that
>>> can get thrown post-commit.  We already, for example, do not treat
>>> post-commit file unlink failures as ERROR, though we surely would prefer
>>> to do that.
>
>> We could treated it as a lost-communication scenario.  The appropriate
>> recovery actions from the client's point of view are identical.
>
> I'd hardly rate that as an attractive option.

Well, the only other principled fix I can see is to add a new reponse
along the lines of ERRORBUTITCOMMITTED, which does not seem attractive
either, since all clients will have to be taught to understand it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] why postgresql define NTUP_PER_BUCKET as 10, not other numbers smaller

Robert Haas  writes:
> On Tue, Jun 10, 2014 at 10:27 AM, Tom Lane  wrote:
>> I don't really recall any hard numbers being provided.  I think if we
>> looked at some results that said "here's the average gain, and here's
>> the worst-case loss, and here's an estimate of how often you'd hit
>> the worst case", then we could make a decision.

> The worst case loss is that you have to rescan the entire inner
> relation, so it's pretty darned bad.  I'm not sure how to construct an
> optimal worst case fot that being monumentally expensive, but making
> the inner relation gigantic is probably a good start.

"Rescan"?  I'm pretty sure there are no cases where nodeHash reads the
inner relation more than once.  If you mean dumping to disk vs not dumping
to disk, yeah, that's a big increment in the cost.

> If we could allow NTUP_PER_BUCKET to drop when the hashtable is
> expected to fit in memory either way, perhaps with some safety margin
> (e.g. we expect to use less than 75% of work_mem), I bet that would
> make the people who have been complaining about this issue happy.

Could be a good plan.  We still need some test results though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

On Tue, Jun 10, 2014 at 10:32 AM, Andres Freund  wrote:
> On 2014-06-10 07:56:01 -0400, Gurjeet Singh wrote:
>> Providing it as GUC would have given end users both the peices, but
>> with a compile-time option they have only one half of the solution;
>> except if they go compile their own binaries, which forces them into
>> being packagers.
>>
>> I am not alone in feeling that if Postgres wishes to provide a control
>> over child backend's oom_score_adj, it should be a GUC parameter
>> rather than a compile-time option. Yesterday a customer wanted to
>> leverage this and couldn't because they refuse to maintain their own
>> fork of Postgres code.
>
> Independent of the rest of the discussion, I think there's one more
> point: Trying to keep your system stable by *increasing* the priority of
> normal backends is a bad idea. If you system gets into OOM land you need
> to fix that, not whack who gets killed around.
> The reason it makes sense to increase the priority of the postmaster is
> that that *does* increase the stability by cleaning up resources and
> restarting everything.

As it stands right now, a user can decrease the likelyhood of
Postmaster being killed by adjusting the start script, but that
decreases the likelyhood of al the child processes, too, making the
Postmaster just as likely as a kill-candidate, maybe even higher
because it's the parent, as any backend.

This patch gives the user a control to let the backend's likelyhood of
being killed be different/higher than that of the postmaster.

The users were already capable of doing that, but were required to
custom-compile Postgres to get the benefits. This patch just removes
that requirement.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

On 2014-06-10 10:42:41 -0400, Robert Haas wrote:
> On Tue, Jun 10, 2014 at 10:32 AM, Andres Freund  
> wrote:
> > On 2014-06-10 07:56:01 -0400, Gurjeet Singh wrote:
> >> Providing it as GUC would have given end users both the peices, but
> >> with a compile-time option they have only one half of the solution;
> >> except if they go compile their own binaries, which forces them into
> >> being packagers.
> >>
> >> I am not alone in feeling that if Postgres wishes to provide a control
> >> over child backend's oom_score_adj, it should be a GUC parameter
> >> rather than a compile-time option. Yesterday a customer wanted to
> >> leverage this and couldn't because they refuse to maintain their own
> >> fork of Postgres code.
> >
> > Independent of the rest of the discussion, I think there's one more
> > point: Trying to keep your system stable by *increasing* the priority of
> > normal backends is a bad idea. If you system gets into OOM land you need
> > to fix that, not whack who gets killed around.
> > The reason it makes sense to increase the priority of the postmaster is
> > that that *does* increase the stability by cleaning up resources and
> > restarting everything.
> 
> But the priority is inherited by child processes, so to decrease the
> priority of JUST the postmaster you need to decrease its priority and
> then set the priority of each child back to the original value.

Gurjeet argues for making the absolute value of child processes priority
configurable though? My point is that attacking the problem from that
angle is wrong.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2014-06-10 Thread Rahila Syed

Hello ,


In order to facilitate changing of compression algorithms  and to be able
to recover using WAL records compressed with different compression
algorithms, information about compression algorithm can be stored in WAL
record.

XLOG record header has 2 to 4 padding bytes in order to align the WAL
record. This space can be used for  a new flag in order to store
information about the compression algorithm used. Like the xl_info field of
XlogRecord struct,  8 bits flag  can be constructed with the lower 4 bits
of the flag used to indicate which backup block is compressed out of
0,1,2,3. Higher four bits can be used to indicate state of compression i.e
off,lz4,snappy,pglz.

The flag can be extended to incorporate more compression algorithms added
in future if any.

What is your opinion on this?


Thank you,

Rahila Syed


On Tue, May 27, 2014 at 9:27 AM, Rahila Syed 
wrote:

> Hello All,
>
> 0001-CompressBackupBlock_snappy_lz4_pglz extends patch on compression of
> full page writes to include LZ4 and Snappy . Changes include making
> "compress_backup_block" GUC from boolean to enum. Value of the GUC can be
> OFF, pglz, snappy or lz4 which can be used to turn off compression or set
> the desired compression algorithm.
>
> 0002-Support_snappy_lz4 adds support for LZ4 and Snappy in PostgreSQL. It
> uses Andres’s patch for getting Makefiles working and has a few wrappers to
> make the function calls to LZ4 and Snappy compression functions and handle
> varlena datatypes.
> Patch Courtesy: Pavan Deolasee
>
> These patches serve as a way to test various compression algorithms. These
> are WIP yet. They don’t support changing compression algorithms on standby
> .
> Also, compress_backup_block GUC needs to be merged with full_page_writes.
> The patch uses LZ4 high compression(HC) variant.
> I have conducted initial tests which I would like to share and solicit
> feedback
>
> Tests use JDBC runner TPC-C benchmark to measure the amount of WAL
> compression ,tps and response time in each of the scenarios viz .
> Compression = OFF , pglz, LZ4 , snappy ,FPW=off
>
> Server specifications:
> Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
> RAM: 32GB
> Disk : HDD  450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
> 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm
>
>
> Benchmark:
> Scale : 100
> Command  :java JR  /home/postgres/jdbcrunner-1.2/scripts/tpcc.js
>  -sleepTime
> 600,350,300,250,250
> Warmup time  : 1 sec
> Measurement time : 900 sec
> Number of tx types   : 5
> Number of agents : 16
> Connection pool size : 16
> Statement cache size : 40
> Auto commit  : false
> Sleep time   : 600,350,300,250,250 msec
>
> Checkpoint segments:1024
> Checkpoint timeout:5 mins
>
>
> Scenario   WAL generated(bytes)   Compression
> (bytes)   TPS (tx1,tx2,tx3,tx4,tx5)
> No_compress  2220787088 (~2221MB) NULL
> 13.3,13.3,1.3,1.3,1.3 tps
> Pglz  1796213760 (~1796MB) 424573328
> (19.11%) 13.1,13.1,1.3,1.3,1.3 tps
> Snappy 1724171112 (~1724MB) 496615976( 22.36%)
> 13.2,13.2,1.3,1.3,1.3 tps
> LZ4(HC)1658941328 (~1659MB) 561845760(25.29%)
> 13.2,13.2,1.3,1.3,1.3 tps
> FPW(off)   139384320(~139 MB)NULL
> 13.3,13.3,1.3,1.3,1.3 tps
>
> As per measurement results, WAL reduction using LZ4 is close to 25% which
> shows 6 percent increase in WAL reduction when compared to pglz . WAL
> reduction in snappy is close to 22 % .
> The numbers for compression using LZ4 and Snappy doesn’t seem to be very
> high as compared to pglz for given workload. This can be due to
> in-compressible nature of the TPC-C data which contains random strings
>
> Compression does not have bad impact on the response time. In fact,
> response
> times for Snappy, LZ4 are much better than no compression with almost ½ to
> 1/3 of the response times of no-compression(FPW=on) and FPW = off.
> The response time order for each  type of compression is
> Pglz>Snappy>LZ4
>
> Scenario  Response time (tx1,tx2,tx3,tx4,tx5)
> no_compress,1848,4221,6791,5747 msec
> pglz4275,2659,1828,4025,3326 msec
> Snappy   3790,2828,2186,1284,1120 msec
> LZ4(hC)  2519,2449,1158,2066,2065 msec
> FPW(off) 6234,2430,3017,5417,5885 msec
>
> LZ4 and Snappy are almost at par with each other in terms of response time
> as average response times of five types of transactions remains almost same
> for both.
> 0001-CompressBackupBlock_snappy_lz4_pglz.patch
> <
> http://postgresql.1045698.n5.nabble.com/file/n5805044/0001-CompressBackupBlock_snappy_lz4_pglz.patch
> >
> 0002-Support_snappy_lz4.patch
> <
> http://postgresql.1045698.n5.nabble.com/file/n5805044/0002-Support_snappy_lz4.patch
> >
>
>
>
>
> --
> View this message in context:
> http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp57690

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

Robert Haas  writes:
> On Tue, Jun 10, 2014 at 10:02 AM, Tom Lane  wrote:
>> I don't find that this argument holds any water at all.  Anyone who's
>> developing their own start script can be expected to manage recompiling
>> Postgres.

> Huh?  Lots of people install PostgreSQL via, say, RPMs, but may still
> want to change their startup script locally.

So?  The RPM packager could probably be expected to have compiled with the
oom-adjust-reset option enabled.  If not, why aren't these people lobbying
the packager to meet their expectation?

I remain of the opinion that allowing nonprivileged people to decide
whether that code is active or not is unsafe from a system level.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] NUMA packaging and patch

On Mon, Jun 9, 2014 at 1:00 PM, Kevin Grittner  wrote:
> Andres Freund  wrote:
>> On 2014-06-09 08:59:03 -0700, Kevin Grittner wrote:
 *) There is a lot of advice floating around (for example here:
 http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html
  )
 to instruct operators to disable zone_reclaim.  Will your changes
 invalidate any of that advice?
>>>
>>> I expect that it will make the need for that far less acute,
>>> although it is probably still best to disable zone_reclaim (based
>>> on the documented conditions under which disabling it makes sense).
>>
>> I think it'll still be important unless you're running an OLTP workload
>> (i.e. minimal per backend allocations) and your entire workload fits
>> into shared buffers. What zone_reclaim > 0 essentially does is to never
>> allocate memory from remote nodes. I.e. it will throw away all numa node
>> local OS cache to satisfy a memory allocation (including
>> pagefaults).
>
> I don't think that cpuset spreading of OS buffers and cache, and
> the patch to spread shared memory, will make too much difference
> unless the working set is fully cached.  Where I have seen the
> biggest problems is when the active set > one memory node and <
> total machine RAM.

But that's precisely the scenario where vm.zone_reclaim_mode != 0 is a
disaster.  You'll end up throwing away the cached pages and rereading
the data from disk, even though the memory *could* have been kept all
in cache.

> I would agree that unless this patch is
> providing benefit for such a fully-cached load, it won't make any
> difference regarding the need for zone_reclaim_mode.  Where the
> data is heavily cached, zone_reclaim > 0 might discard some cached
> pages to allow, say, a RAM sort to be done in faster memory (for
> the current process's core), so it might be a wash or even make
> zone_reclaim > 0 a win.

I will believe that when, and only when, I see benchmarks convincingly
demonstrating it.  Setting zone_reclaim_mode can only be a win if the
performance benefit from using faster memory is greater than the
performance cost of any rereading-from-disk that happens.  IME, that's
a highly unusual situation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

On Tue, Jun 10, 2014 at 10:51 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Jun 10, 2014 at 10:02 AM, Tom Lane  wrote:
>>> I don't find that this argument holds any water at all.  Anyone who's
>>> developing their own start script can be expected to manage recompiling
>>> Postgres.
>
>> Huh?  Lots of people install PostgreSQL via, say, RPMs, but may still
>> want to change their startup script locally.
>
> So?  The RPM packager could probably be expected to have compiled with the
> oom-adjust-reset option enabled.  If not, why aren't these people lobbying
> the packager to meet their expectation?

Because that might take years to happen, or the packager might never
do it at all on the theory that what is good for customers in general
is different than what's good for one particular customer, or on the
theory that it's just not a high enough priority for that packager.

Are you seriously saying that you've never needed to customize a
startup script on a RHEL box somewhere?

> I remain of the opinion that allowing nonprivileged people to decide
> whether that code is active or not is unsafe from a system level.

On what factual basis?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] why postgresql define NTUP_PER_BUCKET as 10, not other numbers smaller

On Tue, Jun 10, 2014 at 10:46 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Jun 10, 2014 at 10:27 AM, Tom Lane  wrote:
>>> I don't really recall any hard numbers being provided.  I think if we
>>> looked at some results that said "here's the average gain, and here's
>>> the worst-case loss, and here's an estimate of how often you'd hit
>>> the worst case", then we could make a decision.
>
>> The worst case loss is that you have to rescan the entire inner
>> relation, so it's pretty darned bad.  I'm not sure how to construct an
>> optimal worst case fot that being monumentally expensive, but making
>> the inner relation gigantic is probably a good start.
>
> "Rescan"?  I'm pretty sure there are no cases where nodeHash reads the
> inner relation more than once.  If you mean dumping to disk vs not dumping
> to disk, yeah, that's a big increment in the cost.

Sorry, that's what I meant.

>> If we could allow NTUP_PER_BUCKET to drop when the hashtable is
>> expected to fit in memory either way, perhaps with some safety margin
>> (e.g. we expect to use less than 75% of work_mem), I bet that would
>> make the people who have been complaining about this issue happy.
>
> Could be a good plan.  We still need some test results though.

Sounds reasonable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

Andres Freund  writes:
> Independent of the rest of the discussion, I think there's one more
> point: Trying to keep your system stable by *increasing* the priority of
> normal backends is a bad idea. If you system gets into OOM land you need
> to fix that, not whack who gets killed around.
> The reason it makes sense to increase the priority of the postmaster is
> that that *does* increase the stability by cleaning up resources and
> restarting everything.

That's half of the reason.  The other half is that, at least back when
we added this code, the Linux kernel's victim-selection code
disproportionately chose to kill the postmaster rather than any child
backend, an outcome definitely not to be preferred.  IIRC this was because
it blamed the postmaster for the sum of childrens' memory sizes *including
shared memory*, counting the shared memory over again for each child.

I don't know whether our switch away from SysV shmem has helped that, or
if recent kernel versions have less brain-dead OOM victim selection.
I'm not terribly hopeful though.

But anyway, yeah, the point of this feature is that the OOM priority
of the postmaster, and *only* the postmaster, should be raised.  Allowing
unprivileged people to break that is not attractive on any level.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

Robert Haas  writes:
> On Tue, Jun 10, 2014 at 10:51 AM, Tom Lane  wrote:
>> So?  The RPM packager could probably be expected to have compiled with the
>> oom-adjust-reset option enabled.  If not, why aren't these people lobbying
>> the packager to meet their expectation?

> Because that might take years to happen,

... unlike adding a GUC?

> or the packager might never
> do it at all on the theory that what is good for customers in general
> is different than what's good for one particular customer, or on the
> theory that it's just not a high enough priority for that packager.

> Are you seriously saying that you've never needed to customize a
> startup script on a RHEL box somewhere?

Sure, but what's that have to do with this?  Any Red Hat or PGDG RPM will
come with this code already enabled in the build, so there is no need for
anyone to have a GUC to play around with the behavior.

>> I remain of the opinion that allowing nonprivileged people to decide
>> whether that code is active or not is unsafe from a system level.

> On what factual basis?

Because it would convert the intended behavior (postmaster and only
postmaster is exempt from OOM kill) into a situation where possibly
all of the database processes are exempt from OOM kill, at the whim
of somebody who should not have the privilege to decide that.

In my view, the root-owned startup script grants OOM exemption to
the postmaster because it *knows* that the postmaster's children
will drop the exemption.  If that trust can be violated because some
clueless DBA decided to frob a GUC setting, I'd be a lot more hesitant
about giving the postmaster the exemption in the first place.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

> Andres Freund  writes:
> > Independent of the rest of the discussion, I think there's one more
> > point: Trying to keep your system stable by *increasing* the priority of
> > normal backends is a bad idea. If you system gets into OOM land you need
> > to fix that, not whack who gets killed around.
> > The reason it makes sense to increase the priority of the postmaster is
> > that that *does* increase the stability by cleaning up resources and
> > restarting everything.
> 
> That's half of the reason.  The other half is that, at least back when
> we added this code, the Linux kernel's victim-selection code
> disproportionately chose to kill the postmaster rather than any child
> backend, an outcome definitely not to be preferred.  IIRC this was because
> it blamed the postmaster for the sum of childrens' memory sizes *including
> shared memory*, counting the shared memory over again for each child.

That's essentially the same thing. We want the postmaster survive, not
the children...

> I don't know whether our switch away from SysV shmem has helped that, or
> if recent kernel versions have less brain-dead OOM victim selection.
> I'm not terribly hopeful though.

It has gotten much better in the latest few versions. But it'll be a
fair bit till those will be wildly used. And even then, from the
kernel's view, there's just no reason strongly prefer not to kill the
postmaster over other processes without the user providing the information.

On 2014-06-10 11:04:52 -0400, Tom Lane wrote:
> But anyway, yeah, the point of this feature is that the OOM priority
> of the postmaster, and *only* the postmaster, should be raised.  Allowing
> unprivileged people to break that is not attractive on any level.

A superuser could already write a function that echoes into /proc? I
fail to see how a GUC changes the landscape significantly here.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

Andres Freund  writes:
> On 2014-06-10 11:04:52 -0400, Tom Lane wrote:
>> But anyway, yeah, the point of this feature is that the OOM priority
>> of the postmaster, and *only* the postmaster, should be raised.  Allowing
>> unprivileged people to break that is not attractive on any level.

> A superuser could already write a function that echoes into /proc?

Maybe I'm mistaken, but I thought once the fork_process code has reset our
process's setting to zero it's not possible to lower it again (without
privileges we'd not have).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

On 2014-06-10 11:14:43 -0400, Tom Lane wrote:
> Sure, but what's that have to do with this?  Any Red Hat or PGDG RPM will
> come with this code already enabled in the build, so there is no need for
> anyone to have a GUC to play around with the behavior.

That's imo a fair point. Unless I misunderstand things Gurjeet picked
the topic up again because he wants to increase the priority of the
children. Is that correct Gurjeet?

I do think a GUC might be a nicer interface for this than a
#define. Possibly even with a absolute reset value for the children. But
only because there's no way, afaics, to explicitly reset to the default
value.

> >> I remain of the opinion that allowing nonprivileged people to decide
> >> whether that code is active or not is unsafe from a system level.
> 
> > On what factual basis?
> 
> Because it would convert the intended behavior (postmaster and only
> postmaster is exempt from OOM kill) into a situation where possibly
> all of the database processes are exempt from OOM kill, at the whim
> of somebody who should not have the privilege to decide that.

Meh^3. By that argument we need to forbid superusers to create any form
of untrusted functions. Forbid anything that does malloc(), system(),
fork(), whatever from a user's influence.
Superusers can execute code in the postmaster using
shared_preload_libraries. Feigning that that's not possible isn't buying
us anything.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

Gurjeet Singh  writes:
> As it stands right now, a user can decrease the likelyhood of
> Postmaster being killed by adjusting the start script, but that
> decreases the likelyhood of al the child processes, too, making the
> Postmaster just as likely as a kill-candidate, maybe even higher
> because it's the parent, as any backend.

Exactly.

> This patch gives the user a control to let the backend's likelyhood of
> being killed be different/higher than that of the postmaster.

If you think your users might want to give the postmaster OOM-exemption,
why don't you just activate the existing code when you build?  Resetting
the OOM setting to zero is safe whether or not the startup script did
anything to the postmaster's setting.

In short, I don't see what a GUC adds here, except uncertainty, which
is not a good thing.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels