date:20140113


On 01/11/2014 12:40 AM, Peter Geoghegan wrote:

My problem is that in general I'm not sold on the actual utility of
making this kind of row locking work with exclusion constraints. I'm
sincerely having a hard time thinking of a practical use-case
(although, as I've said, I want to make it work with IGNORE). Even if
you work all this row locking stuff out, and the spill-to-disk aspect
out, the interface is still wrong, because you need to figure out a
way to project more than one reject per slot. Maybe I lack imagination
around how to make that work, but there are a lot of ifs and buts
either way.


Exclusion constraints can be used to implement uniqueness checks with 
SP-GiST or GiST indexes. For example, if you want to enforce that there 
are no two tuples with the same x and y coordinates, ie. use a point as 
the key. You could add a b-tree index just to enforce the constraint, 
but it's better if you don't have to. In general, it's just always 
better if features don't have implementation-specific limitations like this.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Filter error log statements by sqlstate

2014-01-13 Thread Jeevan Chalke

Hi Oskari,

I had a quick look over the patch (Not compiled though). Here are few
comments on the changes:

1. Documentation is missing and thus becomes difficult to understand what
exactly you are trying to do. Or in other words, user will be uncertain
about
using it more efficiently.
2. Some more comments required. At each new function and specifically at
get_sqlstate_error_level().
3. Please add test-case if possible.
4. Some code part does not comply with PostgreSQL indentation style. (Can be
ignored as it will pass through pg_indent, but better fix it).
5. You have used XX000:warning, string to get maximum possible length of
the valid sqlstate:level identifier. It's perfect, but small explanation
about
that will be good there. Also in future if we have any other error level
which
exceeds this, we need changes here too. Right ?

I will look into this further. But please have your attention on above
points.

Thanks



On Fri, Jan 10, 2014 at 12:56 AM, Oskari Saarenmaa o...@ohmu.fi wrote:

 Allow the default log_min_error_statement to be overridden per sqlstate to
 make it possible to filter out some error types while maintaining a low
 log_min_error_statement or enable logging for some error types when the
 default is to not log anything.

 I've tried to do something like this using rsyslog filters, but that's
 pretty awkward and doesn't work at all when the statement is split to
 multiple syslog messages.

 https://github.com/saaros/postgres/compare/log-by-sqlstate

  src/backend/utils/error/elog.c | 183 ++
 ++-
  src/backend/utils/misc/guc.c   |  14 +++-
  src/include/utils/guc.h|   4 +
  src/include/utils/guc_tables.h |   1 +
  4 files changed, 199 insertions(+), 3 deletions(-)

 / Oskari


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Jeevan B Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are
not the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution, retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE


On 01/11/2014 12:39 PM, Peter Geoghegan wrote:

In any case, my patch is bound to win decisively for the other
extreme, the insert-only case, because the overhead of doing an index
scan first is always wasted there with your approach, and the overhead
of extended btree leaf page locking has been shown to be quite low.


Quite possibly. Run the benchmark, and we'll see how big a difference 
we're talking about.



In
the past you've spoken of avoiding that overhead through an adaptive
strategy based on statistics, but I think you'll have a hard time
beating a strategy where the decision comes as late as possible, and
is informed by highly localized page-level metadata already available.
My implementation can abort an attempt to just read an existing
would-be duplicate very inexpensively (with no strong locks), going
back to just after the _bt_search() to get a heavyweight lock if just
reading doesn't work out (if there is no duplicate found), so as to
not waste all of its prior work. Doing one of the two extremes of
insert-mostly or update-only well is relatively easy; dynamically
adapting to one or the other is much harder. Especially if it's a
consistent mix of inserts and updates, where general observations
aren't terribly useful.


Another way to optimize it is to keep the b-tree page pinned after doing 
the pre-check. Then you don't need to descend the tree again when doing 
the insert. That would require small indexam API changes, but wouldn't 
be too invasive, I think.



All other concerns of mine still remain, including the concern over
the extra locking of the proc array - I'm concerned about the
performance impact of that on other parts of the system not exercised
by this test.


Yeah, I'm not thrilled about that part either. Fortunately there are 
other ways to implement that. In fact, I think you could just not bother 
taking the ProcArrayLock when setting the fields. The danger is that 
another backend sees a mixed state of the fields, but that's OK. The 
worst that can happen is that it will do an unnecessary lock/release on 
the heavy-weight lock. And to reduce the overhead when reading the 
fields, you could merge the SpeculativeInsertionIsInProgress() check 
into TransactionIdIsInProgress(). The call site in tqual.c always calls 
it together with TransactionIdIsInProgress(), which scans the proc array 
anyway, while holding the lock.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ECPG regression tests generating warnings

2014-01-13 Thread Michael Meskes

On Sun, Jan 12, 2014 at 08:28:57AM -0800, Kevin Grittner wrote:
 desc.pgc:55: WARNING: descriptor outdesc does not exist
 desc.pgc:86: WARNING: descriptor outdesc does not exist

Thanks, I didn't notice, fixed.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org
Jabber: michael.meskes at gmail dot com
VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP patch (v2) for updatable security barrier views

2014-01-13 Thread Dean Rasheed

On 12 January 2014 10:12, Craig Ringer cr...@2ndquadrant.com wrote:
 On 01/09/2014 06:48 PM, Dean Rasheed wrote:
 On 8 January 2014 10:19, Dean Rasheed dean.a.rash...@gmail.com wrote:
 The assertion failure with inheritance and sublinks is a separate
 issue --- adjust_appendrel_attrs() is not expecting to find any
 unplanned sublinks in the query tree when it is invoked, since they
 would normally have all been planned by that point. However, the
 addition of the new security barrier subqueries after inheritance
 expansion can now insert new sublinks which need to be planned. I'll
 look into how best to make that happen.

 The attached patch does that, which fixes the case you reported.

 Dean, any objections to adding this to the current CF, or to my doing so?


OK, I'll do that.

I've added a page to the wiki with a more in-depth description of how
the patch works, and the test cases I've tried so far:

https://wiki.postgresql.org/wiki/Making_security_barrier_views_automatically_updatable

there's obviously still a lot more testing to do, but the early signs
are encouraging.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Filter error log statements by sqlstate

2014-01-13 Thread Oskari Saarenmaa


Hi,

On 13/01/14 10:26, Jeevan Chalke wrote:

1. Documentation is missing and thus becomes difficult to understand what
exactly you are trying to do. Or in other words, user will be uncertain
about using it more efficiently.


I figured I'd write documentation for this if it looks like a useful 
feature which would be accepted for 9.4, but I guess it would've helped 
to have a bit better description of this for the initial submission as well.



2. Some more comments required. At each new function and specifically at
get_sqlstate_error_level().


Just after I submitted the patch I noticed that I had a placeholder for 
comment about that function but never wrote the actual comment, sorry 
about that.



3. Please add test-case if possible.


Sure.


4. Some code part does not comply with PostgreSQL indentation style. (Can be
ignored as it will pass through pg_indent, but better fix it).


I'll try to fix this for v2.


5. You have used XX000:warning, string to get maximum possible length of
the valid sqlstate:level identifier. It's perfect, but small explanation
about that will be good there. Also in future if we have any other error level
which exceeds this, we need changes here too. Right ?


Good point, I'll address this in v2.


I will look into this further. But please have your attention on above
points.


Thanks for the review!

/ Oskari


On Fri, Jan 10, 2014 at 12:56 AM, Oskari Saarenmaa o...@ohmu.fi
wrote:
 Allow the default log_min_error_statement to be overridden per
 sqlstate to make it possible to filter out some error types while
 maintaining a low log_min_error_statement or enable logging for some
 error types when the default is to not log anything.

 I've tried to do something like this using rsyslog filters, but
 that's pretty awkward and doesn't work at all when the statement is
 split to multiple syslog messages.

 https://github.com/saaros/postgres/compare/log-by-sqlstate




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.


On 01/09/2014 10:55 PM, Josh Berkus wrote:

On 01/09/2014 12:05 PM, Heikki Linnakangas wrote:


Actually, why is the partially-filled 00010002 file
archived in the first place? Looking at the code, it's been like that
forever, but it seems like a bad idea. If the original server is still
up and running, and writing more data to that file, what will happen is
that when the original server later tries to archive it, it will fail
because the partial version of the file is already in the archive. Or
worse, the partial version overwrites a previously archived more
complete version.


Oh!  This explains some transient errors I've seen.


Wouldn't it be better to not archive the old segment, and instead switch
to a new segment after writing the end-of-recovery checkpoint, so that
the segment on the new timeline is archived sooner?


It would be better to zero-fill and switch segments, yes.  We should
NEVER be in a position of archiving two different versions of the same
segment.


Ok, I think we're in agreement that that's the way to go for master.

Now, what to do about back-branches? On one hand, I'd like to apply the 
same fix to all stable branches, as the current behavior is silly and 
always has been. On the other hand, we haven't heard any complaints 
about it, so we probably shouldn't fix what ain't broken. Perhaps we 
should apply it to 9.3, as that's where we have the acute problem the OP 
reported. Thoughts?


In summary, I propose that we change master and REL9_3_STABLE to not 
archive the partial segment from previous timeline. Older branches will 
keep the current behavior.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] generic pseudotype IO functions?

2014-01-13 Thread Peter Eisentraut

On Mon, 2014-01-06 at 17:36 +0100, Andres Freund wrote:
 FWIW, I am perfectly fine with duplicating the functions for now - I
 just thought that that might not be the best way but I didn't (and
 still
 don't) have a strong opinion.

Could we just put 0 in for the functions' OID and have code elsewhere
that errors there is no input function for this type?




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-13 Thread Rajeev rastogi

 
 On Sun, Jan 12, Amit Kapila wrote:
  How would that work?  Would it be a tool in contrib?  There already
  is a timeout, so if a tool checked more frequently than the timeout,
  it should work.  The durable notification of the admin would happen
  in the tool, right?
 
  Well, you know what tool *I'm* planning to use.
 
  Thing is, when we talk about auto-degrade, we need to determine
 things
  like Is the replica down or is this just a network blip? and take
  action according to the user's desired configuration.  This is not
  something, realistically, that we can do on a single request.
 Whereas
  it would be fairly simple for an external monitoring utility to do:
 
  1. decide replica is offline for the duration (several poll attempts
  have failed)
 
  2. Send ALTER SYSTEM SET to the master and change/disable the
  synch_replicas.
 
Will it possible in current mechanism, because presently master will
not accept any new command when the sync replica is not available?
Or is there something else also which needs to be done along with
above 2 points to make it possible.

Since there is not WAL written for ALTER SYSTEM SET command, 
then
it should be able to handle this command even though sync 
replica is
not available.

Thanks and Regards,
Kumar Rajeev Rastogi


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ISN extension bug? (with patch)


On 01/03/2014 07:53 PM, Fabien COELHO wrote:



If so, there is only the one-liner patch to consider.


This patch doesn't apply anymore.  Please submit an updated patch for
the commit fest.


In src/include/utils/elog.h there is an include for utils/errcodes.h
which is generated somehow when compiling postgresql but not present by
default. So you have to compile postgresql and then the contrib, or use
PGXS with an already installed version.

With this caveat, the one-liner patch (4 characters removed) reattached
does compile for me:


Thanks, applied.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP patch (v2) for updatable security barrier views

2014-01-13 Thread Craig Ringer

On 01/09/2014 11:19 PM, Tom Lane wrote:
 Dean Rasheed dean.a.rash...@gmail.com writes:
 My first thought was that it should just preprocess any security
 barrier quals in subquery_planner() in the same way as other quals are
 preprocessed. But thinking about it further, those quals are destined
 to become the quals of subqueries in the range table, so we don't
 actually want to preprocess them at that stage --- that will happen
 later when the new subquery is planned by recursion back into
 subquery_planner(). So I think the right answer is to make
 adjust_appendrel_attrs() handle recursion into sublink subqueries.
 
 TBH, this sounds like doubling down on a wrong design choice.  I see
 no good reason that updatable security views should require any
 fundamental rearrangements of the order of operations in the planner.

In that case, would you mind offerign a quick sanity check on the
following alternative idea:

- Add sourceRelation to Query. This refers to the RTE that supplies
tuple projections to feed into ExecModifyTable, with appropriate resjunk
ctid and (if requ'd) oid cols present.

- When expanding a target view into a subquery, set sourceRelation on
the outer view to the index of the RTE of the newly expanded subquery.

- In rewriteTargetView, as now, reassign resultRelation to the target
view's base rel. This is required so that  do any RETURNING and WITH
CHECK OPTION fixups required to adjust the RETURNING list to the new
result relation, so they act on the final tuple after any BEFORE
triggers act. Do not flatten the view subquery and merge the quals (as
currently happens); allow it to be expanded as a subquery by the
rewriter instead. Don't mess with the view tlist at this point except by
removing the whole-row Var added by rewriteTargetListUD.

- When doing tlist expansion in preprocess_targetlist, when we process
the outer Query (the only one for which query type is not SELECT, and
the only one that has a non-zero resultRelation), if resultRelation !=
sourceRelation recursively follow the chain of sourceRelation s to the
bottom one with type RTE_RELATION. Do tlist expansion on that inner-most
Query first, using sourceRelation to supply the varno for injected TLEs,
including injecting ctid, oid if req'd, etc. During call stack
unwind, have each intermediate layer do regular tlist expansion, adding
a Var pointing to each tlist entry of the inner subquery.

At the outer level of preprocess_targetlist, sort the tlist, now
expanded to include all required vars, into attribute order for the
resultRelation. (this level is the only one that has resultRelation set).

Avoid invoking preprocess_targetlist on the inner Query again when it's
processed in turn, or just bail out when we see sourceRelation set since
we know it's already been done.

(Alternately, it might be possible to run preprocess_targetlist
depth-first instead of the current outermost-first, but I haven't looked
at that).


The optimizer can still flatten non-security-barrier updatable views,
following the chain of Vars as it collapses each layer. That's
effectively what the current rewriteTargetView code is doing manually at
each pass right now.

I'm sure there are some holes in this outline, but it's struck me as
possibly workable. The key is to set sourceRelation on every inner
subquery in the target query chain, not just the outer one, so it can be
easily followed from the outer query though the subqueries into the
innermost query with RTE_RELATION type.



The only alternative I've looked at is looking clumsier the longer I
examine it: adding a back-reference in each subquery's Query struct, to
the Query containing it and the RTI of the subquery within the outer
Query. That way, once rewriting hits the innermost rel with RTE_RELATION
type, the rewriter can walk back up the Query tree doing tlist
rewriting. I'm not sure if this is workable yet, and it creates ugly
pointer-based backrefs *up* the Query chain, making what was previously
a tree of Query* into a graph. That could get exciting, though there'd
never be any need for mutators to follow the parent query pointer so it
wouldn't make tree rewrites harder.






-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested hstore patch

2014-01-13 Thread Oleg Bartunov

Thank you, Erik !

Oleg

On Mon, Jan 13, 2014 at 12:25 PM, Erik Rijkers e...@xs4all.nl wrote:
 On Mon, January 13, 2014 00:24, Erik Rijkers wrote:
 On Sat, January 11, 2014 22:47, Andrew Dunstan wrote:

 On 01/11/2014 03:03 PM, Erik Rijkers wrote:
 On Sat, January 11, 2014 20:30, Peter Eisentraut wrote:
 The documentation doesn't build.
 corrective patch is here:

 http://www.postgresql.org/message-id/37b9f104d5a838eec9b75f3668517aa5.squir...@webmail.xs4all.nl

 It will be in the next version of the patch posted.


 Attached is another handful of doc-fixes...


 There are errors in the example expressions in Table F-6. hstore Operators.

 Attached is a cumulative doc-patch (which includes the changes I sent 
 earlier) which fixes these.

 I also attach an test perl program that shows the (small) differences in 
 output between what's in that doc table and what
 one actually gets. (I found these too insignificant to change but perhaps you 
 have a different opinion.)


 thanks,

 Erik Rijkers





 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

On Jan12, 2014, at 04:18 , Josh Berkus j...@agliodbs.com wrote:
 Thing is, when we talk about auto-degrade, we need to determine things
 like Is the replica down or is this just a network blip? and take
 action according to the user's desired configuration.  This is not
 something, realistically, that we can do on a single request.  Whereas
 it would be fairly simple for an external monitoring utility to do:
 
 1. decide replica is offline for the duration (several poll attempts
 have failed)
 
 2. Send ALTER SYSTEM SET to the master and change/disable the
 synch_replicas.
 
 In other words, if we're going to have auto-degrade, the most
 intelligent place for it is in
 RepMgr/HandyRep/OmniPITR/pgPoolII/whatever.  It's also the *easiest*
 place.  Anything we do *inside* Postgres is going to have a really,
 really hard time determining when to degrade.

+1

This is also how 2PC works, btw - the database provides the building
blocks, i.e. PREPARE and COMMIT, and leaves it to a transaction manager
to deal with issues that require a whole-cluster perspective.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Where do we stand on 9.3 bugs?

What remaining issues are there blocking a 9.3.3 release?  I know that
there were unresolved multixact issues when we put out 9.3.2 --- are
those all dealt with now?  What else do people see as release-blockers?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] generic pseudotype IO functions?

Peter Eisentraut pete...@gmx.net writes:
 On Mon, 2014-01-06 at 17:36 +0100, Andres Freund wrote:
 FWIW, I am perfectly fine with duplicating the functions for now - I
 just thought that that might not be the best way but I didn't (and
 still don't) have a strong opinion.

 Could we just put 0 in for the functions' OID and have code elsewhere
 that errors there is no input function for this type?

That doesn't seem like much of an improvement to me: that would be
taking a catalog corruption condition and blessing it as a legitimate
state of affairs, thereby reducing our ability to detect problems.

One instance where it would create issues is that I'm pretty sure
pg_dump would get confused by such a type.  Admittedly, pg_dump will
never try to dump the built-in pseudotypes, but do we really want them
handled so differently from user-definable types?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Where do we stand on 9.3 bugs?

Tom Lane wrote:
 What remaining issues are there blocking a 9.3.3 release?  I know that
 there were unresolved multixact issues when we put out 9.3.2 --- are
 those all dealt with now?  What else do people see as release-blockers?

The only thing I'm aware still outstanding in multixact land is the
creation of a separate multixact freeze age GUC variable.  All other
issues were resolved.  Bug #8470 is still unresolved, but that's a
performance issue.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Where do we stand on 9.3 bugs?

On 2014-01-13 12:26:45 -0300, Alvaro Herrera wrote:
 Tom Lane wrote:
  What remaining issues are there blocking a 9.3.3 release?  I know that
  there were unresolved multixact issues when we put out 9.3.2 --- are
  those all dealt with now?  What else do people see as release-blockers?
 
 The only thing I'm aware still outstanding in multixact land is the
 creation of a separate multixact freeze age GUC variable.  All other
 issues were resolved.  Bug #8470 is still unresolved, but that's a
 performance issue.

It's not strictly a 9.3 issue, but there's still the issue of /members
wrapping around into valid data, right?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Where do we stand on 9.3 bugs?

Andres Freund wrote:
 On 2014-01-13 12:26:45 -0300, Alvaro Herrera wrote:
  Tom Lane wrote:
   What remaining issues are there blocking a 9.3.3 release?  I know that
   there were unresolved multixact issues when we put out 9.3.2 --- are
   those all dealt with now?  What else do people see as release-blockers?
  
  The only thing I'm aware still outstanding in multixact land is the
  creation of a separate multixact freeze age GUC variable.  All other
  issues were resolved.  Bug #8470 is still unresolved, but that's a
  performance issue.
 
 It's not strictly a 9.3 issue, but there's still the issue of /members
 wrapping around into valid data, right?

Yes, but TBH I don't have much of an intention to fix that other than
having the freeze limit.  At least not for 9.3.3.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested hstore patch

Andrew Dunstan wrote:
 
 On 01/13/2014 03:25 AM, Erik Rijkers wrote:
 
 There are errors in the example expressions in Table F-6. hstore Operators.
 
 Attached is a cumulative doc-patch (which includes the changes I sent 
 earlier) which fixes these.
 
 I also attach an test perl program that shows the (small) differences in 
 output between what's in that doc table and what
 one actually gets. (I found these too insignificant to change but perhaps 
 you have a different opinion.)
 
 
 A new version of the patch is attached. It includes all of Erik's
 docs fixes and a small fix by Alexander Korotkov for hstore hash
 ops.

Interestingly, this also include transaction_commit event triggers.

There are also a few PANIC elogs, probably not what's intended.

(I was just giving this a quick skim to see if there's support to build
JSON objects incrementally from C source, i.e. not have to call
functions using the fmgr interface.  Apparently that's not the case, but
if I'm wrong please let me know.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested hstore patch

2014-01-13 Thread Oleg Bartunov

Andrew,

did you run perl script ? Actually, I found, that operator table needs
to be fixed.

Oleg

On Mon, Jan 13, 2014 at 7:36 PM, Andrew Dunstan and...@dunslane.net wrote:

 On 01/13/2014 03:25 AM, Erik Rijkers wrote:

 There are errors in the example expressions in Table F-6. hstore
 Operators.

 Attached is a cumulative doc-patch (which includes the changes I sent
 earlier) which fixes these.

 I also attach an test perl program that shows the (small) differences in
 output between what's in that doc table and what
 one actually gets. (I found these too insignificant to change but perhaps
 you have a different opinion.)





 A new version of the patch is attached. It includes all of Erik's docs fixes
 and a small fix by Alexander Korotkov for hstore hash ops.

 cheers

 andrew


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add CREATE support to event triggers

Alvaro Herrera escribió:

 In an event trigger, the function pg_event_trigger_get_creation_commands()
 returns the following JSON blob:

After playing with this for a while, I realized something that must have
seemed quite obvious to those paying attention: what this function is,
is just a glorified sprintf() for JSON.  So I propose we take our
existing format(text) and use it to model a new format(json) function,
which will be useful to the project at hand and be of more general
applicability.

To make it a better fit, I have changed the spec slightly.  The format
string is now the fmt element in the topmost JSON.  This format string
can contain % escapes, which consist of:

* the literal % itself
* an element name, enclosed in braces { }.  The name can optionally be
  followed by a colon and a possibly-empty array separator.
* a format specifier, which can be I (identifier), D (dotted name), or s
  (string)
* Alternatively, %% expands to a literal %, as usual.

For each such escape, the JSON object is searched using the element name
as key.  For identifiers, the element is expected to be a string, and
will be quoted per identifier quoting rules.  Dotted-names are used to
format possibly-qualified relation names and such; the element must be
an object with one, two or three string elements, each of which is
quoted per identifier rules, and output separated by periods.

Finally, for arrays we expand each element in the JSON array element,
and separate them with the separator specified in the {} part of the
format specifier.

For instance,
alvherre=# select format(json '{fmt:hello, %{who}s! This is %{name}I, 
who:world, name:a function}');
   format   
--
 hello, world! This is a function

Elements can be objects, in which case they are expanded recursively: a
fmt element is looked up and expanded as described above.


I don't yet see a need for %L escapes (that is, literals that can expand
to a single-quoted value or to NULL), but if I see it I will add that
too.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested hstore patch

2014-01-13 Thread Andrew Dunstan





On 01/13/2014 11:03 AM, Alvaro Herrera wrote:

Andrew Dunstan wrote:

On 01/13/2014 03:25 AM, Erik Rijkers wrote:


There are errors in the example expressions in Table F-6. hstore Operators.

Attached is a cumulative doc-patch (which includes the changes I sent earlier) 
which fixes these.

I also attach an test perl program that shows the (small) differences in output 
between what's in that doc table and what
one actually gets. (I found these too insignificant to change but perhaps you 
have a different opinion.)


A new version of the patch is attached. It includes all of Erik's
docs fixes and a small fix by Alexander Korotkov for hstore hash
ops.

Interestingly, this also include transaction_commit event triggers.


Oh, wow, really? git really did something horrible, or I did 
inadvertently. This is what comes from using the same directory for 
multiple development lines :-(



Will fix



There are also a few PANIC elogs, probably not what's intended.



Oleg, Teodor, please address.



(I was just giving this a quick skim to see if there's support to build
JSON objects incrementally from C source, i.e. not have to call
functions using the fmgr interface.  Apparently that's not the case, but
if I'm wrong please let me know.)



Erm, maybe you need the other json patch: 
http://www.postgresql.org/message-id/52c76b33.1050...@dunslane.net


If we need to adjust some of that a bit to make it more friendly for 
internal use I'm happy to try to do that. Unfortunately, I don't think 
that's terribly easy for VARIADIC any functions like these.


cheers

andrew




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread Merlin Moncure

On Sun, Jan 12, 2014 at 4:38 AM, Craig Ringer cr...@2ndquadrant.com wrote:
 Implicit casts to text, anybody?

This backward compatibility break orphaned the company I work for on
8.1 until last year and very nearly caused postgres to be summarily
extirpated (only rescued at the last minute by my arrival). It cost
hundreds of thousands of dollars to qualify a sprawling java code base
so that it could be moved back into a supported version.  Breaking
compatibility sucks -- it hurts your users and costs people money.
Hacking type casts may not have been a mistake, but the arbitrary
introduction of the breakage certainly was.

This project has no deprecation policy, and I'd argue we'd need one
before considering breaking changes.  For example, maybe we could pull
out an occasional release for longer term support to help users that
caught out.   But really, the better way to go IMNSHO is to take a
hard line on compatibility issues pretty much always -- consider the
case of libc and win32 api.  There are certain limited exceptions to
this rule -- for example security problems or gross violations of the
standard (bringing row-wise comparison to spec comes to mind as an
example of that).

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Linux kernel impact on PostgreSQL performance

Hi,

I'm the chair for Linux Storage, Filesystem and Memory Management Summit 2014
(LSF/MM). A CFP was sent out last month (https://lwn.net/Articles/575681/)
that you may have seen already.

In recent years we have had at least one topic that was shared between
all three tracks that was lead by a person outside of the usual kernel
development community. I am checking if the PostgreSQL community
would be willing to volunteer someone to lead a topic discussing
PostgreSQL performance with recent kernels or to highlight regressions
or future developments you feel are potentially a problem. With luck
someone suitable is already travelling to the collaboration summit
(http://events.linuxfoundation.org/events/collaboration-summit) and it
would not be too inconvenient to drop in for LSF/MM as well.

There are two reasons why I'm suggesting this. First, PostgreSQL was the
basis of a test used to highlight a scheduler problem around kernel 3.6
but otherwise in my experience it is rare that PostgreSQL is part of a
bug report. I am skeptical this particular bug report was a typical use
case for PostgreSQL (pgbench, read-only, many threads, very small in-memory
database). I wonder why reports related to PostgreSQL are not more common.
One assumption would be that PostgreSQL is perfectly happy with the current
kernel behaviour in which case our discussion here is done.

This brings me to the second reason -- there is evidence
that the PostgreSQL community is not happy with the current
direction of kernel development. The most obvious example is this thread
http://postgresql.1045698.n5.nabble.com/Why-we-are-going-to-have-to-go-DirectIO-td5781471.html
but I suspect there are others. The thread alleges that the kernel community
are in the business of pushing hackish changes into the IO stack without
much thought or testing although the linked article describes a VM and not
a storage problem. I'm not here to debate the kernels regression testing
or development methodology but LSF/MM is one place where a large number
of people involved with the IO layers will be attending. If you have a
concrete complaint then here is a soap box.

Does the PostgreSQL community have a problem with recent kernels,
particularly with respect to the storage, filesystem or memory management
layers? If yes, do you have some data that can highlight this and can you
volunteer someone to represent your interests to the kernel community? Are
current developments in the IO layer counter to the PostgreSQL requirements?
If so, what developments, why are they a problem, do you have a suggested
alternative or some idea of what we should watch out for? The track topic
would be up to you but just as a hint, we'd need something a lot more
concrete than you should test more.

--
Mel Gorman
SUSE Labs

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN improvements part 1: additional information

2014-01-13 Thread Alexander Korotkov

On Sat, Jan 11, 2014 at 6:15 AM, Tomas Vondra t...@fuzzy.cz wrote:

 On 8.1.2014 22:58, Alexander Korotkov wrote:
  Thanks for reporting. Fixed version is attached.

 I've tried to rerun the 'archie' benchmark with the current patch, and
 once again I got

PANIC:  could not split GIN page, didn't fit

 I reran it with '--enable-cassert' and with that I got

 TRAP: FailedAssertion(!(ginCompareItemPointers(items[i - 1],
items[i])  0), File: gindatapage.c, Line: 149)
 LOG:  server process (PID 5364) was terminated by signal 6: Aborted
 DETAIL:  Failed process was running: INSERT INTO messages ...

 so the assert in GinDataLeafPageGetUncompressed fails for some reason.

 I can easily reproduce it, but my knowledge in this area is rather
 limited so I'm not entirely sure what to look for.


I've fixed this bug and many other bug. Now patch passes test suite that
I've used earlier. The results are so:

Operations time:
 event | period
---+-
 index_build   | 00:01:47.53915
 index_build_recovery  | 00:00:04
 index_update  | 00:05:24.388163
 index_update_recovery | 00:00:53
 search_new| 00:24:02.289384
 search_updated| 00:27:09.193343
(6 rows)

Index sizes:
 label |   size
---+---
 new   | 384761856
 after_updates | 667942912
(2 rows)

Also, I made following changes in algorithms:

   - Now, there is a limit to number of uncompressed TIDs in the page.
   After reaching this limit, they are encoded independent on if they can fit
   page. That seems to me more desirable behaviour and somehow it accelerates
   search speed. Before this change times were following:

 event | period
---+-
 index_build   | 00:01:51.467888
 index_build_recovery  | 00:00:04
 index_update  | 00:05:03.315155
 index_update_recovery | 00:00:51
 search_new| 00:24:43.194882
 search_updated| 00:28:36.316784
(6 rows)

   - Page are not fully re-encoded if it's enough to re-encode just last
   segment.


README is updated.

--
With best regards,
Alexander Korotkov.


gin-packed-postinglists-varbyte5.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] KNN-GiST with recheck

2014-01-13 Thread Alexander Korotkov

Hackers!

This patch was split from thread:
http://www.postgresql.org/message-id/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=gon-...@mail.gmail.com

I've split it to separate thead, because it's related to partial sort only
conceptually not technically. Also I renamed it to knn-gist-recheck from
partial-knn as more appropriate name. In the attached version docs are
updated. Possible weak point of this patch design is that it fetches heap
tuple from GiST scan. However, I didn't receive any notes about its design,
so, I'm going to put it to commitfest.

Here goes a desription of this patch same as in original thread.

KNN-GiST provides ability to get ordered results from index, but this order
is based only on index information. For instance, GiST index contains
bounding rectangles for polygons, and we can't get exact distance to
polygon from index (similar situation is in PostGIS). In attached patch,
GiST distance method can set recheck flag (similar to consistent method).
This flag means that distance method returned lower bound of distance and
we should recheck it from heap.

See an example.

create table test as (select id, polygon(3+(random()*10)::int,
circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
generate_series(1,100) id);
create index test_idx on test using gist (p);

We can get results ordered by distance from polygon to point.

postgres=# select id, p - point(0.5,0.5) from test order by p -
point(0.5,0.5) limit 10;
   id   |   ?column?
+--
 755611 | 0.000405855808916853
 807562 | 0.000464123777564343
 437778 | 0.000738524708741959
 947860 |  0.00076250998760724
 389843 | 0.000886362723569568
  17586 | 0.000981960100555216
 411329 |  0.00145338112316853
 894191 |  0.00149399559703506
 391907 |   0.0016647896049741
 235381 |  0.00167554614889509
(10 rows)

It's fast using just index scan.

QUERY PLAN

--
 Limit  (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
rows=10 loops=1)
   -  Index Scan using test_idx on test  (cost=0.29..157672.29
rows=100 width=36) (actual time=0.179..0.228 rows=10 loops=1)
 Order By: (p - '(0.5,0.5)'::point)
 Total runtime: 0.305 ms
(4 rows)

--
With best regards,
Alexander Korotkov.


knn-gist-recheck-1.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Add transforms feature

On Fri, Jan 10, 2014 at 10:40 PM, Peter Eisentraut pete...@gmx.net wrote:
 On Wed, 2013-12-11 at 11:07 -0500, Tom Lane wrote:
 We should have learned by now that those are usually a bad idea.
 In this case, we've got changes in the behavior of function calling,
 which seems like not only a nightmare for debugging but a fertile
 source of security issues.

 I note that this is the same mechanism that we have elaborately designed
 for *avoiding* security issues from search_path.

And it works like crap.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested hstore patch

2014-01-13 Thread Andrew Dunstan




On 01/13/2014 11:16 AM, Oleg Bartunov wrote:

Andrew,

did you run perl script ? Actually, I found, that operator table needs
to be fixed.





No. My build machine doesn't actually have DBD::Pg installed. Can you 
send me a patch if you don't want to push it yourself, or maybe Erik can 
send a pacth top adjust the table.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PoC: Partial sort

2014-01-13 Thread Alexander Korotkov

On Tue, Dec 31, 2013 at 5:41 AM, Andreas Karlsson andr...@proxel.se wrote:

 On 12/29/2013 08:24 AM, David Rowley wrote:

 If it was possible to devise some way to reuse any
 previous tuplesortstate perhaps just inventing a reset method which
 clears out tuples, then we could see performance exceed the standard
 seqscan - sort. The code the way it is seems to lookup the sort
 functions from the syscache for each group then allocate some sort
 space, so quite a bit of time is also spent in palloc0() and pfree()

 If it was not possible to do this then maybe adding a cost to the number
 of sort groups would be better so that the optimization is skipped if
 there are too many sort groups.


 It should be possible. I have hacked a quick proof of concept for reusing
 the tuplesort state. Can you try it and see if the performance regression
 is fixed by this?

 One thing which have to be fixed with my patch is that we probably want to
 close the tuplesort once we have returned the last tuple from ExecSort().

 I have attached my patch and the incremental patch on Alexander's patch.


Thanks. It's included into attached version of patch. As wall as estimation
improvements, more comments and regression tests fix.

--
With best regards,
Alexander Korotkov.


partial-sort-5.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: show xid and xmin in pg_stat_activity and pg_stat_replication


On 12/17/2013 04:58 PM, Christian Kruse wrote:

attached you will find a patch for showing the current transaction id
(xid) and the xmin of a backend in pg_stat_activty and the xmin in
pg_stat_replication.


Docs.

When an admin is looking for a long-running transaction that's blocking 
vacuum, he will currently rely on the timestamp fields, xact_start and 
query_start. I'm not sure how much extra value this adds over those 
timestamps in pg_stat_activity, but there are not such fields in 
pg_stat_replication, so that part is definitely useful. And if we're 
going to add xmin to pg_stat_replication, it makes sense to add it to 
pg_stat_activity too. Unless someone can come up with something better 
to display for walsenders. The timestamp of the last commit record 
that's been replayed, perhaps?


What else would a user would want to do with these?

This definitely sounds useful to me as a developer, though. So I'm 
thinking we should add these for that reason, in any case.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-13 Thread Hannu Krosing

On 01/13/2014 04:12 PM, Florian Pflug wrote:
 On Jan12, 2014, at 04:18 , Josh Berkus j...@agliodbs.com wrote:
 Thing is, when we talk about auto-degrade, we need to determine things
 like Is the replica down or is this just a network blip? and take
 action according to the user's desired configuration.  This is not
 something, realistically, that we can do on a single request.  Whereas
 it would be fairly simple for an external monitoring utility to do:

 1. decide replica is offline for the duration (several poll attempts
 have failed)

 2. Send ALTER SYSTEM SET to the master and change/disable the
 synch_replicas.

 In other words, if we're going to have auto-degrade, the most
 intelligent place for it is in
 RepMgr/HandyRep/OmniPITR/pgPoolII/whatever.  It's also the *easiest*
 place.  Anything we do *inside* Postgres is going to have a really,
 really hard time determining when to degrade.
 +1

 This is also how 2PC works, btw - the database provides the building
 blocks, i.e. PREPARE and COMMIT, and leaves it to a transaction manager
 to deal with issues that require a whole-cluster perspective.


++1

I like Simons idea to have a pg_xxx function for switching between
replication modes, which should be enough to support a monitor
daemon doing the switching.

Maybe we could have an 'syncrep_taking_too_long_command' GUC
which could be used to alert such a monitoring daemon, so it can
immediately check weather to

a) switch master to async rep or standalone mode (in case of sync slave
becoming unavailable)

or

b) to failover to slave (in almost equally likely case that it was the
master
which became disconnected from the world and slave is available)

or

c) do something else depending on circumstances/policy :)


NB! Note that in case of b) 'syncrep_taking_too_long_command' will
very likely also not reach the monitor daemon, so it can not relay on
this as main trigger!

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-13 Thread Joshua D. Drake



On 01/13/2014 10:12 AM, Hannu Krosing wrote:

In other words, if we're going to have auto-degrade, the most
intelligent place for it is in
RepMgr/HandyRep/OmniPITR/pgPoolII/whatever.  It's also the *easiest*
place.  Anything we do *inside* Postgres is going to have a really,
really hard time determining when to degrade.

+1

This is also how 2PC works, btw - the database provides the building
blocks, i.e. PREPARE and COMMIT, and leaves it to a transaction manager
to deal with issues that require a whole-cluster perspective.



++1


+1



I like Simons idea to have a pg_xxx function for switching between
replication modes, which should be enough to support a monitor
daemon doing the switching.

Maybe we could have an 'syncrep_taking_too_long_command' GUC
which could be used to alert such a monitoring daemon, so it can
immediately check weather to



I would think that would be a column in pg_stat_replication. Basically 
last_ack or something like that.




a) switch master to async rep or standalone mode (in case of sync slave
becoming unavailable)


Yep.



or

b) to failover to slave (in almost equally likely case that it was the
master
which became disconnected from the world and slave is available)

or


I think this should be left to external tools.

JD


--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
In a time of universal deceit - telling the truth is a revolutionary 
act., George Orwell



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

Mel,

 I'm the chair for Linux Storage, Filesystem and Memory Management Summit 2014
 (LSF/MM). A CFP was sent out last month (https://lwn.net/Articles/575681/)
 that you may have seen already.
 
 In recent years we have had at least one topic that was shared between
 all three tracks that was lead by a person outside of the usual kernel
 development community. I am checking if the PostgreSQL community
 would be willing to volunteer someone to lead a topic discussing
 PostgreSQL performance with recent kernels or to highlight regressions
 or future developments you feel are potentially a problem. With luck
 someone suitable is already travelling to the collaboration summit
 (http://events.linuxfoundation.org/events/collaboration-summit) and it
 would not be too inconvenient to drop in for LSF/MM as well.

We can definitely get someone there.  I'll certainly be there; I'm
hoping to get someone who has closer involvement with our kernel
interaction as well.

 There are two reasons why I'm suggesting this. First, PostgreSQL was the
 basis of a test used to highlight a scheduler problem around kernel 3.6
 but otherwise in my experience it is rare that PostgreSQL is part of a
 bug report.  I am skeptical this particular bug report was a typical use
 case for PostgreSQL (pgbench, read-only, many threads, very small in-memory
 database). I wonder why reports related to PostgreSQL are not more common.
 One assumption would be that PostgreSQL is perfectly happy with the current
 kernel behaviour in which case our discussion here is done.

To be frank, it's because most people are still running on 2.6.19, and
as a result are completely unaware of recent developments.  Second,
because there's no obvious place to complain to ... lkml doesn't welcome
bug reports, and where else do you go?

 Does the PostgreSQL community have a problem with recent kernels,
 particularly with respect to the storage, filesystem or memory management
 layers? If yes, do you have some data that can highlight this and can you
 volunteer someone to represent your interests to the kernel community? 

Yes, and yes.

 Are
 current developments in the IO layer counter to the PostgreSQL requirements?
 If so, what developments, why are they a problem, do you have a suggested
 alternative or some idea of what we should watch out for? 

Mostly the issue is changes to the IO scheduler which improve one use
case at the expense of others, or set defaults which emphasize desktop
hardware over server hardware.

What also came up with the recent change to LRU is that the Postgres
community apparently has more experience than the Linux community with
buffer-clearing algorithms, and we ought to share that.

 The track topic
 would be up to you but just as a hint, we'd need something a lot more
 concrete than you should test more.

How about don't add major IO behavior changes with no
backwards-compatibility switches?  ;-)

Seriously, one thing I'd like to get out of Collab would be a reasonable
regimen for testing database performance on Linux kernels.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Kevin Grittner

Josh Berkus j...@agliodbs.com wrote:

 Does the PostgreSQL community have a problem with recent
 kernels, particularly with respect to the storage, filesystem or
 memory management layers?

 How about don't add major IO behavior changes with no
 backwards-compatibility switches?  ;-)

I notice, Josh, that you didn't mention the problems many people
have run into with Transparent Huge Page defrag and with NUMA
access.  Is that because there *are* configuration options that
allow people to get decent performance once the issue is diagnosed?
It seems like maybe there could be a better way to give a heads-up
on hazards in a new kernel to the database world, but I don't know
quite what that would be.  For all I know, it is already available
if you know where to look.

 Seriously, one thing I'd like to get out of Collab would be a
 reasonable regimen for testing database performance on Linux
 kernels.

... or perhaps you figure this is what would bring such issues to
the community's attention before people are bitten in production
environments?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

On Mon, Jan 13, 2014 at 12:23 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Exclusion constraints can be used to implement uniqueness checks with
 SP-GiST or GiST indexes. For example, if you want to enforce that there are
 no two tuples with the same x and y coordinates, ie. use a point as the key.
 You could add a b-tree index just to enforce the constraint, but it's better
 if you don't have to. In general, it's just always better if features don't
 have implementation-specific limitations like this.

That seems rather narrow. Among other things, I worry about the
baggage for users in documenting supporting SP-GiST/GiST. We support
it, but it only really works for the case where you're using exclusion
constraints as unique constraints, something that might make sense in
certain narrow contexts, contrary to our earlier general statement
that a unique index should be preferred there. We catalog amcanunique
methods as the way that we support unique indexes. I really do feel
that that's the appropriate level to support the feature at, and I
have not precluded other amcanunique implementations from doing the
same, having documented the intended value locking interface/contract
for the benefit of any future amcanunique AM author. It's ON DUPLICATE
KEY, not ON OVERLAPPING KEY, or any other syntax suggestive of
exclusion constraints and their arbitrary commutative operators.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

On 01/13/2014 10:51 AM, Kevin Grittner wrote:
 How about don't add major IO behavior changes with no
 backwards-compatibility switches?  ;-)
 
 I notice, Josh, that you didn't mention the problems many people
 have run into with Transparent Huge Page defrag and with NUMA
 access.  Is that because there *are* configuration options that
 allow people to get decent performance once the issue is diagnosed?
 It seems like maybe there could be a better way to give a heads-up
 on hazards in a new kernel to the database world, but I don't know
 quite what that would be.  For all I know, it is already available
 if you know where to look.

Well, it was the lack of sysctl options which takes the 2Q change from
annoyance to potential disaster.  We can't ever get away from the
possibility that the Postgres use-case might be the minority use-case,
and we might have to use non-default options.  It's when those options
aren't present *at all* that we're stuck.

However, I agree that a worthwhile thing to talk about is having some
better channel to notify the Postgres (and other DB) communities about
major changes to IO and Memory management.

Wanna go to Collab?

 Seriously, one thing I'd like to get out of Collab would be a
 reasonable regimen for testing database performance on Linux
 kernels.
 
 ... or perhaps you figure this is what would bring such issues to
 the community's attention before people are bitten in production
 environments?

That, too.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Show lossy heap block info in EXPLAIN ANALYZE for bitmap heap scan

On Thu, Jan 9, 2014 at 10:57 PM, Etsuro Fujita
fujita.ets...@lab.ntt.co.jp wrote:
 I wrote:
 Robert Haas wrote:
  Hmm, fair point.  But I'm still not convinced that we really need to
  add extra accounting for this.  What's wrong with just reporting the
  number of exact and lossy pages?

 No.  I intended to show the desired memory space for a TIDBitmap rather
 than the peak memory usage for that TIDBitmap.  And I thought it'd be
 better
 for the latter to be displayed as additional information.  However, I've
 removed the functionality for showing the desired memory space due to
 technical problems.  Now I should probably remove the functionality for
 showing the peak memory usage too.

 Yes, as Andres mentioned, showing the peak memory usage is not a bad idea,
 I think.  But I start to think it's not necessarily worth complicating the
 code ...

 If there are no objections of others, I'll remove extra accounting for
 showing the peak memory usage.

 Done.  Please find attached a patch.

Looks good to me, so committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER SYSTEM SET command to change postgresql.conf parameters

On Thu, Jan 9, 2014 at 11:06 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Thu, Jan 9, 2014 at 12:26 AM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Jan 6, 2014 at 11:37 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Tue, Jan 7, 2014 at 12:52 AM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Jan 6, 2014 at 9:48 AM, Amit Kapila amit.kapil...@gmail.com 
 wrote:
 Couldn't we also handle this by postponing FreeConfigVariables until
 after the if (error) block?

Wouldn't doing that way can lead to bigger memory leak, if error level
is ERROR. Though in current fix also it can leak memory but it will be
just for ErrorConfFile_save. I think some similar case can happen for
'pre_value' in code currently as well, that's why I have fixed it in a
similar way in patch.

 I was assuming that error-recovery would reset the containing memory
 context, but I'm not sure what memory context we're executing in at
 this point.


 In current code, the only time it can go to error path with elevel as
 ERROR is during Postmaster startup
 (context == PGC_POSTMASTER), at which it will anyway upgrade
 ERROR to FATAL, so it should not be a problem to move
 function FreeConfigVariables() after error block check. However
 in future, if someone added any more ERROR (the chances of which
 seems to be quite less), it can cause leak, may be thats why original
 code has been written that way.

 If you think it's better to fix by moving FreeConfigVariables() after error
 block check, then I can update the patch by doing so and incorporate other
 change (directly use PG_AUTOCONF_FILENAME) suggested by you
 as well?

 Yeah, let's do it that way.

 Okay, done. Attached patch fixes both the display of wrong file name and
 usage of PG_AUTOCONF_FILENAME.

Committed with a comment change.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Planning time in explain/explain analyze

On Thu, Jan 9, 2014 at 11:45 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Greg Stark st...@mit.edu writes:
 On Thu, Jan 9, 2014 at 9:14 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 In short then, I think we should just add this to EXPLAIN and be done.
 -1 for sticking the info into PlannedStmt or anything like that.

 I'm confused. I thought I was arguing to support your suggestion that
 the initial planning store the time in the cached plan and explain
 should output the time the original planning took.

 Uh, no, wasn't my suggestion.  Doesn't that design imply measuring *every*
 planning cycle, explain or no?  I was thinking more of just putting the
 timing calls into explain.c.

Currently the patch includes changes to prepare.c which is what seems
odd to me.  I think it'd be fine to say, hey, I can't give you the
planning time in this EXPLAIN ANALYZE because I just used a cached
plan and did not re-plan.  But saying, hey, the planning time is
$TINYVALUE, when what we really mean is that looking up the
previously-cached plan took only that long, seems actively misleading
to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Planning time in explain/explain analyze

2014-01-13 Thread Stephen Frost

* Robert Haas (robertmh...@gmail.com) wrote:
 Currently the patch includes changes to prepare.c which is what seems
 odd to me.  I think it'd be fine to say, hey, I can't give you the
 planning time in this EXPLAIN ANALYZE because I just used a cached
 plan and did not re-plan.  But saying, hey, the planning time is
 $TINYVALUE, when what we really mean is that looking up the
 previously-cached plan took only that long, seems actively misleading
 to me.

My thought, at least, was to always grab the planning time and then
provide it for explain and/or explain analyze, and then for re-plan
cases, indicate if a cached plan was returned, if a replan happened, and
if a replan happened, what the old plan time and the new plan time was.

I don't think it makes any sense to report on the time returned from
pulling a previously-cached plan.

I understand that it's not completely free to track the plan time for
every query but I'm in the camp that says we need better metrics and
information for 99% of what we do and I'd like to see us eventually
able to track average plan time (maybe on a per-query basis..), average
run-time, how many times we do a hashjoin, mergejoin, the number of
records in/out of each, memory usage, etc, etc..  I don't think we need
per-tuple timing information.  I certainly wouldn't want to try and
collect all of this through shared memory or our existing stats
collector.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner kgri...@ymail.com wrote:
 I notice, Josh, that you didn't mention the problems many people
 have run into with Transparent Huge Page defrag and with NUMA
 access.

Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
setting zone_reclaim_mode; is there some other problem besides that?

The other thing that comes to mind is the kernel's caching behavior.
We've talked a lot over the years about the difficulties of getting
the kernel to write data out when we want it to and to not write data
out when we don't want it to.  When it writes data back to disk too
aggressively, we get lousy throughput because the same page can get
written more than once when caching it for longer would have allowed
write-combining.  When it doesn't write data to disk aggressively
enough, we get huge latency spikes at checkpoint time when we call
fsync() and the kernel says uh, what? you wanted that data *on the
disk*? sorry boss! and then proceeds to destroy the world by starving
the rest of the system for I/O for many seconds or minutes at a time.
We've made some desultory attempts to use sync_file_range() to improve
things here, but I'm not sure that's really the right tool, and if it
is we don't know how to use it well enough to obtain consistent
positive results.

On a related note, there's also the problem of double-buffering.  When
we read a page into shared_buffers, we leave a copy behind in the OS
buffers, and similarly on write-out.  It's very unclear what to do
about this, since the kernel and PostgreSQL don't have intimate
knowledge of what each other are doing, but it would be nice to solve
somehow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

On Mon, Jan 13, 2014 at 1:53 PM, Peter Geoghegan p...@heroku.com wrote:
 On Mon, Jan 13, 2014 at 12:23 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 Exclusion constraints can be used to implement uniqueness checks with
 SP-GiST or GiST indexes. For example, if you want to enforce that there are
 no two tuples with the same x and y coordinates, ie. use a point as the key.
 You could add a b-tree index just to enforce the constraint, but it's better
 if you don't have to. In general, it's just always better if features don't
 have implementation-specific limitations like this.

 That seems rather narrow. Among other things, I worry about the
 baggage for users in documenting supporting SP-GiST/GiST. We support
 it, but it only really works for the case where you're using exclusion
 constraints as unique constraints, something that might make sense in
 certain narrow contexts, contrary to our earlier general statement
 that a unique index should be preferred there. We catalog amcanunique
 methods as the way that we support unique indexes. I really do feel
 that that's the appropriate level to support the feature at, and I
 have not precluded other amcanunique implementations from doing the
 same, having documented the intended value locking interface/contract
 for the benefit of any future amcanunique AM author. It's ON DUPLICATE
 KEY, not ON OVERLAPPING KEY, or any other syntax suggestive of
 exclusion constraints and their arbitrary commutative operators.

For what it's worth, I agree with Heikki.  There's probably nothing
sensible an upsert can do if it conflicts with more than one tuple,
but if it conflicts with just exactly one, it oughta be OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas robertmh...@gmail.com wrote:
 On a related note, there's also the problem of double-buffering.  When
 we read a page into shared_buffers, we leave a copy behind in the OS
 buffers, and similarly on write-out.  It's very unclear what to do
 about this, since the kernel and PostgreSQL don't have intimate
 knowledge of what each other are doing, but it would be nice to solve
 somehow.


There you have a much harder algorithmic problem.

You can basically control duplication with fadvise and WONTNEED. The
problem here is not the kernel and whether or not it allows postgres
to be smart about it. The problem is... what kind of smarts
(algorithm) to use.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Planning time in explain/explain analyze

2014-01-13 Thread Andreas Karlsson


On 01/13/2014 09:06 PM, Stephen Frost wrote:

My thought, at least, was to always grab the planning time and then
provide it for explain and/or explain analyze, and then for re-plan
cases, indicate if a cached plan was returned, if a replan happened, and
if a replan happened, what the old plan time and the new plan time was.


I like this solution due it is correctness and that all information is 
included. But I also think my original path was fine in how it in its 
simplicity solved the problem without adding any overhead in the 
non-EXPLAIN case. Either solution would be fine by me. If the consensus 
is that we want to always measure it I will look at implementing that 
instead.


--
Andreas Karlsson


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Planning time in explain/explain analyze

Robert Haas robertmh...@gmail.com writes:
 On Thu, Jan 9, 2014 at 11:45 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Uh, no, wasn't my suggestion.  Doesn't that design imply measuring *every*
 planning cycle, explain or no?  I was thinking more of just putting the
 timing calls into explain.c.

 Currently the patch includes changes to prepare.c which is what seems
 odd to me.  I think it'd be fine to say, hey, I can't give you the
 planning time in this EXPLAIN ANALYZE because I just used a cached
 plan and did not re-plan.  But saying, hey, the planning time is
 $TINYVALUE, when what we really mean is that looking up the
 previously-cached plan took only that long, seems actively misleading
 to me.

Meh.  Why?  This would only come into play for EXPLAIN EXECUTE stmtname.
I don't think users would be surprised to see a report of minimal planning
time for that.  In fact, it might be a good thing, as it would make it
easier to tell the difference between whether you were seeing a generic
plan or a custom plan for the prepared statement.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance


On 1/13/14, 2:19 PM, Claudio Freire wrote:

On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas robertmh...@gmail.com wrote:

On a related note, there's also the problem of double-buffering.  When
we read a page into shared_buffers, we leave a copy behind in the OS
buffers, and similarly on write-out.  It's very unclear what to do
about this, since the kernel and PostgreSQL don't have intimate
knowledge of what each other are doing, but it would be nice to solve
somehow.



There you have a much harder algorithmic problem.

You can basically control duplication with fadvise and WONTNEED. The
problem here is not the kernel and whether or not it allows postgres
to be smart about it. The problem is... what kind of smarts
(algorithm) to use.


Isn't this a fairly simple matter of when we read a page into shared buffers 
tell the kernel do forget that page? And a corollary to that for when we dump a 
page out of shared_buffers (here kernel, please put this back into your cache).
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

On Fri, Jan 10, 2014 at 2:07 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Florian Pflug f...@phlo.org writes:
 On Jan10, 2014, at 19:08 , Tom Lane t...@sss.pgh.pa.us wrote:
 Although, having said that ... maybe build your own aggregate would
 be a reasonable suggestion for people who need this?  I grant that
 it's going to be a minority requirement, maybe even a small minority
 requirement.  People who have the chops to get this sort of thing right
 can probably manage a custom aggregate definition.

 So we'd put a footgun into the hands of people who don't know what they're
 doing, to be fired for performance's sake, and leave it to the people
 who know what they are doing to put the safety on?

 If I may put words in Kevin's mouth, I think his point is that having
 float8 sum() at all is a foot-gun, and that's hard to deny.  You need
 to know how to use it safely.

Yeah, but Florian's point is that not all foot-guns are created equal.
 The fact that we're walking around with a loaded BB-gun in our hip
pocket is not a good reason to replace it with a howitzer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby j...@nasby.net wrote:
 On 1/13/14, 2:19 PM, Claudio Freire wrote:

 On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas robertmh...@gmail.com
 wrote:

 On a related note, there's also the problem of double-buffering.  When
 we read a page into shared_buffers, we leave a copy behind in the OS
 buffers, and similarly on write-out.  It's very unclear what to do
 about this, since the kernel and PostgreSQL don't have intimate
 knowledge of what each other are doing, but it would be nice to solve
 somehow.



 There you have a much harder algorithmic problem.

 You can basically control duplication with fadvise and WONTNEED. The
 problem here is not the kernel and whether or not it allows postgres
 to be smart about it. The problem is... what kind of smarts
 (algorithm) to use.


 Isn't this a fairly simple matter of when we read a page into shared buffers
 tell the kernel do forget that page? And a corollary to that for when we
 dump a page out of shared_buffers (here kernel, please put this back into
 your cache).


That's my point. In terms of kernel-postgres interaction, it's fairly simple.

What's not so simple, is figuring out what policy to use. Remember,
you cannot tell the kernel to put some page in its page cache without
reading it or writing it. So, once you make the kernel forget a page,
evicting it from shared buffers becomes quite expensive.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Planning time in explain/explain analyze

On Mon, Jan 13, 2014 at 3:23 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Thu, Jan 9, 2014 at 11:45 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Uh, no, wasn't my suggestion.  Doesn't that design imply measuring *every*
 planning cycle, explain or no?  I was thinking more of just putting the
 timing calls into explain.c.

 Currently the patch includes changes to prepare.c which is what seems
 odd to me.  I think it'd be fine to say, hey, I can't give you the
 planning time in this EXPLAIN ANALYZE because I just used a cached
 plan and did not re-plan.  But saying, hey, the planning time is
 $TINYVALUE, when what we really mean is that looking up the
 previously-cached plan took only that long, seems actively misleading
 to me.

 Meh.  Why?  This would only come into play for EXPLAIN EXECUTE stmtname.
 I don't think users would be surprised to see a report of minimal planning
 time for that.  In fact, it might be a good thing, as it would make it
 easier to tell the difference between whether you were seeing a generic
 plan or a custom plan for the prepared statement.

It would also make it easier to be wrong.  If you want to display that
information explicitly, fine.  But asking the user to use the elapsed
time to guess whether or not we really planned anything is just going
to confuse people who don't have enough experience with the system to
know what the boundary is between the largest time that could be a
cache lookup and the smallest time that could be real planning
activity.  And that means virtually everyone, me included.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance


On 1/13/14, 2:27 PM, Claudio Freire wrote:

On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby j...@nasby.net wrote:

On 1/13/14, 2:19 PM, Claudio Freire wrote:


On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas robertmh...@gmail.com
wrote:


On a related note, there's also the problem of double-buffering.  When
we read a page into shared_buffers, we leave a copy behind in the OS
buffers, and similarly on write-out.  It's very unclear what to do
about this, since the kernel and PostgreSQL don't have intimate
knowledge of what each other are doing, but it would be nice to solve
somehow.




There you have a much harder algorithmic problem.

You can basically control duplication with fadvise and WONTNEED. The
problem here is not the kernel and whether or not it allows postgres
to be smart about it. The problem is... what kind of smarts
(algorithm) to use.



Isn't this a fairly simple matter of when we read a page into shared buffers
tell the kernel do forget that page? And a corollary to that for when we
dump a page out of shared_buffers (here kernel, please put this back into
your cache).



That's my point. In terms of kernel-postgres interaction, it's fairly simple.

What's not so simple, is figuring out what policy to use. Remember,
you cannot tell the kernel to put some page in its page cache without
reading it or writing it. So, once you make the kernel forget a page,
evicting it from shared buffers becomes quite expensive.


Well, if we were to collaborate with the kernel community on this then presumably we can 
do better than that for eviction... even to the extent of here's some data from 
this range in this file. It's (clean|dirty). Put it in your cache. Just trust me on 
this.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

On Mon, Jan 13, 2014 at 5:32 PM, Jim Nasby j...@nasby.net wrote:

 That's my point. In terms of kernel-postgres interaction, it's fairly
 simple.

 What's not so simple, is figuring out what policy to use. Remember,
 you cannot tell the kernel to put some page in its page cache without
 reading it or writing it. So, once you make the kernel forget a page,
 evicting it from shared buffers becomes quite expensive.


 Well, if we were to collaborate with the kernel community on this then
 presumably we can do better than that for eviction... even to the extent of
 here's some data from this range in this file. It's (clean|dirty). Put it
 in your cache. Just trust me on this.


If I had a kernel developer hat, I'd put it on to say: I don't think
allowing that last bit is wise for a kernel.

It would violate oh-so-many separation rules and open an oh-so-big can-o-worms.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Planning time in explain/explain analyze

Robert Haas robertmh...@gmail.com writes:
 On Mon, Jan 13, 2014 at 3:23 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Meh.  Why?  This would only come into play for EXPLAIN EXECUTE stmtname.
 I don't think users would be surprised to see a report of minimal planning
 time for that.  In fact, it might be a good thing, as it would make it
 easier to tell the difference between whether you were seeing a generic
 plan or a custom plan for the prepared statement.

 It would also make it easier to be wrong.  If you want to display that
 information explicitly, fine.  But asking the user to use the elapsed
 time to guess whether or not we really planned anything is just going
 to confuse people who don't have enough experience with the system to
 know what the boundary is between the largest time that could be a
 cache lookup and the smallest time that could be real planning
 activity.  And that means virtually everyone, me included.

If you're saying that you'd like EXPLAIN to explicitly mention whether
the plan was cached or custom, I don't have any great complaint about
that.  I'm just not seeing how you arrive at the conclusion that we
mustn't report the amount of time EXPLAIN spent to get the plan.
If we do what you're proposing we'll just have a different set of confused
users, who will be wondering how EXPLAIN could have managed to spend
100 msec planning something when the EXPLAIN only took 10 msec in toto
according to psql.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

On 2014-01-13 15:15:16 -0500, Robert Haas wrote:
 On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner kgri...@ymail.com wrote:
  I notice, Josh, that you didn't mention the problems many people
  have run into with Transparent Huge Page defrag and with NUMA
  access.
 
 Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
 setting zone_reclaim_mode; is there some other problem besides that?

I think that fixes some of the worst instances, but I've seen machines
spending horrible amounts of CPU ( BUS) time in page reclaim
nonetheless. If I analyzed it correctly it's in RAM  working set
workloads where RAM is pretty large and most of it is used as page
cache. The kernel ends up spending a huge percentage of time finding and
potentially defragmenting pages when looking for victim buffers.

 On a related note, there's also the problem of double-buffering.  When
 we read a page into shared_buffers, we leave a copy behind in the OS
 buffers, and similarly on write-out.  It's very unclear what to do
 about this, since the kernel and PostgreSQL don't have intimate
 knowledge of what each other are doing, but it would be nice to solve
 somehow.

I've wondered before if there wouldn't be a chance for postgres to say
my dear OS, that the file range 0-8192 of file x contains y, no need to
reread and do that when we evict a page from s_b but I never dared to
actually propose that to kernel people...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

On Sat, Jan 11, 2014 at 1:08 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 Yes, currently this applies to update, what I have in mind is that
 in future if some one wants to use WAL compression for any other
 operation like 'full_page_writes', then it can be easily extendible.

 To be honest, I have not evaluated whether such a flag or compression
 would make sense for full page writes, but I think it should be possible
 while doing full page write (BkpBlock has RelFileNode) to check such a
 flag if it's present.

Makes sense.

 The reason of adding the same chunk in head of list is that it uses same
 technique as pglz_hist_add. Now in pglz, it will not have repeat steps
 from c~f, as it has concept of good_match which leads to get this done in
 one go.

 Being said above, I am really not sure, how much real world data falls
 in above category and should we try to optimize based on above example,
 but yes it will save some CPU cycles in current test we are using.

In the Rabin algorithm, we shouldn't try to find a longer match.  The
match should end at the chunk end, period.  Otherwise, you lose the
shift-resistant property of the algorithm.

But I do think there might be a bug here, which is
 that, unless I'm misinterpreting something, hp is NOT the end of the
 chunk.  After calling pgrb_hash_init(), we've looked at the first FOUR
 bytes of the input.  If we find that we have a zero hash value at that
 point, shouldn't the chunk size be 4, not 1?  And similarly if we find
 it after sucking in one more byte, shouldn't the chunk size be 5, not
 2?  Right now, we're deciding where the chunks should end based on the
 data in the chunk plus the following 3 bytes, and that seems wonky.  I
 would expect us to include all of those bytes in the chunk.

 It depends on how we define chunk, basically chunk size will be based
 on the byte for which we consider hindex. The hindex for any byte is
 calculated considering that byte and the following 3 bytes, so
 after calling pgrb_hash_init(), even though we have looked at 4 bytes
 but still the hindex is for first byte and thats why it consider
 chunk size as 1, not 4.

 Isn't it similar to how current pglz works, basically it also
 uses next 4 bytes to calculate index (pglz_hist_idx) but still
 does byte by byte comparison, here if we try to map to rabin's
 delta encoding then always chunk size is 1.

I don't quite understand this.  The point of the Rabin algorithm is to
split the old tuple up into chunks and then for those chunks in the
new tuple.  For example, suppose the old tuple is
abcdefghijklmnopqrstuvwxyz.  It might get split like this: abcdef
hijklmnopqrstuvw xyz.  If any of those three chunks appear in the new
tuple, then we'll use them for compression.  If not, we'll just copy
the literal bytes.  If the chunks appear in the new tuple reordered or
shifted or with stuff inserted between one chunk at the next, we'll
still find them.  Unless I'm confused, which is possible, what you're
doing is essentially looking at the string and spitting it in those
three places, but then recording the chunks as being three bytes
shorter than they really are.  I don't see how that can be right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Planning time in explain/explain analyze

On Mon, Jan 13, 2014 at 3:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Mon, Jan 13, 2014 at 3:23 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Meh.  Why?  This would only come into play for EXPLAIN EXECUTE stmtname.
 I don't think users would be surprised to see a report of minimal planning
 time for that.  In fact, it might be a good thing, as it would make it
 easier to tell the difference between whether you were seeing a generic
 plan or a custom plan for the prepared statement.

 It would also make it easier to be wrong.  If you want to display that
 information explicitly, fine.  But asking the user to use the elapsed
 time to guess whether or not we really planned anything is just going
 to confuse people who don't have enough experience with the system to
 know what the boundary is between the largest time that could be a
 cache lookup and the smallest time that could be real planning
 activity.  And that means virtually everyone, me included.

 If you're saying that you'd like EXPLAIN to explicitly mention whether
 the plan was cached or custom, I don't have any great complaint about
 that.  I'm just not seeing how you arrive at the conclusion that we
 mustn't report the amount of time EXPLAIN spent to get the plan.
 If we do what you're proposing we'll just have a different set of confused
 users, who will be wondering how EXPLAIN could have managed to spend
 100 msec planning something when the EXPLAIN only took 10 msec in toto
 according to psql.

What I'm saying is that if EXPLAIN reports something that's labelled
Planning Time, it should *be* the planning time, and not anything
else.  When we retrieve a plan from cache, it would be sensible not to
report the planning time at all, and IMHO it would also be sensible to
report the time it actually took to plan whenever we originally did
it.  But reporting a value that is not the planning time and calling
it the planning time does not seem like a good idea to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

On Mon, Jan 13, 2014 at 12:17 PM, Robert Haas robertmh...@gmail.com wrote:
 For what it's worth, I agree with Heikki.  There's probably nothing
 sensible an upsert can do if it conflicts with more than one tuple,
 but if it conflicts with just exactly one, it oughta be OK.

If there is exactly one, *and* the existing value is exactly the same
as the value proposed for insertion (or, I suppose, a subset of the
existing value, but that's so narrow that it might as well not apply).
In short, when you're using an exclusion constraint as a unique
constraint. Which is very narrow indeed. Weighing the costs and the
benefits, that seems like far more cost than benefit, before we even
consider anything beyond simply explaining the applicability and
limitations of upserting with exclusion constraints. It's generally
far cleaner to define speculative insertion as something that happens
with unique indexes only.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PoC: Partial sort

2014-01-13 Thread Marti Raudsepp

Hi Alexander,

First, thanks a lot for working on this feature. This PostgreSQL
shortcoming crops up in all the time in web applications that implement
paging by multiple sorted columns.

I've been trying it out in a few situations. I implemented a new
enable_partialsort GUC to make it easier to turn on/off, this way it's a
lot easier to test. The attached patch applies on top of
partial-sort-5.patch

I will spend more time reviewing the patch, but some of this planner code
is over my head. If there's any way I can help to make sure this lands in
the next version, let me know.



The patch performs just as well as I would expect it to:

marti=# select ac.name, r.name from artist_credit ac join release r on (
ac.id=r.artist_credit) order by ac.name, r.name limit 1000;
Time: 9.830 ms
marti=# set enable_partialsort = off;
marti=# select ac.name, r.name from artist_credit ac join release r on (
ac.id=r.artist_credit) order by ac.name, r.name limit 1000;
Time: 1442.815 ms

A difference of almost 150x!

There's a missed opportunity in that the code doesn't consider pushing new
Sort steps into subplans. For example, if there's no index on
language(name) then this query cannot take advantage partial sorts:

marti=# explain select l.name, r.name from language l join release r on (
l.id=r.language) order by l.name, r.name limit 1000;
 Limit  (cost=123203.20..123205.70 rows=1000 width=32)
   -  Sort  (cost=123203.20..126154.27 rows=1180430 width=32)
 Sort Key: l.name, r.name
 -  Hash Join  (cost=229.47..58481.49 rows=1180430 width=32)
   Hash Cond: (r.language = l.id)
   -  Seq Scan on release r  (cost=0.00..31040.10 rows=1232610
width=26)
   -  Hash  (cost=131.43..131.43 rows=7843 width=14)
 -  Seq Scan on language l  (cost=0.00..131.43
rows=7843 width=14)

But because there are only so few languages, it would be a lot faster to
sort languages in advance and then do partial sort:
 Limit  (rows=1000 width=31)
   -  Partial sort  (rows=1180881 width=31)
 Sort Key: l.name, r.name
 Presorted Key: l.name
 -  Nested Loop  (rows=1180881 width=31)
   -  Sort  (rows=7843 width=10)
 Sort Key: name
 -  Seq Scan on language  (rows=7843 width=14)
   -  Index Scan using release_language_idx on release r
(rows=11246 width=25)
 Index Cond: (language = l.id)

Even an explicit sorted CTE cannot take advantage of partial sorts:
marti=# explain with sorted_lang as (select id, name from language order by
name)
marti-# select l.name, r.name from sorted_lang l join release r on
(l.id=r.language)
order by l.name, r.name limit 1000;
 Limit  (cost=3324368.83..3324371.33 rows=1000 width=240)
   CTE sorted_lang
 -  Sort  (cost=638.76..658.37 rows=7843 width=14)
   Sort Key: language.name
   -  Seq Scan on language  (cost=0.00..131.43 rows=7843 width=14)
   -  Sort  (cost=3323710.46..3439436.82 rows=46290543 width=240)
 Sort Key: l.name, r.name
 -  Merge Join  (cost=664.62..785649.92 rows=46290543 width=240)
   Merge Cond: (r.language = l.id)
   -  Index Scan using release_language_idx on release r
(cost=0.43..87546.06 rows=1232610 width=26)
   -  Sort  (cost=664.19..683.80 rows=7843 width=222)
 Sort Key: l.id
 -  CTE Scan on sorted_lang l  (cost=0.00..156.86
rows=7843 width=222)

But even with these limitations, this will easily be the killer feature of
the next release, for me at least.

Regards,
Marti


On Mon, Jan 13, 2014 at 8:01 PM, Alexander Korotkov aekorot...@gmail.comwrote:

 On Tue, Dec 31, 2013 at 5:41 AM, Andreas Karlsson andr...@proxel.sewrote:

 On 12/29/2013 08:24 AM, David Rowley wrote:

 If it was possible to devise some way to reuse any
 previous tuplesortstate perhaps just inventing a reset method which
 clears out tuples, then we could see performance exceed the standard
 seqscan - sort. The code the way it is seems to lookup the sort
 functions from the syscache for each group then allocate some sort
 space, so quite a bit of time is also spent in palloc0() and pfree()

 If it was not possible to do this then maybe adding a cost to the number
 of sort groups would be better so that the optimization is skipped if
 there are too many sort groups.


 It should be possible. I have hacked a quick proof of concept for reusing
 the tuplesort state. Can you try it and see if the performance regression
 is fixed by this?

 One thing which have to be fixed with my patch is that we probably want
 to close the tuplesort once we have returned the last tuple from ExecSort().

 I have attached my patch and the incremental patch on Alexander's patch.


 Thanks. It's included into attached version of patch. As wall as
 estimation improvements, more comments and regression tests fix.

 --
 With best regards,
 Alexander Korotkov.


 --
 Sent via

Re: [HACKERS] Linux kernel impact on PostgreSQL performance


On 1/13/14, 2:37 PM, Claudio Freire wrote:

On Mon, Jan 13, 2014 at 5:32 PM, Jim Nasby j...@nasby.net wrote:


That's my point. In terms of kernel-postgres interaction, it's fairly
simple.

What's not so simple, is figuring out what policy to use. Remember,
you cannot tell the kernel to put some page in its page cache without
reading it or writing it. So, once you make the kernel forget a page,
evicting it from shared buffers becomes quite expensive.



Well, if we were to collaborate with the kernel community on this then
presumably we can do better than that for eviction... even to the extent of
here's some data from this range in this file. It's (clean|dirty). Put it
in your cache. Just trust me on this.



If I had a kernel developer hat, I'd put it on to say: I don't think
allowing that last bit is wise for a kernel.

It would violate oh-so-many separation rules and open an oh-so-big can-o-worms.


Yeah, if it were me I'd probably want to keep a hash of the page and it's 
address and only accept putting a page back into the kernel if it matched my 
hash. Otherwise you'd just have to treat it as a write.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE


On 01/13/2014 10:53 PM, Peter Geoghegan wrote:

On Mon, Jan 13, 2014 at 12:17 PM, Robert Haas robertmh...@gmail.com wrote:

For what it's worth, I agree with Heikki.  There's probably nothing
sensible an upsert can do if it conflicts with more than one tuple,
but if it conflicts with just exactly one, it oughta be OK.


If there is exactly one, *and* the existing value is exactly the same
as the value proposed for insertion (or, I suppose, a subset of the
existing value, but that's so narrow that it might as well not apply).
In short, when you're using an exclusion constraint as a unique
constraint. Which is very narrow indeed. Weighing the costs and the
benefits, that seems like far more cost than benefit, before we even
consider anything beyond simply explaining the applicability and
limitations of upserting with exclusion constraints. It's generally
far cleaner to define speculative insertion as something that happens
with unique indexes only.


Well, even if you don't agree that locking all the conflicting rows for 
update is sensible, it's still perfectly sensible to return the rejected 
rows to the user. For example, you're inserting N rows, and if some of 
them violate a constraint, you still want to insert the non-conflicting 
rows instead of rolling back the whole transaction.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On 2014-01-13 15:53:36 -0500, Trond Myklebust wrote:
  I've wondered before if there wouldn't be a chance for postgres to say
  my dear OS, that the file range 0-8192 of file x contains y, no need to
  reread and do that when we evict a page from s_b but I never dared to
  actually propose that to kernel people...
 
 O_DIRECT was specifically designed to solve the problem of double buffering 
 between applications and the kernel. Why are you not able to use that in 
 these situations?

Because we like to handle the OS handle part of postgres' caching. For
one, it makes servers with several applications/databases much more
realistic without seriously overallocating memory, for another it's a
huge chunk of platform dependent code to get good performance
everywhere.
The above was explicitly not to avoid double buffering but to move a
buffer away from postgres' own buffers to the kernel's buffers once it's
not 100% clear we need it in buffers anymore.

Part of the reason this is being discussed is because previously people
suggested going the direct IO route and some people (most prominently
J. Corbet in 
http://archives.postgresql.org/message-id/20131204083345.31c60dd1%40lwn.net
) and others disagreed because that goes the route of reinventing
storage layers everywhere without improving the common codepaths.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Jeff Janes

On Mon, Jan 13, 2014 at 12:32 PM, Jim Nasby j...@nasby.net wrote:

 On 1/13/14, 2:27 PM, Claudio Freire wrote:

 On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby j...@nasby.net wrote:

 On 1/13/14, 2:19 PM, Claudio Freire wrote:


 On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas robertmh...@gmail.com
 wrote:


 On a related note, there's also the problem of double-buffering.  When
 we read a page into shared_buffers, we leave a copy behind in the OS
 buffers, and similarly on write-out.  It's very unclear what to do
 about this, since the kernel and PostgreSQL don't have intimate
 knowledge of what each other are doing, but it would be nice to solve
 somehow.




 There you have a much harder algorithmic problem.

 You can basically control duplication with fadvise and WONTNEED. The
 problem here is not the kernel and whether or not it allows postgres
 to be smart about it. The problem is... what kind of smarts
 (algorithm) to use.



 Isn't this a fairly simple matter of when we read a page into shared
 buffers
 tell the kernel do forget that page? And a corollary to that for when we
 dump a page out of shared_buffers (here kernel, please put this back into
 your cache).



 That's my point. In terms of kernel-postgres interaction, it's fairly
 simple.

 What's not so simple, is figuring out what policy to use.


I think the above is pretty simple for both interaction (allow us to inject
a clean page into the file page cache) and policy (forget it after you hand
it to us, then remember it again when we hand it back to you clean).  And I
think it would pretty likely be an improvement over what we currently do.
 But I think it is probably the wrong way to get the improvement.  I think
the real problem is that we don't trust ourselves to manage more of the
memory ourselves.

As far as I know, we still don't have a publicly disclosable and readily
reproducible test case for the reports of performance degradation when we
have more than 8GB in shared_buffers.   If we had one of those, we could
likely reduce the double buffering problem by fixing our own scalability
issues and therefore taking responsibility for more of the data ourselves.



Remember,
 you cannot tell the kernel to put some page in its page cache without
 reading it or writing it. So, once you make the kernel forget a page,
 evicting it from shared buffers becomes quite expensive.


 Well, if we were to collaborate with the kernel community on this then
 presumably we can do better than that for eviction... even to the extent of
 here's some data from this range in this file. It's (clean|dirty). Put it
 in your cache. Just trust me on this.


Which, in the case of it being clean, amounts to Here is data we don't
want in memory any more because we think it is cold.  But we don't trust
ourselves, so please hold on to it anyway.  That might be a tough sell to
the kernel people.

 Cheers,

Jeff

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On Mon, Jan 13, 2014 at 3:53 PM, Trond Myklebust tron...@gmail.com wrote:
 O_DIRECT was specifically designed to solve the problem of double buffering 
 between applications and the kernel. Why are you not able to use that in 
 these situations?

O_DIRECT was apparently designed by a deranged monkey on some serious
mind-controlling substances.  But don't take it from me, I have it on
good authority:

http://yarchive.net/comp/linux/o_direct.html

One might even say the best authority.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread James Bottomley

On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote:
 On 1/13/14, 2:27 PM, Claudio Freire wrote:
  On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby j...@nasby.net wrote:
  On 1/13/14, 2:19 PM, Claudio Freire wrote:
 
  On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas robertmh...@gmail.com
  wrote:
 
  On a related note, there's also the problem of double-buffering.  When
  we read a page into shared_buffers, we leave a copy behind in the OS
  buffers, and similarly on write-out.  It's very unclear what to do
  about this, since the kernel and PostgreSQL don't have intimate
  knowledge of what each other are doing, but it would be nice to solve
  somehow.
 
 
 
  There you have a much harder algorithmic problem.
 
  You can basically control duplication with fadvise and WONTNEED. The
  problem here is not the kernel and whether or not it allows postgres
  to be smart about it. The problem is... what kind of smarts
  (algorithm) to use.
 
 
  Isn't this a fairly simple matter of when we read a page into shared 
  buffers
  tell the kernel do forget that page? And a corollary to that for when we
  dump a page out of shared_buffers (here kernel, please put this back into
  your cache).
 
 
  That's my point. In terms of kernel-postgres interaction, it's fairly 
  simple.
 
  What's not so simple, is figuring out what policy to use. Remember,
  you cannot tell the kernel to put some page in its page cache without
  reading it or writing it. So, once you make the kernel forget a page,
  evicting it from shared buffers becomes quite expensive.
 
 Well, if we were to collaborate with the kernel community on this then
 presumably we can do better than that for eviction... even to the
 extent of here's some data from this range in this file. It's (clean|
 dirty). Put it in your cache. Just trust me on this.

This should be the madvise() interface (with MADV_WILLNEED and
MADV_DONTNEED) is there something in that interface that is
insufficient?

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Trond Myklebust


On Jan 13, 2014, at 15:40, Andres Freund and...@2ndquadrant.com wrote:

 On 2014-01-13 15:15:16 -0500, Robert Haas wrote:
 On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner kgri...@ymail.com wrote:
 I notice, Josh, that you didn't mention the problems many people
 have run into with Transparent Huge Page defrag and with NUMA
 access.
 
 Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
 setting zone_reclaim_mode; is there some other problem besides that?
 
 I think that fixes some of the worst instances, but I've seen machines
 spending horrible amounts of CPU ( BUS) time in page reclaim
 nonetheless. If I analyzed it correctly it's in RAM  working set
 workloads where RAM is pretty large and most of it is used as page
 cache. The kernel ends up spending a huge percentage of time finding and
 potentially defragmenting pages when looking for victim buffers.
 
 On a related note, there's also the problem of double-buffering.  When
 we read a page into shared_buffers, we leave a copy behind in the OS
 buffers, and similarly on write-out.  It's very unclear what to do
 about this, since the kernel and PostgreSQL don't have intimate
 knowledge of what each other are doing, but it would be nice to solve
 somehow.
 
 I've wondered before if there wouldn't be a chance for postgres to say
 my dear OS, that the file range 0-8192 of file x contains y, no need to
 reread and do that when we evict a page from s_b but I never dared to
 actually propose that to kernel people...

O_DIRECT was specifically designed to solve the problem of double buffering 
between applications and the kernel. Why are you not able to use that in these 
situations?

Cheers,
   Trond

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Kevin Grittner

Josh Berkus j...@agliodbs.com wrote:

 Wanna go to Collab?

I don't think that works out for me, but thanks for suggesting it.

I'd be happy to brainstorm with anyone who does go about issues to
discuss; although the ones I keep running into have already been
mentioned.

Regarding the problems others have mentioned, there are a few
features that might be a very big plus for us.  Additional ways of
hinting pages might be very useful.  If we had a way to specify how
many dirty pages were cached in PostgreSQL, the OS would count
those for calculations for writing dirty pages, and we could avoid
the write avalanche which is currently so tricky to avoid without
causing repeated writes to the same page.  Or perhaps instead a way
to hint a page as dirty so that the OS could not only count those,
but discard the obsolete data from its cache if it is not already
dirty at the OS level, and lower the write priority if it is dirty
(to improve the odds of collapsing multiple writes).  If there was
a way to use DONTNEED or something similar with the ability to
rescind it if the page was still happened to be in the OS cache,
that might help for when we discard a still-clean page from our
buffers.  And I seem to have a vague memory of there being cases
where the OS is first reading pages when we ask to write them,
which seems like avoidable I/O.  (I'm not sure about that one,
though.)

Also, something like THP support should really have sysctl support
rather than requiring people to put echo commands into scripts and
tie those into runlevel changes.  That's pretty ugly for something
which has turned out to be necessary so often.

I don't get too excited about changes to the default schedulers --
it's been pretty widely known for a long time that DEADLINE or NOOP
perform better than any alternatives for most database loads. 
Anyone with a job setting up Linux machines to be used for database
servers should know to cover that.  As long as those two don't get
broken, I'm good.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On 2014-01-13 12:34:35 -0800, James Bottomley wrote:
 On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote:
  Well, if we were to collaborate with the kernel community on this then
  presumably we can do better than that for eviction... even to the
  extent of here's some data from this range in this file. It's (clean|
  dirty). Put it in your cache. Just trust me on this.
 
 This should be the madvise() interface (with MADV_WILLNEED and
 MADV_DONTNEED) is there something in that interface that is
 insufficient?

For one, postgres doesn't use mmap for files (and can't without major
new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has
horrible consequences for performance/scalability - very quickly you
contend on locks in the kernel.
Also, that will mark that page dirty, which isn't what we want in this
case. One major usecase is transplanting a page comming from postgres'
buffers into the kernel's buffercache because the latter has a much
better chance of properly allocating system resources across independent
applications running.

Oh, and the kernel's page-cache management while far from perfect,
actually scales much better than postgres'.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master


On 1/13/14, 12:21 PM, Joshua D. Drake wrote:


On 01/13/2014 10:12 AM, Hannu Krosing wrote:

In other words, if we're going to have auto-degrade, the most
intelligent place for it is in
RepMgr/HandyRep/OmniPITR/pgPoolII/whatever.  It's also the *easiest*
place.  Anything we do *inside* Postgres is going to have a really,
really hard time determining when to degrade.

+1

This is also how 2PC works, btw - the database provides the building
blocks, i.e. PREPARE and COMMIT, and leaves it to a transaction manager
to deal with issues that require a whole-cluster perspective.



++1


+1


Josh, what do you think of the upthread idea of being able to recover 
in-progress transactions that are waiting when we turn off sync rep? I'm 
thinking that would be a very good feature to have... and it's not something 
you can easily do externally.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

On 2014-01-13 15:14:21 -0600, Jim Nasby wrote:
 On 1/13/14, 12:21 PM, Joshua D. Drake wrote:
 
 On 01/13/2014 10:12 AM, Hannu Krosing wrote:
 In other words, if we're going to have auto-degrade, the most
 intelligent place for it is in
 RepMgr/HandyRep/OmniPITR/pgPoolII/whatever.  It's also the *easiest*
 place.  Anything we do *inside* Postgres is going to have a really,
 really hard time determining when to degrade.
 +1
 
 This is also how 2PC works, btw - the database provides the building
 blocks, i.e. PREPARE and COMMIT, and leaves it to a transaction manager
 to deal with issues that require a whole-cluster perspective.
 
 
 ++1
 
 +1
 
 Josh, what do you think of the upthread idea of being able to recover 
 in-progress transactions that are waiting when we turn off sync rep? I'm 
 thinking that would be a very good feature to have... and it's not something 
 you can easily do externally.

I think it'd be a fairly simple patch to re-check the state of syncrep
config in SyncRepWaitForLsn(). Alternatively you can just write code to
iterate over the procarray and sets Proc-syncRepState to
SYNC_REP_WAIT_CANCELLED or such.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds


On 1/13/14, 10:40 AM, Merlin Moncure wrote:

On Sun, Jan 12, 2014 at 4:38 AM, Craig Ringercr...@2ndquadrant.com  wrote:

Implicit casts to text, anybody?

This backward compatibility break orphaned the company I work for on
8.1 until last year and very nearly caused postgres to be summarily
extirpated (only rescued at the last minute by my arrival). It cost
hundreds of thousands of dollars to qualify a sprawling java code base
so that it could be moved back into a supported version.  Breaking
compatibility sucks -- it hurts your users and costs people money.
Hacking type casts may not have been a mistake, but the arbitrary
introduction of the breakage certainly was.


IMHO, I see no reason we need to plan on removing support. Presumably it's not 
that much burden on our codebase; it's only a PITA for users writing correct 
code. (It'd be very interesting to see how much user code would blow up if 
presented with anything other than 1 as the lower bound...)

I'd be perfectly happy with an initdb option to allow for lower bound support if you 
wanted it and disable it by default. People the legitimately want/need  1 
lower bounds can set that up, but our general user population will end up protected 
from a class of heisenbugs.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Greg Stark

On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund and...@2ndquadrant.com wrote:
 For one, postgres doesn't use mmap for files (and can't without major
 new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has
 horrible consequences for performance/scalability - very quickly you
 contend on locks in the kernel.


I may as well dump this in this thread. We've discussed this in person
a few times, including at least once with Ted T'so when he visited
Dublin last year.

The fundamental conflict is that the kernel understands better the
hardware and other software using the same resources, Postgres
understands better its own access patterns. We need to either add
interfaces so Postgres can teach the kernel what it needs about its
access patterns or add interfaces so Postgres can find out what it
needs to know about the hardware context.

The more ambitious and interesting direction is to let Postgres tell
the kernel what it needs to know to manage everything. To do that we
would need the ability to control when pages are flushed out. This is
absolutely necessary to maintain consistency. Postgres would need to
be able to mark pages as unflushable until some point in time in the
future when the journal is flushed. We discussed various ways that
interface could work but it would be tricky to keep it low enough
overhead to be workable.

The less exciting, more conservative option would be to add kernel
interfaces to teach Postgres about things like raid geometries. Then
Postgres could use directio and decide to do prefetching based on the
raid geometry, how much available i/o bandwidth and iops is available,
etc.

Reimplementing i/o schedulers and all the rest of the work that the
kernel provides inside Postgres just seems like something outside our
competency and that none of us is really excited about doing.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-13 Thread Joshua D. Drake



On 01/13/2014 01:14 PM, Jim Nasby wrote:


On 1/13/14, 12:21 PM, Joshua D. Drake wrote:


On 01/13/2014 10:12 AM, Hannu Krosing wrote:

In other words, if we're going to have auto-degrade, the most
intelligent place for it is in
RepMgr/HandyRep/OmniPITR/pgPoolII/whatever.  It's also the *easiest*
place.  Anything we do *inside* Postgres is going to have a really,
really hard time determining when to degrade.

+1

This is also how 2PC works, btw - the database provides the building
blocks, i.e. PREPARE and COMMIT, and leaves it to a transaction manager
to deal with issues that require a whole-cluster perspective.



++1


+1


Josh, what do you think of the upthread idea of being able to recover
in-progress transactions that are waiting when we turn off sync rep? I'm
thinking that would be a very good feature to have... and it's not
something you can easily do externally.


I think it is extremely valuable, else we have lost those transactions 
which is exactly what we don't want.


JD


--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
In a time of universal deceit - telling the truth is a revolutionary 
act., George Orwell



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread David Fetter

On Mon, Jan 13, 2014 at 10:40:57AM -0600, Merlin Moncure wrote:
 On Sun, Jan 12, 2014 at 4:38 AM, Craig Ringer cr...@2ndquadrant.com wrote:
  Implicit casts to text, anybody?
 
 This backward compatibility break orphaned the company I work for on
 8.1 until last year and very nearly caused postgres to be summarily
 extirpated (only rescued at the last minute by my arrival). It cost
 hundreds of thousands of dollars to qualify a sprawling java code base
 so that it could be moved back into a supported version.  Breaking
 compatibility sucks -- it hurts your users and costs people money.
 Hacking type casts may not have been a mistake, but the arbitrary
 introduction of the breakage certainly was.

With utmost respect, it was not.  Databases are no good if there are
fixable things in them that cause them to produce incorrect results at
random, as auto-casting to text did.

 This project has no deprecation policy,

I believe it actually does, although it's not a formal, written
policy.  Would you like to help draft one up?

 and I'd argue we'd need one
 before considering breaking changes.  For example, maybe we could pull
 out an occasional release for longer term support to help users that
 caught out.   But really, the better way to go IMNSHO is to take a
 hard line on compatibility issues pretty much always -- consider the
 case of libc and win32 api.

Could you please help remind us what that was?

 There are certain limited exceptions to this rule -- for example
 security problems

Probably not.

 or gross violations of the standard

We're definitely there on lower bounds of arrays.  The standard, for a
wonder, is clear and unambiguous about them.  Whether we should go
there on the rest of our array implementation is a question for
another thread.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.consistent_into


On 1/13/14, 1:44 AM, Pavel Stehule wrote:




2014/1/12 Florian Pflug f...@phlo.org mailto:f...@phlo.org

On Jan12, 2014, at 22:37 , Pavel Stehule pavel.steh...@gmail.com 
mailto:pavel.steh...@gmail.com wrote:
  There is  GUC for variable_conflict already too. In this case I would to
  enable this functionality everywhere (it is tool how to simply eliminate
  some kind of strange bugs) so it needs a GUC.
 
  We have GUC for plpgsql.variable_conflict three years and I don't know
  about any problem.

I must say I hate behaviour-changing GUCs with quite some passion. IMHO
they tend to cause bugs, not avoid them, in the long run. The pattern
usually is

   1) Code gets written, depends on some particular set of settings
  to work correctly

   2) Code gets reused, with little further testing since it's supposed
  to be battle-proven anyway. Settings get dropped.

   3) Code blows up for those corner-cases where the setting actually
  matter. Debugging is hell, because you effectively have to go
  over the code line-by-line and check if it might be affected by
  some GUC or another.

Only a few days ago I spent more than an hour tracking down a bug
which, as it turned out, was caused by a regex which subtly changed its
meaning depending on whether standard_conforming_strings is on or off.

Some GUCs are unavoidable - standard_conforming_strings, for example
probably still was a good idea, since the alternative would have been
to stick with the historical, non-standard behaviour forever.

But in this case, my feeling is that the trouble such a GUC may cause
out-weights the potential benefits. I'm all for having a directive like
#consistent_into (though I feel that the name could convey the
meaning better). If we *really* think that this ought to be the default
from 9.4 onward, then we should

   *) Change it to always complain, except if the function explictly
  specifies #consistent_into on or whatever.

   *) Have pg_dump add that to all plpgsql functions if the server
  version is  9.4 or whatever major release this ends up in

That's all just my opinion of course.


I am thinking so GUC and plpgsql option can live together. If you like to 
accent a some behave, then you can use a plpgsql option. On second hand, I 
would to use a some functionality, that is safe, but I don't would to dirty 
source code by using repeated options. But I have to check (and calculate with 
risk) a GUC settings.

One idea: required GUC? Can be nice a possibility to ensure some GUC setting, 
and restore ensure these values or raises warning.

Back to main topic. Required and described feature doesn't change a behave of 
INTO clause. I can enable or disable this functionality and well written code 
should to work without change (and problems). When check is disabled, then 
execution is just less safe. So in this case, a impact of GUC is significantly 
less than by you described issues. Does know anybody a use case where this 
check should be disabled?

Probably we have a different experience about GUC. I had a problem with  
standard_conforming_strings and bytea format some years ago. Now I prepare 
document about required setting. But I can see (from my experience from Czech 
area) more often  problems related to effective_cache_size or 
from_collapse_limit and similar GUC. These parameters are behind knowledge (and 
visibility) typical user.


ISTM that in this case, it should be safe to make the new default behavior STRICT; if you 
forget to set the GUC to disable than you'll get an error that points directly at the 
problem, at which point you'll go Oh, yeah... I forgot to set X...

Outside of the GUC, I believe the default should definitely be STRICT. If your 
app is relying on non-strict then you need to be made aware of that. We should 
be able to provide a DO block that will change this setting for every function 
you've got if someone isn't happy with STRICT mode.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

On Mon, Jan 13, 2014 at 4:45 PM, David Fetter da...@fetter.org wrote:
 On Mon, Jan 13, 2014 at 10:40:57AM -0600, Merlin Moncure wrote:
 On Sun, Jan 12, 2014 at 4:38 AM, Craig Ringer cr...@2ndquadrant.com wrote:
  Implicit casts to text, anybody?

 This backward compatibility break orphaned the company I work for on
 8.1 until last year and very nearly caused postgres to be summarily
 extirpated (only rescued at the last minute by my arrival). It cost
 hundreds of thousands of dollars to qualify a sprawling java code base
 so that it could be moved back into a supported version.  Breaking
 compatibility sucks -- it hurts your users and costs people money.
 Hacking type casts may not have been a mistake, but the arbitrary
 introduction of the breakage certainly was.

 With utmost respect, it was not.  Databases are no good if there are
 fixable things in them that cause them to produce incorrect results at
 random, as auto-casting to text did.

With a precisely equal level of respect, that's a load of bunk.  As
has been discussed here many times in the past, those changes broke
many applications that were just fine.  Mine included, EnterpriseDB's
included.  It was designed to maximize rather than minimize breakage
in ways that were completely unnecessary, and every time anyone dares
to argue that the change was less than perfect in every respect, the
reponses evince a level of dubiousness normally reserved for the
parole hearings of convicted murderers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

Everyone,

I am looking for one or more hackers to go to Collab with me to discuss
this.  If you think that might be you, please let me know and I'll look
for funding for your travel.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Syntax of INSERT...ON DUPLICATE KEY LOCK FOR UPDATE


On 1/12/14, 9:35 PM, Andreas Karlsson wrote:

On 01/12/2014 11:20 PM, Peter Geoghegan wrote:

On Sun, Jan 12, 2014 at 8:12 AM, Andreas Karlsson andr...@proxel.se wrote:

On 01/11/2014 11:42 PM, Peter Geoghegan wrote:

I recently suggested that rather than RETURNING REJECTS, we could have
a REJECTING clause, which would see a DML statement project strictly
the complement of what RETURNING projects in the same context. So
perhaps you could also see what RETURNING would not have projected
because a before row trigger returned NULL (i.e. when a before trigger
indicates to not proceed with insertion). That is certainly more
general, and so is perhaps preferable. It's also less verbose, and it
seems less likely to matter that we'll need to make REJECTING a fully
reserved keyword, as compared to REJECTS. (RETURNING is already a
fully reserved keyword not described by the standard, so this makes a
certain amount of sense to me). If nothing else, REJECTING is more
terse than RETURNING REJECTS.


I do not entirely understand what you are proposing here.  Any example how
this would look compared to your RETURNING REJECTS proposal?


It's very similar - REJECTING is a total generalization of what I
already have. The difference is only that REJECTING is accepted in all
contexts that RETURNING is, and not just with INSERT...ON DUPLICATE
KEY LOCK FOR UPDATE. So you could potentially have REJECTING project
the slot proposed for insertion on an UPDATE where RETURNING would
not. If for example a BEFORE ROW trigger fired, and returned NULL,
perhaps it'd then be possible to project the slot as it was before
being passed to the trigger. Perhaps there is no real demand for that,
but, as I said, from a usability perspective it may be easier to
reason about a feature that projects strictly the complement of what
RETURNING would project in the same context.


So simply this?

WITH rej AS (
 INSERT INTO foo (a, b, c)
 VALUES (44, 1078, 'insert'), (55, 1088, 'insert')
 REJECTING a)
UPDATE foo SET c = 'update' FROM rej WHERE foo.a = rej.a;

Another question: have you given any thought on the case where you want to use 
both the successfully inserted tuples and the rejected and use in the CTE? Is 
that even something anyone would want? Would perhaps MERGE be more suited for 
that?


Well, a common case for INSERT RETURNING is to get your set of surrogate keys 
back; so I think users would want the ability to RETURN what finally made it 
into the table.

Also, if we want to support the case of identifying tuples where a BEFORE 
trigger disallowed the insert, we probably want to expose that that's why those 
tuples were rejected (as opposed to them being rejected due to a duplicate key 
violation).
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread Merlin Moncure

On Mon, Jan 13, 2014 at 3:45 PM, David Fetter da...@fetter.org wrote:
 On Mon, Jan 13, 2014 at 10:40:57AM -0600, Merlin Moncure wrote:
 This project has no deprecation policy,

 I believe it actually does, although it's not a formal, written
 policy.  Would you like to help draft one up?

Lack of 'formal, written, policy' is equivalent to 'no policy'.
Regardless, the way things we done in the 7.x/8.x series may no longer
apply today; the project has grown up and we need to be more serious
about things, at least, IMNSHO.

 and I'd argue we'd need one
 before considering breaking changes.  For example, maybe we could pull
 out an occasional release for longer term support to help users that
 caught out.   But really, the better way to go IMNSHO is to take a
 hard line on compatibility issues pretty much always -- consider the
 case of libc and win32 api.

 Could you please help remind us what that was?

Let's take gets() for example.  C11 finally ditched it 12 years (!)
after it was formally deprecated in C99 and informally deprecate in
endless man pages (don't use this!) for decades before that.  And
even then most compilers, at least the decent ones, should allow to
request previous standards for some time beyond that.  The win32 API
is also remarkably stable; ancient code written for it beyond the dim
horizon of time will still compile and execute today.  These are
probably strong contenders for most popular APIs ever made -- see the
connection?  Now, comparing C APIs to an SQL implementation for
deprecation purposes isn't quite applies to apples, but I'll stand by
the analogy.

 or gross violations of the standard

 We're definitely there on lower bounds of arrays.  The standard, for a
 wonder, is clear and unambiguous about them.  Whether we should go
 there on the rest of our array implementation is a question for
 another thread.

The SQL standard requests that standard syntax gives standard
behavior.  Alternate bounds is non-standard syntax giving non-standard
behavior and is thus excepted.  Naturally, non-standard syntax is
dangerous because the standard may later implement it in which case
you then have a real problem (that may be the case here: I don't
know).  Our array implementation is a real mess on multiple levels but
at least it's an internally consistent mess.  Maybe it really should
be 'fixed', but not before the super un-fun discussion of how to ease
the path for our hapless users happens first.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

On Mon, Jan 13, 2014 at 03:15:16PM -0500, Robert Haas wrote:
 On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner kgri...@ymail.com wrote:
  I notice, Josh, that you didn't mention the problems many people
  have run into with Transparent Huge Page defrag and with NUMA
  access.
 

Ok, there are at least three potential problems there that you may or
may not have run into.

First, THP when it was first introduced was a bit of a disaster. In 3.0,
it was *very* heavy handed and would trash the system reclaiming memory
to satisfy an allocation. When it did this, it would also writeback a
bunch of data and block on it to boot. It was not the smartest move of
all time but was improved over time and in some cases the patches were
also backported by 3.0.101. This is a problem that should have
alleviated over time.

The general symptoms of the problem would be massive stalls and
monitoring the /proc/PID/stack of interesting processes would show it to
be somewhere in do_huge_pmd_anonymous_page - alloc_page_nodemask -
try_to_free_pages - migrate_pages or something similar. You may have
worked around it by disabling THP with a command line switch or
/sys/kernel/mm/transparent_hugepage/enabled in the past.

This is not meant to happen any more or at least it has been a while
since a bug was filed against me in this area. There are corner cases
though. If the underlying filesystem is NFS, the problem might still be
experienced.

That is the simple case.

You might have also hit the case where THPages filled with zeros did not
use the zero page. That would have looked like a larger footprint than
anticipated and lead to another range of problems. This is also addressed
since but maybe not recently enough. It's less likely this is your problem
though as I expect you actually use your buffers, not leave them filled
with zeros.

You mention NUMA but that's trickier to figure out that problem without more
context.  THP can cause unexpected interleaving between NUMA nodes. Memory
that would have been local on a 4K page boundary becomes remote accesses
when THP is enabled and performance would be hit (maybe 3-5% depending on
the machine). It's not the only possibility though. If memory was being
used sparsely and THP was in use then the overall memory footprint may be
higher than it should be. This potentially would cause allocations to spill
over to remote nodes while kswapd wakes up to reclaim local memory. That
would lead to weird buffer aging inversion problems. This is a hell of a
lot of guessing though and we'd need a better handle on the reproduction
case to pin it down.

 Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
 setting zone_reclaim_mode; is there some other problem besides that?
 

Really?

zone_reclaim_mode is often a complete disaster unless the workload is
partitioned to fit within NUMA nodes. On older kernels enabling it would
sometimes cause massive stalls. I'm actually very surprised to hear it
fixes anything and would be interested in hearing more about what sort
of circumstnaces would convince you to enable that thing.

 The other thing that comes to mind is the kernel's caching behavior.
 We've talked a lot over the years about the difficulties of getting
 the kernel to write data out when we want it to and to not write data
 out when we don't want it to. 

Is sync_file_range() broke?

 When it writes data back to disk too
 aggressively, we get lousy throughput because the same page can get
 written more than once when caching it for longer would have allowed
 write-combining. 

Do you think that is related to dirty_ratio or dirty_writeback_centisecs?
If it's dirty_writeback_centisecs then that would be particularly tricky
because poor interactions there would come down to luck basically.

 When it doesn't write data to disk aggressively
 enough, we get huge latency spikes at checkpoint time when we call
 fsync() and the kernel says uh, what? you wanted that data *on the
 disk*? sorry boss! and then proceeds to destroy the world by starving
 the rest of the system for I/O for many seconds or minutes at a time.

Ok, parts of that are somewhat expected. It *may* depend on the
underlying filesystem. Some of them handle fsync better than others. If
you are syncing the whole file though when you call fsync then you are
potentially burned by having to writeback dirty_ratio amounts of memory
which could take a substantial amount of time.

 We've made some desultory attempts to use sync_file_range() to improve
 things here, but I'm not sure that's really the right tool, and if it
 is we don't know how to use it well enough to obtain consistent
 positive results.
 

That implies that either sync_file_range() is broken in some fashion we
(or at least I) are not aware of and that needs kicking.

 On a related note, there's also the problem of double-buffering.  When
 we read a page into shared_buffers, we leave a copy behind in the OS
 buffers, and similarly on write-out.  It's very unclear what to

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

On Mon, Jan 13, 2014 at 06:27:03PM -0200, Claudio Freire wrote:
 On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby j...@nasby.net wrote:
  On 1/13/14, 2:19 PM, Claudio Freire wrote:
 
  On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas robertmh...@gmail.com
  wrote:
 
  On a related note, there's also the problem of double-buffering.  When
  we read a page into shared_buffers, we leave a copy behind in the OS
  buffers, and similarly on write-out.  It's very unclear what to do
  about this, since the kernel and PostgreSQL don't have intimate
  knowledge of what each other are doing, but it would be nice to solve
  somehow.
 
 
 
  There you have a much harder algorithmic problem.
 
  You can basically control duplication with fadvise and WONTNEED. The
  problem here is not the kernel and whether or not it allows postgres
  to be smart about it. The problem is... what kind of smarts
  (algorithm) to use.
 
 
  Isn't this a fairly simple matter of when we read a page into shared buffers
  tell the kernel do forget that page? And a corollary to that for when we
  dump a page out of shared_buffers (here kernel, please put this back into
  your cache).
 
 
 That's my point. In terms of kernel-postgres interaction, it's fairly simple.
 
 What's not so simple, is figuring out what policy to use. Remember,
 you cannot tell the kernel to put some page in its page cache without
 reading it or writing it. So, once you make the kernel forget a page,
 evicting it from shared buffers becomes quite expensive.

posix_fadvise(POSIX_FADV_WILLNEED) is meant to cover this case by
forcing readahead. If you evict it prematurely then you do get kinda
screwed because you pay the IO cost to read it back in again even if you
had enough memory to cache it. Maybe this is the type of kernel-postgres
interaction that is annoying you.

If you don't evict, the kernel eventually steps in and evicts the wrong
thing. If you do evict and it was unnecessarily you pay an IO cost.

That could be something we look at. There are cases buried deep in the
VM where pages get shuffled to the end of the LRU and get tagged for
reclaim as soon as possible. Maybe you need access to something like
that via posix_fadvise to say reclaim this page if you need memory but
leave it resident if there is no memory pressure or something similar.
Not exactly sure what that interface would look like or offhand how it
could be reliably implemented.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

On Mon, Jan 13, 2014 at 12:58 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Well, even if you don't agree that locking all the conflicting rows for
 update is sensible, it's still perfectly sensible to return the rejected
 rows to the user. For example, you're inserting N rows, and if some of them
 violate a constraint, you still want to insert the non-conflicting rows
 instead of rolling back the whole transaction.

Right, but with your approach, can you really be sure that you have
the right rejecting tuple ctid (not reject)? In other words, as you
wait for the exclusion constraint to conclusively indicate that there
is a conflict, minutes may have passed in which time other conflicts
may emerge in earlier unique indexes. Whereas with an approach where
values are locked, you are guaranteed that earlier unique indexes have
no conflicting values. Maintaining that property seems useful, since
we check in a well-defined order, and we're still projecting a ctid.
Unlike when row locking is involved, we can make no assumptions or
generalizations around where conflicts will occur. Although that may
also be a general concern with your approach when row locking, for
multi-master replication use-cases. There may be some value in knowing
it cannot have been earlier unique indexes (and so the existing values
for those unique indexes in the locked row should stay the same -
don't many conflict resolution policies work that way?).


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On 2014-01-13 14:19:56 -0800, James Bottomley wrote:
   Frequently mmap()/madvise()/munmap()ing 8kb chunks has
  horrible consequences for performance/scalability - very quickly you
  contend on locks in the kernel.
 
 Is this because of problems in the mmap_sem?

It's been a while since I looked at it, but yes, mmap_sem was part of
it. I also seem to recall the amount of IPIs increasing far too much for
it to be practical, but I am not sure anymore.

  Also, that will mark that page dirty, which isn't what we want in this
  case.
 
 You mean madvise (page_addr)?  It shouldn't ... the state of the dirty
 bit should only be updated by actual writes.  Which MADV_ primitive is
 causing the dirty marking, because we might be able to fix it (unless
 there's some weird corner case I don't know about).

Not the madvise() itself, but transplanting the buffer from postgres'
buffers to the mmap() area of the underlying file would, right?

 We also do have a way of transplanting pages: it's called splice.  How
 do the semantics of splice differ from what you need?

Hm. I don't really see how splice would allow us to seed the kernel's
pagecache with content *without* marking the page as dirty in the
kernel.
We don't need zero-copy IO here, the important thing is just to fill the
pagecache with content without a) rereading the page from disk b)
marking the page as dirty.

   One major usecase is transplanting a page comming from postgres'
  buffers into the kernel's buffercache because the latter has a much
  better chance of properly allocating system resources across independent
  applications running.
 
 If you want to share pages between the application and the page cache,
 the only known interface is mmap ... perhaps we can discuss how better
 to improve mmap for you?

I think purely using mmap() is pretty unlikely to work out - there's
just too many constraints about when a page is allowed to be written out
(e.g. it's interlocked with postgres' write ahead log). I also think
that for many practical purposes using mmap() would result in an absurd
number of mappings or mapping way too huge areas; e.g. large btree
indexes are usually accessed in a quite fragmented manner.

  Oh, and the kernel's page-cache management while far from perfect,
  actually scales much better than postgres'.
 
 Well, then, it sounds like the best way forward would be to get
 postgress to use the kernel page cache more efficiently.

No arguments there, although working on postgres scalability is a good
idea as well ;)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On Mon, Jan 13, 2014 at 11:38:44PM +0100, Jan Kara wrote:
 On Mon 13-01-14 22:26:45, Mel Gorman wrote:
  The flipside is also meant to hold true. If you know data will be needed
  in the near future then posix_fadvise(POSIX_FADV_WILLNEED). Glancing at
  the implementation it does a forced read-ahead on the range of pages of
  interest. It doesn't look like it would block.
   That's not quite true. POSIX_FADV_WILLNEED still needs to map logical
 file offsets to physical disk blocks and create IO requests. This happens
 synchronously. So if your disk is congested and relevant metadata is out of
 cache, or we simply run out of free IO requests, POSIX_FADV_WILLNEED can
 block for a significant amount of time.
 

Umm, yes, you're right. It also potentially stalls allocating the pages
up front even though it will only try and direct reclaim pages once.
That can stall in some circumstances, particularly if there are a number
of processes trying to reclaim memory.

That kinda sucks though. One point of discussion would be to check if
this is an interface that can be used and if so, is it required to never
block and if so is there something we can do about it -- queue the IO
asynchronously if you can but if the kernel would block then do not bother.
That does mean that fadvise is not guaranteeing that the pages will be
resident in the future but it was not the intent of the interface
anyway.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Syntax of INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

On Mon, Jan 13, 2014 at 2:20 PM, Jim Nasby j...@nasby.net wrote:
 Well, a common case for INSERT RETURNING is to get your set of surrogate
 keys back; so I think users would want the ability to RETURN what finally
 made it into the table.

Your update can also have a RETURNING clause. I'm not necessarily that
attached to fully generalizing RETURNING REJECTS as REJECTING. It was
just an idea. When an insert is rejected and you lock a conflicting
row, it hardly matters what your surrogate key might have been had
that insert succeeded.

To get the surrogate key when it upsert inserts, do a regular
INSERTRETURNING..., and break the work up into multiple commands.
That will almost always be sufficient, because you'll almost always
know ahead of time where the conflict might be (certainly, the MySQL
feature mandates that you do know).

 Also, if we want to support the case of identifying tuples where a BEFORE
 trigger disallowed the insert, we probably want to expose that that's why
 those tuples were rejected (as opposed to them being rejected due to a
 duplicate key violation).

The ctid *won't* indicate a specific rejecting row then, I guess,
which will do it.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

On Jan13, 2014, at 22:30 , Joshua D. Drake j...@commandprompt.com wrote:
 On 01/13/2014 01:14 PM, Jim Nasby wrote:
 
 On 1/13/14, 12:21 PM, Joshua D. Drake wrote:
 
 On 01/13/2014 10:12 AM, Hannu Krosing wrote:
 In other words, if we're going to have auto-degrade, the most
 intelligent place for it is in
 RepMgr/HandyRep/OmniPITR/pgPoolII/whatever.  It's also the *easiest*
 place.  Anything we do *inside* Postgres is going to have a really,
 really hard time determining when to degrade.
 +1
 
 This is also how 2PC works, btw - the database provides the building
 blocks, i.e. PREPARE and COMMIT, and leaves it to a transaction manager
 to deal with issues that require a whole-cluster perspective.
 
 
 ++1
 
 +1
 
 Josh, what do you think of the upthread idea of being able to recover
 in-progress transactions that are waiting when we turn off sync rep? I'm
 thinking that would be a very good feature to have... and it's not
 something you can easily do externally.
 
 I think it is extremely valuable, else we have lost those transactions which
 is exactly what we don't want.

We *have* to recover waiting transaction upon switching off sync rep.

A transaction that waits for a sync standby to respond has already committed
locally (i.e., updated the clog), it just hasn't updated the proc array yet,
and thus is still seen as in-progress by the rest of the system. But rolling
back the transaction is nevertheless *impossible* at that point (except by
PITR, and hence the quoted around reciver). So the only alternative to
recovering them, i.e. have them abort their waiting, is to let them linger
indefinitely, still holding their locks, preventing xmin from advancing, etc,
until either the client disconnects or the server is restarted.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

On 01/13/2014 02:26 PM, Mel Gorman wrote:
 Really?
 
 zone_reclaim_mode is often a complete disaster unless the workload is
 partitioned to fit within NUMA nodes. On older kernels enabling it would
 sometimes cause massive stalls. I'm actually very surprised to hear it
 fixes anything and would be interested in hearing more about what sort
 of circumstnaces would convince you to enable that thing.

So the problem with the default setting is that it pretty much isolates
all FS cache for PostgreSQL to whichever socket the postmaster is
running on, and makes the other FS cache unavailable.  This means that,
for example, if you have two memory banks, then only one of them is
available for PostgreSQL filesystem caching ... essentially cutting your
available cache in half.

And however slow moving cached pages between memory banks is, it's an
order of magnitude faster than moving them from disk.  But this isn't
how the NUMA stuff is configured; it seems to assume that it's less
expensive to get pages from disk than to move them between banks, so
whatever you've got cached on the other bank, it flushes it to disk as
fast as possible.  I understand the goal was to make memory usage local
to the processors stuff was running on, but that includes an implicit
assumption that no individual process will ever want more than one
memory bank worth of cache.

So disabling all of the NUMA optimizations is the way to go for any
workload I personally deal with.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread Craig Ringer

On 01/14/2014 12:40 AM, Merlin Moncure wrote:
 On Sun, Jan 12, 2014 at 4:38 AM, Craig Ringer cr...@2ndquadrant.com wrote:
 Implicit casts to text, anybody?
 
 This backward compatibility break orphaned the company I work for on
 8.1 until last year and very nearly caused postgres to be summarily
 extirpated (only rescued at the last minute by my arrival).

You're far from the only one, too. Until last year I was still seeing
people saying they can't upgrade because of this. OTOH, that was a
sudden and drastic change, with no BC switch like the removal of
implicit joins had, that affected wide swaths of code. Lets not do that
again.

Removal of lower bounds for arrays is unlikely to even get noticed by
the vast majority of users, and can be done progressively with BC features.

The real issue IMO is how to get those few to stop using it so it can be
truly removed. Past experience has shown that people just turn the
compatibility flag on and forget they're using the deprecated feature.
If there are warnings they'll silence them in their application and
still forget they're using the deprecated feature. If there are log
messages, they'll probably turn logging detail down to hide them and
STILL forget they're using the deprecated feature.

Then whine about it three years later when it gets removed.

So I guess the question is: Is it worth all that hassle to remove a
misfeature you have to go out of your way to use? Is support for non-1
lower bounds stopping us from doing something useful and important? Or
is it just an irritation that it exists?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN improvements part 1: additional information

2014-01-13 Thread Tomas Vondra

On 13.1.2014 18:07, Alexander Korotkov wrote:
 On Sat, Jan 11, 2014 at 6:15 AM, Tomas Vondra t...@fuzzy.cz
 mailto:t...@fuzzy.cz wrote:
 
 On 8.1.2014 22:58, Alexander Korotkov wrote:
  Thanks for reporting. Fixed version is attached.
 
 I've tried to rerun the 'archie' benchmark with the current patch, and
 once again I got
 
PANIC:  could not split GIN page, didn't fit
 
 I reran it with '--enable-cassert' and with that I got
 
 TRAP: FailedAssertion(!(ginCompareItemPointers(items[i - 1],
items[i])  0), File: gindatapage.c, Line: 149)
 LOG:  server process (PID 5364) was terminated by signal 6: Aborted
 DETAIL:  Failed process was running: INSERT INTO messages ...
 
 so the assert in GinDataLeafPageGetUncompressed fails for some reason.
 
 I can easily reproduce it, but my knowledge in this area is rather
 limited so I'm not entirely sure what to look for.
 
 
 I've fixed this bug and many other bug. Now patch passes test suite that
 I've used earlier. The results are so:

OK, it seems the bug is gone. However now there's a memory leak
somewhere. I'm loading pgsql mailing list archives (~600k messages)
using this script

   https://bitbucket.org/tvondra/archie/src/1bbeb920/bin/load.py

And after loading about 1/5 of the data, all the memory gets filled by
the pgsql backends (loading the data in parallel) and the DB gets killed
by the OOM killer.

Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.consistent_into

On Jan13, 2014, at 22:49 , Jim Nasby j...@nasby.net wrote:
 ISTM that in this case, it should be safe to make the new default behavior 
 STRICT;
 if you forget to set the GUC to disable than you'll get an error that points 
 directly
 at the problem, at which point you'll go Oh, yeah... I forgot to set X...

What do you mean by STRICT? STRICT (which we already support) complains if the
query doesn't return *exactly* one row. What Marko wants is to raise an error
for a plain SELECT ... INTO if more than one row is returned, but to still
convert zero rows to NULL.

 Outside of the GUC, I believe the default should definitely be STRICT. If 
 your app is
 relying on non-strict then you need to be made aware of that. We should be 
 able to
 provide a DO block that will change this setting for every function you've 
 got if
 someone isn't happy with STRICT mode.

If you mean that we should change SELECT ... INTO to always behave as if STRICT
had been specified - why on earth would we want to do that? That would break
perfectly fine code for no good reason whatsoever.

In fact, after reading the documentation on SELECT ... INTO, I'm convinced the
the whole consistent_into thing is a bad idea. The documentation states clearly
that

  For INSERT/UPDATE/DELETE with RETURNING, PL/pgSQL reports an error for more 
than
  one returned row, even when STRICT is not specified. This is because there is 
no
  option such as ORDER BY with which to determine which affected row should be
  returned.

It therefor isn't an oversight that SELECT ... INTO allows multiple result rows
but INSERT/UPDATE/DELETE forbids them, it's been done that way on purpose and
for a reason. We shouldn't be second-guessing ourselves by changing that later -
not, at least, unless we have a *very* good reason for it. Which, AFAICS, we 
don't.

(And yeah, personally I'd prefer if we'd complain about multiple rows. But it's
IMHO just too late for that)

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.consistent_into

2014-01-13 Thread Marko Tiikkaja


On 1/14/14, 12:41 AM, Florian Pflug wrote:

In fact, after reading the documentation on SELECT ... INTO, I'm convinced the
the whole consistent_into thing is a bad idea. The documentation states clearly
that

   For INSERT/UPDATE/DELETE with RETURNING, PL/pgSQL reports an error for more 
than
   one returned row, even when STRICT is not specified. This is because there 
is no
   option such as ORDER BY with which to determine which affected row should be
   returned.

It therefor isn't an oversight that SELECT ... INTO allows multiple result rows
but INSERT/UPDATE/DELETE forbids them, it's been done that way on purpose and
for a reason.


Yeah, it does state that.  But it's a BS reason.  In addition to ORDER 
BY, SELECT also has a LIMIT which you can use to get the first row 
behaviour.  There's no way to go to the more sane behaviour from what we 
have right now.


When I've worked with PL/PgSQL, this has been a source of a few bugs 
that would have been noticed during testing if the behaviour of INTO 
wasn't as dangerous as it is right now.  Yes, it breaks backwards 
compatibility, but that's why there's a nice GUC.  If we're not going to 
scrap PL/PgSQL and start over again, we are going to have to do changes 
like this to make the language better.  Also I think that out of all the 
things we could do to break backwards compatibility, this is closer to 
harmless than a pain in the butt.



Regards,
Marko Tiikkaja


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.consistent_into

On 01/13/2014 03:41 PM, Florian Pflug wrote:
 It therefor isn't an oversight that SELECT ... INTO allows multiple result 
 rows
 but INSERT/UPDATE/DELETE forbids them, it's been done that way on purpose and
 for a reason. We shouldn't be second-guessing ourselves by changing that 
 later -
 not, at least, unless we have a *very* good reason for it. Which, AFAICS, we 
 don't.
 
 (And yeah, personally I'd prefer if we'd complain about multiple rows. But 
 it's
 IMHO just too late for that)

I *really* don't want to go through all my old code to find places where
I used SELECT ... INTO just to pop off the first row, and ignored the
rest.  I doubt anyone else does, either.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Hannu Krosing

On 01/13/2014 09:53 PM, Trond Myklebust wrote:
 On Jan 13, 2014, at 15:40, Andres Freund and...@2ndquadrant.com wrote:

 On 2014-01-13 15:15:16 -0500, Robert Haas wrote:
 On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner kgri...@ymail.com wrote:
 I notice, Josh, that you didn't mention the problems many people
 have run into with Transparent Huge Page defrag and with NUMA
 access.
 Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
 setting zone_reclaim_mode; is there some other problem besides that?
 I think that fixes some of the worst instances, but I've seen machines
 spending horrible amounts of CPU ( BUS) time in page reclaim
 nonetheless. If I analyzed it correctly it's in RAM  working set
 workloads where RAM is pretty large and most of it is used as page
 cache. The kernel ends up spending a huge percentage of time finding and
 potentially defragmenting pages when looking for victim buffers.

 On a related note, there's also the problem of double-buffering.  When
 we read a page into shared_buffers, we leave a copy behind in the OS
 buffers, and similarly on write-out.  It's very unclear what to do
 about this, since the kernel and PostgreSQL don't have intimate
 knowledge of what each other are doing, but it would be nice to solve
 somehow.
 I've wondered before if there wouldn't be a chance for postgres to say
 my dear OS, that the file range 0-8192 of file x contains y, no need to
 reread and do that when we evict a page from s_b but I never dared to
 actually propose that to kernel people...
 O_DIRECT was specifically designed to solve the problem of double buffering 
 between applications and the kernel. Why are you not able to use that in 
 these situations?
What is asked is the opposite of O_DIRECT - the write from a buffer inside
postgresql to linux *buffercache* and telling linux that it is the same
as what
is currently on disk, so don't bother to write it back ever.

This would avoid current double-buffering between postgresql and linux
buffer caches while still making use of linux cache when possible.

The use case is  pages that postgresql has moved into its buffer cache
but which it has not modified. They will at some point be evicted from the
postgresql cache, but it is likely that they will still be needed
sometime soon,
so what is required is writing them back to the original file, only
they should
not really be written - or marked dirty to be written later - more
levels than
just to the linux cache, as they *already* are on the disk.

It is probably ok to put them in the LRU position as they are written
out from postgresql, though it may be better if we get some more control
over
where in the LRU order they would be placed. It may make sense to put them
there based on when they were last read while residing inside postgresql
cache

Cheers


-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread Hannu Krosing

On 01/14/2014 12:33 AM, Craig Ringer wrote:
 On 01/14/2014 12:40 AM, Merlin Moncure wrote:
 On Sun, Jan 12, 2014 at 4:38 AM, Craig Ringer cr...@2ndquadrant.com wrote:
 Implicit casts to text, anybody?
 This backward compatibility break orphaned the company I work for on
 8.1 until last year and very nearly caused postgres to be summarily
 extirpated (only rescued at the last minute by my arrival).
 You're far from the only one, too. Until last year I was still seeing
 people saying they can't upgrade because of this. OTOH, that was a
 sudden and drastic change, with no BC switch like the removal of
 implicit joins had, that affected wide swaths of code. Lets not do that
 again.

 Removal of lower bounds for arrays is unlikely to even get noticed by
 the vast majority of users, and can be done progressively with BC features.

 The real issue IMO is how to get those few to stop using it so it can be
 truly removed. Past experience has shown that people just turn the
 compatibility flag on and forget they're using the deprecated feature.
 If there are warnings they'll silence them in their application and
 still forget they're using the deprecated feature. If there are log
 messages, they'll probably turn logging detail down to hide them and
 STILL forget they're using the deprecated feature.

 Then whine about it three years later when it gets removed.

 So I guess the question is: Is it worth all that hassle to remove a
 misfeature you have to go out of your way to use? Is support for non-1
 lower bounds stopping us from doing something useful and important? Or
 is it just an irritation that it exists?

Let's just add user defined operator for '[]' (weirdly-positioned but
2 argument, almost infix :) ) and add that to JSON arrays to get
0-based ones into poastgresq  ;)

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Syntax of INSERT...ON DUPLICATE KEY LOCK FOR UPDATE


On 1/13/14, 5:05 PM, Peter Geoghegan wrote:

On Mon, Jan 13, 2014 at 2:20 PM, Jim Nasby j...@nasby.net wrote:

Well, a common case for INSERT RETURNING is to get your set of surrogate
keys back; so I think users would want the ability to RETURN what finally
made it into the table.


Your update can also have a RETURNING clause. I'm not necessarily that
attached to fully generalizing RETURNING REJECTS as REJECTING. It was
just an idea. When an insert is rejected and you lock a conflicting
row, it hardly matters what your surrogate key might have been had
that insert succeeded.

To get the surrogate key when it upsert inserts, do a regular
INSERTRETURNING..., and break the work up into multiple commands.
That will almost always be sufficient, because you'll almost always
know ahead of time where the conflict might be (certainly, the MySQL
feature mandates that you do know).


As long as there's a way to get back what was ultimately inserted or updated 
that'd work... there might be some cases where you'd actually want to know what 
the result of the REJECTING command was (ie: did the update do something 
fancy?).

Actually, you'd also want to know if triggers did anything. So we definitely 
want to keep the existing RETURNING behavior (sorry, I don't know offhand if 
you've kept that or not).


Also, if we want to support the case of identifying tuples where a BEFORE
trigger disallowed the insert, we probably want to expose that that's why
those tuples were rejected (as opposed to them being rejected due to a
duplicate key violation).


The ctid *won't* indicate a specific rejecting row then, I guess,
which will do it.


Yeah, the only other thing you might want is the name of the trigger that 
returned NULL... that would allow you to do something different based on which 
trigger it was.

Part of me thinks that'd be useful... part of me thinks it's just a foot-gun...
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.consistent_into

On Jan14, 2014, at 00:52 , Marko Tiikkaja ma...@joh.to wrote:
 When I've worked with PL/PgSQL, this has been a source of a few bugs that
 would have been noticed during testing if the behaviour of INTO wasn't as
 dangerous as it is right now.

The question is, how many bugs stemmed from wrong SQL queries, and what
percentage of those would have been caught by this? The way I see it, there
are thousands of ways to screw up a query, and having it return multiple 
rows instead of one is just one of them.

 Yes, it breaks backwards compatibility, but that's why there's a nice GUC.

Which doesn't help, because the GUC isn't tied to the code. This *adds*
an error case, not remove one - now, instead of getting your code correct,
you *also* have to get the GUC correct. If you even *know* that such a GUC
exists.

 If we're not going to scrap PL/PgSQL and
 start over again, we are going to have to do changes like this to make the
 language better.  Also I think that out of all the things we could do to
 break backwards compatibility, this is closer to harmless than a pain
 in the butt.

I very strongly believe that languages don't get better by adding a thousand
little knobs which subtly change semantics. Look at the mess that is PHP -
we absolutely, certainly don't want to go there. The most important rule in
language design is in my opinion stick with your choices. C, C++ and JAVA
all seem to follow this, and it's one of the reasons these languages are
popular for big projects, I think.

The way I see it, the only OK way to change existing behaviour is to have
the concept of a language version, and force code to indicate the language
version it expects. The important thing is that the language version is an
attribute of code, not some global setting that you can change without ever
looking at the code it'd affect.

So if we really want to change this, I think we need to have a
LANGUAGE_VERSION attribute on functions. Each time a major postgres release
changes the behaviour of one of the procedural languages, we'd increment
that language's version, and enable the old behaviour for all functions
tagged with an older one.

best regards,
Florian Pflug








-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested hstore patch

2014-01-13 Thread Erik Rijkers

On Mon, January 13, 2014 18:30, Andrew Dunstan wrote:


 On 01/13/2014 11:16 AM, Oleg Bartunov wrote:
 Andrew,

 did you run perl script ? Actually, I found, that operator table needs
 to be fixed.


 No. My build machine doesn't actually have DBD::Pg installed. Can you
 send me a patch if you don't want to push it yourself, or maybe Erik can
 send a pacth top adjust the table.


 [ nested_hstore_and_jsonb-2.patch ]

( centos 6.5, gcc 4.8.2. )

The patch applies  compiles with warnings (see below).

The opr_sanity test fails during make check: regression.diffs attached.

Also attached are changes to hstore.sgml, to operator + functions table, plus 
some typos.

Thanks,
Erik Rijkers


make

jsonfuncs.c: In function each_object_field_end_jsonb:
jsonfuncs.c:1328:7: warning: assignment from incompatible pointer type [enabled 
by default]
   val = DatumGetPointer(DirectFunctionCall1(jsonb_in, CStringGetDatum(cstr)));
   ^
jsonfuncs.c: In function elements_array_element_end_jsonb:
jsonfuncs.c:1530:8: warning: assignment from incompatible pointer type [enabled 
by default]
  jbval = DatumGetPointer(DirectFunctionCall1(jsonb_in, CStringGetDatum(cstr)));
^


make contrib:

hstore_io.c: In function array_to_hstore:
hstore_io.c:1694:30: warning: result may be used uninitialized in this 
function [-Wmaybe-uninitialized]
  PG_RETURN_POINTER(hstoreDump(result));







regression.diffs
Description: Binary data
--- doc/src/sgml/hstore.sgml.orig	2014-01-14 00:06:30.070883763 +0100
+++ doc/src/sgml/hstore.sgml	2014-01-14 00:58:53.069334810 +0100
@@ -350,7 +350,7 @@
   entrytypetext[]//entry
   entryget values for keys (literalNULL/ if not present)/entry
   entryliteral'a=gt;x, b=gt;y, c=gt;z'::hstore -gt; ARRAY['c','a']/literal/entry
-  entryliteral{z,x}/literal/entry
+  entryliteral{z,x}/literal/entry
  /row
 
  row
@@ -422,7 +422,7 @@
   entrytypehstore//entry
   entrydelete key from left operand/entry
   entryliteral'a=gt;1, b=gt;2, c=gt;3'::hstore - 'b'::text/literal/entry
-  entryliterala=gt;1, c=gt;3/literal/entry
+  entryliterala=gt;1, c=gt;3/literal/entry
  /row
 
  row
@@ -438,7 +438,7 @@
   entrytypehstore//entry
   entrydelete keys from left operand/entry
   entryliteral'a=gt;1, b=gt;2, c=gt;3'::hstore - ARRAY['a','b']/literal/entry
-  entryliteralc=gt;3/literal/entry
+  entryliteralc=gt;3/literal/entry
  /row
 
  row
@@ -446,14 +446,14 @@
   entrytypehstore//entry
   entrydelete matching pairs from left operand/entry
   entryliteral'a=gt;1, b=gt;2, c=gt;3'::hstore - 'a=gt;4, b=gt;2'::hstore/literal/entry
-  entryliterala=gt;1, c=gt;3/literal/entry
+  entryliterala=gt;1, c=gt;3/literal/entry
  /row
 
  row
   entrytypehstore/ literal#-/ typetext[]//entry
   entrytypehstore//entry
   entrydelete key path from left operand/entry
-  entryliteral'{a =gt; {b =gt; { c =gt; [1,2]}}}'::hstore #- '[a,b,c,0]'/literal/entry
+  entryliteral'{a =gt; {b =gt; { c =gt; [1,2]}}}'::hstore #- '{a,b,c,0}'/literal/entry
   entryliterala=gt;{b=gt;{c=gt;[2]}}/literal/entry
  /row
 
@@ -525,7 +525,7 @@
   entrytypehstore/type/entry
   entryconstruct an typehstore/ from a record or row/entry
   entryliteralhstore(ROW(1,2))/literal/entry
-  entryliteralf1=gt;1,f2=gt;2/literal/entry
+  entryliteralf1=gt;1,f2=gt;2/literal/entry
  /row
 
  row
@@ -534,7 +534,7 @@
   entryconstruct an typehstore/ from an array, which may be either
a key/value array, or a two-dimensional array/entry
   entryliteralhstore(ARRAY['a','1','b','2']) || hstore(ARRAY[['c','3'],['d','4']])/literal/entry
-  entryliterala=gt;1, b=gt;2, c=gt;3, d=gt;4/literal/entry
+  entryliterala=gt;1, b=gt;2, c=gt;3, d=gt;4/literal/entry
  /row
 
  row
@@ -707,7 +707,7 @@
   entrytypehstore/type/entry
   entryextract a subset of an typehstore//entry
   entryliteralslice('a=gt;1,b=gt;2,c=gt;3'::hstore, ARRAY['b','c','x'])/literal/entry
-  entryliteralb=gt;2, c=gt;3/literal/entry
+  entryliteralb=gt;2, c=gt;3/literal/entry
  /row
 
  row
@@ -766,15 +766,15 @@
   entryfunctionreplace(hstore,text[],hstore)/functionindextermprimaryreplace/primary/indexterm/entry
   entrytypehstore/type/entry
   entryreplace value at the specified path/entry
-  entryliteralreplace('a=gt;1,b=gt;{c=gt;3,d=gt;[4,5,6]}'::hstore,'[b,d]', '1')/literal/entry
-  entryliterala=gt;1, b=gt;{c=gt;3, d=gt;}/literal/entry
+  entryliteralreplace('a=gt;1,b=gt;{c=gt;3,d=gt;[4,5,6]}'::hstore,'{b,d}', '1')/literal/entry
+  entryliterala=gt;1, b=gt;{c=gt;3, d=gt;1}/literal/entry
  /row
 
  row
   entryfunctionconcat_path(hstore,text[],hstore)/functionindextermprimaryconcat_path/primary/indexterm/entry
   entrytypehstore/type/entry
   entryconcatenate typehstore/ value at the specified path/entry
-

Re: [HACKERS] plpgsql.consistent_into


On 1/13/14, 5:57 PM, Josh Berkus wrote:

On 01/13/2014 03:41 PM, Florian Pflug wrote:

It therefor isn't an oversight that SELECT ... INTO allows multiple result rows
but INSERT/UPDATE/DELETE forbids them, it's been done that way on purpose and
for a reason. We shouldn't be second-guessing ourselves by changing that later -
not, at least, unless we have a *very* good reason for it. Which, AFAICS, we 
don't.

(And yeah, personally I'd prefer if we'd complain about multiple rows. But it's
IMHO just too late for that)


I *really* don't want to go through all my old code to find places where
I used SELECT ... INTO just to pop off the first row, and ignored the
rest.  I doubt anyone else does, either.


Do you regularly have use cases where you actually want just one RANDOM row? I 
suspect the far more likely scenario is that people write code assuming they'll 
get only one row and they'll end up with extremely hard to trace bugs if that 
assumption is ever wrong.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.consistent_into


On 1/13/14, 6:16 PM, Florian Pflug wrote:

On Jan14, 2014, at 00:52 , Marko Tiikkaja ma...@joh.to wrote:

When I've worked with PL/PgSQL, this has been a source of a few bugs that
would have been noticed during testing if the behaviour of INTO wasn't as
dangerous as it is right now.


The question is, how many bugs stemmed from wrong SQL queries, and what
percentage of those would have been caught by this? The way I see it, there
are thousands of ways to screw up a query, and having it return multiple
rows instead of one is just one of them.


A query that's simply wrong is more likely to fail consistently. Non-strict use 
of INTO is going to fail in very subtle ways (unless you actually DO want just 
the first row, in which case you should explicitly use LIMIT 1).


If we're not going to scrap PL/PgSQL and
start over again, we are going to have to do changes like this to make the
language better.  Also I think that out of all the things we could do to
break backwards compatibility, this is closer to harmless than a pain
in the butt.


I very strongly believe that languages don't get better by adding a thousand
little knobs which subtly change semantics. Look at the mess that is PHP -
we absolutely, certainly don't want to go there. The most important rule in
language design is in my opinion stick with your choices. C, C++ and JAVA
all seem to follow this, and it's one of the reasons these languages are
popular for big projects, I think.

The way I see it, the only OK way to change existing behaviour is to have
the concept of a language version, and force code to indicate the language
version it expects. The important thing is that the language version is an
attribute of code, not some global setting that you can change without ever
looking at the code it'd affect.

So if we really want to change this, I think we need to have a
LANGUAGE_VERSION attribute on functions. Each time a major postgres release
changes the behaviour of one of the procedural languages, we'd increment
that language's version, and enable the old behaviour for all functions
tagged with an older one.


I like that idea. It allows us to fix past decisions that were ill considered 
without hosing all existing code.

BTW, have we always had support for STRICT, or was it added at some point? It's 
in 8.4, but I don't know how far back it goes.

And if we've always had it, why on earth didn't we make STRICT the default 
behavior?
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

On Mon, Jan 13, 2014 at 7:36 PM, Mel Gorman mgor...@suse.de wrote:
 That could be something we look at. There are cases buried deep in the
 VM where pages get shuffled to the end of the LRU and get tagged for
 reclaim as soon as possible. Maybe you need access to something like
 that via posix_fadvise to say reclaim this page if you need memory but
 leave it resident if there is no memory pressure or something similar.
 Not exactly sure what that interface would look like or offhand how it
 could be reliably implemented.


I don't see a reason not to make this behavior the default for WONTNEED.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds