date:20160221

Re: [HACKERS] Writing new unit tests with PostgresNode

2016-02-21 Thread Abhijit Menon-Sen

At 2016-02-22 07:45:37 +, e...@xs4all.nl wrote:
> 
> I think what's needed is:
> 
> use 5.008_008;

That is meant to fail if the code is run on a version of Perl older than
5.8.8, not fail if the code uses language features not present in 5.8.8.
It is little help for those trying to maintain backwards compatibility.

-- Abhijit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Writing new unit tests with PostgresNode

2016-02-21 Thread Erik Rijkers




- All the core routines used should be compatible down to perl 5.8.8.



Ugh. So not just Perl, ancient perl.

I don't suppose Perl offers any kind of "compatible(5.8.8)" statement 
or
anything? Do I have to compile a ten-year-old Perl and its dependencies 
to

work on PostgreSQL tests?
http://search.cpan.org/dist/Perl-MinimumVersion/lib/Perl/MinimumVersion.pm
looks useful; do you think it's reasonable for code that passes that 
check

to just be thrown at the buildfarm?



I think what's needed is:

use 5.008_008;

( see: http://perldoc.perl.org/functions/use.html )




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Writing new unit tests with PostgresNode

2016-02-21 Thread Michael Paquier

On Mon, Feb 22, 2016 at 4:24 PM, Craig Ringer  wrote:
> On 22 February 2016 at 14:30, Michael Paquier 
> wrote:
>> - All the core routines used should be compatible down to perl 5.8.8.
>
> Ugh. So not just Perl, ancient perl.
>
> I don't suppose Perl offers any kind of "compatible(5.8.8)" statement or
> anything? Do I have to compile a ten-year-old Perl and its dependencies to
> work on PostgreSQL tests?
> http://search.cpan.org/dist/Perl-MinimumVersion/lib/Perl/MinimumVersion.pm
> looks useful; do you think it's reasonable for code that passes that check
> to just be thrown at the buildfarm?

No, I don't use those old versions :) Module::CoreList->first_release
is one way to do that for a module:
$ perl -MModule::CoreList -e ' print
Module::CoreList->first_release('Test::More'), "\n";'
5.006002
Last time I was dealing with that I had as well a look at
http://perldoc.perl.org/, which was quite helpful by browsing through
each version.

>> > Sound about right? I can tidy that up a bit and turn it into a README
>> > and
>> > add a reference to that to the public tap docs to tell users where to go
>> > if
>> > they want to write more tests.
>>
>> Yes, please.
>
> Will do that now.

This is definitely independent from the efforts of the other patches.

>> > I don't know how many suites we'll want - whether it'll be desirable to
>> > have
>> > a few suites with lots of tests or to have lots of suites with just a
>> > few
>> > tests. I'm planning on starting by adding a suite named 'replication'
>> > and
>> > putting some tests for failover slots in there. Reasonable?
>>
>> It seems to me that the failover slot tests should be part of the
>> recovery test suite I have proposed already. Those are located in
>> src/test/recovery, introducing as well a set of routines in
>> PostgresNode.pm that allows one to pass options to PostgresNode::init
>> to enable archive or streaming on a given node. I would believe that
>> any replication suite is going to need that, and I have done a bunch
>> of legwork to make sure that this works on Windows as well.
>
>
> Not committed yet, I see. That's https://commitfest.postgresql.org/9/438/
> right?

Yeah... That's life.

> I'd rather not duplicate your work there, so I should build on that. Is
> there a public git tree for that?

The latest set of patches is here:
https://github.com/michaelpq/postgres/commits/recovery_regression
Those are the ones I sent a couple of days back on -hackers. Hopefully
those will get integrated into the core code.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Writing new unit tests with PostgresNode

2016-02-21 Thread Craig Ringer

On 22 February 2016 at 14:30, Michael Paquier 
wrote:

> +1. I will be happy to contribute into that.
>

Thanks. Hopefully what I wrote will be useful as a starting point.

As I read through the modules I'll write some draft Perldoc too.

> > The docs at http://www.postgresql.org/docs/devel/static/regress-tap.html
> > mention the use of tap for client programs but don't have anything on
> > writing/using the framework. It doesn't seem to be very close to /
> > compatible with http://pgtap.org/ unless I'm missing something obvious.
> >
> > I want to add tests for failover slots but I'm not sure where to put
> them or
> > how to set up the test skeleton.
>
> src/test/modules
>

Could use a README to explain what the directory is for.

It looks like it contains a bunch of the extensions that used to be in
contrib/ . So that'd be where I'd want to put any extension needed as part
of failover slots testing, but not the Perl test control code for setting
up the nodes, etc. This testing cannot usefully be done with pg_regress and
the part that can be is already added to contrib/test_decoding.

> A couple of other things that I am aware of:
>
- including strict and warnings is as well mandatory to ensure that
> test grammar is sane, and the tests written should have no warnings.
>

Yep, did that.

> - All the core routines used should be compatible down to perl 5.8.8.
>

Ugh. So not just Perl, ancient perl.

I don't suppose Perl offers any kind of "compatible(5.8.8)" statement or
anything? Do I have to compile a ten-year-old Perl and its dependencies to
work on PostgreSQL tests?
http://search.cpan.org/dist/Perl-MinimumVersion/lib/Perl/MinimumVersion.pm
looks useful; do you think it's reasonable for code that passes that check
to just be thrown at the buildfarm?

> - The code produced should be sanitized with perltidy before commit.
>

Makes sense. Keeps things consistent.

> > The suite should be added to src/test/Makefile's ALWAYS_SUBDIRS entry.
>
> SUBDIRS is more suited for tests that do not represent a security
> risk. The ssl suite is part of ALWAYS_SUBDIRS because it enforces the
> use of 127.0.0.1.
>

Ok, that makes sense.

> > Sound about right? I can tidy that up a bit and turn it into a README and
> > add a reference to that to the public tap docs to tell users where to go
> if
> > they want to write more tests.
>
> Yes, please.
>

Will do that now.

> > I don't know how many suites we'll want - whether it'll be desirable to
> have
> > a few suites with lots of tests or to have lots of suites with just a few
> > tests. I'm planning on starting by adding a suite named 'replication' and
> > putting some tests for failover slots in there. Reasonable?
>
> It seems to me that the failover slot tests should be part of the
> recovery test suite I have proposed already. Those are located in
> src/test/recovery, introducing as well a set of routines in
> PostgresNode.pm that allows one to pass options to PostgresNode::init
> to enable archive or streaming on a given node. I would believe that
> any replication suite is going to need that, and I have done a bunch
> of legwork to make sure that this works on Windows as well.
>

Not committed yet, I see. That's https://commitfest.postgresql.org/9/438/
right?

I'd rather not duplicate your work there, so I should build on that. Is
there a public git tree for that?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Add generate_series(date,date) and generate_series(date,date,integer)

2016-02-21 Thread Michael Paquier

On Mon, Feb 22, 2016 at 6:52 AM, Tom Lane  wrote:
> Oooh ... actually, that works today if you consider the SRF-in-targetlist
> case:
>
> regression=# select generate_series(now(), 'infinity', '1 day') limit 10;
> generate_series
> ---
>  2016-02-21 16:51:03.303064-05
>  2016-02-22 16:51:03.303064-05
>  2016-02-23 16:51:03.303064-05
>  2016-02-24 16:51:03.303064-05
>  2016-02-25 16:51:03.303064-05
>  2016-02-26 16:51:03.303064-05
>  2016-02-27 16:51:03.303064-05
>  2016-02-28 16:51:03.303064-05
>  2016-02-29 16:51:03.303064-05
>  2016-03-01 16:51:03.303064-05
> (10 rows)
>
> Time: 8.457 ms
>
> Given that counterexample, I think we not only shouldn't back-patch such a
> change but should reject it altogether.

Ouch, good point. The overflows are a different problem that we had
better address though (still on my own TODO list)...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] about google summer of code 2016

2016-02-21 Thread Chapman Flack

On 02/21/16 23:10, Tom Lane wrote:

> Another variable is that your answers might depend on what format you
> assume the client is trying to convert from/to.  (It's presumably not
> text JSON, but then what is it?)

This connects tangentially to a question I've been meaning to ask
for a while, since I was looking at the representation of XML.

As far as I can tell, XML is simply stored in its character serialized
representation (very likely compressed, if large enough to TOAST), and
the text in/out methods simply deal in that representation. The 'binary'
send/recv methods seem to differ only in possibly using a different
character encoding on the wire.

Now, also as I understand it, there's no requirement that a type even
/have/ binary send/recv methods. Text in/out it always needs, but send/recv
only if they are interesting enough to buy you something. I'm not sure
the XML send/recv really do buy anything. It is not as if they present the
XML in any more structured or tokenized form. If they buy anything at all,
it may be only an extra transcoding that the other end will probably
immediately do in reverse.

So, if that's the situation, is there some other, really simple, choice
for what XML send/recv might usefully do, that would buy more than what
they do now?

Well, PGLZ is in libpqcommon now, right? What if xml send wrote a flag
to indicate compressed or not, and then if the value is compressed TOAST,
streamed it right out as is, with no expansion on the server? I could see
that being a worthwhile win, /without even having to devise some
XML-specific encoding/. (XML has a big expansion ratio.)

And, since that idea is not inherently XML-specific ... does the JSONB
representation have the same properties?  How about even text or bytea?

The XML question has a related, JDBC-specific part. JDBC presents XML
via interfaces that can deal in Source and Result objects, and these
come in different flavors (DOMSource, an all-in-memory tree, SAXSource
and StAXSource, both streaming tokenized forms, or StreamSource, a
streaming, character-serialized form). Client code can ask for one of
those forms explicitly, or use null to say it doesn't care. In the
doesn't-care case, the driver is expected to choose the form closest
to what it's got under the hood; the client can convert if necessary,
and if it had any other preference, it would have said so. For PGJDBC,
that choice would naturally be the character StreamSource, because that
/is/ the form it's got under the hood, but for reasons mysterious to me,
pgjdbc actually chooses DOMSource in the don't-care case, and then
expends the full effort of turning the serialized stream it does have
into a full in-memory DOM that the client hasn't asked for and might
not even want. I know this is more a PGJDBC question, but I mention it
here just because it's so much like the what-should-send/recv-do question,
repeated at another level.

-Chap

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Writing new unit tests with PostgresNode

2016-02-21 Thread Michael Paquier

On Mon, Feb 22, 2016 at 2:19 PM, Craig Ringer  wrote:
> I've been taking a look at the Perl test infrastructure ( src/test/perl )
> for writing multi-node tests, starting with PostgresNode.pm and I have a few
> comments based on my first approach to the code "cold".
>
> I think a README in src/test/perl/ would be very helpful. It's currently not
> at all clear how to go about using the new test infrastructure when first
> looking at it, particularly adding new test suites. Also a summary of
> available test facilities and some short examples or pointers to where they
> can be found ( src/bin/, src/test/ssl, etc).

+1. I will be happy to contribute into that.

> The docs at http://www.postgresql.org/docs/devel/static/regress-tap.html
> mention the use of tap for client programs but don't have anything on
> writing/using the framework. It doesn't seem to be very close to /
> compatible with http://pgtap.org/ unless I'm missing something obvious.
>
> I want to add tests for failover slots but I'm not sure where to put them or
> how to set up the test skeleton. So I went looking, and I could use
> confirmation that the below is roughly right so I can write it up for some
> quickstart docs.

src/test/modules

> It *looks* like one needs to create a Makefile with:
>
> subdir = src/test/mytests
> top_builddir = ../../..
> include $(top_builddir)/src/Makefile.global
>
> check:
> $(prove_check)
>
> clean:
> rm -rf ./tmp_check

Yep.

> Then create a subdir src/test/mytests/t and in it put files named
> 001_sometest.pl etc. Each of which should import PostgresNode and generally
> then create new instances indirectly via the PostgresNode::get_new_node
> function rather than by using the constructor. Then init and start the
> server and run tests. Like:

A couple of other things that I am aware of:
- including strict and warnings is as well mandatory to ensure that
test grammar is sane, and the tests written should have no warnings.
- All the core routines used should be compatible down to perl 5.8.8.
- The code produced should be sanitized with perltidy before commit.

> use strict;
> use warnings;
> use PostgresNode;
> use TestLib;
> use Test::More tests => 1;
>
> diag 'Setting up node';
> my $node = get_new_node('master');
> $node->init;
> $node->start;
>
> my $ret = $node->psql('postgres', 'SELECT 1;');
> is($ret, '1', 'simple SELECT');
>
> $node->teardown_node;

Teardown is not mandatory at the end, though I would recommend having
it, PostgresNode::DESTROY enforcing the nodes to stop at the end of a
test, and the temporary paths are unconditionally removed (those
should not be removed in case of a failure honestly).

> Knowledge of perl's Test::More is required since most of the suite is built
> on it.
>
> The suite should be added to src/test/Makefile's ALWAYS_SUBDIRS entry.

SUBDIRS is more suited for tests that do not represent a security
risk. The ssl suite is part of ALWAYS_SUBDIRS because it enforces the
use of 127.0.0.1.

> Sound about right? I can tidy that up a bit and turn it into a README and
> add a reference to that to the public tap docs to tell users where to go if
> they want to write more tests.

Yes, please.

> I don't know how many suites we'll want - whether it'll be desirable to have
> a few suites with lots of tests or to have lots of suites with just a few
> tests. I'm planning on starting by adding a suite named 'replication' and
> putting some tests for failover slots in there. Reasonable?

It seems to me that the failover slot tests should be part of the
recovery test suite I have proposed already. Those are located in
src/test/recovery, introducing as well a set of routines in
PostgresNode.pm that allows one to pass options to PostgresNode::init
to enable archive or streaming on a given node. I would believe that
any replication suite is going to need that, and I have done a bunch
of legwork to make sure that this works on Windows as well.

> (BTW, the ssl tests don't seem to use a bunch of the facilities provided by
> PostgresNode, instead rolling their own, so they don't serve as a good
> example.)

Yep.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: make NOTIFY list de-duplication optional

2016-02-21 Thread Catalin Iacob

On Sat, Feb 20, 2016 at 2:00 PM, Filip Rembiałkowski
 wrote:
> I was stuck because both syntaxes have their ugliness. NOTIFY allows the
> payload to be NULL:
> NOTIFY chan01;
>
> How would this look like in "never" mode?
> NOTIFY chan01, NULL, 'never'; -- seems very cryptic.

The docs say:
"The information passed to the client for a notification event
includes the notification channel name, the notifying session's server
process PID, and the payload string, which is an empty string if it
has not been specified."

So a missing payload is not a SQL NULL but an empty string. This means
you would have:
NOTIFY chan01;
NOTIFY chan01, ''; -- same as above
NOTIFY chan01, '', 'maybe'; -- same as above
NOTIFY chan01, '', 'never'; -- send this all the time

Seems ok to me.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc

2016-02-21 Thread Thomas Munro

On Tue, Feb 16, 2016 at 12:12 PM, Noah Misch  wrote:
> On Mon, Feb 15, 2016 at 06:07:48PM -0500, Tom Lane wrote:
>> Noah Misch  writes:
>> > I configured a copy of animal "mandrill" that way and launched a test run.
>> > The postgres_fdw suite failed as attached.  A manual "make -C contrib
>> > installcheck" fails the same way on a ppc64 GNU/Linux box, but it passes on
>> > x86_64 and aarch64.  Since contrib test suites don't recognize TEMP_CONFIG,
>> > check-world passes everywhere.
>>
>> Hm, is this with or without the ppc-related atomics fix you just found?
>
> Without those.  The ppc64 GNU/Linux configuration used gcc, though, and the
> atomics change affects xlC only.  Also, the postgres_fdw behavior does not
> appear probabilistic; it failed twenty times in a row.

The postgres_fdw failure is a visibility-of-my-own-uncommitted-work
problem.  The first command in a transaction updates a row via an FDW,
and then the SELECT expects to see the effects, but when run in a
background worker it creates a new FDW connection that can't see the
uncommitted UPDATE.

I wonder if parallelism of queries involving an FDW should not be
allowed if your transaction has written through the FDW.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Writing new unit tests with PostgresNode

2016-02-21 Thread Craig Ringer

Hi all

I've been taking a look at the Perl test infrastructure ( src/test/perl )
for writing multi-node tests, starting with PostgresNode.pm and I have a
few comments based on my first approach to the code "cold".

I think a README in src/test/perl/ would be very helpful. It's currently
not at all clear how to go about using the new test infrastructure when
first looking at it, particularly adding new test suites. Also a summary of
available test facilities and some short examples or pointers to where they
can be found ( src/bin/, src/test/ssl, etc).

The docs at http://www.postgresql.org/docs/devel/static/regress-tap.html
mention the use of tap for client programs but don't have anything on
writing/using the framework. It doesn't seem to be very close to /
compatible with http://pgtap.org/ unless I'm missing something obvious.

I want to add tests for failover slots but I'm not sure where to put them
or how to set up the test skeleton. So I went looking, and I could use
confirmation that the below is roughly right so I can write it up for some
quickstart docs.

It *looks* like one needs to create a Makefile with:


subdir = src/test/mytests
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global

check:
$(prove_check)

clean:
rm -rf ./tmp_check


Then create a subdir src/test/mytests/t and in it put files named
001_sometest.pl etc. Each of which should import PostgresNode and generally
then create new instances indirectly via the PostgresNode::get_new_node
function rather than by using the constructor. Then init and start the
server and run tests. Like:

use strict;
use warnings;
use PostgresNode;
use TestLib;
use Test::More tests => 1;

diag 'Setting up node';
my $node = get_new_node('master');
$node->init;
$node->start;

my $ret = $node->psql('postgres', 'SELECT 1;');
is($ret, '1', 'simple SELECT');

$node->teardown_node;


Knowledge of perl's Test::More is required since most of the suite is built
on it.


The suite should be added to src/test/Makefile's ALWAYS_SUBDIRS entry.

Sound about right? I can tidy that up a bit and turn it into a README and
add a reference to that to the public tap docs to tell users where to go if
they want to write more tests.

I don't know how many suites we'll want - whether it'll be desirable to
have a few suites with lots of tests or to have lots of suites with just a
few tests. I'm planning on starting by adding a suite named 'replication'
and putting some tests for failover slots in there. Reasonable?


(BTW, the ssl tests don't seem to use a bunch of the facilities provided by
PostgresNode, instead rolling their own, so they don't serve as a good
example.)


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Handling changes to default type transformations in PLs

2016-02-21 Thread Pavel Stehule

> 4) Create a mechanism for specifying default TRANSFORMs for a PL, and
> essentially "solve" these issues by supplying a built-in transform.
>
> I think default transforms (4) are worth doing no matter what. Having to
> manually remember to add potentially multiple TRANSFORMs is a PITA. But I'm
> not sure TRANSFORMS would actually fix all issues. For example, you can't
> specify a transform for an array type, so this probably wouldn't work for
> one of the plpython problems.
>
>
>
yesterday, I wrote some doc about TRANSFORMS - and it is not possible. The
transformation cannot be transparent for PL function - because you have to
have information, what is your parameters inside function, and if these
parameters are original or transformed. This is not a issue for "smart"
object types, but generally it should not be ensured for basic types. So
default transformation is not a good idea. You can have a transform for C
language and it is really different from Python.

Maybe concept of TRANSFORMs too general and some specific PL extension's
can be better. It needs some concept of persistent options, that can be
used in these extension.

CREATE OR REPLACE FUNCTION  LANGUAGE plpython WITH OPTIONS
('transform_dictionary', ...)

The execution should be stopped if any option will not be processed.

With these extensions you can do anything and It can works (I hope). But
the complexity of our PL will be significantly higher. And then Tom's
proposal can be better. It can help with faster adoption of new features
and it is relative simple solution.

Regards

Pavel

Re: [HACKERS] Handling changes to default type transformations in PLs

2016-02-21 Thread Pavel Stehule

> > 3) Add the concept of PL API versions. This would allow users to specify
>
> So that leaves #3, which doesn't seem all that unreasonable from here.
> We don't have a problem with bundling a bunch of unrelated changes
> into any one major PG revision.  The scripting languages we're talking
> about calling do similar things.  So why not for the semantics of the
> glue layer?
>
> It seems like you really need to be able to specify this at the
> per-function level, which makes me think that specifying
> "LANGUAGE plpython_2" or "LANGUAGE plperl_3" might be the right
> kind of API.
>

I am not big fan of this proposal. A users usually would to choose only
some preferred features - and this design has maybe too small granularity.

Objections:

* usually is used keyword REVISON - so syntax can be LANGUAGE plpython
REVISION 3. It is more readable. You need to specify preferred revision for
any language. The revision is persistent. The behave is same like Tom's
proposal, but I hope so this can be better readable and understandable

regards

Pavel




>
> regards, tom lane
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2016-02-21 Thread Amit Kapila

On Wed, Feb 3, 2016 at 8:59 AM, Robert Haas  wrote:
>
> On Tue, Feb 2, 2016 at 10:27 PM, Amit Kapila 
wrote:
> > So, let's leave adding any additional column, but Alexander has brought
up
> > a good point about storing the wait_type and actual wait_event
> > information into four bytes.  Currently I have stored wait_type (aka
> > classId)
> > in first byte and then two bytes for wait_event (eventId) and remaining
> > one-byte can be used in future if required, however Alexandar is
proposing
> > to
> > combine both these (classId and eventId) into two-bytes which sounds
> > reasonable to me apart from the fact that it might add operation or two
> > extra
> > in this path.  Do you or anyone else have any preference over this
point?
>
> I wouldn't bother tinkering with it at this point.  The value isn't
> going to be recorded on disk anywhere, so it will be easy to change
> the way it's computed in the future if we ever need to do that.
>

Okay. Find the rebased patch attached with this mail.  I have moved
this patch to upcoming CF.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


extend_pg_stat_activity_v10.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How are CREATE EXTENSION ... VERSION or ALTER EXTENSION ... UPDATE TO ... intended to work?

2016-02-21 Thread Chapman Flack

On 02/16/16 22:44, Tom Lane wrote:
> Chapman Flack  writes:
>> been several releases of an extension, and the extensions directory
>> is now populated with a bunch of files quux--1.0.sql, quux--1.1.sql,
>> quux--1.0--1.1.sql, quux--1.1--1.0.sql, ..., quux.control.
>> And somewhere in $libdir there are quux-1.0.so, quux-1.1.so.
> 
> Well, at least so far as the existing extensions in contrib are concerned,
> there are *not* version numbers in the .so filenames.  This means you
> can't have more than one version of the .so installed at once, but we've
> not really found a need for that.  It's usually feasible --- and desirable
> --- to keep ABI compatibility to the extent that the new .so can be
> swapped in for the old without needing to change the SQL function
> definitions immediately.

It's true enough that in PL/Java's case, the ABIs / interfaces between
the SQL function definitions and their implementations in the .so have
been quite stable for years, so there might be no immediate problem.
On the other hand, other details of the implementation (bugs come to mind)
do change ... letting a version-specific CREATE EXTENSION load an
unversioned .so could lead to surprises in that area, because it would
/appear to succeed/, meaning pg_extension.extversion would get changed
to what you /thought/ you had loaded (and what you would probably answer,
if asked "what version are you reporting this bug against?"), while what's
actually running could be different, and if I squint I see diagnostic
headaches lying not far around that bend.

> into the field.  Avoiding MODULE_PATHNAME in favor of writing out
> the versioned .so name in the .sql files is probably the path of
> least resistance.

Agreed ... for once maybe I'll follow it.

-Chap

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Handling changes to default type transformations in PLs

2016-02-21 Thread Tom Lane

Jim Nasby  writes:
> Some of our PLs have the unfortunate problem of making a weak effort 
> with converting types to and from the PL and Postgres. For example, 
> plpythonu will correctly transform a complex type to a dict and an array 
> to a list, but it punts back to text for an array contained inside a 
> complex type. I know plTCL has similar issues; presumably other PLs are 
> affected as well.

> While it's a SMOC to fix this, what's not simple is the backwards 
> compatability: users that are currently using these types are expecting 
> to be handed strings created by the type's output function, so we can't 
> just drop these changes in without breaking user code.

> It might be possible to work around this with TRANSFORMs, but that's 
> just ugly: first you'd have to write a bunch of non-trivial C code, then 
> you'd need to forever remember to specify TRANSFORM FOR TYPE blah.

> Some ways to handle this:

> 1) Use a PL-specific GUC for each case where we need backwards 
> compatibility. For plpython we'd need 2. plTCL would need 1 or 2.

> 2) Use a single all-or-nothing GUC. Downside is that if we later decide 
> to expand automatic conversion again we'd need yet another GUC.

> 3) Add the concept of PL API versions. This would allow users to specify 
> what range of API versions they support. I think this would have been 
> helpful with the plpython elog() patch.

> 4) Create a mechanism for specifying default TRANSFORMs for a PL, and 
> essentially "solve" these issues by supplying a built-in transform.

> I think default transforms (4) are worth doing no matter what. Having to 
> manually remember to add potentially multiple TRANSFORMs is a PITA. But 
> I'm not sure TRANSFORMS would actually fix all issues. For example, you 
> can't specify a transform for an array type, so this probably wouldn't 
> work for one of the plpython problems.

> 3 is interesting, but maybe it would be bad to tie multiple unrelated 
> API changes together.

> So I'm leaning towards 1. It means potentially adding a fair number of 
> new GUCs, but these would all be custom GUCs, so maybe it's not that 
> bad. The other downside to GUCs is I think it'd be nice to be able to 
> set this at a schema level, which you can't currently do with GUCs.

> Thoughts?

I think harsh experience has taught us to distrust GUCs that change
code semantics.  So I'm not very attracted by option #1, much less
option #2.  I'm not sure about option #4 --- it smells like it would
have the same problems as a GUC, namely that it would be
action-at-a-distance on the semantics of a PL function's arguments,
with insufficient ability to control the scope of the effects.

So that leaves #3, which doesn't seem all that unreasonable from here.
We don't have a problem with bundling a bunch of unrelated changes
into any one major PG revision.  The scripting languages we're talking
about calling do similar things.  So why not for the semantics of the
glue layer?

It seems like you really need to be able to specify this at the
per-function level, which makes me think that specifying
"LANGUAGE plpython_2" or "LANGUAGE plperl_3" might be the right
kind of API.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] about google summer of code 2016

2016-02-21 Thread Tom Lane

Heikki Linnakangas  writes:
> On 19/02/16 10:10, Ãlvaro HernÃ¡ndez Tortosa wrote:
>> Oleg and I discussed recently that a really good addition to a GSoC
>> item would be to study whether it's convenient to have a binary
>> serialization format for jsonb over the wire.

> Seems a bit risky for a GSoC project. We don't know if a different 
> serialization format will be a win, or whether we want to do it in the 
> end, until the benchmarking is done. It's also not clear what we're 
> trying to achieve with the serialization format: smaller on-the-wire 
> size, faster serialization in the server, faster parsing in the client, 
> or what?

Another variable is that your answers might depend on what format you
assume the client is trying to convert from/to.  (It's presumably not
text JSON, but then what is it?)

Having said that, I'm not sure that risk is a blocking factor here.
History says that a large fraction of our GSoC projects don't result
in a commit to core PG.  As long as we're clear that "success" in this
project isn't measured by getting a feature committed, it doesn't seem
riskier than any other one.  Maybe it's even less risky, because there's
less of the success condition that's not under the GSoC student's control.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Handling changes to default type transformations in PLs

2016-02-21 Thread Jim Nasby

Some of our PLs have the unfortunate problem of making a weak effort 
with converting types to and from the PL and Postgres. For example, 
plpythonu will correctly transform a complex type to a dict and an array 
to a list, but it punts back to text for an array contained inside a 
complex type. I know plTCL has similar issues; presumably other PLs are 
affected as well.


While it's a SMOC to fix this, what's not simple is the backwards 
compatability: users that are currently using these types are expecting 
to be handed strings created by the type's output function, so we can't 
just drop these changes in without breaking user code.


It might be possible to work around this with TRANSFORMs, but that's 
just ugly: first you'd have to write a bunch of non-trivial C code, then 
you'd need to forever remember to specify TRANSFORM FOR TYPE blah.


Some ways to handle this:

1) Use a PL-specific GUC for each case where we need backwards 
compatibility. For plpython we'd need 2. plTCL would need 1 or 2.


2) Use a single all-or-nothing GUC. Downside is that if we later decide 
to expand automatic conversion again we'd need yet another GUC.


3) Add the concept of PL API versions. This would allow users to specify 
what range of API versions they support. I think this would have been 
helpful with the plpython elog() patch.


4) Create a mechanism for specifying default TRANSFORMs for a PL, and 
essentially "solve" these issues by supplying a built-in transform.


I think default transforms (4) are worth doing no matter what. Having to 
manually remember to add potentially multiple TRANSFORMs is a PITA. But 
I'm not sure TRANSFORMS would actually fix all issues. For example, you 
can't specify a transform for an array type, so this probably wouldn't 
work for one of the plpython problems.


3 is interesting, but maybe it would be bad to tie multiple unrelated 
API changes together.


So I'm leaning towards 1. It means potentially adding a fair number of 
new GUCs, but these would all be custom GUCs, so maybe it's not that 
bad. The other downside to GUCs is I think it'd be nice to be able to 
set this at a schema level, which you can't currently do with GUCs.


Thoughts?
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql metaqueries with \gexec

2016-02-21 Thread Jim Nasby


On 2/19/16 7:32 PM, Corey Huinker wrote:

Wow, and I thought *I* liked metaprogramming! :)

I like what you've proposed, though I am wondering if you considered 
doing something server-side instead? It seems a shame to do all this 
work and exclude all other tools.


I frequently find myself creating a function that is just a wrapper on 
EXECUTE for this purpose, but obviously that has transactional limitations.


FWIW, I also wish we had something better than format() for this stuff. 
I did create [1] towards that end, but it currently depends on some C 
code, which is cumbersome.



I am not sure that "gexec" is the right name for this command. Others
considered were \execute_each, \meta, \gmeta, \geach, as well as adding
a "<" parameter to the \g command.


\gexec sounds fine to me. I would think \g < would be something done at 
the shell level...


[1] 
https://github.com/decibel/trunklet-format/blob/master/doc/trunklet-format.asc

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow to specify (auto-)vacuum cost limits relative to the database/cluster size?

2016-02-21 Thread Jim Nasby


On 1/12/16 6:42 AM, Andres Freund wrote:

Somehow computing the speed in relation to the cluster/database size is
probably possible, but I wonder how we can do so without constantly
re-computing something relatively expensive?


ISTM relpages would probably be good enough for this, if we take the 
extra step of getting actual relation size when relpages is 0.


I'm not sure a straght scale factor is the way to go though... it seems 
that might be problematic? I think we'd at least one a minimum default 
value; you certainly don't want even a small system running vacuum at 
1kB/s...

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] about google summer of code 2016

2016-02-21 Thread Heikki Linnakangas


On 19/02/16 10:10, Álvaro Hernández Tortosa wrote:

  Oleg and I discussed recently that a really good addition to a GSoC
item would be to study whether it's convenient to have a binary
serialization format for jsonb over the wire. Some argue this should be
benchmarked first. So the scope for this project would be to benchmark
and analyze the potential improvements and then agree on which format
jsonb could be serialized to (apart from the current on-disk format,
there are many json or nested k-v formats that could be used for sending
over the wire).


Seems a bit risky for a GSoC project. We don't know if a different 
serialization format will be a win, or whether we want to do it in the 
end, until the benchmarking is done. It's also not clear what we're 
trying to achieve with the serialization format: smaller on-the-wire 
size, faster serialization in the server, faster parsing in the client, 
or what?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-02-21 Thread Thom Brown

On 21 February 2016 at 23:18, Thomas Munro
 wrote:
> On Mon, Feb 22, 2016 at 2:10 AM, Thom Brown  wrote:
>> On 3 February 2016 at 10:46, Thomas Munro  
>> wrote:
>>> On Wed, Feb 3, 2016 at 10:59 PM, Amit Langote
>>>  wrote:
 There seems to be a copy-pasto there - shouldn't that be:

 + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)
>>>
>>> Indeed, thanks!  New patch attached.
>>
>> I've given this a test drive, and it works exactly as described.
>
> Thanks for trying it out!
>
>> But one thing which confuses me is when a standby, with causal_reads
>> enabled, has just finished starting up.  I can't connect to it
>> because:
>>
>> FATAL:  standby is not available for causal reads
>>
>> However, this is the same message when I'm successfully connected, but
>> it's lagging, and the primary is still waiting for the standby to
>> catch up:
>>
>> ERROR:  standby is not available for causal reads
>>
>> What is the difference here?  The problem being reported appears to be
>> identical, but in the first case I can't connect, but in the second
>> case I can (although I still can't issue queries).
>
> Right, you get the error at login before it has managed to connect to
> the primary, and for a short time after while it's in 'joining' state,
> or potentially longer if there is a backlog of WAL to apply.
>
> The reason is that when causal_reads = on in postgresql.conf (as
> opposed to being set for an individual session or role), causal reads
> logic is used for snapshots taken during authentication (in fact the
> error is generated when trying to take a snapshot slightly before
> authentication proper begins, in InitPostgres).  I think that's a
> desirable feature: if you have causal reads on and you create/alter a
> database/role (for example setting a new password) and commit, and
> then you immediately try to connect to that database/role on a standby
> where you have causal reads enabled system-wide, then you get the
> causal reads guarantee during authentication: you either see the
> effects of your earlier transaction or you see the error.  As you have
> discovered, there is a small window after a standby comes up where it
> will show the error because it hasn't got a lease yet so it can't let
> you log in yet because it could be seeing a stale catalog (your user
> may not exist on the standby yet, or have been altered in some way, or
> your database may not exist yet, etc).
>
> Does that make sense?

Ah, alles klar.  Yes, that makes sense now.  I've been trying to break
it the past few days, and this was the only thing which I wasn't clear
on.  The parameters all work as described

The replay_lag is particularly cool.  Didn't think it was possible to
glean this information on the primary, but the timings are correct in
my tests.

+1 for this patch.  Looks like this solves the problem that
semi-synchronous replication tries to solve, although arguably in a
more sensible way.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relaxing SSL key permission checks

2016-02-21 Thread Tom Lane

Stephen Frost  writes:
> Just to be clear, I'm not really against this patch as-is, but it
> shouldn't be a precedent or limit us from supporting more permissive
> permissions in other areas (or even here) if there are sensible
> use-cases for more permissive permissions.

OK, and to be clear, I'm not against considering other use-cases and
trying to do something appropriate for them.  I just reject the idea
that it's unnecessary or inappropriate for us to be concerned about
whether secret-holding files are secure.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-02-21 Thread Thomas Munro

On Mon, Feb 22, 2016 at 2:10 AM, Thom Brown  wrote:
> On 3 February 2016 at 10:46, Thomas Munro  
> wrote:
>> On Wed, Feb 3, 2016 at 10:59 PM, Amit Langote
>>  wrote:
>>> There seems to be a copy-pasto there - shouldn't that be:
>>>
>>> + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)
>>
>> Indeed, thanks!  New patch attached.
>
> I've given this a test drive, and it works exactly as described.

Thanks for trying it out!

> But one thing which confuses me is when a standby, with causal_reads
> enabled, has just finished starting up.  I can't connect to it
> because:
>
> FATAL:  standby is not available for causal reads
>
> However, this is the same message when I'm successfully connected, but
> it's lagging, and the primary is still waiting for the standby to
> catch up:
>
> ERROR:  standby is not available for causal reads
>
> What is the difference here?  The problem being reported appears to be
> identical, but in the first case I can't connect, but in the second
> case I can (although I still can't issue queries).

Right, you get the error at login before it has managed to connect to
the primary, and for a short time after while it's in 'joining' state,
or potentially longer if there is a backlog of WAL to apply.

The reason is that when causal_reads = on in postgresql.conf (as
opposed to being set for an individual session or role), causal reads
logic is used for snapshots taken during authentication (in fact the
error is generated when trying to take a snapshot slightly before
authentication proper begins, in InitPostgres).  I think that's a
desirable feature: if you have causal reads on and you create/alter a
database/role (for example setting a new password) and commit, and
then you immediately try to connect to that database/role on a standby
where you have causal reads enabled system-wide, then you get the
causal reads guarantee during authentication: you either see the
effects of your earlier transaction or you see the error.  As you have
discovered, there is a small window after a standby comes up where it
will show the error because it hasn't got a lease yet so it can't let
you log in yet because it could be seeing a stale catalog (your user
may not exist on the standby yet, or have been altered in some way, or
your database may not exist yet, etc).

Does that make sense?

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] significant semi join overestimates (with CTEs)

2016-02-21 Thread Tomas Vondra


Hi,

while investigating a poor query plan choice, I've noticed suspicious 
semi join estimates looking like this:


 ->  Nested Loop  (cost=35.52..1787.31 rows=281653470 width=566)
   Buffers: shared hit=9305
   ->  HashAggregate  (cost=34.81..36.81 rows=200 width=32)
 Group Key: data.column1
 ->  CTE Scan on data  (cost=0.00..30.94 rows=1547 width=32)
   ->  Index Scan using t_pkey on t (cost=0.71..8.74 rows=1 ...)
 Index Cond: ((a = data.column1) AND (...))
 Buffers: shared hit=9305

Notice the cardinality estimates - we expect the outer relation to have 
200 rows, and for each of the 200 rows 1 matching row in the inner rel. 
Yet we estimate the join cardinality to be ~280M rows, which is rather 
excessive (it even exceeds the cartesian product cardinality).


So, how could this happen? Quite easily, actually ...

Consider a simple table

   CREATE TABLE t (a INT PRIMARY KEY, ...);

and this query:

WITH data AS (VALUES (1), (2), (3), ... list of more IDs ...)
SELECT * FROM t
WHERE a IN (SELECT column1 FROM data);

This is a simple semi join - the planner will do a HashAggregate on the 
CTE to find the list of distinct values, and then it will do a lookup on 
't' table to find matching rows. There may be additional conditions, as 
suggested by the initial plan, but let's ignore that for now.


Now, the CTE actually has 1547 values in it, but the HashAggregate is 
estimated to output just 200 values. Where does that come from?


Turns out, when there are no stats for the relation (e.g. when it's a 
CTE), get_variable_numdistinct() behaves differently depending on the 
cardinality of the relation. Essentially, this happens:


  if (rows < DEFAULT_NUM_DISTINCT)
  {
  *isdefault = false;
  return rows;
  }
  else
  {
 *isdefault = true;
 return DEFAULT_NUM_DISTINCT;
  }

i.e. we cap the ndistinct estimate to DEFAULT_NUM_DISTINCT and for some 
reason return different value for the 'isdefault' flag.


Then, eqjoinsel_semi() falls through to this part of the code, as there 
is no MCV for the CTE table:


if (!isdefault1 && !isdefault2)
{
if (nd1 <= nd2 || nd2 < 0)
selec = 1.0 - nullfrac1;
else
selec = (nd2 / nd1) * (1.0 - nullfrac1);
}
else
selec = 0.5 * (1.0 - nullfrac1);

Assuming the other side of the join is a regular table with proper stats 
(isdefault=false), depending on the CTE cardinality we fall into 
different branches of the if condition.


When rows<200, we end up with isdefault=false and use e.g.

selec = (nd2 / nd1) * (1.0 - nullfrac1)

which produces a sane estimate. But with rows >= 200, we suddenly switch 
to the other estimate


selec = 0.5 * (1.0 - nullfrac1);

which completely ignores the ndistinct values and instead evaluates to 
0.5 (when nullfrac1=0.0). This is the source of overestimates, because 
the actual join selectivity is way lower than 0.5 and we simply do


rows = outer_rows * jselec;

in calc_joinrel_size_estimate() for JOIN_SEMI. It's not difficult to 
construct examples of those over-estimates - attached are two scripts 
with examples demonstrating the issue, differing in the scale of the 
over-estimates.


It's also possible to construct examples for the inner if condition (see 
example-3.sql):


if (nd1 <= nd2 || nd2 < 0)
selec = 1.0 - nullfrac1;
else
selec = (nd2 / nd1) * (1.0 - nullfrac1);

but in this case we only switch between selec=1.0 and 0.5 (again, 
assuming nullfrac=0.0). While this sudden jump in estimates is not 
great, it can't result in such excessive estimates.


Clearly, this use of CTEs is a bit unnecessary and it's trivial to 
replace it with a simple IN() clause.


Moreover we can't really expect good estimates without stats, and I 
don't think changing DEFAULT_NUM_DISTINCT from one arbitrary value to 
some other arbitrary value would help much - we'd only see the strange 
plan changes with different numbers of values in the CTE. But perhaps we 
can do at least a bit better.


Firstly, we could clamp the join cardinality estimate to the size of 
cartesian product, eliminating at least the most obvious over-estimates.


Secondly, I think it'd be nice to somehow eliminate the sudden jumps in 
the estimates - I'm not quite sure why need the three different formulas 
in eqjoinsel_semi, so maybe we can get rid of some of them?



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


semijoin-example-3.sql
Description: application/sql


semijoin-example-1.sql
Description: application/sql


semijoin-example-2.sql
Description: application/sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relaxing SSL key permission checks

2016-02-21 Thread Stephen Frost

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Christoph Berg  writes:
> > The attached patch has successfully been included in the 9.6 Debian
> > package, passed the regression tests there, and I've also done some
> > chmod/chown tests on the filesystem to verify it indeed catches the
> > cases laid out by Tom.
> 
> Please add this to the upcoming commitfest.  We don't seem to have
> enough consensus to just apply it, but perhaps the review process
> will produce some agreement.

Just to be clear, I'm not really against this patch as-is, but it
shouldn't be a precedent or limit us from supporting more permissive
permissions in other areas (or even here) if there are sensible
use-cases for more permissive permissions.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-21 Thread Fabien COELHO




ISTM that "progress" and "progress_slice" only depend on num_scanned and
per-tablespace num_to_scan and total num_to_scan, so they are somehow
redundant and the progress could be recomputed from the initial figures
when needed.


They don't cause much space usage, and we access the values frequently. 
So why not store them?


The same question would work the other way around: these values are one 
division away, why not compute them when needed? No big deal.


[...] Given realistic amounts of memory the max potential "skew" seems 
fairly small with float8. If we ever flush one buffer "too much" for a 
tablespace it's pretty much harmless.


I do agree. I'm suggesting that a comment should be added to justify why 
float8 accuracy is okay.



I see a binary_heap_allocate but no corresponding deallocation, this
looks like a memory leak... or is there some magic involved?


Hm. I think we really should use a memory context for all of this - we
could after all error out somewhere in the middle...


I'm not sure that a memory context is justified here, there are only two 
mallocs and the checkpointer works for very long times. I think that it is 
simpler to just get the malloc/free right.


[...] I'm not arguing for ripping it out, what I mean is that we don't 
set a nondefault value for the GUCs on platforms with just 
posix_fadivise available...


Ok with that.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add generate_series(date,date) and generate_series(date,date,integer)

2016-02-21 Thread Tom Lane

Christoph Berg  writes:
> Re: David Fetter 2016-01-26 <20160126180011.ga16...@fetter.org>
>> +1 for back-patching.  There's literally no case where an infinite
>> input could be correct as the start or end of an interval for
>> generate_series.

> select * from generate_series(now(), 'infinity', '1 day') limit 10;
> ... seems pretty legit to me. If limit pushdown into SRFs happened to
> work some day, it'd be a pity if the above query raised an error.

Oooh ... actually, that works today if you consider the SRF-in-targetlist
case:

regression=# select generate_series(now(), 'infinity', '1 day') limit 10;
generate_series
---
 2016-02-21 16:51:03.303064-05
 2016-02-22 16:51:03.303064-05
 2016-02-23 16:51:03.303064-05
 2016-02-24 16:51:03.303064-05
 2016-02-25 16:51:03.303064-05
 2016-02-26 16:51:03.303064-05
 2016-02-27 16:51:03.303064-05
 2016-02-28 16:51:03.303064-05
 2016-02-29 16:51:03.303064-05
 2016-03-01 16:51:03.303064-05
(10 rows)

Time: 8.457 ms

Given that counterexample, I think we not only shouldn't back-patch such a
change but should reject it altogether.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-21 Thread Fabien COELHO




[...] I do think that this whole writeback logic really does make
sense *per table space*,


Leads to less regular IO, because if your tablespaces are evenly sized
(somewhat common) you'll sometimes end up issuing sync_file_range's
shortly after each other.  For latency outside checkpoints it's
important to control the total amount of dirty buffers, and that's
obviously independent of tablespaces.


I do not understand/buy this argument.

The underlying IO queue is per device, and table spaces should be per device
as well (otherwise what the point?), so you should want to coalesce and
"writeback" pages per device as wel. Calling sync_file_range on distinct
devices should probably be issued more or less randomly, and should not
interfere one with the other.


The kernel's dirty buffer accounting is global, not per block device.


Sure, but this is not my point. My point is that "sync_file_range" moves 
buffers to the device io queues, which are per device. If there is one 
queue in pg and many queues on many devices, the whole point of coalescing 
to get sequential writes is somehow lost.



It's also actually rather common to have multiple tablespaces on a
single block device. Especially if SANs and such are involved; where you
don't even know which partitions are on which disks.


Ok, some people would not benefit if the use many tablespaces on one 
device, too bad but that does not look like a useful very setting anyway, 
and I do not think it would harm much in this case.



If you use just one context, the more table spaces the less performance
gains, because there is less and less aggregation thus sequential writes per
device.

So for me there should really be one context per tablespace. That would
suggest a hashtable or some other structure to keep and retrieve them, which
would not be that bad, and I think that it is what is needed.


That'd be much easier to do by just keeping the context in the
per-tablespace struct. But anyway, I'm really doubtful about going for
that; I had it that way earlier, and observing IO showed it not being
beneficial.


ISTM that you would need a significant number of tablespaces to see the 
benefit. If you do not do that, the more table spaces the more random the 
IOs, which is disappointing. Also, "the cost is marginal", so I do not see 
any good argument not to do it.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add generate_series(date,date) and generate_series(date,date,integer)

2016-02-21 Thread Christoph Berg

Re: David Fetter 2016-01-26 <20160126180011.ga16...@fetter.org>
> +1 for back-patching.  There's literally no case where an infinite
> input could be correct as the start or end of an interval for
> generate_series.

select * from generate_series(now(), 'infinity', '1 day') limit 10;

... seems pretty legit to me. If limit pushdown into SRFs happened to
work some day, it'd be a pity if the above query raised an error.

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

2016-02-21 Thread Tom Lane

I wrote:
> Robert Haas  writes:
>> As for the patch itself, I'm having trouble grokking what it's trying
>> to do.  I think it might be worth having a comment defining precisely
>> what we mean by "A blocks B".  I would define "A blocks B" in general
>> as either A holds a lock which conflicts with one sought by B
>> (hard-blocked) or A awaits a lock which conflicts with one sought by B
>> and precedes it in the wait queue (soft-blocked).

> Yes, that is exactly what I implemented ... and it's something you can't
> find out from pg_locks.  I'm not sure how that view could be made to
> expose wait-queue ordering.

Here's an updated version of this patch, now with user-facing docs.

I decided that "pg_blocking_pids()" is a better function name than
"pg_blocker_pids()".  The code's otherwise the same, although I
revisited some of the comments.

I also changed quite a few references to "transaction" into "process"
in the discussion of pg_locks.  The previous choice to conflate
processes with transactions was never terribly wise in my view, and
it's certainly completely broken by parallel query.

regards, tom lane

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index d77e999..d3270e4 100644
*** a/doc/src/sgml/catalogs.sgml
--- b/doc/src/sgml/catalogs.sgml
***
*** 8015,8030 

 The view pg_locks provides access to
!information about the locks held by open transactions within the
 database server.  See  for more discussion
 of locking.

 pg_locks contains one row per active lockable
!object, requested lock mode, and relevant transaction.  Thus, the same
 lockable object might
!appear many times, if multiple transactions are holding or waiting
 for locks on it.  However, an object that currently has no locks on it
 will not appear at all.

--- 8015,8030 

 The view pg_locks provides access to
!information about the locks held by active processes within the
 database server.  See  for more discussion
 of locking.

 pg_locks contains one row per active lockable
!object, requested lock mode, and relevant process.  Thus, the same
 lockable object might
!appear many times, if multiple processs are holding or waiting
 for locks on it.  However, an object that currently has no locks on it
 will not appear at all.

***
*** 8200,8210 

 granted is true in a row representing a lock
!held by the indicated transaction.  False indicates that this transaction is
!currently waiting to acquire this lock, which implies that some other
!transaction is holding a conflicting lock mode on the same lockable object.
!The waiting transaction will sleep until the other lock is released (or a
!deadlock situation is detected). A single transaction can be waiting to
 acquire at most one lock at a time.

--- 8200,8210 

 granted is true in a row representing a lock
!held by the indicated process.  False indicates that this process is
!currently waiting to acquire this lock, which implies that at least one other
!process is holding a conflicting lock mode on the same lockable object.
!The waiting process will sleep until the other lock is released (or a
!deadlock situation is detected). A single process can be waiting to
 acquire at most one lock at a time.

***
*** 8224,8230 
 Although tuples are a lockable type of object,
 information about row-level locks is stored on disk, not in memory,
 and therefore row-level locks normally do not appear in this view.
!If a transaction is waiting for a
 row-level lock, it will usually appear in the view as waiting for the
 permanent transaction ID of the current holder of that row lock.

--- 8224,8230 
 Although tuples are a lockable type of object,
 information about row-level locks is stored on disk, not in memory,
 and therefore row-level locks normally do not appear in this view.
!If a process is waiting for a
 row-level lock, it will usually appear in the view as waiting for the
 permanent transaction ID of the current holder of that row lock.

*** SELECT * FROM pg_locks pl LEFT JOIN pg_p
*** 8281,8286 
--- 8281,8300 

+While it is possible to obtain information about which processes block
+which other processes by joining pg_locks against
+itself, this is very difficult to get right in detail.  Such a query would
+have to encode knowledge about which lock modes conflict with which
+others.  Worse, the pg_locks view does not expose
+information about which processes are ahead of which others in lock wait
+queues, nor information about which processes are parallel workers running
+on behalf of which other client sessions.  It is better to use
+

Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-02-21 Thread Thom Brown

On 3 February 2016 at 10:46, Thomas Munro  wrote:
> On Wed, Feb 3, 2016 at 10:59 PM, Amit Langote
>  wrote:
>> There seems to be a copy-pasto there - shouldn't that be:
>>
>> + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)
>
> Indeed, thanks!  New patch attached.

I've given this a test drive, and it works exactly as described.

But one thing which confuses me is when a standby, with causal_reads
enabled, has just finished starting up.  I can't connect to it
because:

FATAL:  standby is not available for causal reads

However, this is the same message when I'm successfully connected, but
it's lagging, and the primary is still waiting for the standby to
catch up:

ERROR:  standby is not available for causal reads

What is the difference here?  The problem being reported appears to be
identical, but in the first case I can't connect, but in the second
case I can (although I still can't issue queries).

Thom

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-21 Thread Andres Freund

Hi,

On 2016-02-21 10:52:45 +0100, Fabien COELHO wrote:
> * CpktSortItem:
> 
> I think that allocating 20 bytes per buffer in shared memory is a little on
> the heavy side. Some compression can be achieved: sizeof(ForlNum) is 4 bytes
> to hold 4 values, could be one byte or even 2 bits somewhere. Also, there
> are very few tablespaces, they could be given a small number and this number
> could be used instead of the Oid, so the space requirement could be reduced
> to say 16 bytes per buffer by combining space & fork in 2 shorts and keeping
> 4 bytes alignement and also getting 8 byte alignement... If this is too
> much, I have shown that it can work with only 4 bytes per buffer, as the
> sorting is really just a performance optimisation and is not broken if some
> stuff changes between sorting & writeback, but you did not like the idea. If
> the amount of shared memory required is a significant concern, it could be
> resurrected, though.

This is less than 0.2 % of memory related to shared buffers. We have the
same amount of memory allocated in CheckpointerShmemSize(), and nobody
has complained so far.  And sorry, going back to the previous approach
isn't going to fly, and I've no desire to discuss that *again*.

> ISTM that "progress" and "progress_slice" only depend on num_scanned and
> per-tablespace num_to_scan and total num_to_scan, so they are somehow
> redundant and the progress could be recomputed from the initial figures
> when needed.

They don't cause much space usage, and we access the values
frequently. So why not store them?

> If these fields are kept, I think that a comment should justify why float8
> precision is okay for the purpose. I think it is quite certainly fine in the
> worst case with 32 bits buffer_ids, but it would not be if this size is
> changed someday.

That seems pretty much unrelated to having the fields - the question of
accuracy plays a role regardless, no? Given realistic amounts of memory
the max potential "skew" seems fairly small with float8. If we ever
flush one buffer "too much" for a tablespace it's pretty much harmless.

> ISTM that nearly all of the collected data on the second sweep could be
> collected on the first sweep, so that this second sweep could be avoided
> altogether. The only missing data is the index of the first buffer in the
> array, which can be computed by considering tablespaces only, sweeping over
> buffers is not needed. That would suggest creating the heap or using a hash
> in the initial buffer sweep to keep this information. This would also
> provide a point where to number tablespaces for compressing the CkptSortItem
> struct.

Doesn't seem worth the complexity to me.

> I'm wondering about calling CheckpointWriteDelay on each round, maybe
> a minimum amount of write would make sense.

Why? There's not really much benefit of doing more work than needed. I
think we should sleep far shorter in many cases, but that's indeed a
separate issue.

> I see a binary_heap_allocate but no corresponding deallocation, this
> looks like a memory leak... or is there some magic involved?

Hm. I think we really should use a memory context for all of this - we
could after all error out somewhere in the middle...

> >I think this patch primarily needs:
> >* Benchmarking on FreeBSD/OSX to see whether we should enable the
> > mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm
> > inclined to leave it off till then.
> 
> I do not have that. As "msync" seems available on Linux, it is possible to
> force using it with a "ifdef 0" to skip sync_file_range and check whether it
> does some good there.

Unfortunately it doesn't work well on linux:
 * On many OSs msync() on a mmap'ed file triggers writeback. On 
linux
 * it only does so when MS_SYNC is specified, but then it does 
the
 * writeback synchronously. Luckily all common linux systems 
have
 * sync_file_range().  This is preferrable over FADV_DONTNEED 
because
 * it doesn't flush out clean data.

I've verified beforehand, with a simple demo program, that
msync(MS_ASYNC) does something reasonable of freebsd...

> Idem for the "posix_fadvise" stuff. I can try to do
> that, but it takes time to do so, if someone can test on other OS it would
> be much better. I think that if it works it should be kept in, so it is just
> a matter of testing it.

I'm not arguing for ripping it out, what I mean is that we don't set a
nondefault value for the GUCs on platforms with just posix_fadivise
available...

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-21 Thread Andres Freund

On 2016-02-21 08:26:28 +0100, Fabien COELHO wrote:
> >>In the discussion in the wal section, I'm not sure about the effect of
> >>setting writebacks on SSD, [...]
> >
> >Yea, that paragraph needs some editing. I think we should basically
> >remove that last sentence.
> 
> Ok, fine with me. Does that mean that flushing as a significant positive
> impact on SSD in your tests?

Yes. The reason we need flushing is that the kernel amasses dirty pages,
and then flushes them at once. That hurts for both SSDs and rotational
media. Sorting is the the bigger question, but I've seen it have clearly
beneficial performance impacts. I guess if you look at devices with a
internal block size bigger than 8k, you'd even see larger differences.

> >>Maybe the merging strategy could be more aggressive than just strict
> >>neighbors?
> >
> >I don't think so. If you flush more than neighbouring writes you'll
> >often end up flushing buffers dirtied by another backend, causing
> >additional stalls.
> 
> Ok. Maybe the neightbor definition could be relaxed just a little bit so
> that small holes are overtake, but not large holes? If there is only a few
> pages in between, even if written by another process, then writing them
> together should be better? Well, this can wait for a clear case, because
> hopefully the OS will recoalesce them behind anyway.

I'm against doing so without clear measurements of a benefit.

> >Also because the infrastructure is used for more than checkpoint
> >writes. There's absolutely no ordering guarantees there.
> 
> Yep, but not much benefit to expect from a few dozens random pages either.

Actually, there's kinda frequently a benefit observable. Even if few
requests can be merged, doing IO requests in an order more likely doable
within a few rotations is beneficial. Also, the cost is marginal, so why
worry?

> >>[...] I do think that this whole writeback logic really does make
> >>sense *per table space*,
> >
> >Leads to less regular IO, because if your tablespaces are evenly sized
> >(somewhat common) you'll sometimes end up issuing sync_file_range's
> >shortly after each other.  For latency outside checkpoints it's
> >important to control the total amount of dirty buffers, and that's
> >obviously independent of tablespaces.
> 
> I do not understand/buy this argument.
> 
> The underlying IO queue is per device, and table spaces should be per device
> as well (otherwise what the point?), so you should want to coalesce and
> "writeback" pages per device as wel. Calling sync_file_range on distinct
> devices should probably be issued more or less randomly, and should not
> interfere one with the other.

The kernel's dirty buffer accounting is global, not per block device.
It's also actually rather common to have multiple tablespaces on a
single block device. Especially if SANs and such are involved; where you
don't even know which partitions are on which disks.

> If you use just one context, the more table spaces the less performance
> gains, because there is less and less aggregation thus sequential writes per
> device.
> 
> So for me there should really be one context per tablespace. That would
> suggest a hashtable or some other structure to keep and retrieve them, which
> would not be that bad, and I think that it is what is needed.

That'd be much easier to do by just keeping the context in the
per-tablespace struct. But anyway, I'm really doubtful about going for
that; I had it that way earlier, and observing IO showed it not being
beneficial.

> >>For the checkpointer, a key aspect is that the scheduling process goes
> >>to sleep from time to time, and this sleep time looked like a great
> >>opportunity to do this kind of flushing. You choose not to take advantage
> >>of the behavior, why?
> >
> >Several reasons: Most importantly there's absolutely no guarantee that
> >you'll ever end up sleeping, it's quite common to happen only seldomly.
> 
> Well, that would be under a situation when pg is completely unresponsive.
> More so, this behavior *makes* pg unresponsive.

No. The checkpointer being bottlenecked on actual IO performance doesn't
impact production that badly. It'll just sometimes block in
sync_file_range(), but the IO queues will have enough space to
frequently give way to other backends, particularly to synchronous reads
(most pg reads) and synchronous writes (fdatasync()).  So a single
checkpoint will take a bit longer, but otherwise the system will mostly
keep up the work in a regular manner.  Without the sync_file_range()
calls the kernel will amass dirty buffers until global dirty limits are
reached, which then will bring the whole system to a standstill.

It's pretty common that checkpoint_timeout is too short to be able to
write all shared_buffers out, in that case it's much better to slow down
the whole checkpoint, instead of being incredibly slow at the end.

> >I also don't really believe it helps that much, although that's a complex
> >argument to make.
> 
> Yep. My

Re: [HACKERS] Re: Add generate_series(date, date) and generate_series(date, date, integer)

2016-02-21 Thread Vik Fearing

On 02/21/2016 07:56 PM, Corey Huinker wrote:
> 
> Other than that, the only difference is the ::date part. Is it
> really worth adding extra code just for that? I would say not.
> 
> 
> I would argue it belongs for the sake of completeness. 

So would I.

+1 for adding this missing function.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relaxing SSL key permission checks

2016-02-21 Thread Tom Lane

Christoph Berg  writes:
> The attached patch has successfully been included in the 9.6 Debian
> package, passed the regression tests there, and I've also done some
> chmod/chown tests on the filesystem to verify it indeed catches the
> cases laid out by Tom.

Please add this to the upcoming commitfest.  We don't seem to have
enough consensus to just apply it, but perhaps the review process
will produce some agreement.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: Add generate_series(date, date) and generate_series(date, date, integer)

2016-02-21 Thread Corey Huinker

>
>
> [step] is in days, but is not documented as such.
>
>
It is in days, and is not documented as such, but since a day is the
smallest unit of time for a date, I felt there was no other interpretation.



> My understanding is you want to replace this
>
> SELECT d.dt::date as dt
> FROM generate_series('2015-01-01'::date,
>  '2016-01-04'::date,
>  interval '1 day') AS d(dt);
>
>
> with this
>
> SELECT d.dt
> FROM generate_series('2015-01-01'::date,
>  '2016-01-04'::date,
>  7) as d(dt);
>
>

I'd also like to be able to join the values of d.dt without typecasting
them.
To me it's as awkward as (select 1::double + 1::double)::integer


>
> Personally, I think writing  INTERVAL '7 days' to be clearer than just
> typing 7.
>

Well, nearly all my use cases involve the step being 1 (and thus omitted)
or -1.

Maybe this example will appeal to you

SELECT d.birth_date, COUNT(r.kiddo_name)
FROM generate_series('2016-01-01'::date,'2016-01-10'::date) as d(birth_date)
LEFT OUTER JOIN birth_records r ON r.birth_date = d.birth_date
GROUP BY 1
ORDER BY 1;



> Other than that, the only difference is the ::date part. Is it really
> worth adding extra code just for that? I would say not.
>

I would argue it belongs for the sake of completeness.
We added generate_series for numerics when generate_series for floats
already existed.

No comments on the patch itself, which seems to do the job, so apologies to
> give this opinion on your work, I do hope it doesn't put you off further
> contributions.
>

Thanks. I appreciate that.

Re: [HACKERS] [COMMITTERS] pgsql: Cosmetic improvements in new config_info code.

2016-02-21 Thread Tom Lane

Stephen Frost  writes:
> * Joe Conway (m...@joeconway.com) wrote:
>> FWIW, strcpy() was being used in src/bin/pg_config/pg_config.c that I
>> started with -- does that mean we are not getting Coverity coverage of
>> src/bin?

> Coverity does run against src/bin also.  It's possible this was
> identified as an issue in pg_config.c, but, as Tom notes, it may not be
> an actual bug and might have been marked as a non-bug in Coverity.

It looks to me like the previous coding was

static char mypath[MAXPGPATH];
...
charpath[MAXPGPATH];
...
strcpy(path, mypath);

so Coverity probably didn't complain because it could see that the source
was also a buffer of size MAXPGPATH.  With the new arrangement it was
probably using an assumption that get_configdata() could be called with
any length of string.

I am not sure how much cross-file analysis Coverity does --- it seems to
do some, but in other cases it acts like it doesn't know anything about
the call sites.  It's possible that it did know that the existing callers
all use MAXPGPATH-sized buffers, but whined anyway on the not-unreasonable
grounds that global functions should be prepared for new callers.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [COMMITTERS] pgsql: Cosmetic improvements in new config_info code.

2016-02-21 Thread Stephen Frost

Joe, all,

* Joe Conway (m...@joeconway.com) wrote:
> On 02/21/2016 08:38 AM, Tom Lane wrote:
> > Coverity griped about use of unchecked strcpy() into a local variable.
> > There's unlikely to be any actual bug there, since no caller would be
> > passing a path longer than MAXPGPATH, but nonetheless use of strlcpy()
> > seems preferable.
> 
> FWIW, strcpy() was being used in src/bin/pg_config/pg_config.c that I
> started with -- does that mean we are not getting Coverity coverage of
> src/bin?

Coverity does run against src/bin also.  It's possible this was
identified as an issue in pg_config.c, but, as Tom notes, it may not be
an actual bug and might have been marked as a non-bug in Coverity.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] a raft of parallelism-related bug fixes

2016-02-21 Thread Tom Lane

Robert Haas  writes:
> On Mon, Feb 8, 2016 at 2:36 PM, Joshua D. Drake  
> wrote:
>> I have no problem running any test cases you wish on a branch in a loop for
>> the next week and reporting back any errors.

> Well, what I've done is push into the buildfarm code that will allow
> us to do *the most exhaustive* testing that I know how to do in an
> automated fashion. Which is to create a file that says this:

> force_parallel_mode=regress
> max_parallel_degree=2

I did a few dozen runs of the core regression tests with those settings
(using current HEAD plus my lockGroupLeaderIdentifier-ectomy patch).
Roughly one time in ten, it fails in the stats test, with diffs as
attached.  I interpret this as meaning that parallel workers don't
reliably transmit stats to the stats collector, though maybe there
is something else happening.

regards, tom lane

*** /home/postgres/pgsql/src/test/regress/expected/stats.out	Wed Mar  4 00:55:25 2015
--- /home/postgres/pgsql/src/test/regress/results/stats.out	Sun Feb 21 12:59:27 2016
***
*** 148,158 
   WHERE relname like 'trunc_stats_test%' order by relname;
relname  | n_tup_ins | n_tup_upd | n_tup_del | n_live_tup | n_dead_tup 
  ---+---+---+---++
!  trunc_stats_test  | 3 | 0 | 0 |  0 |  0
!  trunc_stats_test1 | 4 | 2 | 1 |  1 |  0
!  trunc_stats_test2 | 1 | 0 | 0 |  1 |  0
!  trunc_stats_test3 | 4 | 0 | 0 |  2 |  2
!  trunc_stats_test4 | 2 | 0 | 0 |  0 |  2
  (5 rows)
  
  SELECT st.seq_scan >= pr.seq_scan + 1,
--- 148,158 
   WHERE relname like 'trunc_stats_test%' order by relname;
relname  | n_tup_ins | n_tup_upd | n_tup_del | n_live_tup | n_dead_tup 
  ---+---+---+---++
!  trunc_stats_test  | 0 | 0 | 0 |  0 |  0
!  trunc_stats_test1 | 0 | 0 | 0 |  0 |  0
!  trunc_stats_test2 | 0 | 0 | 0 |  0 |  0
!  trunc_stats_test3 | 0 | 0 | 0 |  0 |  0
!  trunc_stats_test4 | 0 | 0 | 0 |  0 |  0
  (5 rows)
  
  SELECT st.seq_scan >= pr.seq_scan + 1,

==


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Introduce group locking to prevent parallel processes from deadl

2016-02-21 Thread Tom Lane

I wrote:
> Robert Haas  writes:
>> Yeah, you're right.  Attached is a draft patch that tries to clean up
>> that and a bunch of other things that you raised.

> I've been reviewing this patch, and I had a question: why do we need to
> bother with a lockGroupLeaderIdentifier field?  It seems totally redundant
> with the leader's pid field, ie, why doesn't BecomeLockGroupMember simply
> compare leader->pid with the PID it's passed?  For some more safety, it
> could also verify that leader->lockGroupLeader == leader; but I don't
> see what the extra field is buying us.

Here is a proposed patch that gets rid of lockGroupLeaderIdentifier.

I concluded that the "leader->lockGroupLeader == leader" test is
necessary, because we don't ever reset the pid field of a dead PGPROC
until it's re-used.  However, with that check, this seems isomorphic to
the existing code because lockGroupLeader is reset to NULL at all the same
places that lockGroupLeaderIdentifier is reset.  So we do not need both of
those fields.

regards, tom lane

diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index c399618..56b0a12 100644
*** a/src/backend/storage/lmgr/README
--- b/src/backend/storage/lmgr/README
*** those cases so that they no longer use h
*** 650,676 
  (which is not a crazy idea, given that such lock acquisitions are not expected
  to deadlock and that heavyweight lock acquisition is fairly slow anyway).

! Group locking adds four new members to each PGPROC: lockGroupLeaderIdentifier,
! lockGroupLeader, lockGroupMembers, and lockGroupLink. The first is simply a
! safety mechanism. A newly started parallel worker has to try to join the
! leader's lock group, but it has no guarantee that the group leader is still
! alive by the time it gets started. We try to ensure that the parallel leader
! dies after all workers in normal cases, but also that the system could survive
! relatively intact if that somehow fails to happen. This is one of the
! precautions against such a scenario: the leader relays its PGPROC and also its
! PID to the worker, and the worker fails to join the lock group unless the
! given PGPROC still has the same PID. We assume that PIDs are not recycled
! quickly enough for this interlock to fail.
! 
! A PGPROC's lockGroupLeader is NULL for processes not involved in parallel
! query. When a process wants to cooperate with parallel workers, it becomes a
! lock group leader, which means setting this field to point to its own
! PGPROC. When a parallel worker starts up, it points this field at the leader,
! with the above-mentioned interlock. The lockGroupMembers field is only used in
  the leader; it is a list of the member PGPROCs of the lock group (the leader
  and all workers). The lockGroupLink field is the list link for this list.

! All four of these fields are considered to be protected by a lock manager
  partition lock.  The partition lock that protects these fields within a given
  lock group is chosen by taking the leader's pgprocno modulo the number of lock
  manager partitions.  This unusual arrangement has a major advantage: the
--- 650,665 
  (which is not a crazy idea, given that such lock acquisitions are not expected
  to deadlock and that heavyweight lock acquisition is fairly slow anyway).

! Group locking adds three new members to each PGPROC: lockGroupLeader,
! lockGroupMembers, and lockGroupLink. A PGPROC's lockGroupLeader is NULL for
! processes not involved in parallel query. When a process wants to cooperate
! with parallel workers, it becomes a lock group leader, which means setting
! this field to point to its own PGPROC. When a parallel worker starts up, it
! points this field at the leader. The lockGroupMembers field is only used in
  the leader; it is a list of the member PGPROCs of the lock group (the leader
  and all workers). The lockGroupLink field is the list link for this list.

! All three of these fields are considered to be protected by a lock manager
  partition lock.  The partition lock that protects these fields within a given
  lock group is chosen by taking the leader's pgprocno modulo the number of lock
  manager partitions.  This unusual arrangement has a major advantage: the
*** change while the deadlock detector is ru
*** 679,684 
--- 668,685 
  all the lock manager locks.  Also, holding this single lock allows safe
  manipulation of the lockGroupMembers list for the lock group.

+ We need an additional interlock when setting these fields, because a newly
+ started parallel worker has to try to join the leader's lock group, but it
+ has no guarantee that the group leader is still alive by the time it gets
+ started.  We try to ensure that the parallel leader dies after all workers
+ in normal cases, but also that the system could survive relatively intact
+ if that somehow fails to happen.  This is one of the precautions against
+ such a

Re: [HACKERS] proposal: enhancing slow query log, and autoexplain log about waiting on lock before query exec time

2016-02-21 Thread Pavel Stehule

Hi



>> What would be useful logging-wise is if the log line for the query itself
>> could contain lock wait time, but that doesn't sound like what you're
>> proposing?
>>
>
> I hope, so I propose this idea. First time I wanted talk about the idea.
> Next step is the talk about format.
>

Some enhancement should to have two parts:

a) introduce new function GetCurrentStatementStartExecutionTimestamp().
This function can be used  for calculation of initial lock waiting duration.

b) introduce new GUC log_min_lock_duration_statement. It is similar to
log_min_duration_statement. When any statement waits longer time than this
number, then statement is logged similar to other slow statement.

example:

log_min_duration_statement = 1000 .. log any slower than 1sec
log_min_lock_duration_statement = 100 log when waiting is longer than 100 ms

entry in log:

LOG: duration:843 ms  start lock duration: 156 ms  statement: .

This logic is simple - for some more complicated models, we can introduce
some monitoring logs hooks and complex custom logging can be solved in
extension.

When log_min_lock_duration_statement is -1, then "start lock duration"
isn't printed to log.

Comments, notes?

Regards

Pavel

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-02-21 Thread Amit Kapila

On Sun, Feb 21, 2016 at 2:28 PM, Robert Haas  wrote:

> On Sun, Feb 21, 2016 at 12:24 PM, Amit Kapila 
> wrote:
>
>> On Sun, Feb 21, 2016 at 12:02 PM, Robert Haas 
>> wrote:
>>
>>> On Sun, Feb 21, 2016 at 10:27 AM, Amit Kapila 
>>> wrote:
>>>

 Client_Count/Patch_Ver 1 64 128 256
 HEAD(481725c0) 963 28145 28593 26447
 Patch-1 938 28152 31703 29402


 We can see 10~11% performance improvement as observed
 previously.  You might see 0.02% performance difference with
 patch as regression, but that is just a run-to-run variation.

>>>
>>> Don't the single-client numbers show about a 3% regresssion?  Surely not
>>> 0.02%.
>>>
>>
>>
>> Sorry, you are right, it is ~2.66%, but in read-write pgbench tests, I
>> could see such fluctuation.  Also patch doesn't change single-client
>> case.  However, if you still feel that there could be impact by patch,
>> I can re-run the single client case once again with different combinations
>> like first with HEAD and then patch and vice versa.
>>
>
> Are these results from a single run, or median-of-three?
>
>
This was median-of-three, but the highest tps with patch is 1119
and with HEAD, it is 969 which shows a gain at single client-count.
Sometimes, I see such differences, it could be due to auto vacuum
getting triggered at some situations which lead to such variations.
However, if I try 2-3 times, the difference generally gets disappeared
unless there is some real regression or if just switch off auto vacuum
and do manual vacuum after each run.  This time, I haven't run the
tests multiple times.


> I mean, my basic feeling is that I would not accept a 2-3% regression in
> the single client case to get a 10% speedup in the case where we have 128
> clients.
>

I understand your point.  I think to verify whether it is run-to-run
variation or an actual regression, I will re-run these tests on single
client multiple times and post the result.


>   A lot of people will not have 128 clients; quite a few will have a
> single session, or just a few.  Sometimes just making the code more complex
> can hurt performance in subtle ways, e.g. by making it fit into the L1
> instruction cache less well.  If the numbers you have here are accurate,
> I'd vote to reject the patch.
>
>
One point to note is that this patch along with first patch which I
posted in this thread to increase clog buffers can make significant
reduction in contention on CLogControlLock.  OTOH, I think introducing
regression at single-client is also not a sane thing to do, so lets
first try to find if there is actually any regression and if it is, can
we mitigate it by writing code with somewhat fewer instructions or
in a slightly different way and then we can decide whether it is good
to reject the patch or not.  Does that sound reasonable to you?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] exposing pg_controldata and pg_config as functions

2016-02-21 Thread Michael Paquier

On Sat, Feb 20, 2016 at 12:12 PM, Joe Conway  wrote:
> On 01/17/2016 04:10 PM, Joe Conway wrote:
>> On 01/16/2016 06:02 AM, Michael Paquier wrote:
>>> On Wed, Dec 30, 2015 at 9:08 AM, Joe Conway  wrote:
 3) Adds new functions, more or less in line with previous discussions:
* pg_checkpoint_state()
* pg_controldata_state()
* pg_recovery_state()
* pg_init_state()
>>>
>>> Taking the opposite direction of Josh upthread, why is this split
>>> actually necessary? Isn't the idea to provide a SQL interface of what
>>> pg_controldata shows? If this split proves to be useful, shouldn't we
>>> do it as well for pg_controldata?
>>
>> The last discussion moved strongly in the direction of separate
>> functions, and that being different from pg_controldata was not a bad
>> thing. That said, I'm still of the opinion that there are legitimate
>> reasons to want the command line pg_controldata and the SQL functions to
>> produce equivalent, if not identical, results. I just wish we could get
>> a clear consensus one way or the other.
>
> I've assumed that we are sticking with the separate functions. As such,
> here is a rebased patch, with documentation and other fixes such as
> Copyright year, Mkvcbuild support, and some cruft removal.

Looking again at this thread I guess that this is consensus, based on
the proposal from Josh and seeing no other ideas around. Another idea
would be to group all the fields that into a single function
pg_control_data().

> Is there general consensus that we want this feature, and that we want
> it in this form? Any other comments?

I had a look at this patch.

+   
+pg_checkpoint_state
+   
+   
+pg_checkpoint_state returns a record containing
+checkpoint_location, prior_location, redo_location, redo_wal_file,
+timeline_id, prev_timeline_id, full_page_writes, next_xid, next_oid,
+next_multixact_id, next_multi_offset, oldest_xid, oldest_xid_dbid,
+oldest_active_xid, oldest_multi_xid, oldest_multi_dbid,
+oldest_commit_ts_xid, newest_commit_ts_xid, and checkpoint_time.
+   
This is bit unreadable. The only entry in the documentation that
adopts a similar style is pg_stat_file, and with six fields that feels
as being enough. I would suggest using a table instead with the type
of the field and its name.

Regarding the naming of the functions, I think that it would be good
to get something consistent with the concept of those being "Control
Data functions" by having them share the same prefix, say pg_control_
- pg_control_checkpoint
- pg_control_init
- pg_control_system
- pg_control_recovery

+   snprintf (controldata_name, CONTROLDATANAME_LEN,
+ "%s:", controldata[i].name);
Nitpick: extra space.

+static const char *const controldata_names[] =
+{
+   gettext_noop("pg_control version number"),
+   gettext_noop("Catalog version number"),
+   gettext_noop("Database system identifier"),
Is this complication really necessary? Those identifiers are used only
in the frontend and the footprint of this patch on pg_controldata is
really large. What I think we should do is have in src/common the
following set of routines that work directly on ControlFileData:
- checkControlFile, to perform basic sanity checks on the control file
(CRC, see for example pg_rewind.c)
- getControlFile(dataDir), that simply returns a palloc'd
ControlFileData to the caller after looking at global/pg_control.
pg_rewind could perhaps make use of the one to check the control file
CRC, to fetch ControlFileData there is some parallel logic for the
source server if it is either remote or local so it would be better to
not use getControlFile in this case.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

2016-02-21 Thread Michael Paquier

On Fri, Feb 19, 2016 at 10:33 PM, Amit Kapila wrote:
> On Fri, Feb 12, 2016 at 6:57 PM, Michael Paquier wrote:
> You doesn't seem to have taken care of below typo in your patch as
> pointed out by me earlier.
>
> + * to not rely on taking an exclusive lock an all the WAL insertion locks,
>
> /an all/on all

Sorry about that. Hopefully fixed now.

> Does this in anyway take care of the case when there is an activity
> on unlogged tables?
> I think avoiding to perform checkpoints in an idle system is introduced
> in commit 4d14fe0048cf80052a3ba2053560f8aab1bb1b22 whereas
> unlogged relation is introduced by commit
> 53dbc27c62d8e1b6c5253feba04a5094cb8fe046 which is much later.
> Now, I think one might consider it okay to not do anything for
> unlogged tables as it is not done previously and this patch is
> anyway improving the current situation, but I feel if we agree
> that checkpoints will get skipped if the only operations that
> are happening in the system are on unlogged relations, then
> it is better to add it in comments as an improvement even if we
> don't want to address it as part of this patch.

Nope, nothing is done, just to not complicate the patch more than
necessary. I agree though that we had better not update the progress
LSN when logging something related to INIT_FORKNUM, there is little
point to run non-forced checkpoints in this case as on recovery
unlogged relations are wiped out.

> + elog(LOG, "Not a forced or shutdown checkpoint: progress_lsn %X/%X, ckpt
> %X/%X",
> + (uint32) (progress_lsn >> 32), (uint32) progress_lsn,
> + (uint32) (ControlFile->checkPoint >> 32), (uint32)
> ControlFile->checkPoint);
>
> Are you proposing to have the newly introduced elog messages
> as part of commit or are these just for debugging purpose? If you
> are proposing for commit, then I think it is worth to justify the need
> of same and we should discuss what is the appropriate log level,
> otherwise, I think you can have these in an additional patch just for
> verification as the main patch is now very close to being
> ready for committer.

I have never intended to have those logs being part of the patch that
should be committed, and putting them at LOG level avoided to have a
bunch of garbage in the logs during the tests (got lazy to grep on the
entries for those tests). I am no:t against adding a comment, here is
one I just came up with:
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -448,6 +448,12 @@ typedef struct XLogwrtResult
  * hence reducing the impact of the activity lookup. This takes also
  * advantage to avoid 8-byte torn reads on some platforms by using the
  * fact that each insert lock is located on the same cache line.
+ * XXX: There is still room for more improvements here, particularly
+ * WAL operations related to unlogged relations (INIT_FORKNUM) should not
+ * update the progress LSN as those relations are reset during crash
+ * recovery so enforcing buffers of such relations to be flushed for
+ * example in the case of a load only on unlogged relations is a waste
+ * of disk write.
  */

> Also, I think it is worth to once take the performance data for
> write tests (using pgbench 30 minute run or some other way) with
> minimum checkpoint_timeout (i.e 30s) to see if the additional locking
> has any impact on performance.  I think taking locks at intervals
> of 15s or 30s should not matter much, but it is better to be safe.

I don't think performance will be impacted, but there are no reasons
to not do any measurements either. I'll try to get some graphs
tomorrow with runs on my laptop, mainly looking for any effects of
this patch on TPS when checkpoints show up.

For now attached is an updated patch, with a second patch (b)
including the logs for testing.
-- 
Michael
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 05e49d7..3ab2bc9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8373,8 +8373,12 @@ CreateCheckPoint(int flags)
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
   CHECKPOINT_FORCE)) == 0)
 	{
+		elog(LOG, "Not a forced or shutdown checkpoint: progress_lsn %X/%X, ckpt %X/%X",
+			 (uint32) (progress_lsn >> 32), (uint32) progress_lsn,
+			 (uint32) (ControlFile->checkPoint >> 32), (uint32) ControlFile->checkPoint);
 		if (progress_lsn == ControlFile->checkPoint)
 		{
+			elog(LOG, "Checkpoint is skipped");
 			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
 			END_CRIT_SECTION();
@@ -8541,7 +8545,11 @@ CreateCheckPoint(int flags)
 	 * recovery we don't need to write running xact data.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
-		LogStandbySnapshot();
+	{
+		XLogRecPtr lsn = LogStandbySnapshot();
+		elog(LOG, "snapshot taken by checkpoint %X/%X",
+			 (uint32) (lsn >> 32), (uint32) lsn);
+	}
 
 	START_CRIT_SECTION();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index

Re: [HACKERS] Relaxing SSL key permission checks

2016-02-21 Thread Christoph Berg

Re: To Tom Lane 2016-02-19 <20160219115334.gb26...@msg.df7cb.de>
> Updated patch attached.

*Blush* I though I had compile-tested the patch, but without
--enable-openssl it wasn't built :(.

The attached patch has successfully been included in the 9.6 Debian
package, passed the regression tests there, and I've also done some
chmod/chown tests on the filesystem to verify it indeed catches the
cases laid out by Tom.

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/
diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
new file mode 100644
index 1e3dfb6..a6c4ba0
*** a/src/backend/libpq/be-secure-openssl.c
--- b/src/backend/libpq/be-secure-openssl.c
*** be_tls_init(void)
*** 207,213 
  			ssl_key_file)));
  
  		/*
! 		 * Require no public access to key file.
  		 *
  		 * XXX temporarily suppress check when on Windows, because there may
  		 * not be proper support for Unix-y file permissions.  Need to think
--- 207,217 
  			ssl_key_file)));
  
  		/*
! 		 * Require no public access to key file. If the file is owned by us,
! 		 * require mode 0600 or less. If owned by root, require 0640 or less
! 		 * to allow read access through our gid, or a supplementary gid that
! 		 * allows to read system-wide certificates. Refuse to load files owned
! 		 * by other users.
  		 *
  		 * XXX temporarily suppress check when on Windows, because there may
  		 * not be proper support for Unix-y file permissions.  Need to think
*** be_tls_init(void)
*** 215,226 
  		 * directory permission check in postmaster.c)
  		 */
  #if !defined(WIN32) && !defined(__CYGWIN__)
! 		if (!S_ISREG(buf.st_mode) || buf.st_mode & (S_IRWXG | S_IRWXO))
  			ereport(FATAL,
  	(errcode(ERRCODE_CONFIG_FILE_ERROR),
    errmsg("private key file \"%s\" has group or world access",
  		 ssl_key_file),
!    errdetail("Permissions should be u=rw (0600) or less.")));
  #endif
  
  		if (SSL_CTX_use_PrivateKey_file(SSL_context,
--- 219,233 
  		 * directory permission check in postmaster.c)
  		 */
  #if !defined(WIN32) && !defined(__CYGWIN__)
! 		if (!S_ISREG(buf.st_mode) ||
! (buf.st_uid == geteuid() && buf.st_mode & (S_IRWXG | S_IRWXO)) ||
! (buf.st_uid == 0 && buf.st_mode & (S_IWGRP | S_IXGRP | S_IRWXO)) ||
! (buf.st_uid != geteuid() && buf.st_uid != 0))
  			ereport(FATAL,
  	(errcode(ERRCODE_CONFIG_FILE_ERROR),
    errmsg("private key file \"%s\" has group or world access",
  		 ssl_key_file),
!    errdetail("File must be owned by the database user and have permissions u=rw (0600) or less, or owned by root and have permissions u=rw,g=w (0640) or less.")));
  #endif
  
  		if (SSL_CTX_use_PrivateKey_file(SSL_context,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-21 Thread Fabien COELHO



Hallo Andres,

Here is a review for the second patch.


For 0002 I've recently changed:
* Removed the sort timing information, we've proven sufficiently that
 it doesn't take a lot of time.


I put it there initialy to demonstrate that there was no cache performance 
issue when sorting on just buffer indexes. As it is always small, I agree 
that it is not needed. Well, it could be still be in seconds on a very 
large shared buffers setting with a very large checkpoint, but then the 
checkpoint would be tremendously huge...



* Minor comment polishing.


Patch applies and checks on Linux.

* CpktSortItem:

I think that allocating 20 bytes per buffer in shared memory is a little 
on the heavy side. Some compression can be achieved: sizeof(ForlNum) is 4 
bytes to hold 4 values, could be one byte or even 2 bits somewhere. Also, 
there are very few tablespaces, they could be given a small number and 
this number could be used instead of the Oid, so the space requirement 
could be reduced to say 16 bytes per buffer by combining space & fork in 2 
shorts and keeping 4 bytes alignement and also getting 8 byte 
alignement... If this is too much, I have shown that it can work with only 
4 bytes per buffer, as the sorting is really just a performance 
optimisation and is not broken if some stuff changes between sorting & 
writeback, but you did not like the idea. If the amount of shared memory 
required is a significant concern, it could be resurrected, though.


* CkptTsStatus:

As I suggested in the other mail, I think that this structure should also keep
a per tablespace WritebackContext so that coalescing is done per tablespace.

ISTM that "progress" and "progress_slice" only depend on num_scanned and
per-tablespace num_to_scan and total num_to_scan, so they are somehow
redundant and the progress could be recomputed from the initial figures
when needed.

If these fields are kept, I think that a comment should justify why float8 
precision is okay for the purpose. I think it is quite certainly fine in 
the worst case with 32 bits buffer_ids, but it would not be if this size 
is changed someday.


* BufferSync

After a first sweep to collect buffers to write, they are sorted, and then 
there those buffers are swept again to compute some per tablespace data 
and organise a heap.


ISTM that nearly all of the collected data on the second sweep could be 
collected on the first sweep, so that this second sweep could be avoided 
altogether. The only missing data is the index of the first buffer in the 
array, which can be computed by considering tablespaces only, sweeping 
over buffers is not needed. That would suggest creating the heap or using 
a hash in the initial buffer sweep to keep this information. This would 
also provide a point where to number tablespaces for compressing the 
CkptSortItem struct.


I'm wondering about calling CheckpointWriteDelay on each round, maybe
a minimum amount of write would make sense. This remark is independent of 
this patch. Probably it works fine because after a sleep the checkpointer 
is behind enough so that it will write a bunch of buffers before sleeping

again.

I see a binary_heap_allocate but no corresponding deallocation, this
looks like a memory leak... or is there some magic involved?

There are some debug stuff to remove in #ifdefs.

I think that the buffer/README should be updated with explanations about
sorting in the checkpointer.


I think this patch primarily needs:
* Benchmarking on FreeBSD/OSX to see whether we should enable the
 mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm
 inclined to leave it off till then.


I do not have that. As "msync" seems available on Linux, it is possible to 
force using it with a "ifdef 0" to skip sync_file_range and check whether 
it does some good there. Idem for the "posix_fadvise" stuff. I can try to 
do that, but it takes time to do so, if someone can test on other OS it 
would be much better. I think that if it works it should be kept in, so it 
is just a matter of testing it.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-02-21 Thread Robert Haas

On Sun, Feb 21, 2016 at 12:24 PM, Amit Kapila 
wrote:

> On Sun, Feb 21, 2016 at 12:02 PM, Robert Haas 
> wrote:
>
>> On Sun, Feb 21, 2016 at 10:27 AM, Amit Kapila 
>> wrote:
>>
>>>
>>> Client_Count/Patch_Ver 1 64 128 256
>>> HEAD(481725c0) 963 28145 28593 26447
>>> Patch-1 938 28152 31703 29402
>>>
>>>
>>> We can see 10~11% performance improvement as observed
>>> previously.  You might see 0.02% performance difference with
>>> patch as regression, but that is just a run-to-run variation.
>>>
>>
>> Don't the single-client numbers show about a 3% regresssion?  Surely not
>> 0.02%.
>>
>
>
> Sorry, you are right, it is ~2.66%, but in read-write pgbench tests, I
> could see such fluctuation.  Also patch doesn't change single-client
> case.  However, if you still feel that there could be impact by patch,
> I can re-run the single client case once again with different combinations
> like first with HEAD and then patch and vice versa.
>

Are these results from a single run, or median-of-three?

I mean, my basic feeling is that I would not accept a 2-3% regression in
the single client case to get a 10% speedup in the case where we have 128
clients.  A lot of people will not have 128 clients; quite a few will have
a single session, or just a few.  Sometimes just making the code more
complex can hurt performance in subtle ways, e.g. by making it fit into the
L1 instruction cache less well.  If the numbers you have here are accurate,
I'd vote to reject the patch.

Note that we already have apparently regressed single-client performance
noticeably between 9.0 and 9.5:

http://bonesmoses.org/2016/01/08/pg-phriday-how-far-weve-come/

I bet that wasn't a single patch but a series of patches which made things
more complex to improve concurrency behavior, but in the process each one
made the single-client case a tiny bit slower.  In the end, that adds up.
I think we need to put some effort into figuring out if there is a way we
can get some of that single-client performance (and ideally more) back.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-02-21 Thread Robert Haas

On Thu, Feb 18, 2016 at 4:52 PM, Ashutosh Bapat
 wrote:
> If the list in the joining relation changes (may be because we appended
> parameterized conditions), we would be breaking links on all the upper
> relations in the join tree in an unpredictable manner. The problem may not
> show up now, but it's an avenue for unrecognizable bugs. So, it's safer to
> copy the lists in the state that we want them.

Agreed.  The lists figure to be short, so copying them shouldn't be
very expensive, and it's better to do that in all cases than to leave
shared-substructure hazards around for future patch authors to worry
about.

Committed Ashutosh's version of the patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql - DECLARE - cannot to use %TYPE or %ROWTYPE for composite types

2016-02-21 Thread Pavel Stehule

Hi

I am sending updated version - the changes are related to fix comments.

2016-02-19 10:41 GMT+01:00 Artur Zakirov :

> It seems all fixes are done. I tested the patch and regression tests
> passed.
>
> I think here Alvaro means that you should keep original comment without
> the ROW. Like this:
>
> /* XXX perhaps allow REC here? */


I tried rewording this comment

>
>
>
>>
>> By the way, these functions are misnamed after this patch.  They are
>> called "wordtype" and "cwordtype" originally because they accept
>> "word%TYPE" and "compositeword%TYPE", but after the patch they not
>> only
>> accept TYPE at the right of the percent sign but also ELEMENTTYPE and
>> ARRAYTYPE.  Not sure that this is something we want to be too strict
>> about.
>>
>>
>> Understand - used name ***reftype instead type
>>
>
> I am not sure, but it seems that new names is a little worse. I think
> original names are good too. They accept a word and return the PLpgSQL_type
> structure.
>

The "TYPE" word in this name was related to syntax %TYPE. And because new
syntax allows more constructs, then the change name is correct. I am think.
But choosing names is hard work. The new name little bit more strongly show
relation to work with referenced types.


>
>
>>
> I noticed a little typo in the comment in the derive_type():
> /* Return base_type, when it is a array already */
>
> should be:
> /* Return base_type, when it is an array already */
>
>
fixed

Regards

Pavel


>
> --
> Artur Zakirov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company
>
diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
new file mode 100644
index 9786242..140c81f
*** a/doc/src/sgml/plpgsql.sgml
--- b/doc/src/sgml/plpgsql.sgml
*** SELECT merge_fields(t.*) FROM table1 t W
*** 710,715 
--- 710,792 
 

  
+   
+Array Types
+ 
+ 
+ variable%ARRAYTYPE
+ 
+ 
+
+ %ARRAYTYPE provides the array type from a variable or
+ table column. %ARRAYTYPE is particularly valuable in
+ polymorphic functions, since the data types needed for internal variables can
+ change from one call to the next.  Appropriate variables can be
+ created by applying %ARRAYTYPE to the function's
+ arguments or result placeholders:
+ 
+ CREATE OR REPLACE FUNCTION array_init(v anyelement, size integer)
+ RETURNS anyarray AS $$
+ DECLARE
+ result v%ARRAYTYPE DEFAULT '{}';
+ BEGIN
+ -- prefer builtin function array_fill
+ FOR i IN 1 .. size
+ LOOP
+ result := result || v;
+ END LOOP;
+ RETURN result;
+ END;
+ $$ LANGUAGE plpgsql;
+ 
+ SELECT array_init(0::numeric, 10);
+ SELECT array_init(''::varchar, 10);
+ 
+
+   
+ 
+   
+Array Element Types
+ 
+ 
+ variable%ELEMENTTYPE
+ 
+ 
+
+ %ELEMENTTYPE provides the element type of a given
+ array.  %ELEMENTTYPE is particularly valuable in polymorphic
+ functions, since the data types needed for internal variables can
+ change from one call to the next.  Appropriate variables can be
+ created by applying %ELEMENTTYPE to the function's
+ arguments or result placeholders:
+ 
+ CREATE OR REPLACE FUNCTION bubble_sort(a anyarray)
+ RETURNS anyarray AS $$
+ DECLARE
+ aux a%ELEMENTTYPE;
+ repeat_again boolean DEFAULT true;
+ BEGIN
+ -- Don't use this code for large arrays!
+ -- use builtin sort
+ WHILE repeat_again
+ LOOP
+ repeat_again := false;
+ FOR i IN array_lower(a, 1) .. array_upper(a, 1)
+ LOOP
+ IF a[i] > a[i+1] THEN
+ aux := a[i+1];
+ a[i+1] := a[i]; a[i] := aux;
+ repeat_again := true;
+ END IF;
+ END LOOP;
+ END LOOP;
+ RETURN a;
+ END;
+ $$ LANGUAGE plpgsql;
+ 
+
+   
+ 

 Collation of PL/pgSQL Variables
  
diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c
new file mode 100644
index ebe152d..07569f7
*** a/src/pl/plpgsql/src/pl_comp.c
--- b/src/pl/plpgsql/src/pl_comp.c
*** plpgsql_parse_tripword(char *word1, char
*** 1617,1632 
  	return false;
  }
  
  
  /* --
!  * plpgsql_parse_wordtype	The scanner found word%TYPE. word can be
!  *a variable name or a basetype.
   *
   * Returns datatype struct, or NULL if no match found for word.
   * --
   */
  PLpgSQL_type *
! plpgsql_parse_wordtype(char *ident)
  {
  	PLpgSQL_type *dtype;
  	PLpgSQL_nsitem *nse;
--- 1617,1683 
  	return false;
  }
  
+ /*
+  * Derive type from any base type controlled by reftype.
+  *
+  * This routine allows to take array type from an array variable. This behave
+  * is not consistent with enforce_generic_type_consistency, where same task
+  * fails on ERROR: 42704: could not find array type for data type xxx[].
+  */
+ static PLpgSQL_type *
+ derive_type(PLpgSQL_type *base_type, PLpgSQL_reftype reftype)
+ {
+ 	Oid typoid;
+ 
+ 	switch (reftype)

Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-21 Thread Fabien COELHO



Hallo Andres,

[...] I do think that this whole writeback logic really does make sense 
*per table space*,


Leads to less regular IO, because if your tablespaces are evenly sized
(somewhat common) you'll sometimes end up issuing sync_file_range's
shortly after each other.  For latency outside checkpoints it's
important to control the total amount of dirty buffers, and that's
obviously independent of tablespaces.


I do not understand/buy this argument.

The underlying IO queue is per device, and table spaces should be per device 
as well (otherwise what the point?), so you should want to coalesce and 
"writeback" pages per device as wel. Calling sync_file_range on distinct 
devices should probably be issued more or less randomly, and should not 
interfere one with the other.


If you use just one context, the more table spaces the less performance 
gains, because there is less and less aggregation thus sequential writes per 
device.


So for me there should really be one context per tablespace. That would 
suggest a hashtable or some other structure to keep and retrieve them, which 
would not be that bad, and I think that it is what is needed.


Note: I think that an easy way to do that in the "checkpoint sort" patch 
is simply to keep a WritebackContext in CkptTsStatus structure which is 
per table space in the checkpointer.


For bgwriter & backends it can wait, there is few "writeback" coalescing 
because IO should be pretty random, so it does not matter much.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

50 matches

Mail list logo