Re: [HACKERS] Writing new unit tests with PostgresNode
At 2016-02-22 07:45:37 +, e...@xs4all.nl wrote: > > I think what's needed is: > > use 5.008_008; That is meant to fail if the code is run on a version of Perl older than 5.8.8, not fail if the code uses language features not present in 5.8.8. It is little help for those trying to maintain backwards compatibility. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Writing new unit tests with PostgresNode
- All the core routines used should be compatible down to perl 5.8.8. Ugh. So not just Perl, ancient perl. I don't suppose Perl offers any kind of "compatible(5.8.8)" statement or anything? Do I have to compile a ten-year-old Perl and its dependencies to work on PostgreSQL tests? http://search.cpan.org/dist/Perl-MinimumVersion/lib/Perl/MinimumVersion.pm looks useful; do you think it's reasonable for code that passes that check to just be thrown at the buildfarm? I think what's needed is: use 5.008_008; ( see: http://perldoc.perl.org/functions/use.html ) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Writing new unit tests with PostgresNode
On Mon, Feb 22, 2016 at 4:24 PM, Craig Ringerwrote: > On 22 February 2016 at 14:30, Michael Paquier > wrote: >> - All the core routines used should be compatible down to perl 5.8.8. > > Ugh. So not just Perl, ancient perl. > > I don't suppose Perl offers any kind of "compatible(5.8.8)" statement or > anything? Do I have to compile a ten-year-old Perl and its dependencies to > work on PostgreSQL tests? > http://search.cpan.org/dist/Perl-MinimumVersion/lib/Perl/MinimumVersion.pm > looks useful; do you think it's reasonable for code that passes that check > to just be thrown at the buildfarm? No, I don't use those old versions :) Module::CoreList->first_release is one way to do that for a module: $ perl -MModule::CoreList -e ' print Module::CoreList->first_release('Test::More'), "\n";' 5.006002 Last time I was dealing with that I had as well a look at http://perldoc.perl.org/, which was quite helpful by browsing through each version. >> > Sound about right? I can tidy that up a bit and turn it into a README >> > and >> > add a reference to that to the public tap docs to tell users where to go >> > if >> > they want to write more tests. >> >> Yes, please. > > Will do that now. This is definitely independent from the efforts of the other patches. >> > I don't know how many suites we'll want - whether it'll be desirable to >> > have >> > a few suites with lots of tests or to have lots of suites with just a >> > few >> > tests. I'm planning on starting by adding a suite named 'replication' >> > and >> > putting some tests for failover slots in there. Reasonable? >> >> It seems to me that the failover slot tests should be part of the >> recovery test suite I have proposed already. Those are located in >> src/test/recovery, introducing as well a set of routines in >> PostgresNode.pm that allows one to pass options to PostgresNode::init >> to enable archive or streaming on a given node. I would believe that >> any replication suite is going to need that, and I have done a bunch >> of legwork to make sure that this works on Windows as well. > > > Not committed yet, I see. That's https://commitfest.postgresql.org/9/438/ > right? Yeah... That's life. > I'd rather not duplicate your work there, so I should build on that. Is > there a public git tree for that? The latest set of patches is here: https://github.com/michaelpq/postgres/commits/recovery_regression Those are the ones I sent a couple of days back on -hackers. Hopefully those will get integrated into the core code. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Writing new unit tests with PostgresNode
On 22 February 2016 at 14:30, Michael Paquierwrote: > +1. I will be happy to contribute into that. > Thanks. Hopefully what I wrote will be useful as a starting point. As I read through the modules I'll write some draft Perldoc too. > > The docs at http://www.postgresql.org/docs/devel/static/regress-tap.html > > mention the use of tap for client programs but don't have anything on > > writing/using the framework. It doesn't seem to be very close to / > > compatible with http://pgtap.org/ unless I'm missing something obvious. > > > > I want to add tests for failover slots but I'm not sure where to put > them or > > how to set up the test skeleton. > > src/test/modules > Could use a README to explain what the directory is for. It looks like it contains a bunch of the extensions that used to be in contrib/ . So that'd be where I'd want to put any extension needed as part of failover slots testing, but not the Perl test control code for setting up the nodes, etc. This testing cannot usefully be done with pg_regress and the part that can be is already added to contrib/test_decoding. > A couple of other things that I am aware of: > - including strict and warnings is as well mandatory to ensure that > test grammar is sane, and the tests written should have no warnings. > Yep, did that. > - All the core routines used should be compatible down to perl 5.8.8. > Ugh. So not just Perl, ancient perl. I don't suppose Perl offers any kind of "compatible(5.8.8)" statement or anything? Do I have to compile a ten-year-old Perl and its dependencies to work on PostgreSQL tests? http://search.cpan.org/dist/Perl-MinimumVersion/lib/Perl/MinimumVersion.pm looks useful; do you think it's reasonable for code that passes that check to just be thrown at the buildfarm? > - The code produced should be sanitized with perltidy before commit. > Makes sense. Keeps things consistent. > > The suite should be added to src/test/Makefile's ALWAYS_SUBDIRS entry. > > SUBDIRS is more suited for tests that do not represent a security > risk. The ssl suite is part of ALWAYS_SUBDIRS because it enforces the > use of 127.0.0.1. > Ok, that makes sense. > > Sound about right? I can tidy that up a bit and turn it into a README and > > add a reference to that to the public tap docs to tell users where to go > if > > they want to write more tests. > > Yes, please. > Will do that now. > > I don't know how many suites we'll want - whether it'll be desirable to > have > > a few suites with lots of tests or to have lots of suites with just a few > > tests. I'm planning on starting by adding a suite named 'replication' and > > putting some tests for failover slots in there. Reasonable? > > It seems to me that the failover slot tests should be part of the > recovery test suite I have proposed already. Those are located in > src/test/recovery, introducing as well a set of routines in > PostgresNode.pm that allows one to pass options to PostgresNode::init > to enable archive or streaming on a given node. I would believe that > any replication suite is going to need that, and I have done a bunch > of legwork to make sure that this works on Windows as well. > Not committed yet, I see. That's https://commitfest.postgresql.org/9/438/ right? I'd rather not duplicate your work there, so I should build on that. Is there a public git tree for that? -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Add generate_series(date,date) and generate_series(date,date,integer)
On Mon, Feb 22, 2016 at 6:52 AM, Tom Lanewrote: > Oooh ... actually, that works today if you consider the SRF-in-targetlist > case: > > regression=# select generate_series(now(), 'infinity', '1 day') limit 10; > generate_series > --- > 2016-02-21 16:51:03.303064-05 > 2016-02-22 16:51:03.303064-05 > 2016-02-23 16:51:03.303064-05 > 2016-02-24 16:51:03.303064-05 > 2016-02-25 16:51:03.303064-05 > 2016-02-26 16:51:03.303064-05 > 2016-02-27 16:51:03.303064-05 > 2016-02-28 16:51:03.303064-05 > 2016-02-29 16:51:03.303064-05 > 2016-03-01 16:51:03.303064-05 > (10 rows) > > Time: 8.457 ms > > Given that counterexample, I think we not only shouldn't back-patch such a > change but should reject it altogether. Ouch, good point. The overflows are a different problem that we had better address though (still on my own TODO list)... -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] about google summer of code 2016
On 02/21/16 23:10, Tom Lane wrote: > Another variable is that your answers might depend on what format you > assume the client is trying to convert from/to. (It's presumably not > text JSON, but then what is it?) This connects tangentially to a question I've been meaning to ask for a while, since I was looking at the representation of XML. As far as I can tell, XML is simply stored in its character serialized representation (very likely compressed, if large enough to TOAST), and the text in/out methods simply deal in that representation. The 'binary' send/recv methods seem to differ only in possibly using a different character encoding on the wire. Now, also as I understand it, there's no requirement that a type even /have/ binary send/recv methods. Text in/out it always needs, but send/recv only if they are interesting enough to buy you something. I'm not sure the XML send/recv really do buy anything. It is not as if they present the XML in any more structured or tokenized form. If they buy anything at all, it may be only an extra transcoding that the other end will probably immediately do in reverse. So, if that's the situation, is there some other, really simple, choice for what XML send/recv might usefully do, that would buy more than what they do now? Well, PGLZ is in libpqcommon now, right? What if xml send wrote a flag to indicate compressed or not, and then if the value is compressed TOAST, streamed it right out as is, with no expansion on the server? I could see that being a worthwhile win, /without even having to devise some XML-specific encoding/. (XML has a big expansion ratio.) And, since that idea is not inherently XML-specific ... does the JSONB representation have the same properties? How about even text or bytea? The XML question has a related, JDBC-specific part. JDBC presents XML via interfaces that can deal in Source and Result objects, and these come in different flavors (DOMSource, an all-in-memory tree, SAXSource and StAXSource, both streaming tokenized forms, or StreamSource, a streaming, character-serialized form). Client code can ask for one of those forms explicitly, or use null to say it doesn't care. In the doesn't-care case, the driver is expected to choose the form closest to what it's got under the hood; the client can convert if necessary, and if it had any other preference, it would have said so. For PGJDBC, that choice would naturally be the character StreamSource, because that /is/ the form it's got under the hood, but for reasons mysterious to me, pgjdbc actually chooses DOMSource in the don't-care case, and then expends the full effort of turning the serialized stream it does have into a full in-memory DOM that the client hasn't asked for and might not even want. I know this is more a PGJDBC question, but I mention it here just because it's so much like the what-should-send/recv-do question, repeated at another level. -Chap -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Writing new unit tests with PostgresNode
On Mon, Feb 22, 2016 at 2:19 PM, Craig Ringerwrote: > I've been taking a look at the Perl test infrastructure ( src/test/perl ) > for writing multi-node tests, starting with PostgresNode.pm and I have a few > comments based on my first approach to the code "cold". > > I think a README in src/test/perl/ would be very helpful. It's currently not > at all clear how to go about using the new test infrastructure when first > looking at it, particularly adding new test suites. Also a summary of > available test facilities and some short examples or pointers to where they > can be found ( src/bin/, src/test/ssl, etc). +1. I will be happy to contribute into that. > The docs at http://www.postgresql.org/docs/devel/static/regress-tap.html > mention the use of tap for client programs but don't have anything on > writing/using the framework. It doesn't seem to be very close to / > compatible with http://pgtap.org/ unless I'm missing something obvious. > > I want to add tests for failover slots but I'm not sure where to put them or > how to set up the test skeleton. So I went looking, and I could use > confirmation that the below is roughly right so I can write it up for some > quickstart docs. src/test/modules > It *looks* like one needs to create a Makefile with: > > subdir = src/test/mytests > top_builddir = ../../.. > include $(top_builddir)/src/Makefile.global > > check: > $(prove_check) > > clean: > rm -rf ./tmp_check Yep. > Then create a subdir src/test/mytests/t and in it put files named > 001_sometest.pl etc. Each of which should import PostgresNode and generally > then create new instances indirectly via the PostgresNode::get_new_node > function rather than by using the constructor. Then init and start the > server and run tests. Like: A couple of other things that I am aware of: - including strict and warnings is as well mandatory to ensure that test grammar is sane, and the tests written should have no warnings. - All the core routines used should be compatible down to perl 5.8.8. - The code produced should be sanitized with perltidy before commit. > use strict; > use warnings; > use PostgresNode; > use TestLib; > use Test::More tests => 1; > > diag 'Setting up node'; > my $node = get_new_node('master'); > $node->init; > $node->start; > > my $ret = $node->psql('postgres', 'SELECT 1;'); > is($ret, '1', 'simple SELECT'); > > $node->teardown_node; Teardown is not mandatory at the end, though I would recommend having it, PostgresNode::DESTROY enforcing the nodes to stop at the end of a test, and the temporary paths are unconditionally removed (those should not be removed in case of a failure honestly). > Knowledge of perl's Test::More is required since most of the suite is built > on it. > > The suite should be added to src/test/Makefile's ALWAYS_SUBDIRS entry. SUBDIRS is more suited for tests that do not represent a security risk. The ssl suite is part of ALWAYS_SUBDIRS because it enforces the use of 127.0.0.1. > Sound about right? I can tidy that up a bit and turn it into a README and > add a reference to that to the public tap docs to tell users where to go if > they want to write more tests. Yes, please. > I don't know how many suites we'll want - whether it'll be desirable to have > a few suites with lots of tests or to have lots of suites with just a few > tests. I'm planning on starting by adding a suite named 'replication' and > putting some tests for failover slots in there. Reasonable? It seems to me that the failover slot tests should be part of the recovery test suite I have proposed already. Those are located in src/test/recovery, introducing as well a set of routines in PostgresNode.pm that allows one to pass options to PostgresNode::init to enable archive or streaming on a given node. I would believe that any replication suite is going to need that, and I have done a bunch of legwork to make sure that this works on Windows as well. > (BTW, the ssl tests don't seem to use a bunch of the facilities provided by > PostgresNode, instead rolling their own, so they don't serve as a good > example.) Yep. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: make NOTIFY list de-duplication optional
On Sat, Feb 20, 2016 at 2:00 PM, Filip Rembiałkowskiwrote: > I was stuck because both syntaxes have their ugliness. NOTIFY allows the > payload to be NULL: > NOTIFY chan01; > > How would this look like in "never" mode? > NOTIFY chan01, NULL, 'never'; -- seems very cryptic. The docs say: "The information passed to the client for a notification event includes the notification channel name, the notifying session's server process PID, and the payload string, which is an empty string if it has not been specified." So a missing payload is not a SQL NULL but an empty string. This means you would have: NOTIFY chan01; NOTIFY chan01, ''; -- same as above NOTIFY chan01, '', 'maybe'; -- same as above NOTIFY chan01, '', 'never'; -- send this all the time Seems ok to me. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw vs. force_parallel_mode on ppc
On Tue, Feb 16, 2016 at 12:12 PM, Noah Mischwrote: > On Mon, Feb 15, 2016 at 06:07:48PM -0500, Tom Lane wrote: >> Noah Misch writes: >> > I configured a copy of animal "mandrill" that way and launched a test run. >> > The postgres_fdw suite failed as attached. A manual "make -C contrib >> > installcheck" fails the same way on a ppc64 GNU/Linux box, but it passes on >> > x86_64 and aarch64. Since contrib test suites don't recognize TEMP_CONFIG, >> > check-world passes everywhere. >> >> Hm, is this with or without the ppc-related atomics fix you just found? > > Without those. The ppc64 GNU/Linux configuration used gcc, though, and the > atomics change affects xlC only. Also, the postgres_fdw behavior does not > appear probabilistic; it failed twenty times in a row. The postgres_fdw failure is a visibility-of-my-own-uncommitted-work problem. The first command in a transaction updates a row via an FDW, and then the SELECT expects to see the effects, but when run in a background worker it creates a new FDW connection that can't see the uncommitted UPDATE. I wonder if parallelism of queries involving an FDW should not be allowed if your transaction has written through the FDW. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Writing new unit tests with PostgresNode
Hi all I've been taking a look at the Perl test infrastructure ( src/test/perl ) for writing multi-node tests, starting with PostgresNode.pm and I have a few comments based on my first approach to the code "cold". I think a README in src/test/perl/ would be very helpful. It's currently not at all clear how to go about using the new test infrastructure when first looking at it, particularly adding new test suites. Also a summary of available test facilities and some short examples or pointers to where they can be found ( src/bin/, src/test/ssl, etc). The docs at http://www.postgresql.org/docs/devel/static/regress-tap.html mention the use of tap for client programs but don't have anything on writing/using the framework. It doesn't seem to be very close to / compatible with http://pgtap.org/ unless I'm missing something obvious. I want to add tests for failover slots but I'm not sure where to put them or how to set up the test skeleton. So I went looking, and I could use confirmation that the below is roughly right so I can write it up for some quickstart docs. It *looks* like one needs to create a Makefile with: subdir = src/test/mytests top_builddir = ../../.. include $(top_builddir)/src/Makefile.global check: $(prove_check) clean: rm -rf ./tmp_check Then create a subdir src/test/mytests/t and in it put files named 001_sometest.pl etc. Each of which should import PostgresNode and generally then create new instances indirectly via the PostgresNode::get_new_node function rather than by using the constructor. Then init and start the server and run tests. Like: use strict; use warnings; use PostgresNode; use TestLib; use Test::More tests => 1; diag 'Setting up node'; my $node = get_new_node('master'); $node->init; $node->start; my $ret = $node->psql('postgres', 'SELECT 1;'); is($ret, '1', 'simple SELECT'); $node->teardown_node; Knowledge of perl's Test::More is required since most of the suite is built on it. The suite should be added to src/test/Makefile's ALWAYS_SUBDIRS entry. Sound about right? I can tidy that up a bit and turn it into a README and add a reference to that to the public tap docs to tell users where to go if they want to write more tests. I don't know how many suites we'll want - whether it'll be desirable to have a few suites with lots of tests or to have lots of suites with just a few tests. I'm planning on starting by adding a suite named 'replication' and putting some tests for failover slots in there. Reasonable? (BTW, the ssl tests don't seem to use a bunch of the facilities provided by PostgresNode, instead rolling their own, so they don't serve as a good example.) -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Handling changes to default type transformations in PLs
> 4) Create a mechanism for specifying default TRANSFORMs for a PL, and > essentially "solve" these issues by supplying a built-in transform. > > I think default transforms (4) are worth doing no matter what. Having to > manually remember to add potentially multiple TRANSFORMs is a PITA. But I'm > not sure TRANSFORMS would actually fix all issues. For example, you can't > specify a transform for an array type, so this probably wouldn't work for > one of the plpython problems. > > > yesterday, I wrote some doc about TRANSFORMS - and it is not possible. The transformation cannot be transparent for PL function - because you have to have information, what is your parameters inside function, and if these parameters are original or transformed. This is not a issue for "smart" object types, but generally it should not be ensured for basic types. So default transformation is not a good idea. You can have a transform for C language and it is really different from Python. Maybe concept of TRANSFORMs too general and some specific PL extension's can be better. It needs some concept of persistent options, that can be used in these extension. CREATE OR REPLACE FUNCTION LANGUAGE plpython WITH OPTIONS ('transform_dictionary', ...) The execution should be stopped if any option will not be processed. With these extensions you can do anything and It can works (I hope). But the complexity of our PL will be significantly higher. And then Tom's proposal can be better. It can help with faster adoption of new features and it is relative simple solution. Regards Pavel
Re: [HACKERS] Handling changes to default type transformations in PLs
> > 3) Add the concept of PL API versions. This would allow users to specify > > So that leaves #3, which doesn't seem all that unreasonable from here. > We don't have a problem with bundling a bunch of unrelated changes > into any one major PG revision. The scripting languages we're talking > about calling do similar things. So why not for the semantics of the > glue layer? > > It seems like you really need to be able to specify this at the > per-function level, which makes me think that specifying > "LANGUAGE plpython_2" or "LANGUAGE plperl_3" might be the right > kind of API. > I am not big fan of this proposal. A users usually would to choose only some preferred features - and this design has maybe too small granularity. Objections: * usually is used keyword REVISON - so syntax can be LANGUAGE plpython REVISION 3. It is more readable. You need to specify preferred revision for any language. The revision is persistent. The behave is same like Tom's proposal, but I hope so this can be better readable and understandable regards Pavel > > regards, tom lane > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On Wed, Feb 3, 2016 at 8:59 AM, Robert Haaswrote: > > On Tue, Feb 2, 2016 at 10:27 PM, Amit Kapila wrote: > > So, let's leave adding any additional column, but Alexander has brought up > > a good point about storing the wait_type and actual wait_event > > information into four bytes. Currently I have stored wait_type (aka > > classId) > > in first byte and then two bytes for wait_event (eventId) and remaining > > one-byte can be used in future if required, however Alexandar is proposing > > to > > combine both these (classId and eventId) into two-bytes which sounds > > reasonable to me apart from the fact that it might add operation or two > > extra > > in this path. Do you or anyone else have any preference over this point? > > I wouldn't bother tinkering with it at this point. The value isn't > going to be recorded on disk anywhere, so it will be easy to change > the way it's computed in the future if we ever need to do that. > Okay. Find the rebased patch attached with this mail. I have moved this patch to upcoming CF. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com extend_pg_stat_activity_v10.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How are CREATE EXTENSION ... VERSION or ALTER EXTENSION ... UPDATE TO ... intended to work?
On 02/16/16 22:44, Tom Lane wrote: > Chapman Flackwrites: >> been several releases of an extension, and the extensions directory >> is now populated with a bunch of files quux--1.0.sql, quux--1.1.sql, >> quux--1.0--1.1.sql, quux--1.1--1.0.sql, ..., quux.control. >> And somewhere in $libdir there are quux-1.0.so, quux-1.1.so. > > Well, at least so far as the existing extensions in contrib are concerned, > there are *not* version numbers in the .so filenames. This means you > can't have more than one version of the .so installed at once, but we've > not really found a need for that. It's usually feasible --- and desirable > --- to keep ABI compatibility to the extent that the new .so can be > swapped in for the old without needing to change the SQL function > definitions immediately. It's true enough that in PL/Java's case, the ABIs / interfaces between the SQL function definitions and their implementations in the .so have been quite stable for years, so there might be no immediate problem. On the other hand, other details of the implementation (bugs come to mind) do change ... letting a version-specific CREATE EXTENSION load an unversioned .so could lead to surprises in that area, because it would /appear to succeed/, meaning pg_extension.extversion would get changed to what you /thought/ you had loaded (and what you would probably answer, if asked "what version are you reporting this bug against?"), while what's actually running could be different, and if I squint I see diagnostic headaches lying not far around that bend. > into the field. Avoiding MODULE_PATHNAME in favor of writing out > the versioned .so name in the .sql files is probably the path of > least resistance. Agreed ... for once maybe I'll follow it. -Chap -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Handling changes to default type transformations in PLs
Jim Nasbywrites: > Some of our PLs have the unfortunate problem of making a weak effort > with converting types to and from the PL and Postgres. For example, > plpythonu will correctly transform a complex type to a dict and an array > to a list, but it punts back to text for an array contained inside a > complex type. I know plTCL has similar issues; presumably other PLs are > affected as well. > While it's a SMOC to fix this, what's not simple is the backwards > compatability: users that are currently using these types are expecting > to be handed strings created by the type's output function, so we can't > just drop these changes in without breaking user code. > It might be possible to work around this with TRANSFORMs, but that's > just ugly: first you'd have to write a bunch of non-trivial C code, then > you'd need to forever remember to specify TRANSFORM FOR TYPE blah. > Some ways to handle this: > 1) Use a PL-specific GUC for each case where we need backwards > compatibility. For plpython we'd need 2. plTCL would need 1 or 2. > 2) Use a single all-or-nothing GUC. Downside is that if we later decide > to expand automatic conversion again we'd need yet another GUC. > 3) Add the concept of PL API versions. This would allow users to specify > what range of API versions they support. I think this would have been > helpful with the plpython elog() patch. > 4) Create a mechanism for specifying default TRANSFORMs for a PL, and > essentially "solve" these issues by supplying a built-in transform. > I think default transforms (4) are worth doing no matter what. Having to > manually remember to add potentially multiple TRANSFORMs is a PITA. But > I'm not sure TRANSFORMS would actually fix all issues. For example, you > can't specify a transform for an array type, so this probably wouldn't > work for one of the plpython problems. > 3 is interesting, but maybe it would be bad to tie multiple unrelated > API changes together. > So I'm leaning towards 1. It means potentially adding a fair number of > new GUCs, but these would all be custom GUCs, so maybe it's not that > bad. The other downside to GUCs is I think it'd be nice to be able to > set this at a schema level, which you can't currently do with GUCs. > Thoughts? I think harsh experience has taught us to distrust GUCs that change code semantics. So I'm not very attracted by option #1, much less option #2. I'm not sure about option #4 --- it smells like it would have the same problems as a GUC, namely that it would be action-at-a-distance on the semantics of a PL function's arguments, with insufficient ability to control the scope of the effects. So that leaves #3, which doesn't seem all that unreasonable from here. We don't have a problem with bundling a bunch of unrelated changes into any one major PG revision. The scripting languages we're talking about calling do similar things. So why not for the semantics of the glue layer? It seems like you really need to be able to specify this at the per-function level, which makes me think that specifying "LANGUAGE plpython_2" or "LANGUAGE plperl_3" might be the right kind of API. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] about google summer of code 2016
Heikki Linnakangaswrites: > On 19/02/16 10:10, Ãlvaro Hernández Tortosa wrote: >> Oleg and I discussed recently that a really good addition to a GSoC >> item would be to study whether it's convenient to have a binary >> serialization format for jsonb over the wire. > Seems a bit risky for a GSoC project. We don't know if a different > serialization format will be a win, or whether we want to do it in the > end, until the benchmarking is done. It's also not clear what we're > trying to achieve with the serialization format: smaller on-the-wire > size, faster serialization in the server, faster parsing in the client, > or what? Another variable is that your answers might depend on what format you assume the client is trying to convert from/to. (It's presumably not text JSON, but then what is it?) Having said that, I'm not sure that risk is a blocking factor here. History says that a large fraction of our GSoC projects don't result in a commit to core PG. As long as we're clear that "success" in this project isn't measured by getting a feature committed, it doesn't seem riskier than any other one. Maybe it's even less risky, because there's less of the success condition that's not under the GSoC student's control. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Handling changes to default type transformations in PLs
Some of our PLs have the unfortunate problem of making a weak effort with converting types to and from the PL and Postgres. For example, plpythonu will correctly transform a complex type to a dict and an array to a list, but it punts back to text for an array contained inside a complex type. I know plTCL has similar issues; presumably other PLs are affected as well. While it's a SMOC to fix this, what's not simple is the backwards compatability: users that are currently using these types are expecting to be handed strings created by the type's output function, so we can't just drop these changes in without breaking user code. It might be possible to work around this with TRANSFORMs, but that's just ugly: first you'd have to write a bunch of non-trivial C code, then you'd need to forever remember to specify TRANSFORM FOR TYPE blah. Some ways to handle this: 1) Use a PL-specific GUC for each case where we need backwards compatibility. For plpython we'd need 2. plTCL would need 1 or 2. 2) Use a single all-or-nothing GUC. Downside is that if we later decide to expand automatic conversion again we'd need yet another GUC. 3) Add the concept of PL API versions. This would allow users to specify what range of API versions they support. I think this would have been helpful with the plpython elog() patch. 4) Create a mechanism for specifying default TRANSFORMs for a PL, and essentially "solve" these issues by supplying a built-in transform. I think default transforms (4) are worth doing no matter what. Having to manually remember to add potentially multiple TRANSFORMs is a PITA. But I'm not sure TRANSFORMS would actually fix all issues. For example, you can't specify a transform for an array type, so this probably wouldn't work for one of the plpython problems. 3 is interesting, but maybe it would be bad to tie multiple unrelated API changes together. So I'm leaning towards 1. It means potentially adding a fair number of new GUCs, but these would all be custom GUCs, so maybe it's not that bad. The other downside to GUCs is I think it'd be nice to be able to set this at a schema level, which you can't currently do with GUCs. Thoughts? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql metaqueries with \gexec
On 2/19/16 7:32 PM, Corey Huinker wrote: Wow, and I thought *I* liked metaprogramming! :) I like what you've proposed, though I am wondering if you considered doing something server-side instead? It seems a shame to do all this work and exclude all other tools. I frequently find myself creating a function that is just a wrapper on EXECUTE for this purpose, but obviously that has transactional limitations. FWIW, I also wish we had something better than format() for this stuff. I did create [1] towards that end, but it currently depends on some C code, which is cumbersome. I am not sure that "gexec" is the right name for this command. Others considered were \execute_each, \meta, \gmeta, \geach, as well as adding a "<" parameter to the \g command. \gexec sounds fine to me. I would think \g < would be something done at the shell level... [1] https://github.com/decibel/trunklet-format/blob/master/doc/trunklet-format.asc -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow to specify (auto-)vacuum cost limits relative to the database/cluster size?
On 1/12/16 6:42 AM, Andres Freund wrote: Somehow computing the speed in relation to the cluster/database size is probably possible, but I wonder how we can do so without constantly re-computing something relatively expensive? ISTM relpages would probably be good enough for this, if we take the extra step of getting actual relation size when relpages is 0. I'm not sure a straght scale factor is the way to go though... it seems that might be problematic? I think we'd at least one a minimum default value; you certainly don't want even a small system running vacuum at 1kB/s... -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] about google summer of code 2016
On 19/02/16 10:10, Álvaro Hernández Tortosa wrote: Oleg and I discussed recently that a really good addition to a GSoC item would be to study whether it's convenient to have a binary serialization format for jsonb over the wire. Some argue this should be benchmarked first. So the scope for this project would be to benchmark and analyze the potential improvements and then agree on which format jsonb could be serialized to (apart from the current on-disk format, there are many json or nested k-v formats that could be used for sending over the wire). Seems a bit risky for a GSoC project. We don't know if a different serialization format will be a win, or whether we want to do it in the end, until the benchmarking is done. It's also not clear what we're trying to achieve with the serialization format: smaller on-the-wire size, faster serialization in the server, faster parsing in the client, or what? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data
On 21 February 2016 at 23:18, Thomas Munrowrote: > On Mon, Feb 22, 2016 at 2:10 AM, Thom Brown wrote: >> On 3 February 2016 at 10:46, Thomas Munro >> wrote: >>> On Wed, Feb 3, 2016 at 10:59 PM, Amit Langote >>> wrote: There seems to be a copy-pasto there - shouldn't that be: + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush) >>> >>> Indeed, thanks! New patch attached. >> >> I've given this a test drive, and it works exactly as described. > > Thanks for trying it out! > >> But one thing which confuses me is when a standby, with causal_reads >> enabled, has just finished starting up. I can't connect to it >> because: >> >> FATAL: standby is not available for causal reads >> >> However, this is the same message when I'm successfully connected, but >> it's lagging, and the primary is still waiting for the standby to >> catch up: >> >> ERROR: standby is not available for causal reads >> >> What is the difference here? The problem being reported appears to be >> identical, but in the first case I can't connect, but in the second >> case I can (although I still can't issue queries). > > Right, you get the error at login before it has managed to connect to > the primary, and for a short time after while it's in 'joining' state, > or potentially longer if there is a backlog of WAL to apply. > > The reason is that when causal_reads = on in postgresql.conf (as > opposed to being set for an individual session or role), causal reads > logic is used for snapshots taken during authentication (in fact the > error is generated when trying to take a snapshot slightly before > authentication proper begins, in InitPostgres). I think that's a > desirable feature: if you have causal reads on and you create/alter a > database/role (for example setting a new password) and commit, and > then you immediately try to connect to that database/role on a standby > where you have causal reads enabled system-wide, then you get the > causal reads guarantee during authentication: you either see the > effects of your earlier transaction or you see the error. As you have > discovered, there is a small window after a standby comes up where it > will show the error because it hasn't got a lease yet so it can't let > you log in yet because it could be seeing a stale catalog (your user > may not exist on the standby yet, or have been altered in some way, or > your database may not exist yet, etc). > > Does that make sense? Ah, alles klar. Yes, that makes sense now. I've been trying to break it the past few days, and this was the only thing which I wasn't clear on. The parameters all work as described The replay_lag is particularly cool. Didn't think it was possible to glean this information on the primary, but the timings are correct in my tests. +1 for this patch. Looks like this solves the problem that semi-synchronous replication tries to solve, although arguably in a more sensible way. Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relaxing SSL key permission checks
Stephen Frostwrites: > Just to be clear, I'm not really against this patch as-is, but it > shouldn't be a precedent or limit us from supporting more permissive > permissions in other areas (or even here) if there are sensible > use-cases for more permissive permissions. OK, and to be clear, I'm not against considering other use-cases and trying to do something appropriate for them. I just reject the idea that it's unnecessary or inappropriate for us to be concerned about whether secret-holding files are secure. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data
On Mon, Feb 22, 2016 at 2:10 AM, Thom Brownwrote: > On 3 February 2016 at 10:46, Thomas Munro > wrote: >> On Wed, Feb 3, 2016 at 10:59 PM, Amit Langote >> wrote: >>> There seems to be a copy-pasto there - shouldn't that be: >>> >>> + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush) >> >> Indeed, thanks! New patch attached. > > I've given this a test drive, and it works exactly as described. Thanks for trying it out! > But one thing which confuses me is when a standby, with causal_reads > enabled, has just finished starting up. I can't connect to it > because: > > FATAL: standby is not available for causal reads > > However, this is the same message when I'm successfully connected, but > it's lagging, and the primary is still waiting for the standby to > catch up: > > ERROR: standby is not available for causal reads > > What is the difference here? The problem being reported appears to be > identical, but in the first case I can't connect, but in the second > case I can (although I still can't issue queries). Right, you get the error at login before it has managed to connect to the primary, and for a short time after while it's in 'joining' state, or potentially longer if there is a backlog of WAL to apply. The reason is that when causal_reads = on in postgresql.conf (as opposed to being set for an individual session or role), causal reads logic is used for snapshots taken during authentication (in fact the error is generated when trying to take a snapshot slightly before authentication proper begins, in InitPostgres). I think that's a desirable feature: if you have causal reads on and you create/alter a database/role (for example setting a new password) and commit, and then you immediately try to connect to that database/role on a standby where you have causal reads enabled system-wide, then you get the causal reads guarantee during authentication: you either see the effects of your earlier transaction or you see the error. As you have discovered, there is a small window after a standby comes up where it will show the error because it hasn't got a lease yet so it can't let you log in yet because it could be seeing a stale catalog (your user may not exist on the standby yet, or have been altered in some way, or your database may not exist yet, etc). Does that make sense? -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] significant semi join overestimates (with CTEs)
Hi, while investigating a poor query plan choice, I've noticed suspicious semi join estimates looking like this: -> Nested Loop (cost=35.52..1787.31 rows=281653470 width=566) Buffers: shared hit=9305 -> HashAggregate (cost=34.81..36.81 rows=200 width=32) Group Key: data.column1 -> CTE Scan on data (cost=0.00..30.94 rows=1547 width=32) -> Index Scan using t_pkey on t (cost=0.71..8.74 rows=1 ...) Index Cond: ((a = data.column1) AND (...)) Buffers: shared hit=9305 Notice the cardinality estimates - we expect the outer relation to have 200 rows, and for each of the 200 rows 1 matching row in the inner rel. Yet we estimate the join cardinality to be ~280M rows, which is rather excessive (it even exceeds the cartesian product cardinality). So, how could this happen? Quite easily, actually ... Consider a simple table CREATE TABLE t (a INT PRIMARY KEY, ...); and this query: WITH data AS (VALUES (1), (2), (3), ... list of more IDs ...) SELECT * FROM t WHERE a IN (SELECT column1 FROM data); This is a simple semi join - the planner will do a HashAggregate on the CTE to find the list of distinct values, and then it will do a lookup on 't' table to find matching rows. There may be additional conditions, as suggested by the initial plan, but let's ignore that for now. Now, the CTE actually has 1547 values in it, but the HashAggregate is estimated to output just 200 values. Where does that come from? Turns out, when there are no stats for the relation (e.g. when it's a CTE), get_variable_numdistinct() behaves differently depending on the cardinality of the relation. Essentially, this happens: if (rows < DEFAULT_NUM_DISTINCT) { *isdefault = false; return rows; } else { *isdefault = true; return DEFAULT_NUM_DISTINCT; } i.e. we cap the ndistinct estimate to DEFAULT_NUM_DISTINCT and for some reason return different value for the 'isdefault' flag. Then, eqjoinsel_semi() falls through to this part of the code, as there is no MCV for the CTE table: if (!isdefault1 && !isdefault2) { if (nd1 <= nd2 || nd2 < 0) selec = 1.0 - nullfrac1; else selec = (nd2 / nd1) * (1.0 - nullfrac1); } else selec = 0.5 * (1.0 - nullfrac1); Assuming the other side of the join is a regular table with proper stats (isdefault=false), depending on the CTE cardinality we fall into different branches of the if condition. When rows<200, we end up with isdefault=false and use e.g. selec = (nd2 / nd1) * (1.0 - nullfrac1) which produces a sane estimate. But with rows >= 200, we suddenly switch to the other estimate selec = 0.5 * (1.0 - nullfrac1); which completely ignores the ndistinct values and instead evaluates to 0.5 (when nullfrac1=0.0). This is the source of overestimates, because the actual join selectivity is way lower than 0.5 and we simply do rows = outer_rows * jselec; in calc_joinrel_size_estimate() for JOIN_SEMI. It's not difficult to construct examples of those over-estimates - attached are two scripts with examples demonstrating the issue, differing in the scale of the over-estimates. It's also possible to construct examples for the inner if condition (see example-3.sql): if (nd1 <= nd2 || nd2 < 0) selec = 1.0 - nullfrac1; else selec = (nd2 / nd1) * (1.0 - nullfrac1); but in this case we only switch between selec=1.0 and 0.5 (again, assuming nullfrac=0.0). While this sudden jump in estimates is not great, it can't result in such excessive estimates. Clearly, this use of CTEs is a bit unnecessary and it's trivial to replace it with a simple IN() clause. Moreover we can't really expect good estimates without stats, and I don't think changing DEFAULT_NUM_DISTINCT from one arbitrary value to some other arbitrary value would help much - we'd only see the strange plan changes with different numbers of values in the CTE. But perhaps we can do at least a bit better. Firstly, we could clamp the join cardinality estimate to the size of cartesian product, eliminating at least the most obvious over-estimates. Secondly, I think it'd be nice to somehow eliminate the sudden jumps in the estimates - I'm not quite sure why need the three different formulas in eqjoinsel_semi, so maybe we can get rid of some of them? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services semijoin-example-3.sql Description: application/sql semijoin-example-1.sql Description: application/sql semijoin-example-2.sql Description: application/sql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relaxing SSL key permission checks
* Tom Lane (t...@sss.pgh.pa.us) wrote: > Christoph Bergwrites: > > The attached patch has successfully been included in the 9.6 Debian > > package, passed the regression tests there, and I've also done some > > chmod/chown tests on the filesystem to verify it indeed catches the > > cases laid out by Tom. > > Please add this to the upcoming commitfest. We don't seem to have > enough consensus to just apply it, but perhaps the review process > will produce some agreement. Just to be clear, I'm not really against this patch as-is, but it shouldn't be a precedent or limit us from supporting more permissive permissions in other areas (or even here) if there are sensible use-cases for more permissive permissions. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] checkpointer continuous flushing - V18
ISTM that "progress" and "progress_slice" only depend on num_scanned and per-tablespace num_to_scan and total num_to_scan, so they are somehow redundant and the progress could be recomputed from the initial figures when needed. They don't cause much space usage, and we access the values frequently. So why not store them? The same question would work the other way around: these values are one division away, why not compute them when needed? No big deal. [...] Given realistic amounts of memory the max potential "skew" seems fairly small with float8. If we ever flush one buffer "too much" for a tablespace it's pretty much harmless. I do agree. I'm suggesting that a comment should be added to justify why float8 accuracy is okay. I see a binary_heap_allocate but no corresponding deallocation, this looks like a memory leak... or is there some magic involved? Hm. I think we really should use a memory context for all of this - we could after all error out somewhere in the middle... I'm not sure that a memory context is justified here, there are only two mallocs and the checkpointer works for very long times. I think that it is simpler to just get the malloc/free right. [...] I'm not arguing for ripping it out, what I mean is that we don't set a nondefault value for the GUCs on platforms with just posix_fadivise available... Ok with that. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add generate_series(date,date) and generate_series(date,date,integer)
Christoph Bergwrites: > Re: David Fetter 2016-01-26 <20160126180011.ga16...@fetter.org> >> +1 for back-patching. There's literally no case where an infinite >> input could be correct as the start or end of an interval for >> generate_series. > select * from generate_series(now(), 'infinity', '1 day') limit 10; > ... seems pretty legit to me. If limit pushdown into SRFs happened to > work some day, it'd be a pity if the above query raised an error. Oooh ... actually, that works today if you consider the SRF-in-targetlist case: regression=# select generate_series(now(), 'infinity', '1 day') limit 10; generate_series --- 2016-02-21 16:51:03.303064-05 2016-02-22 16:51:03.303064-05 2016-02-23 16:51:03.303064-05 2016-02-24 16:51:03.303064-05 2016-02-25 16:51:03.303064-05 2016-02-26 16:51:03.303064-05 2016-02-27 16:51:03.303064-05 2016-02-28 16:51:03.303064-05 2016-02-29 16:51:03.303064-05 2016-03-01 16:51:03.303064-05 (10 rows) Time: 8.457 ms Given that counterexample, I think we not only shouldn't back-patch such a change but should reject it altogether. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
[...] I do think that this whole writeback logic really does make sense *per table space*, Leads to less regular IO, because if your tablespaces are evenly sized (somewhat common) you'll sometimes end up issuing sync_file_range's shortly after each other. For latency outside checkpoints it's important to control the total amount of dirty buffers, and that's obviously independent of tablespaces. I do not understand/buy this argument. The underlying IO queue is per device, and table spaces should be per device as well (otherwise what the point?), so you should want to coalesce and "writeback" pages per device as wel. Calling sync_file_range on distinct devices should probably be issued more or less randomly, and should not interfere one with the other. The kernel's dirty buffer accounting is global, not per block device. Sure, but this is not my point. My point is that "sync_file_range" moves buffers to the device io queues, which are per device. If there is one queue in pg and many queues on many devices, the whole point of coalescing to get sequential writes is somehow lost. It's also actually rather common to have multiple tablespaces on a single block device. Especially if SANs and such are involved; where you don't even know which partitions are on which disks. Ok, some people would not benefit if the use many tablespaces on one device, too bad but that does not look like a useful very setting anyway, and I do not think it would harm much in this case. If you use just one context, the more table spaces the less performance gains, because there is less and less aggregation thus sequential writes per device. So for me there should really be one context per tablespace. That would suggest a hashtable or some other structure to keep and retrieve them, which would not be that bad, and I think that it is what is needed. That'd be much easier to do by just keeping the context in the per-tablespace struct. But anyway, I'm really doubtful about going for that; I had it that way earlier, and observing IO showed it not being beneficial. ISTM that you would need a significant number of tablespaces to see the benefit. If you do not do that, the more table spaces the more random the IOs, which is disappointing. Also, "the cost is marginal", so I do not see any good argument not to do it. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add generate_series(date,date) and generate_series(date,date,integer)
Re: David Fetter 2016-01-26 <20160126180011.ga16...@fetter.org> > +1 for back-patching. There's literally no case where an infinite > input could be correct as the start or end of an interval for > generate_series. select * from generate_series(now(), 'infinity', '1 day') limit 10; ... seems pretty legit to me. If limit pushdown into SRFs happened to work some day, it'd be a pity if the above query raised an error. Christoph -- c...@df7cb.de | http://www.df7cb.de/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.
I wrote: > Robert Haaswrites: >> As for the patch itself, I'm having trouble grokking what it's trying >> to do. I think it might be worth having a comment defining precisely >> what we mean by "A blocks B". I would define "A blocks B" in general >> as either A holds a lock which conflicts with one sought by B >> (hard-blocked) or A awaits a lock which conflicts with one sought by B >> and precedes it in the wait queue (soft-blocked). > Yes, that is exactly what I implemented ... and it's something you can't > find out from pg_locks. I'm not sure how that view could be made to > expose wait-queue ordering. Here's an updated version of this patch, now with user-facing docs. I decided that "pg_blocking_pids()" is a better function name than "pg_blocker_pids()". The code's otherwise the same, although I revisited some of the comments. I also changed quite a few references to "transaction" into "process" in the discussion of pg_locks. The previous choice to conflate processes with transactions was never terribly wise in my view, and it's certainly completely broken by parallel query. regards, tom lane diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index d77e999..d3270e4 100644 *** a/doc/src/sgml/catalogs.sgml --- b/doc/src/sgml/catalogs.sgml *** *** 8015,8030 The view pg_locks provides access to !information about the locks held by open transactions within the database server. See for more discussion of locking. pg_locks contains one row per active lockable !object, requested lock mode, and relevant transaction. Thus, the same lockable object might !appear many times, if multiple transactions are holding or waiting for locks on it. However, an object that currently has no locks on it will not appear at all. --- 8015,8030 The view pg_locks provides access to !information about the locks held by active processes within the database server. See for more discussion of locking. pg_locks contains one row per active lockable !object, requested lock mode, and relevant process. Thus, the same lockable object might !appear many times, if multiple processs are holding or waiting for locks on it. However, an object that currently has no locks on it will not appear at all. *** *** 8200,8210 granted is true in a row representing a lock !held by the indicated transaction. False indicates that this transaction is !currently waiting to acquire this lock, which implies that some other !transaction is holding a conflicting lock mode on the same lockable object. !The waiting transaction will sleep until the other lock is released (or a !deadlock situation is detected). A single transaction can be waiting to acquire at most one lock at a time. --- 8200,8210 granted is true in a row representing a lock !held by the indicated process. False indicates that this process is !currently waiting to acquire this lock, which implies that at least one other !process is holding a conflicting lock mode on the same lockable object. !The waiting process will sleep until the other lock is released (or a !deadlock situation is detected). A single process can be waiting to acquire at most one lock at a time. *** *** 8224,8230 Although tuples are a lockable type of object, information about row-level locks is stored on disk, not in memory, and therefore row-level locks normally do not appear in this view. !If a transaction is waiting for a row-level lock, it will usually appear in the view as waiting for the permanent transaction ID of the current holder of that row lock. --- 8224,8230 Although tuples are a lockable type of object, information about row-level locks is stored on disk, not in memory, and therefore row-level locks normally do not appear in this view. !If a process is waiting for a row-level lock, it will usually appear in the view as waiting for the permanent transaction ID of the current holder of that row lock. *** SELECT * FROM pg_locks pl LEFT JOIN pg_p *** 8281,8286 --- 8281,8300 +While it is possible to obtain information about which processes block +which other processes by joining pg_locks against +itself, this is very difficult to get right in detail. Such a query would +have to encode knowledge about which lock modes conflict with which +others. Worse, the pg_locks view does not expose +information about which processes are ahead of which others in lock wait +queues, nor information about which processes are parallel workers running +on behalf of which other client sessions. It is better to use +
Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data
On 3 February 2016 at 10:46, Thomas Munrowrote: > On Wed, Feb 3, 2016 at 10:59 PM, Amit Langote > wrote: >> There seems to be a copy-pasto there - shouldn't that be: >> >> + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush) > > Indeed, thanks! New patch attached. I've given this a test drive, and it works exactly as described. But one thing which confuses me is when a standby, with causal_reads enabled, has just finished starting up. I can't connect to it because: FATAL: standby is not available for causal reads However, this is the same message when I'm successfully connected, but it's lagging, and the primary is still waiting for the standby to catch up: ERROR: standby is not available for causal reads What is the difference here? The problem being reported appears to be identical, but in the first case I can't connect, but in the second case I can (although I still can't issue queries). Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Hi, On 2016-02-21 10:52:45 +0100, Fabien COELHO wrote: > * CpktSortItem: > > I think that allocating 20 bytes per buffer in shared memory is a little on > the heavy side. Some compression can be achieved: sizeof(ForlNum) is 4 bytes > to hold 4 values, could be one byte or even 2 bits somewhere. Also, there > are very few tablespaces, they could be given a small number and this number > could be used instead of the Oid, so the space requirement could be reduced > to say 16 bytes per buffer by combining space & fork in 2 shorts and keeping > 4 bytes alignement and also getting 8 byte alignement... If this is too > much, I have shown that it can work with only 4 bytes per buffer, as the > sorting is really just a performance optimisation and is not broken if some > stuff changes between sorting & writeback, but you did not like the idea. If > the amount of shared memory required is a significant concern, it could be > resurrected, though. This is less than 0.2 % of memory related to shared buffers. We have the same amount of memory allocated in CheckpointerShmemSize(), and nobody has complained so far. And sorry, going back to the previous approach isn't going to fly, and I've no desire to discuss that *again*. > ISTM that "progress" and "progress_slice" only depend on num_scanned and > per-tablespace num_to_scan and total num_to_scan, so they are somehow > redundant and the progress could be recomputed from the initial figures > when needed. They don't cause much space usage, and we access the values frequently. So why not store them? > If these fields are kept, I think that a comment should justify why float8 > precision is okay for the purpose. I think it is quite certainly fine in the > worst case with 32 bits buffer_ids, but it would not be if this size is > changed someday. That seems pretty much unrelated to having the fields - the question of accuracy plays a role regardless, no? Given realistic amounts of memory the max potential "skew" seems fairly small with float8. If we ever flush one buffer "too much" for a tablespace it's pretty much harmless. > ISTM that nearly all of the collected data on the second sweep could be > collected on the first sweep, so that this second sweep could be avoided > altogether. The only missing data is the index of the first buffer in the > array, which can be computed by considering tablespaces only, sweeping over > buffers is not needed. That would suggest creating the heap or using a hash > in the initial buffer sweep to keep this information. This would also > provide a point where to number tablespaces for compressing the CkptSortItem > struct. Doesn't seem worth the complexity to me. > I'm wondering about calling CheckpointWriteDelay on each round, maybe > a minimum amount of write would make sense. Why? There's not really much benefit of doing more work than needed. I think we should sleep far shorter in many cases, but that's indeed a separate issue. > I see a binary_heap_allocate but no corresponding deallocation, this > looks like a memory leak... or is there some magic involved? Hm. I think we really should use a memory context for all of this - we could after all error out somewhere in the middle... > >I think this patch primarily needs: > >* Benchmarking on FreeBSD/OSX to see whether we should enable the > > mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm > > inclined to leave it off till then. > > I do not have that. As "msync" seems available on Linux, it is possible to > force using it with a "ifdef 0" to skip sync_file_range and check whether it > does some good there. Unfortunately it doesn't work well on linux: * On many OSs msync() on a mmap'ed file triggers writeback. On linux * it only does so when MS_SYNC is specified, but then it does the * writeback synchronously. Luckily all common linux systems have * sync_file_range(). This is preferrable over FADV_DONTNEED because * it doesn't flush out clean data. I've verified beforehand, with a simple demo program, that msync(MS_ASYNC) does something reasonable of freebsd... > Idem for the "posix_fadvise" stuff. I can try to do > that, but it takes time to do so, if someone can test on other OS it would > be much better. I think that if it works it should be kept in, so it is just > a matter of testing it. I'm not arguing for ripping it out, what I mean is that we don't set a nondefault value for the GUCs on platforms with just posix_fadivise available... Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-02-21 08:26:28 +0100, Fabien COELHO wrote: > >>In the discussion in the wal section, I'm not sure about the effect of > >>setting writebacks on SSD, [...] > > > >Yea, that paragraph needs some editing. I think we should basically > >remove that last sentence. > > Ok, fine with me. Does that mean that flushing as a significant positive > impact on SSD in your tests? Yes. The reason we need flushing is that the kernel amasses dirty pages, and then flushes them at once. That hurts for both SSDs and rotational media. Sorting is the the bigger question, but I've seen it have clearly beneficial performance impacts. I guess if you look at devices with a internal block size bigger than 8k, you'd even see larger differences. > >>Maybe the merging strategy could be more aggressive than just strict > >>neighbors? > > > >I don't think so. If you flush more than neighbouring writes you'll > >often end up flushing buffers dirtied by another backend, causing > >additional stalls. > > Ok. Maybe the neightbor definition could be relaxed just a little bit so > that small holes are overtake, but not large holes? If there is only a few > pages in between, even if written by another process, then writing them > together should be better? Well, this can wait for a clear case, because > hopefully the OS will recoalesce them behind anyway. I'm against doing so without clear measurements of a benefit. > >Also because the infrastructure is used for more than checkpoint > >writes. There's absolutely no ordering guarantees there. > > Yep, but not much benefit to expect from a few dozens random pages either. Actually, there's kinda frequently a benefit observable. Even if few requests can be merged, doing IO requests in an order more likely doable within a few rotations is beneficial. Also, the cost is marginal, so why worry? > >>[...] I do think that this whole writeback logic really does make > >>sense *per table space*, > > > >Leads to less regular IO, because if your tablespaces are evenly sized > >(somewhat common) you'll sometimes end up issuing sync_file_range's > >shortly after each other. For latency outside checkpoints it's > >important to control the total amount of dirty buffers, and that's > >obviously independent of tablespaces. > > I do not understand/buy this argument. > > The underlying IO queue is per device, and table spaces should be per device > as well (otherwise what the point?), so you should want to coalesce and > "writeback" pages per device as wel. Calling sync_file_range on distinct > devices should probably be issued more or less randomly, and should not > interfere one with the other. The kernel's dirty buffer accounting is global, not per block device. It's also actually rather common to have multiple tablespaces on a single block device. Especially if SANs and such are involved; where you don't even know which partitions are on which disks. > If you use just one context, the more table spaces the less performance > gains, because there is less and less aggregation thus sequential writes per > device. > > So for me there should really be one context per tablespace. That would > suggest a hashtable or some other structure to keep and retrieve them, which > would not be that bad, and I think that it is what is needed. That'd be much easier to do by just keeping the context in the per-tablespace struct. But anyway, I'm really doubtful about going for that; I had it that way earlier, and observing IO showed it not being beneficial. > >>For the checkpointer, a key aspect is that the scheduling process goes > >>to sleep from time to time, and this sleep time looked like a great > >>opportunity to do this kind of flushing. You choose not to take advantage > >>of the behavior, why? > > > >Several reasons: Most importantly there's absolutely no guarantee that > >you'll ever end up sleeping, it's quite common to happen only seldomly. > > Well, that would be under a situation when pg is completely unresponsive. > More so, this behavior *makes* pg unresponsive. No. The checkpointer being bottlenecked on actual IO performance doesn't impact production that badly. It'll just sometimes block in sync_file_range(), but the IO queues will have enough space to frequently give way to other backends, particularly to synchronous reads (most pg reads) and synchronous writes (fdatasync()). So a single checkpoint will take a bit longer, but otherwise the system will mostly keep up the work in a regular manner. Without the sync_file_range() calls the kernel will amass dirty buffers until global dirty limits are reached, which then will bring the whole system to a standstill. It's pretty common that checkpoint_timeout is too short to be able to write all shared_buffers out, in that case it's much better to slow down the whole checkpoint, instead of being incredibly slow at the end. > >I also don't really believe it helps that much, although that's a complex > >argument to make. > > Yep. My
Re: [HACKERS] Re: Add generate_series(date, date) and generate_series(date, date, integer)
On 02/21/2016 07:56 PM, Corey Huinker wrote: > > Other than that, the only difference is the ::date part. Is it > really worth adding extra code just for that? I would say not. > > > I would argue it belongs for the sake of completeness. So would I. +1 for adding this missing function. -- Vik Fearing +33 6 46 75 15 36 http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relaxing SSL key permission checks
Christoph Bergwrites: > The attached patch has successfully been included in the 9.6 Debian > package, passed the regression tests there, and I've also done some > chmod/chown tests on the filesystem to verify it indeed catches the > cases laid out by Tom. Please add this to the upcoming commitfest. We don't seem to have enough consensus to just apply it, but perhaps the review process will produce some agreement. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Add generate_series(date, date) and generate_series(date, date, integer)
> > > [step] is in days, but is not documented as such. > > It is in days, and is not documented as such, but since a day is the smallest unit of time for a date, I felt there was no other interpretation. > My understanding is you want to replace this > > SELECT d.dt::date as dt > FROM generate_series('2015-01-01'::date, > '2016-01-04'::date, > interval '1 day') AS d(dt); > > > with this > > SELECT d.dt > FROM generate_series('2015-01-01'::date, > '2016-01-04'::date, > 7) as d(dt); > > I'd also like to be able to join the values of d.dt without typecasting them. To me it's as awkward as (select 1::double + 1::double)::integer > > Personally, I think writing INTERVAL '7 days' to be clearer than just > typing 7. > Well, nearly all my use cases involve the step being 1 (and thus omitted) or -1. Maybe this example will appeal to you SELECT d.birth_date, COUNT(r.kiddo_name) FROM generate_series('2016-01-01'::date,'2016-01-10'::date) as d(birth_date) LEFT OUTER JOIN birth_records r ON r.birth_date = d.birth_date GROUP BY 1 ORDER BY 1; > Other than that, the only difference is the ::date part. Is it really > worth adding extra code just for that? I would say not. > I would argue it belongs for the sake of completeness. We added generate_series for numerics when generate_series for floats already existed. No comments on the patch itself, which seems to do the job, so apologies to > give this opinion on your work, I do hope it doesn't put you off further > contributions. > Thanks. I appreciate that.
Re: [HACKERS] [COMMITTERS] pgsql: Cosmetic improvements in new config_info code.
Stephen Frostwrites: > * Joe Conway (m...@joeconway.com) wrote: >> FWIW, strcpy() was being used in src/bin/pg_config/pg_config.c that I >> started with -- does that mean we are not getting Coverity coverage of >> src/bin? > Coverity does run against src/bin also. It's possible this was > identified as an issue in pg_config.c, but, as Tom notes, it may not be > an actual bug and might have been marked as a non-bug in Coverity. It looks to me like the previous coding was static char mypath[MAXPGPATH]; ... charpath[MAXPGPATH]; ... strcpy(path, mypath); so Coverity probably didn't complain because it could see that the source was also a buffer of size MAXPGPATH. With the new arrangement it was probably using an assumption that get_configdata() could be called with any length of string. I am not sure how much cross-file analysis Coverity does --- it seems to do some, but in other cases it acts like it doesn't know anything about the call sites. It's possible that it did know that the existing callers all use MAXPGPATH-sized buffers, but whined anyway on the not-unreasonable grounds that global functions should be prepared for new callers. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [COMMITTERS] pgsql: Cosmetic improvements in new config_info code.
Joe, all, * Joe Conway (m...@joeconway.com) wrote: > On 02/21/2016 08:38 AM, Tom Lane wrote: > > Coverity griped about use of unchecked strcpy() into a local variable. > > There's unlikely to be any actual bug there, since no caller would be > > passing a path longer than MAXPGPATH, but nonetheless use of strlcpy() > > seems preferable. > > FWIW, strcpy() was being used in src/bin/pg_config/pg_config.c that I > started with -- does that mean we are not getting Coverity coverage of > src/bin? Coverity does run against src/bin also. It's possible this was identified as an issue in pg_config.c, but, as Tom notes, it may not be an actual bug and might have been marked as a non-bug in Coverity. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] a raft of parallelism-related bug fixes
Robert Haaswrites: > On Mon, Feb 8, 2016 at 2:36 PM, Joshua D. Drake > wrote: >> I have no problem running any test cases you wish on a branch in a loop for >> the next week and reporting back any errors. > Well, what I've done is push into the buildfarm code that will allow > us to do *the most exhaustive* testing that I know how to do in an > automated fashion. Which is to create a file that says this: > force_parallel_mode=regress > max_parallel_degree=2 I did a few dozen runs of the core regression tests with those settings (using current HEAD plus my lockGroupLeaderIdentifier-ectomy patch). Roughly one time in ten, it fails in the stats test, with diffs as attached. I interpret this as meaning that parallel workers don't reliably transmit stats to the stats collector, though maybe there is something else happening. regards, tom lane *** /home/postgres/pgsql/src/test/regress/expected/stats.out Wed Mar 4 00:55:25 2015 --- /home/postgres/pgsql/src/test/regress/results/stats.out Sun Feb 21 12:59:27 2016 *** *** 148,158 WHERE relname like 'trunc_stats_test%' order by relname; relname | n_tup_ins | n_tup_upd | n_tup_del | n_live_tup | n_dead_tup ---+---+---+---++ ! trunc_stats_test | 3 | 0 | 0 | 0 | 0 ! trunc_stats_test1 | 4 | 2 | 1 | 1 | 0 ! trunc_stats_test2 | 1 | 0 | 0 | 1 | 0 ! trunc_stats_test3 | 4 | 0 | 0 | 2 | 2 ! trunc_stats_test4 | 2 | 0 | 0 | 0 | 2 (5 rows) SELECT st.seq_scan >= pr.seq_scan + 1, --- 148,158 WHERE relname like 'trunc_stats_test%' order by relname; relname | n_tup_ins | n_tup_upd | n_tup_del | n_live_tup | n_dead_tup ---+---+---+---++ ! trunc_stats_test | 0 | 0 | 0 | 0 | 0 ! trunc_stats_test1 | 0 | 0 | 0 | 0 | 0 ! trunc_stats_test2 | 0 | 0 | 0 | 0 | 0 ! trunc_stats_test3 | 0 | 0 | 0 | 0 | 0 ! trunc_stats_test4 | 0 | 0 | 0 | 0 | 0 (5 rows) SELECT st.seq_scan >= pr.seq_scan + 1, == -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Introduce group locking to prevent parallel processes from deadl
I wrote: > Robert Haaswrites: >> Yeah, you're right. Attached is a draft patch that tries to clean up >> that and a bunch of other things that you raised. > I've been reviewing this patch, and I had a question: why do we need to > bother with a lockGroupLeaderIdentifier field? It seems totally redundant > with the leader's pid field, ie, why doesn't BecomeLockGroupMember simply > compare leader->pid with the PID it's passed? For some more safety, it > could also verify that leader->lockGroupLeader == leader; but I don't > see what the extra field is buying us. Here is a proposed patch that gets rid of lockGroupLeaderIdentifier. I concluded that the "leader->lockGroupLeader == leader" test is necessary, because we don't ever reset the pid field of a dead PGPROC until it's re-used. However, with that check, this seems isomorphic to the existing code because lockGroupLeader is reset to NULL at all the same places that lockGroupLeaderIdentifier is reset. So we do not need both of those fields. regards, tom lane diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README index c399618..56b0a12 100644 *** a/src/backend/storage/lmgr/README --- b/src/backend/storage/lmgr/README *** those cases so that they no longer use h *** 650,676 (which is not a crazy idea, given that such lock acquisitions are not expected to deadlock and that heavyweight lock acquisition is fairly slow anyway). ! Group locking adds four new members to each PGPROC: lockGroupLeaderIdentifier, ! lockGroupLeader, lockGroupMembers, and lockGroupLink. The first is simply a ! safety mechanism. A newly started parallel worker has to try to join the ! leader's lock group, but it has no guarantee that the group leader is still ! alive by the time it gets started. We try to ensure that the parallel leader ! dies after all workers in normal cases, but also that the system could survive ! relatively intact if that somehow fails to happen. This is one of the ! precautions against such a scenario: the leader relays its PGPROC and also its ! PID to the worker, and the worker fails to join the lock group unless the ! given PGPROC still has the same PID. We assume that PIDs are not recycled ! quickly enough for this interlock to fail. ! ! A PGPROC's lockGroupLeader is NULL for processes not involved in parallel ! query. When a process wants to cooperate with parallel workers, it becomes a ! lock group leader, which means setting this field to point to its own ! PGPROC. When a parallel worker starts up, it points this field at the leader, ! with the above-mentioned interlock. The lockGroupMembers field is only used in the leader; it is a list of the member PGPROCs of the lock group (the leader and all workers). The lockGroupLink field is the list link for this list. ! All four of these fields are considered to be protected by a lock manager partition lock. The partition lock that protects these fields within a given lock group is chosen by taking the leader's pgprocno modulo the number of lock manager partitions. This unusual arrangement has a major advantage: the --- 650,665 (which is not a crazy idea, given that such lock acquisitions are not expected to deadlock and that heavyweight lock acquisition is fairly slow anyway). ! Group locking adds three new members to each PGPROC: lockGroupLeader, ! lockGroupMembers, and lockGroupLink. A PGPROC's lockGroupLeader is NULL for ! processes not involved in parallel query. When a process wants to cooperate ! with parallel workers, it becomes a lock group leader, which means setting ! this field to point to its own PGPROC. When a parallel worker starts up, it ! points this field at the leader. The lockGroupMembers field is only used in the leader; it is a list of the member PGPROCs of the lock group (the leader and all workers). The lockGroupLink field is the list link for this list. ! All three of these fields are considered to be protected by a lock manager partition lock. The partition lock that protects these fields within a given lock group is chosen by taking the leader's pgprocno modulo the number of lock manager partitions. This unusual arrangement has a major advantage: the *** change while the deadlock detector is ru *** 679,684 --- 668,685 all the lock manager locks. Also, holding this single lock allows safe manipulation of the lockGroupMembers list for the lock group. + We need an additional interlock when setting these fields, because a newly + started parallel worker has to try to join the leader's lock group, but it + has no guarantee that the group leader is still alive by the time it gets + started. We try to ensure that the parallel leader dies after all workers + in normal cases, but also that the system could survive relatively intact + if that somehow fails to happen. This is one of the precautions against + such a
Re: [HACKERS] proposal: enhancing slow query log, and autoexplain log about waiting on lock before query exec time
Hi >> What would be useful logging-wise is if the log line for the query itself >> could contain lock wait time, but that doesn't sound like what you're >> proposing? >> > > I hope, so I propose this idea. First time I wanted talk about the idea. > Next step is the talk about format. > Some enhancement should to have two parts: a) introduce new function GetCurrentStatementStartExecutionTimestamp(). This function can be used for calculation of initial lock waiting duration. b) introduce new GUC log_min_lock_duration_statement. It is similar to log_min_duration_statement. When any statement waits longer time than this number, then statement is logged similar to other slow statement. example: log_min_duration_statement = 1000 .. log any slower than 1sec log_min_lock_duration_statement = 100 log when waiting is longer than 100 ms entry in log: LOG: duration:843 ms start lock duration: 156 ms statement: . This logic is simple - for some more complicated models, we can introduce some monitoring logs hooks and complex custom logging can be solved in extension. When log_min_lock_duration_statement is -1, then "start lock duration" isn't printed to log. Comments, notes? Regards Pavel
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Sun, Feb 21, 2016 at 2:28 PM, Robert Haaswrote: > On Sun, Feb 21, 2016 at 12:24 PM, Amit Kapila > wrote: > >> On Sun, Feb 21, 2016 at 12:02 PM, Robert Haas >> wrote: >> >>> On Sun, Feb 21, 2016 at 10:27 AM, Amit Kapila >>> wrote: >>> Client_Count/Patch_Ver 1 64 128 256 HEAD(481725c0) 963 28145 28593 26447 Patch-1 938 28152 31703 29402 We can see 10~11% performance improvement as observed previously. You might see 0.02% performance difference with patch as regression, but that is just a run-to-run variation. >>> >>> Don't the single-client numbers show about a 3% regresssion? Surely not >>> 0.02%. >>> >> >> >> Sorry, you are right, it is ~2.66%, but in read-write pgbench tests, I >> could see such fluctuation. Also patch doesn't change single-client >> case. However, if you still feel that there could be impact by patch, >> I can re-run the single client case once again with different combinations >> like first with HEAD and then patch and vice versa. >> > > Are these results from a single run, or median-of-three? > > This was median-of-three, but the highest tps with patch is 1119 and with HEAD, it is 969 which shows a gain at single client-count. Sometimes, I see such differences, it could be due to auto vacuum getting triggered at some situations which lead to such variations. However, if I try 2-3 times, the difference generally gets disappeared unless there is some real regression or if just switch off auto vacuum and do manual vacuum after each run. This time, I haven't run the tests multiple times. > I mean, my basic feeling is that I would not accept a 2-3% regression in > the single client case to get a 10% speedup in the case where we have 128 > clients. > I understand your point. I think to verify whether it is run-to-run variation or an actual regression, I will re-run these tests on single client multiple times and post the result. > A lot of people will not have 128 clients; quite a few will have a > single session, or just a few. Sometimes just making the code more complex > can hurt performance in subtle ways, e.g. by making it fit into the L1 > instruction cache less well. If the numbers you have here are accurate, > I'd vote to reject the patch. > > One point to note is that this patch along with first patch which I posted in this thread to increase clog buffers can make significant reduction in contention on CLogControlLock. OTOH, I think introducing regression at single-client is also not a sane thing to do, so lets first try to find if there is actually any regression and if it is, can we mitigate it by writing code with somewhat fewer instructions or in a slightly different way and then we can decide whether it is good to reject the patch or not. Does that sound reasonable to you? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] exposing pg_controldata and pg_config as functions
On Sat, Feb 20, 2016 at 12:12 PM, Joe Conwaywrote: > On 01/17/2016 04:10 PM, Joe Conway wrote: >> On 01/16/2016 06:02 AM, Michael Paquier wrote: >>> On Wed, Dec 30, 2015 at 9:08 AM, Joe Conway wrote: 3) Adds new functions, more or less in line with previous discussions: * pg_checkpoint_state() * pg_controldata_state() * pg_recovery_state() * pg_init_state() >>> >>> Taking the opposite direction of Josh upthread, why is this split >>> actually necessary? Isn't the idea to provide a SQL interface of what >>> pg_controldata shows? If this split proves to be useful, shouldn't we >>> do it as well for pg_controldata? >> >> The last discussion moved strongly in the direction of separate >> functions, and that being different from pg_controldata was not a bad >> thing. That said, I'm still of the opinion that there are legitimate >> reasons to want the command line pg_controldata and the SQL functions to >> produce equivalent, if not identical, results. I just wish we could get >> a clear consensus one way or the other. > > I've assumed that we are sticking with the separate functions. As such, > here is a rebased patch, with documentation and other fixes such as > Copyright year, Mkvcbuild support, and some cruft removal. Looking again at this thread I guess that this is consensus, based on the proposal from Josh and seeing no other ideas around. Another idea would be to group all the fields that into a single function pg_control_data(). > Is there general consensus that we want this feature, and that we want > it in this form? Any other comments? I had a look at this patch. + +pg_checkpoint_state + + +pg_checkpoint_state returns a record containing +checkpoint_location, prior_location, redo_location, redo_wal_file, +timeline_id, prev_timeline_id, full_page_writes, next_xid, next_oid, +next_multixact_id, next_multi_offset, oldest_xid, oldest_xid_dbid, +oldest_active_xid, oldest_multi_xid, oldest_multi_dbid, +oldest_commit_ts_xid, newest_commit_ts_xid, and checkpoint_time. + This is bit unreadable. The only entry in the documentation that adopts a similar style is pg_stat_file, and with six fields that feels as being enough. I would suggest using a table instead with the type of the field and its name. Regarding the naming of the functions, I think that it would be good to get something consistent with the concept of those being "Control Data functions" by having them share the same prefix, say pg_control_ - pg_control_checkpoint - pg_control_init - pg_control_system - pg_control_recovery + snprintf (controldata_name, CONTROLDATANAME_LEN, + "%s:", controldata[i].name); Nitpick: extra space. +static const char *const controldata_names[] = +{ + gettext_noop("pg_control version number"), + gettext_noop("Catalog version number"), + gettext_noop("Database system identifier"), Is this complication really necessary? Those identifiers are used only in the frontend and the footprint of this patch on pg_controldata is really large. What I think we should do is have in src/common the following set of routines that work directly on ControlFileData: - checkControlFile, to perform basic sanity checks on the control file (CRC, see for example pg_rewind.c) - getControlFile(dataDir), that simply returns a palloc'd ControlFileData to the caller after looking at global/pg_control. pg_rewind could perhaps make use of the one to check the control file CRC, to fetch ControlFileData there is some parallel logic for the source server if it is either remote or local so it would be better to not use getControlFile in this case. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
On Fri, Feb 19, 2016 at 10:33 PM, Amit Kapila wrote: > On Fri, Feb 12, 2016 at 6:57 PM, Michael Paquier wrote: > You doesn't seem to have taken care of below typo in your patch as > pointed out by me earlier. > > + * to not rely on taking an exclusive lock an all the WAL insertion locks, > > /an all/on all Sorry about that. Hopefully fixed now. > Does this in anyway take care of the case when there is an activity > on unlogged tables? > I think avoiding to perform checkpoints in an idle system is introduced > in commit 4d14fe0048cf80052a3ba2053560f8aab1bb1b22 whereas > unlogged relation is introduced by commit > 53dbc27c62d8e1b6c5253feba04a5094cb8fe046 which is much later. > Now, I think one might consider it okay to not do anything for > unlogged tables as it is not done previously and this patch is > anyway improving the current situation, but I feel if we agree > that checkpoints will get skipped if the only operations that > are happening in the system are on unlogged relations, then > it is better to add it in comments as an improvement even if we > don't want to address it as part of this patch. Nope, nothing is done, just to not complicate the patch more than necessary. I agree though that we had better not update the progress LSN when logging something related to INIT_FORKNUM, there is little point to run non-forced checkpoints in this case as on recovery unlogged relations are wiped out. > + elog(LOG, "Not a forced or shutdown checkpoint: progress_lsn %X/%X, ckpt > %X/%X", > + (uint32) (progress_lsn >> 32), (uint32) progress_lsn, > + (uint32) (ControlFile->checkPoint >> 32), (uint32) > ControlFile->checkPoint); > > Are you proposing to have the newly introduced elog messages > as part of commit or are these just for debugging purpose? If you > are proposing for commit, then I think it is worth to justify the need > of same and we should discuss what is the appropriate log level, > otherwise, I think you can have these in an additional patch just for > verification as the main patch is now very close to being > ready for committer. I have never intended to have those logs being part of the patch that should be committed, and putting them at LOG level avoided to have a bunch of garbage in the logs during the tests (got lazy to grep on the entries for those tests). I am no:t against adding a comment, here is one I just came up with: --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -448,6 +448,12 @@ typedef struct XLogwrtResult * hence reducing the impact of the activity lookup. This takes also * advantage to avoid 8-byte torn reads on some platforms by using the * fact that each insert lock is located on the same cache line. + * XXX: There is still room for more improvements here, particularly + * WAL operations related to unlogged relations (INIT_FORKNUM) should not + * update the progress LSN as those relations are reset during crash + * recovery so enforcing buffers of such relations to be flushed for + * example in the case of a load only on unlogged relations is a waste + * of disk write. */ > Also, I think it is worth to once take the performance data for > write tests (using pgbench 30 minute run or some other way) with > minimum checkpoint_timeout (i.e 30s) to see if the additional locking > has any impact on performance. I think taking locks at intervals > of 15s or 30s should not matter much, but it is better to be safe. I don't think performance will be impacted, but there are no reasons to not do any measurements either. I'll try to get some graphs tomorrow with runs on my laptop, mainly looking for any effects of this patch on TPS when checkpoints show up. For now attached is an updated patch, with a second patch (b) including the logs for testing. -- Michael diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 05e49d7..3ab2bc9 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -8373,8 +8373,12 @@ CreateCheckPoint(int flags) if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_FORCE)) == 0) { + elog(LOG, "Not a forced or shutdown checkpoint: progress_lsn %X/%X, ckpt %X/%X", + (uint32) (progress_lsn >> 32), (uint32) progress_lsn, + (uint32) (ControlFile->checkPoint >> 32), (uint32) ControlFile->checkPoint); if (progress_lsn == ControlFile->checkPoint) { + elog(LOG, "Checkpoint is skipped"); WALInsertLockRelease(); LWLockRelease(CheckpointLock); END_CRIT_SECTION(); @@ -8541,7 +8545,11 @@ CreateCheckPoint(int flags) * recovery we don't need to write running xact data. */ if (!shutdown && XLogStandbyInfoActive()) - LogStandbySnapshot(); + { + XLogRecPtr lsn = LogStandbySnapshot(); + elog(LOG, "snapshot taken by checkpoint %X/%X", + (uint32) (lsn >> 32), (uint32) lsn); + } START_CRIT_SECTION(); diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index
Re: [HACKERS] Relaxing SSL key permission checks
Re: To Tom Lane 2016-02-19 <20160219115334.gb26...@msg.df7cb.de> > Updated patch attached. *Blush* I though I had compile-tested the patch, but without --enable-openssl it wasn't built :(. The attached patch has successfully been included in the 9.6 Debian package, passed the regression tests there, and I've also done some chmod/chown tests on the filesystem to verify it indeed catches the cases laid out by Tom. Christoph -- c...@df7cb.de | http://www.df7cb.de/ diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c new file mode 100644 index 1e3dfb6..a6c4ba0 *** a/src/backend/libpq/be-secure-openssl.c --- b/src/backend/libpq/be-secure-openssl.c *** be_tls_init(void) *** 207,213 ssl_key_file))); /* ! * Require no public access to key file. * * XXX temporarily suppress check when on Windows, because there may * not be proper support for Unix-y file permissions. Need to think --- 207,217 ssl_key_file))); /* ! * Require no public access to key file. If the file is owned by us, ! * require mode 0600 or less. If owned by root, require 0640 or less ! * to allow read access through our gid, or a supplementary gid that ! * allows to read system-wide certificates. Refuse to load files owned ! * by other users. * * XXX temporarily suppress check when on Windows, because there may * not be proper support for Unix-y file permissions. Need to think *** be_tls_init(void) *** 215,226 * directory permission check in postmaster.c) */ #if !defined(WIN32) && !defined(__CYGWIN__) ! if (!S_ISREG(buf.st_mode) || buf.st_mode & (S_IRWXG | S_IRWXO)) ereport(FATAL, (errcode(ERRCODE_CONFIG_FILE_ERROR), errmsg("private key file \"%s\" has group or world access", ssl_key_file), ! errdetail("Permissions should be u=rw (0600) or less."))); #endif if (SSL_CTX_use_PrivateKey_file(SSL_context, --- 219,233 * directory permission check in postmaster.c) */ #if !defined(WIN32) && !defined(__CYGWIN__) ! if (!S_ISREG(buf.st_mode) || ! (buf.st_uid == geteuid() && buf.st_mode & (S_IRWXG | S_IRWXO)) || ! (buf.st_uid == 0 && buf.st_mode & (S_IWGRP | S_IXGRP | S_IRWXO)) || ! (buf.st_uid != geteuid() && buf.st_uid != 0)) ereport(FATAL, (errcode(ERRCODE_CONFIG_FILE_ERROR), errmsg("private key file \"%s\" has group or world access", ssl_key_file), ! errdetail("File must be owned by the database user and have permissions u=rw (0600) or less, or owned by root and have permissions u=rw,g=w (0640) or less."))); #endif if (SSL_CTX_use_PrivateKey_file(SSL_context, -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Hallo Andres, Here is a review for the second patch. For 0002 I've recently changed: * Removed the sort timing information, we've proven sufficiently that it doesn't take a lot of time. I put it there initialy to demonstrate that there was no cache performance issue when sorting on just buffer indexes. As it is always small, I agree that it is not needed. Well, it could be still be in seconds on a very large shared buffers setting with a very large checkpoint, but then the checkpoint would be tremendously huge... * Minor comment polishing. Patch applies and checks on Linux. * CpktSortItem: I think that allocating 20 bytes per buffer in shared memory is a little on the heavy side. Some compression can be achieved: sizeof(ForlNum) is 4 bytes to hold 4 values, could be one byte or even 2 bits somewhere. Also, there are very few tablespaces, they could be given a small number and this number could be used instead of the Oid, so the space requirement could be reduced to say 16 bytes per buffer by combining space & fork in 2 shorts and keeping 4 bytes alignement and also getting 8 byte alignement... If this is too much, I have shown that it can work with only 4 bytes per buffer, as the sorting is really just a performance optimisation and is not broken if some stuff changes between sorting & writeback, but you did not like the idea. If the amount of shared memory required is a significant concern, it could be resurrected, though. * CkptTsStatus: As I suggested in the other mail, I think that this structure should also keep a per tablespace WritebackContext so that coalescing is done per tablespace. ISTM that "progress" and "progress_slice" only depend on num_scanned and per-tablespace num_to_scan and total num_to_scan, so they are somehow redundant and the progress could be recomputed from the initial figures when needed. If these fields are kept, I think that a comment should justify why float8 precision is okay for the purpose. I think it is quite certainly fine in the worst case with 32 bits buffer_ids, but it would not be if this size is changed someday. * BufferSync After a first sweep to collect buffers to write, they are sorted, and then there those buffers are swept again to compute some per tablespace data and organise a heap. ISTM that nearly all of the collected data on the second sweep could be collected on the first sweep, so that this second sweep could be avoided altogether. The only missing data is the index of the first buffer in the array, which can be computed by considering tablespaces only, sweeping over buffers is not needed. That would suggest creating the heap or using a hash in the initial buffer sweep to keep this information. This would also provide a point where to number tablespaces for compressing the CkptSortItem struct. I'm wondering about calling CheckpointWriteDelay on each round, maybe a minimum amount of write would make sense. This remark is independent of this patch. Probably it works fine because after a sleep the checkpointer is behind enough so that it will write a bunch of buffers before sleeping again. I see a binary_heap_allocate but no corresponding deallocation, this looks like a memory leak... or is there some magic involved? There are some debug stuff to remove in #ifdefs. I think that the buffer/README should be updated with explanations about sorting in the checkpointer. I think this patch primarily needs: * Benchmarking on FreeBSD/OSX to see whether we should enable the mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm inclined to leave it off till then. I do not have that. As "msync" seems available on Linux, it is possible to force using it with a "ifdef 0" to skip sync_file_range and check whether it does some good there. Idem for the "posix_fadvise" stuff. I can try to do that, but it takes time to do so, if someone can test on other OS it would be much better. I think that if it works it should be kept in, so it is just a matter of testing it. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Sun, Feb 21, 2016 at 12:24 PM, Amit Kapilawrote: > On Sun, Feb 21, 2016 at 12:02 PM, Robert Haas > wrote: > >> On Sun, Feb 21, 2016 at 10:27 AM, Amit Kapila >> wrote: >> >>> >>> Client_Count/Patch_Ver 1 64 128 256 >>> HEAD(481725c0) 963 28145 28593 26447 >>> Patch-1 938 28152 31703 29402 >>> >>> >>> We can see 10~11% performance improvement as observed >>> previously. You might see 0.02% performance difference with >>> patch as regression, but that is just a run-to-run variation. >>> >> >> Don't the single-client numbers show about a 3% regresssion? Surely not >> 0.02%. >> > > > Sorry, you are right, it is ~2.66%, but in read-write pgbench tests, I > could see such fluctuation. Also patch doesn't change single-client > case. However, if you still feel that there could be impact by patch, > I can re-run the single client case once again with different combinations > like first with HEAD and then patch and vice versa. > Are these results from a single run, or median-of-three? I mean, my basic feeling is that I would not accept a 2-3% regression in the single client case to get a 10% speedup in the case where we have 128 clients. A lot of people will not have 128 clients; quite a few will have a single session, or just a few. Sometimes just making the code more complex can hurt performance in subtle ways, e.g. by making it fit into the L1 instruction cache less well. If the numbers you have here are accurate, I'd vote to reject the patch. Note that we already have apparently regressed single-client performance noticeably between 9.0 and 9.5: http://bonesmoses.org/2016/01/08/pg-phriday-how-far-weve-come/ I bet that wasn't a single patch but a series of patches which made things more complex to improve concurrency behavior, but in the process each one made the single-client case a tiny bit slower. In the end, that adds up. I think we need to put some effort into figuring out if there is a way we can get some of that single-client performance (and ideally more) back. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)
On Thu, Feb 18, 2016 at 4:52 PM, Ashutosh Bapatwrote: > If the list in the joining relation changes (may be because we appended > parameterized conditions), we would be breaking links on all the upper > relations in the join tree in an unpredictable manner. The problem may not > show up now, but it's an avenue for unrecognizable bugs. So, it's safer to > copy the lists in the state that we want them. Agreed. The lists figure to be short, so copying them shouldn't be very expensive, and it's better to do that in all cases than to leave shared-substructure hazards around for future patch authors to worry about. Committed Ashutosh's version of the patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql - DECLARE - cannot to use %TYPE or %ROWTYPE for composite types
Hi I am sending updated version - the changes are related to fix comments. 2016-02-19 10:41 GMT+01:00 Artur Zakirov: > It seems all fixes are done. I tested the patch and regression tests > passed. > > I think here Alvaro means that you should keep original comment without > the ROW. Like this: > > /* XXX perhaps allow REC here? */ I tried rewording this comment > > > >> >> By the way, these functions are misnamed after this patch. They are >> called "wordtype" and "cwordtype" originally because they accept >> "word%TYPE" and "compositeword%TYPE", but after the patch they not >> only >> accept TYPE at the right of the percent sign but also ELEMENTTYPE and >> ARRAYTYPE. Not sure that this is something we want to be too strict >> about. >> >> >> Understand - used name ***reftype instead type >> > > I am not sure, but it seems that new names is a little worse. I think > original names are good too. They accept a word and return the PLpgSQL_type > structure. > The "TYPE" word in this name was related to syntax %TYPE. And because new syntax allows more constructs, then the change name is correct. I am think. But choosing names is hard work. The new name little bit more strongly show relation to work with referenced types. > > >> > I noticed a little typo in the comment in the derive_type(): > /* Return base_type, when it is a array already */ > > should be: > /* Return base_type, when it is an array already */ > > fixed Regards Pavel > > -- > Artur Zakirov > Postgres Professional: http://www.postgrespro.com > Russian Postgres Company > diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml new file mode 100644 index 9786242..140c81f *** a/doc/src/sgml/plpgsql.sgml --- b/doc/src/sgml/plpgsql.sgml *** SELECT merge_fields(t.*) FROM table1 t W *** 710,715 --- 710,792 + +Array Types + + + variable%ARRAYTYPE + + + + %ARRAYTYPE provides the array type from a variable or + table column. %ARRAYTYPE is particularly valuable in + polymorphic functions, since the data types needed for internal variables can + change from one call to the next. Appropriate variables can be + created by applying %ARRAYTYPE to the function's + arguments or result placeholders: + + CREATE OR REPLACE FUNCTION array_init(v anyelement, size integer) + RETURNS anyarray AS $$ + DECLARE + result v%ARRAYTYPE DEFAULT '{}'; + BEGIN + -- prefer builtin function array_fill + FOR i IN 1 .. size + LOOP + result := result || v; + END LOOP; + RETURN result; + END; + $$ LANGUAGE plpgsql; + + SELECT array_init(0::numeric, 10); + SELECT array_init(''::varchar, 10); + + + + + +Array Element Types + + + variable%ELEMENTTYPE + + + + %ELEMENTTYPE provides the element type of a given + array. %ELEMENTTYPE is particularly valuable in polymorphic + functions, since the data types needed for internal variables can + change from one call to the next. Appropriate variables can be + created by applying %ELEMENTTYPE to the function's + arguments or result placeholders: + + CREATE OR REPLACE FUNCTION bubble_sort(a anyarray) + RETURNS anyarray AS $$ + DECLARE + aux a%ELEMENTTYPE; + repeat_again boolean DEFAULT true; + BEGIN + -- Don't use this code for large arrays! + -- use builtin sort + WHILE repeat_again + LOOP + repeat_again := false; + FOR i IN array_lower(a, 1) .. array_upper(a, 1) + LOOP + IF a[i] > a[i+1] THEN + aux := a[i+1]; + a[i+1] := a[i]; a[i] := aux; + repeat_again := true; + END IF; + END LOOP; + END LOOP; + RETURN a; + END; + $$ LANGUAGE plpgsql; + + + + Collation of PL/pgSQL Variables diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c new file mode 100644 index ebe152d..07569f7 *** a/src/pl/plpgsql/src/pl_comp.c --- b/src/pl/plpgsql/src/pl_comp.c *** plpgsql_parse_tripword(char *word1, char *** 1617,1632 return false; } /* -- ! * plpgsql_parse_wordtype The scanner found word%TYPE. word can be ! *a variable name or a basetype. * * Returns datatype struct, or NULL if no match found for word. * -- */ PLpgSQL_type * ! plpgsql_parse_wordtype(char *ident) { PLpgSQL_type *dtype; PLpgSQL_nsitem *nse; --- 1617,1683 return false; } + /* + * Derive type from any base type controlled by reftype. + * + * This routine allows to take array type from an array variable. This behave + * is not consistent with enforce_generic_type_consistency, where same task + * fails on ERROR: 42704: could not find array type for data type xxx[]. + */ + static PLpgSQL_type * + derive_type(PLpgSQL_type *base_type, PLpgSQL_reftype reftype) + { + Oid typoid; + + switch (reftype)
Re: [HACKERS] checkpointer continuous flushing - V18
Hallo Andres, [...] I do think that this whole writeback logic really does make sense *per table space*, Leads to less regular IO, because if your tablespaces are evenly sized (somewhat common) you'll sometimes end up issuing sync_file_range's shortly after each other. For latency outside checkpoints it's important to control the total amount of dirty buffers, and that's obviously independent of tablespaces. I do not understand/buy this argument. The underlying IO queue is per device, and table spaces should be per device as well (otherwise what the point?), so you should want to coalesce and "writeback" pages per device as wel. Calling sync_file_range on distinct devices should probably be issued more or less randomly, and should not interfere one with the other. If you use just one context, the more table spaces the less performance gains, because there is less and less aggregation thus sequential writes per device. So for me there should really be one context per tablespace. That would suggest a hashtable or some other structure to keep and retrieve them, which would not be that bad, and I think that it is what is needed. Note: I think that an easy way to do that in the "checkpoint sort" patch is simply to keep a WritebackContext in CkptTsStatus structure which is per table space in the checkpointer. For bgwriter & backends it can wait, there is few "writeback" coalescing because IO should be pretty random, so it does not matter much. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers