Re: [HACKERS] Proposal: Implement failover on libpq connect level.
Hans-Jürgen Schönig wrote: in addition to that you have the “problem” of transactions. if you failover in the middle of a transaction, strange things might happen from the application point of view. the good thing, however, is that stupid middleware is sometimes not able to handle failed connections. however, overall i think it is more of a danger than a benefit. Maybe I misunderstood the original proposal, but my impression was that the alternative servers would be tried only at the time the connection is established, and there would be no such problems as you describe. Those could only happen if libpq automatically tried to reconnect upon failure without the client noticing. So the stupid middleware would get an error message, but the reconnect would actually work. Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning
On 18 August 2015 at 11:30, Amit Langote langote_amit...@lab.ntt.co.jp wrote: Hi, I would like propose $SUBJECT for this development cycle. Attached is a WIP patch that implements most if not all of what's described below. Some yet unaddressed parts are mentioned below, too. I'll add this to the CF-SEP. Syntax == 1. Creating a partitioned table CREATE TABLE table_name PARTITION BY {RANGE|LIST} ON (column_list); Where column_list consists of simple column names or expressions: PARTITION BY LIST ON (name) PARTITION BY RANGE ON (year, month) PARTITION BY LIST ON ((lower(left(name, 2))) PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d))) Note: LIST partition key supports only one column. For each column, you could write operator class name: PARTITION BY LIST/RANGE ON (colname [USING] opclass_name), If not specified, the default btree operator class based on type of each key column is used. If none of the available btree operator classes are compatible with the partitioning strategy (list/range), error is thrown. Built-in btree operator classes cover a good number of types for list and range partitioning in practical scenarios. A table created using this form is of proposed new relkind RELKIND_PARTITIONED_REL. An entry in pg_partitioned_rel (see below) is created to store partition key info. Note: A table cannot be partitioned after-the-fact using ALTER TABLE. Normal dependencies are created between the partitioned table and operator classes, object in partition expressions like functions. 2. Creating a partition of a partitioned table CREATE TABLE table_name PARTITION OF partitioned_table_name FOR VALUES values_spec; Where values_spec is: listvalues: [IN] (val1, ...) rangevalues: START (col1min, ... ) END (col1max, ... ) | START (col1min, ... ) | END (col1max, ... ) A table created using this form has proposed pg_class.relispartition set to true. An entry in pg_partition (see below) is created to store the partition bound info. The values_spec should match the partitioning strategy of the partitioned table. In case of a range partition, the values in START and/or END should match columns in the partition key. Defining a list partition is fairly straightforward - just spell out the list of comma-separated values. Error is thrown if the list of values overlaps with one of the existing partitions' list. CREATE TABLE persons_by_state (name text, state text) PARTITION BY LIST ON (state); CREATE TABLE persons_IL PARTITION OF persons_by_state FOR VALUES IN ('IL'); CREATE TABLE persons_fail PARTITION OF persons_by_state FOR VALUES IN ('IL'); ERROR: cannot create partition that overlaps with an existing one For a range partition, there are more than one way: Specify both START and END bounds: resulting range should not overlap with the range(s) covered by existing partitions. Error is thrown otherwise. Although rare in practice, gaps between ranges are OK. CREATE TABLE measurement(logdate date NOT NULL) PARTITION BY RANGE ON (logdate); CREATE TABLE measurement_y2006m02 PARTITION OF measurement FOR VALUES START ('2006-02-01') END ('2006-03-01'); --success CREATE TABLE measurement_fail PARTITION OF measurement FOR VALUES START ('2006-02-15') END ('2006-03-01'); ERROR: cannot create partition that overlaps with an existing one Specify only the START bound: add the partition on the left of some range covered by existing partitions provided no overlap occurs (also considering gaps between ranges, if any). If no such range exists, the new partition will cover the range [START, +INFINITY) and become the rightmost partition. Error is thrown if the specified START causes overlap. CREATE TABLE measurement_y2006m01 PARTITION OF measurement FOR VALUES START ('2006-01-01'); --success CREATE TABLE measurement_fail PARTITION OF measurement FOR VALUES START ('2006-02-01'); --overlaps with measurement_y2006m02 ERROR: cannot create partition that overlaps with an existing one Specify only the END bound: add the partition on the right of some range covered by existing partitions provided no overlap occurs (also considering gaps between ranges, if any). If no such range exists, the new partition would cover the range (-INFINITY, END) and become the leftmost partition. Error is thrown if the specified END causes overlap. CREATE TABLE measurement_y2006m03 PARTITION OF measurement FOR VALUES END ('2006-04-01'); --success CREATE TABLE measurement_fail PARTITION OF measurement FOR VALUES END ('2006-03-01'); --overlaps with measurement_y2006m02 ERROR: cannot create partition that overlaps with an existing one For each partition, START and END bound values are stored in the catalog. Note that the lower bound is inclusive, whereas the upper bound is exclusive. Note: At most one range partition can have null min bound in which case it covers the range
Re: [HACKERS] Proposal: Implement failover on libpq connect level.
On 18 Aug 2015, at 11:19, Albe Laurenz laurenz.a...@wien.gv.at wrote: Hans-Jürgen Schönig wrote: in addition to that you have the “problem” of transactions. if you failover in the middle of a transaction, strange things might happen from the application point of view. the good thing, however, is that stupid middleware is sometimes not able to handle failed connections. however, overall i think it is more of a danger than a benefit. Maybe I misunderstood the original proposal, but my impression was that the alternative servers would be tried only at the time the connection is established, and there would be no such problems as you describe. it would still leave the problem of having a read only on the other side unless you are using BDR or so. regards, hans -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Raising our compiler requirements for 9.6
On Mon, Aug 17, 2015 at 12:06:42PM +0200, Andres Freund wrote: On 2015-08-16 03:31:48 -0400, Noah Misch wrote: I'd love to make it a #warning intead of an error, but unfortunately that's not standard C :( Okay. Other than that benefit, making headers #error-on-FRONTEND mostly lets us congratulate ourselves for having introduced the start of a header layer distinction. I'd be for that if PostgreSQL were new, but I can't justify doing it at the user cost already apparent. That would be bad business. To me that's basically saying that we'll never ever have any better separation between frontend/backend headers since each incremental improvement won't be justifiable. Exactly. This is one of those proposals that can never repay its costs. Header refactoring seduces hackers, but the benefits don't materialize. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb array-style subscripting
On Mon, Aug 17, 2015 at 11:26 PM, Peter Geoghegan p...@heroku.com wrote: On Mon, Aug 17, 2015 at 12:26 PM, Merlin Moncure mmonc...@gmail.com wrote: ...is a good idea. postgres operators tend to return immutable copies of the item they are referring to. This patch does not add an operator at all, actually. If feels like there ought to be an operator, but in fact there is not. The parser is hard-coded to recognize array-style subscripts, which this uses. While I'm certainly glad that Dmitry took the time to work on this, I think we will need an operator, too. Or, more accurately, there should probably be a way to make something like this use some available GIN index: postgres=# explain analyze select * from testjsonb where p['a'] = '[1]'; QUERY PLAN - Seq Scan on testjsonb (cost=0.00..27.00 rows=7 width=32) (actual time=0.022..0.023 rows=1 loops=1) Filter: (p['a'] = '[1]'::jsonb) Planning time: 0.070 ms Execution time: 0.054 ms (4 rows) This doesn't really matter with arrays, but ISTM that it matters here. I have no strong feelings on how it should work, but certain things do seem to suggest themselves. For example, maybe the parser can be made to create a query tree that uses an indexable operator based on special-case logic. Although maybe that's a kludge too far, since I can imagine it breaking other legitimate things. My sense is that this will need to be discussed. Peter, we are thinking about better indexing of subselects, let's first have the syntax sugar in core, which Dmitry implemented. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Implement failover on libpq connect level.
Victor Wagner wrote: Rationale = Since introduction of the WAL-based replication into the PostgreSQL, it is possible to create high-availability and load-balancing clusters. However, there is no support for failover in the client libraries. So, only way to provide transparent for client application failover is IP address migration. This approach has some limitation, i.e. it requires that master and backup servers reside in the same subnet or may not be feasible for other reasons. Commercial RDBMS, such as Oracle, employ more flexible approach. They allow to specify multiple servers in the connect string, so if primary server is not available, client library tries to connect to other ones. This approach allows to use geographically distributed failover clusters and also is a cheap way to implement load-balancing (which is not possible with IP address migration). I wonder how useful this is at the present time. If the primary goes down and the client gets connected to the standby, it would have read-only access there. Most applications wouldn't cope well with that. Once we have multi-master replication that can be used for fail-over, the picture will change. Then a feature like that would be very useful indeed. host=main-server host=standby1 host=standby2 port=5432 dbname=database It seems a bit arbitrary to require that all servers use the same port. Maybe parameters like host2, port2, host3, port3 etc. might be better. Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Proposal] Table partition + join pushdown
Hello Kondo-san, I briefly checked your patch. Let me put some comments about its design and implementation, even though I have no arguments towards its concept. :-) * Construction of RelOptInfo In your patch, try_hashjoin_pushdown() called by try_hashjoin_path() constructs RelOptInfo of the join-rel between inner-rel and a subpath of Append node. It is entirely wrong implementation. I can understand we (may) have no RelOptInfo for the joinrel between tbl_child_0 and other_table, when planner investigates a joinpath to process join Append path with the other_table. HashJoin - Append - SeqScan on tbl_child_0 - SeqScan on tbl_child_1 - SeqScan on tbl_child_2 - SeqScan on tbl_child_3 - Hash - SeqScan on other_table How about these alternatives? - calls make_join_rel() to the pair of tbl_child_X and other_table by try_hashjoin_pushdown() or somewhere. make_join_rel() internally construct a RelOptInfo for the supplied pair of relations, so relevant RelOptInfo shall be properly constructed. - make_join_rel() also calls add_paths_to_joinrel() towards all the join logic, so it makes easier to support to push down other join logic including nested-loop or custom-join. - It may be an idea to add an extra argument to make_join_rel() to inform expressions to be applied for tuple filtering on construction of inner hash table. * Why only SeqScan is supported I think it is the role of Hash-node to filter out inner tuples obviously unrelated to the join (if CHECK constraint of outer relation gives information), because this join-pushdown may be able to support multi-stacked pushdown. For example, if planner considers a path to join this Append-path with another relation, and join clause contains a reference to X? Append - HashJoin - SeqScan on tbl_child_0 - Hash ... Filter: hash_func(X) % 4 = 0 - SeqScan on other_table - HashJoin - SeqScan on tbl_child_1 - Hash ... Filter: hash_func(X) % 4 = 1 - SeqScan on other_table - HashJoin - SeqScan on tbl_child_2 - Hash ... Filter: hash_func(X) % 4 = 2 - SeqScan on other_table - HashJoin - SeqScan on tbl_child_3 - Hash ... Filter: hash_func(X) % 4 = 3 - SeqScan on other_table It may be a good challenge to consider additional join pushdown, even if subpaths of Append are HashJoin, not SeqScan, like: Append - HashJoin - HashJoin - SeqScan on tbl_child_0 - Hash ... Filter: hash_func(X) % 4 = 0 - SeqScan on other_table - Hash ... Filter: hash_func(X) % 4 = 0 - SeqScan on another_table - HashJoin - HashJoin - SeqScan on tbl_child_1 - Hash ... Filter: hash_func(X) % 4 = 1 - SeqScan on other_table - Hash ... Filter: hash_func(X) % 4 = 1 - SeqScan on another_table - HashJoin - HashJoin - SeqScan on tbl_child_2 - Hash ... Filter: hash_func(X) % 4 = 2 - SeqScan on other_table - Hash ... Filter: hash_func(X) % 4 = 2 - SeqScan on another_table - HashJoin - HashJoin - SeqScan on tbl_child_3 - Hash ... Filter: hash_func(X) % 4 = 3 - SeqScan on other_table - Hash ... Filter: hash_func(X) % 4 = 3 - SeqScan on another_table In this case, underlying nodes are not always SeqScan. So, only Hash-node can have filter clauses. * Way to handle expression nodes All this patch supported is CHECK() constraint with equal operation on INT4 data type. You can learn various useful infrastructure of PostgreSQL. For example, ... - expression_tree_mutator() is useful to make a copy of expression node with small modification - pull_varnos() is useful to check which relations are referenced by the expression node. - RestrictInfo-can_join is useful to check whether the clause is binary operator, or not. Anyway, reuse of existing infrastructure is the best way to make a reliable infrastructure and to keep the implementation simple. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Taiki Kondo Sent: Thursday, August 13, 2015 6:30 PM To: pgsql-hackers@postgresql.org Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎) Subject: [HACKERS] [Proposal] Table partition + join pushdown Hi all, I saw the email about the idea from KaiGai-san[1], and I worked to implement this idea. Now, I have implemented a part of this idea, so I want to propose this feature. Patch attached just shows my concept of this feature. It works fine for EXPLAIN, but it returns wrong result for other operations, sadly. Table partition + join pushdown === Motivation -- To make join logic working more effectively, it is important to
Re: [HACKERS] checkpointer continuous flushing
Hello Amit, So the option is best kept as off for now, without further data, I'm fine with that. One point to think here is on what basis user can decide make this option on, is it predictable in any way? I think one case could be when the data set fits in shared_buffers. Yep. In general, providing an option is a good idea if user can decide with ease when to use that option or we can give some clear recommendation for the same otherwise one has to recommend that test your workload with this option and if it works then great else don't use it which might also be okay in some cases, but it is better to be clear. My opinion, which is not backed by any data (anyone can feel free to provide a FreeBSD box for testing...) is that it would mostly be an improvement if you have a significant write load to have the flush option on when running on non-Linux systems which provide posix_fadvise. If you have a lot of reads and few writes, then postgresql currently works reasonably enough, which is why people do not complain too much about write stalls, and I expect that the situation would not be significantly degraded. Now there are competing positive and negative effects induced by using posix_fadvise, and moreover its implementation varries from OS to OS, so without running some experiments it is hard to be definite. One minor point, while glancing through the patch, I noticed that couple of multiline comments are not written in the way which is usually used in code (Keep the first line as empty). Indeed. Please find attached a v10, where I have reviewed comments for style contents, and also slightly extended the documentation about the flush option to hint that it is essentially useful for high write loads. Without further data, I think it is not obvious to give more definite advices. -- Fabien.diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index e900dcc..1cec243 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2454,6 +2454,28 @@ include_dir 'conf.d' /listitem /varlistentry + varlistentry id=guc-checkpoint-sort xreflabel=checkpoint_sort + termvarnamecheckpoint_sort/varname (typebool/type) + indexterm + primaryvarnamecheckpoint_sort/ configuration parameter/primary + /indexterm + /term + listitem + para +Whether to sort buffers before writting them out to disk on checkpoint. +For a HDD storage, this setting allows to group together +neighboring pages written to disk, thus improving performance by +reducing random write activity. +This sorting should have limited performance effects on SSD backends +as such storages have good random write performance, but it may +help with wear-leveling so be worth keeping anyway. +The default is literalon/. +This parameter can only be set in the filenamepostgresql.conf/ +file or on the server command line. + /para + /listitem + /varlistentry + varlistentry id=guc-checkpoint-warning xreflabel=checkpoint_warning termvarnamecheckpoint_warning/varname (typeinteger/type) indexterm diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml index e3941c9..f538698 100644 --- a/doc/src/sgml/wal.sgml +++ b/doc/src/sgml/wal.sgml @@ -546,6 +546,18 @@ /para para + When hard-disk drives (HDD) are used for terminal data storage + xref linkend=guc-checkpoint-sort allows to sort pages + so that neighboring pages on disk will be flushed together by + chekpoints, reducing the random write load and improving performance. + If solid-state drives (SSD) are used, sorting pages induces no benefit + as their random write I/O performance is good: this feature could then + be disabled by setting varnamecheckpoint_sort/ to valueoff/. + It is possible that sorting may help with SSD wear leveling, so it may + be kept on that account. + /para + + para The number of WAL segment files in filenamepg_xlog/ directory depends on varnamemin_wal_size/, varnamemax_wal_size/ and the amount of WAL generated in previous checkpoint cycles. When old log diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 68e33eb..bee38ab 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -7995,11 +7995,13 @@ LogCheckpointEnd(bool restartpoint) sync_secs, total_secs, longest_secs, +sort_secs, average_secs; int write_usecs, sync_usecs, total_usecs, longest_usecs, +sort_usecs, average_usecs; uint64 average_sync_time; @@ -8030,6 +8032,10 @@ LogCheckpointEnd(bool restartpoint) CheckpointStats.ckpt_end_t, total_secs, total_usecs); + TimestampDifference(CheckpointStats.ckpt_sort_t, + CheckpointStats.ckpt_sort_end_t, + sort_secs, sort_usecs); + /* * Timing values returned from CheckpointStats are in
Re: [HACKERS] Proposal: Implement failover on libpq connect level.
On 18 Aug 2015, at 10:32, Albe Laurenz laurenz.a...@wien.gv.at wrote: Victor Wagner wrote: Rationale = Since introduction of the WAL-based replication into the PostgreSQL, it is possible to create high-availability and load-balancing clusters. However, there is no support for failover in the client libraries. So, only way to provide transparent for client application failover is IP address migration. This approach has some limitation, i.e. it requires that master and backup servers reside in the same subnet or may not be feasible for other reasons. Commercial RDBMS, such as Oracle, employ more flexible approach. They allow to specify multiple servers in the connect string, so if primary server is not available, client library tries to connect to other ones. This approach allows to use geographically distributed failover clusters and also is a cheap way to implement load-balancing (which is not possible with IP address migration). I wonder how useful this is at the present time. If the primary goes down and the client gets connected to the standby, it would have read-only access there. Most applications wouldn't cope well with that. Once we have multi-master replication that can be used for fail-over, the picture will change. Then a feature like that would be very useful indeed. host=main-server host=standby1 host=standby2 port=5432 dbname=database It seems a bit arbitrary to require that all servers use the same port. Maybe parameters like host2, port2, host3, port3 etc. might be better. Yours, Laurenz Albe i totally agree with laurenz. in addition to that you have the “problem” of transactions. if you failover in the middle of a transaction, strange things might happen from the application point of view. the good thing, however, is that stupid middleware is sometimes not able to handle failed connections. however, overall i think it is more of a danger than a benefit. regards, hans -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] allowing wal_level change at run time
On Tue, Aug 18, 2015 at 7:59 AM, Peter Eisentraut pete...@gmx.net wrote: How would we handle decreases at run time? We can prevent =archive - minimal if archiving is running or there are physical replication slots, and we can prevent logical - something less if there are logical replication slots, but AFAICT, we don't have a way to check whether anyone currently needs level hot_standby. What do you mean by prevent? If the user edits postgresql.conf and reduces the setting, and then reloads the configuration file, they have a right to expect that the changes got applied. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] allowing wal_level change at run time
In [1], it was discussed to make wal_level changeable at run time (PGC_SIGHUP), as part of an effort to make replication easier to set up and/or provide better defaults. I was wondering what it would take to do that. I looks like increasing the setting is doable, as long as we WAL-log the change using existing facilities. I don't understand the hot_standby - logical transition, so maybe something else is necessary there. How would we handle decreases at run time? We can prevent =archive - minimal if archiving is running or there are physical replication slots, and we can prevent logical - something less if there are logical replication slots, but AFAICT, we don't have a way to check whether anyone currently needs level hot_standby. (Thread [1] originally proposed to get rid of the archive/hot_standby distinction, which would help here.) Or we could just let users make their own mess if they want to. Comments? [1] http://www.postgresql.org/message-id/CA+TgmoZLji0tAjoVOPFAPJ8d2e8Q=zm4zdtqxska3j+z-pn...@mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Test code is worth the space
On Tue, Aug 18, 2015 at 02:03:19PM +0100, Greg Stark wrote: On Tue, Aug 18, 2015 at 6:57 AM, Noah Misch n...@leadboat.com wrote: I suspect any effort to significantly improve Postgres test coverage is doomed until there's an alternative to pg_regress. There is the src/test/perl/TestLib.pm harness. Sadly I think the test suite is only half the battle. The coding style of Postgres predates modern test suite systems and makes it hard to test. Most functions require a specific environment set up that would be laborious and difficult to do in any sane way. Even something as self-contained as tuplesort would be difficult to test without the whole operator class and syscache mechanisms initialized and populated. And that's an easy case, imagine trying to test individual functions in the planner without doing a complete planner run on a query. I'm given to understand that this tight coupling is necessary for performance. Are you saying that it could be unwound, or that testing strategies mostly need to take it into account, or...? Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning
Hi Thom, On Tue, Aug 18, 2015 at 8:02 PM, Thom Brown t...@linux.com wrote: Wow, didn't expect to see that email this morning. A very quick test: CREATE TABLE purchases (purchase_id serial, purchase_time timestamp, item text) partition by range on ((extract(year from purchase_time)),(extract(month from purchase_time))); ERROR: referenced relation purchases is not a table or foreign table Thanks for the quick test. Damn, I somehow missed adding the new relkind to a check in process_owned_by(). Will fix this and look for any such oversights. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] allowing wal_level change at run time
On Tue, Aug 18, 2015 at 1:59 PM, Peter Eisentraut pete...@gmx.net wrote: In [1], it was discussed to make wal_level changeable at run time (PGC_SIGHUP), as part of an effort to make replication easier to set up and/or provide better defaults. I was wondering what it would take to do that. I looks like increasing the setting is doable, as long as we WAL-log the change using existing facilities. I don't understand the hot_standby - logical transition, so maybe something else is necessary there. How would we handle decreases at run time? We can prevent =archive - minimal if archiving is running or there are physical replication slots, and we can prevent logical - something less if there are logical replication slots, but AFAICT, we don't have a way to check whether anyone currently needs level hot_standby. (Thread [1] originally proposed to get rid of the archive/hot_standby distinction, which would help here.) Or we could just let users make their own mess if they want to. I still think we should get rid of the difference between archive/hot_standby. It creates more trouble than I've ever seen it save. I think we can safely label that as something we added because we didn't know if it was going to be needed (because back in 9.0 nobody knew what the impact *really* would be), but now we know it's not necessary so we can just get rid of it. Especially as it makes something like this easier. Removing complexity of such important parts of the code is a feature in itself! -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] Test code is worth the space
On Tue, Aug 18, 2015 at 6:57 AM, Noah Misch n...@leadboat.com wrote: My own position is based on having maintained a pg_regress suite an order of magnitude larger than that. I don't know why that outcome was so different. Comparing the size of test suites by these numbers is impossible because people put more or fewer tests in each schedule file. I would be more interested in the run-time as a metric but even that is fallible as I've seen individual tests that took 30min to run. And does your pg_regress test suite actually find many bugs or does it mainly detect when functionality has changed and require updating expected results to match? I suspect any effort to significantly improve Postgres test coverage is doomed until there's an alternative to pg_regress. There is the src/test/perl/TestLib.pm harness. Sadly I think the test suite is only half the battle. The coding style of Postgres predates modern test suite systems and makes it hard to test. Most functions require a specific environment set up that would be laborious and difficult to do in any sane way. Even something as self-contained as tuplesort would be difficult to test without the whole operator class and syscache mechanisms initialized and populated. And that's an easy case, imagine trying to test individual functions in the planner without doing a complete planner run on a query. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On Mon, Aug 10, 2015 at 11:05 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Mon, Aug 10, 2015 at 12:39 AM, Robert Haas robertmh...@gmail.com wrote: On Thu, Aug 6, 2015 at 11:33 AM, Jim Nasby jim.na...@bluetreble.com wrote: They also provide a level of control over what is and isn't installed in a cluster. Personally, I'd prefer that most users not even be aware of the existence of things like pageinspect. +1. [...] Extensions are a useful packaging mechanism for functionality that is useful but not required, and debugging facilities are definitely very useful but should not be required. +1. Sorry to be come discussion late. I have encountered the much cases where pg_stat_statement, pgstattuples are required in production, so I basically agree with moving such extension into core. But IMO, the diagnostic tools for visibility map, heap (pageinspect) and so on, are a kind of debugging tool. Attached latest v11 patches, which is separated into 2 patches: frozen bit patch and diagnostic function patch. Moving diagnostic function into core is still under the discussion, but this patch puts such function into core because the diagnostic function for visibility map needs to be in core to execute regression test at least. Regards, -- Masahiko Sawada diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c index 22c5f7a..b1b6a06 100644 --- a/contrib/pgstattuple/pgstatapprox.c +++ b/contrib/pgstattuple/pgstatapprox.c @@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat) * If the page has only visible tuples, then we can find out the free * space from the FSM and move on. */ - if (visibilitymap_test(rel, blkno, vmbuffer)) + if (visibilitymap_test(rel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE)) { freespace = GetRecordedFreeSpace(rel, blkno); stat-tuple_len += BLCKSZ - freespace; diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 3701d8e..dabd632 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -2176,8 +2176,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, CheckForSerializableConflictIn(relation, NULL, InvalidBuffer); /* - * Find buffer to insert this tuple into. If the page is all visible, - * this will also pin the requisite visibility map page. + * Find buffer to insert this tuple into. If the page is all visible + * or all frozen, this will also pin the requisite visibility map and + * frozen map page. */ buffer = RelationGetBufferForTuple(relation, heaptup-t_len, InvalidBuffer, options, bistate, @@ -2192,7 +2193,11 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, if (PageIsAllVisible(BufferGetPage(buffer))) { all_visible_cleared = true; + + /* all-frozen information is also cleared at the same time */ PageClearAllVisible(BufferGetPage(buffer)); + PageClearAllFrozen(BufferGetPage(buffer)); + visibilitymap_clear(relation, ItemPointerGetBlockNumber((heaptup-t_self)), vmbuffer); @@ -2493,7 +2498,11 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples, if (PageIsAllVisible(page)) { all_visible_cleared = true; + + /* all-frozen information is also cleared at the same time */ PageClearAllVisible(page); + PageClearAllFrozen(page); + visibilitymap_clear(relation, BufferGetBlockNumber(buffer), vmbuffer); @@ -2776,9 +2785,9 @@ heap_delete(Relation relation, ItemPointer tid, /* * If we didn't pin the visibility map page and the page has become all - * visible while we were busy locking the buffer, we'll have to unlock and - * re-lock, to avoid holding the buffer lock across an I/O. That's a bit - * unfortunate, but hopefully shouldn't happen often. + * visible or all frozen while we were busy locking the buffer, we'll + * have to unlock and re-lock, to avoid holding the buffer lock across an + * I/O. That's a bit unfortunate, but hopefully shouldn't happen often. */ if (vmbuffer == InvalidBuffer PageIsAllVisible(page)) { @@ -2972,10 +2981,15 @@ l1: */ PageSetPrunable(page, xid); + /* clear PD_ALL_VISIBLE and PD_ALL_FORZEN flags */ if (PageIsAllVisible(page)) { all_visible_cleared = true; + + /* all-frozen information is also cleared at the same time */ PageClearAllVisible(page); + PageClearAllFrozen(page); + visibilitymap_clear(relation, BufferGetBlockNumber(buffer), vmbuffer); } @@ -3254,7 +3268,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup, * in the middle of changing this, so we'll need to recheck after we have * the lock. */ - if (PageIsAllVisible(page)) + if (PageIsAllVisible(page) || PageIsAllFrozen(page)) visibilitymap_pin(relation, block, vmbuffer); LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); @@ -3850,14 +3864,22 @@ l2: if (PageIsAllVisible(BufferGetPage(buffer))) { all_visible_cleared = true; + +
Re: [HACKERS] missing documentation for partial WAL files
* Robert Haas (robertmh...@gmail.com) wrote: On Mon, Aug 17, 2015 at 2:50 PM, Peter Eisentraut pete...@gmx.net wrote: The commit message for de76884 contains some important information about the purpose and use of the new .partial WAL files. But I don't see anything about this in the documentation or another user-visible place. We should probably add something. +1. Heikki talked about this at his PGCon presentation, and I thought that was pretty helpful stuff, but not everyone was there (or will remember what he said when the time comes that they need to know). Hopefully the audio or video of it will be posted eventually.. I was hoping to review it myself and to discuss the points he made in depth with David, to make sure we cover all of them (pretty sure we do, but good to verify). The link to the slides, at least, is here: http://www.pgcon.org/2015/schedule/attachments/379_PGCon2015-Warm-standby-done-right.pdf Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Error message with plpgsql CONTINUE
Hi 2015-08-17 23:46 GMT+02:00 Tom Lane t...@sss.pgh.pa.us: Jim Nasby jim.na...@bluetreble.com writes: On 8/17/15 9:48 AM, Tom Lane wrote: I'm inclined to think that if we wanted to make this better, the way to improve it would be to detect the error*at compile time*, and get rid of this hack in plpgsql_exec_function altogether. So split PLPGSQL_NSTYPE_LABEL into PLPGSQL_NSTYPE_BLOCK_LABEL and PLPGSQL_NSTYPE_LOOP_LABEL, and split opt_block_label and opt_label the same way? I think using two NSTYPE codes would probably be a pain because there are numerous places that don't care about the distinction; it'd be better to have a secondary attribute distinguishing these cases. (It looks like you could perhaps reuse the itemno field for the purpose, since that seems to be going unused in LABEL items.) You likely do need to split opt_block_label into two productions, since that will be the easiest way to pass forward the knowledge of whether it's being called from a loop or non-loop construct. when I implemented this check in plpgsql_check I found another minor issue in CONTINUE statement - the typename is wrong Regards Pavel regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers diff --git a/src/pl/plpgsql/src/pl_funcs.c b/src/pl/plpgsql/src/pl_funcs.c new file mode 100644 index 7b26970..7603441 *** a/src/pl/plpgsql/src/pl_funcs.c --- b/src/pl/plpgsql/src/pl_funcs.c *** plpgsql_stmt_typename(PLpgSQL_stmt *stmt *** 235,241 case PLPGSQL_STMT_FOREACH_A: return _(FOREACH over array); case PLPGSQL_STMT_EXIT: ! return EXIT; case PLPGSQL_STMT_RETURN: return RETURN; case PLPGSQL_STMT_RETURN_NEXT: --- 235,241 case PLPGSQL_STMT_FOREACH_A: return _(FOREACH over array); case PLPGSQL_STMT_EXIT: ! return ((PLpgSQL_stmt_exit *) stmt)-is_exit ? EXIT : CONTINUE; case PLPGSQL_STMT_RETURN: return RETURN; case PLPGSQL_STMT_RETURN_NEXT: -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning
great :-) 2. Creating a partition of a partitioned table CREATE TABLE table_name PARTITION OF partitioned_table_name FOR VALUES values_spec; Where values_spec is: listvalues: [IN] (val1, ...) Would it make sense to allow one complementary partition to the listvalues? listvalues: [[NOT] IN] (val1, ...) I've thought a few times about moving data with some most common values to dedicated partitions and keeping the rest in a separate one... best regards, Marc Mamin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] missing documentation for partial WAL files
* Stephen Frost (sfr...@snowman.net) wrote: * Robert Haas (robertmh...@gmail.com) wrote: On Mon, Aug 17, 2015 at 2:50 PM, Peter Eisentraut pete...@gmx.net wrote: The commit message for de76884 contains some important information about the purpose and use of the new .partial WAL files. But I don't see anything about this in the documentation or another user-visible place. We should probably add something. +1. Heikki talked about this at his PGCon presentation, and I thought that was pretty helpful stuff, but not everyone was there (or will remember what he said when the time comes that they need to know). Hopefully the audio or video of it will be posted eventually.. I was hoping to review it myself and to discuss the points he made in depth with David, to make sure we cover all of them (pretty sure we do, but good to verify). The link to the slides, at least, is here: http://www.pgcon.org/2015/schedule/attachments/379_PGCon2015-Warm-standby-done-right.pdf Dan corrected me, it's been posted already, here: https://www.youtube.com/watch?v=mlQ90MntJwMlist=PLWW0CjV-TafZo4lBWuzw7OYJY7Y4SW76Bindex=17 Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Autonomous Transaction is back
On Tue, Aug 18, 2015 at 8:17 AM, Robert Haas robertmh...@gmail.com wrote: On Sat, Aug 15, 2015 at 6:47 PM, Noah Misch n...@leadboat.com wrote: CREATE TABLE t (c) AS SELECT 1; BEGIN; UPDATE t SET c = 2 WHERE c = 1; BEGIN_AUTONOMOUS; UPDATE t SET c = 3 WHERE c = 1; UPDATE t SET c = 4 WHERE c = 2; COMMIT_AUTONOMOUS; ROLLBACK; If you replace the autonomous transaction with a savepoint, the c=3 update finds no rows, and the c=4 update changes one row. When the outer transaction aborts, only the original c=1 row remains live. If you replace the autonomous transaction with a dblink/pg_background call, the c=3 update waits indefinitely for c=2 to commit or abort, an undetected deadlock. Suppose you make the autonomous transaction see tuples like in the savepoint case. The c=3 update finds no rows to update, and the c=4 update changes one row. When the outer transaction aborts, you have two live rows (c=1 and c=4). Suppose you instead make the autonomous transaction see tuples like in the dblink case, yet let c=3 ignore the lock and change a row. If both the autonomous transaction and the outer transaction were to commit, then you get two live rows (c=2 and c=3). Neither of those outcomes is acceptable, of course. In today's tuple update rules, c=3 must deadlock[1]. Other credible tuple update rules may not have this problem, but nothing jumps to mind. [1] That's not to say it must use the shmem lock structures and deadlock detector. This footnote goes to my point. It seems clear to me that having the autonomous transaction see the effects of the outer uncommitted transaction is a recipe for trouble. If the autonomous transaction updates a row and commits, and the outer transaction later aborts, the resulting state is inconsistent with any serial history. I'm fairly certain that's going to leave us in an unhappy place. Even more obviously, ending up with two committed row versions that are both updates of a single ancestor version is no good. So, I agree that this scenario should be an error. What I don't agree with is the idea that it should be the deadlock detector's job to throw that error. Rather, I think that when we examine the xmax of the tuple we can see - which is the original one, not the one updated by the outer transaction - we should check whether that XID belongs to an outer transaction. Hm: do you mean 'an' outer transaction (meaning, basically, any in progress transaction) or the outer transaction of the AT. I think you mean outer transaction of the AT, which makes a lot of sense and should be easy and fast to test. It's like an implied NOWAIT if the locker is the AT and the lockee is the parent. Can you get away with only looking at tuples though? For example, what about advisory locks? Table locks? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Implement failover on libpq connect level.
On Tue, Aug 18, 2015 at 12:53 PM, Jaime Casanova jaime.casan...@2ndquadrant.com wrote: This is not completely true, you can always use something like pgbouncer or other middleware to change the server to which clients connect. you still need to solve the fact that you will have a read-only server at the other side. something like repmgr + pgbouncer will work fine. Sure, but pgbouncer is an extra hop, and has its own foibles. There's real appeal to doing this in the client. i agree that once/if we ever have multimaster included then this could be a good idea I think it has a lot of appeal *now*. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning
Amit, I would like propose $SUBJECT for this development cycle. Attached is a WIP patch that implements most if not all of what's described below. Some yet unaddressed parts are mentioned below, too. I'll add this to the CF-SEP. First of all, wow! Really happy to see this. Syntax == 1. Creating a partitioned table CREATE TABLE table_name PARTITION BY {RANGE|LIST} ON (column_list); Where column_list consists of simple column names or expressions: PARTITION BY LIST ON (name) PARTITION BY RANGE ON (year, month) PARTITION BY LIST ON ((lower(left(name, 2))) PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d))) So far so good. Have you given any thought as to how a user will determine which partition corresponds to which values (for purposes of dropping/maintaining a partition)? Also, won't doing things like extract() for range partitions make it much harder for you to develop the planner parts of this solution? What about defining an interval instead, such as: PARTITION BY RANGE USING ( interval ) ON ( column ); i.e. PARTITION BY RANGE USING ( INTERVAL '1 month' ) ON ( submitted_date ); PARTITION BY RANGE USING ( 10 ) ON ( user_id ); This would make it easy for you to construct range type values defining the range of each partition, which would then make the planner work much easier than calling a function would, no? Or am I misunderstanding how you're using ranges here? It kind of seems like you're still leaving specific range defintions up to the user, which is (from my perspective) unsatisfactory (see below). I'm assuming that all range partitions will be [ closed, open ) ranges. 2. Creating a partition of a partitioned table CREATE TABLE table_name PARTITION OF partitioned_table_name FOR VALUES values_spec; Where values_spec is: listvalues: [IN] (val1, ...) rangevalues: START (col1min, ... ) END (col1max, ... ) | START (col1min, ... ) | END (col1max, ... ) So, one thing I missed in here is anything about automated partitioning of tables; that is, creating new partitions based on incoming data or a simple statement which doesn't require knowledge of the partitioning scheme. It's possible (and entirely accceptable) that you're considering automated partition creation outside of the scope of this patch. However, for range partitions, it would be *really* useful to have this syntax: CREATE NEXT PARTITION ON parent_table; Which would just create the next partition based on whatever the range partitoning scheme is, instead of requiring the user to calculate start and end values which might or might not match the parent partitioning scheme, and might leave gaps. Also this would be useful for range partitions: CREATE PARTITION ON parent_table USING ( start_value ); ... where start_value is the start range of the new partition. Again, easier for users to get correct. Both of these require the idea of regular intervals for range partitions, that is, on a table partitioned by month on a timestamptz column, each partition will have the range [ month:1, nextmonth:1 ). This is the most common use-case for range partitions (like, 95% of all partitioning cases I've seen), so a new partitioning scheme ought to address it. While there are certainly users who desire the ability to define arbitrary ranges for each range partition, these are by far the minority and could be accomodated by a different path with more complex syntax. Further, I'd wager that most users who want to define arbitrary ranges for range partitions aren't going to be satisfied with the other restrictions on declarative partitioning (e.g. same constraints, columns for all partitions) and are going to use inheritance partitioning anyway. 5. Detach partition ALTER TABLE partitioned_table DETACH PARTITION partition_name [USING table_name] This removes partition_name as partition of partitioned_table. The table continues to exist with the same name or 'table_name', if specified. pg_class.relispartition is set to false for the table, so it behaves like a normal table. What about DROPping partitions? Do they need to be detached first? Creating index on parent is not allowed. They should be defined on (leaf) partitions. Because of this limitation, primary keys are not allowed on a partitioned table. Perhaps, we should be able to just create a dummy entry somewhere to represent an index on parent (which every partition then copies.) This would be preferable, yes. Making users remember to manually create indexes on each partition is undesirable. What should TRUNCATE on partitioned table do? On the master table? Truncate all individual partitions. Do not drop the partitions. On a partitition? Truncate just that partition. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:
Re: [HACKERS] allowing wal_level change at run time
On 2015-08-18 13:24:54 -0400, Peter Eisentraut wrote: On 8/18/15 12:35 PM, Robert Haas wrote: If archive_mode=on or max_wal_senders0, then we need at least wal_level=archive. Otherwise wal_level=minimal is enough. Totally forgot about max_wal_senders. However, the thread I linked to earlier aimed for a different master plan (or if not, I'm aiming for it now). There is camp 1, which wants to keep all the defaults the same, for performance or something like that. And there is camp 2, which wants to have a replication-friendly setup by default. Instead of fighting over this, your idea was to be able to switch between 1 and 2 easily (which means in particular without restarts). I don't think not requiring restarts is sufficient, having to twiddle a bunch of parameters manually still is a lot more effort than people see as necessary. The only reason I think changing the default is a good approach is that it's doable within a relatively short amount of time. But if we tie the effective wal_level to archive_mode or max_wal_senders, both of which are restart-only, then we haven't gained anything. (We would have removed half a GUC parameter, effectively. Not bad, but not very exciting.) ISTM that it's not too hard to a) make archive_mode PGC_SIGHUP b) make wal_level PGC_SIGHUP c) automatically increase wal_level to logical whenever a logical replication slot is defined it seems considerably harder to d) make max_wal_senders PGC_SIGHUP e) make max_replication_slots PGC_SIGHUP because they need shmem, locks, and everything. Therefore I propose something slightly different: We change the default of max_wal_senders, max_replication_slots, to some reasonably high setting and make wal_level an internal automagically determined variable. archive_mode = on automatically increases the level to what's now hot-standby. To deal with streaming replication, we automatically create slots for replicas, but, by default, *without* having them reserve WAL. The slot name would be determined in some automatic fashion (uuid or something) and stored in a new file in the data directory. That allows us to increase the internal wal_level to hot_standby/archive whenever a replica has connected (and thus a physical slot exists), and to logical whenever a logical slot exists. Now, that'll mean that the wal_level doesn't decrease automatically until a slot has been dropped. But that seems relatively harmless if it's not reserving WAL. Too crazy? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning
On Tue, Aug 18, 2015 at 6:30 AM, Amit Langote langote_amit...@lab.ntt.co.jp wrote: Hi, I would like propose $SUBJECT for this development cycle. Attached is a WIP patch that implements most if not all of what's described below. Some yet unaddressed parts are mentioned below, too. I'll add this to the CF-SEP. Syntax == 1. Creating a partitioned table CREATE TABLE table_name PARTITION BY {RANGE|LIST} ON (column_list); Where column_list consists of simple column names or expressions: PARTITION BY LIST ON (name) PARTITION BY RANGE ON (year, month) PARTITION BY LIST ON ((lower(left(name, 2))) PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d))) Note: LIST partition key supports only one column. For each column, you could write operator class name: PARTITION BY LIST/RANGE ON (colname [USING] opclass_name), If not specified, the default btree operator class based on type of each key column is used. If none of the available btree operator classes are compatible with the partitioning strategy (list/range), error is thrown. Built-in btree operator classes cover a good number of types for list and range partitioning in practical scenarios. A table created using this form is of proposed new relkind RELKIND_PARTITIONED_REL. An entry in pg_partitioned_rel (see below) is created to store partition key info. Note: A table cannot be partitioned after-the-fact using ALTER TABLE. Normal dependencies are created between the partitioned table and operator classes, object in partition expressions like functions. 2. Creating a partition of a partitioned table CREATE TABLE table_name PARTITION OF partitioned_table_name FOR VALUES values_spec; Where values_spec is: listvalues: [IN] (val1, ...) rangevalues: START (col1min, ... ) END (col1max, ... ) | START (col1min, ... ) | END (col1max, ... ) A table created using this form has proposed pg_class.relispartition set to true. An entry in pg_partition (see below) is created to store the partition bound info. The values_spec should match the partitioning strategy of the partitioned table. In case of a range partition, the values in START and/or END should match columns in the partition key. Defining a list partition is fairly straightforward - just spell out the list of comma-separated values. Error is thrown if the list of values overlaps with one of the existing partitions' list. CREATE TABLE persons_by_state (name text, state text) PARTITION BY LIST ON (state); CREATE TABLE persons_IL PARTITION OF persons_by_state FOR VALUES IN ('IL'); CREATE TABLE persons_fail PARTITION OF persons_by_state FOR VALUES IN ('IL'); ERROR: cannot create partition that overlaps with an existing one For a range partition, there are more than one way: Specify both START and END bounds: resulting range should not overlap with the range(s) covered by existing partitions. Error is thrown otherwise. Although rare in practice, gaps between ranges are OK. CREATE TABLE measurement(logdate date NOT NULL) PARTITION BY RANGE ON (logdate); CREATE TABLE measurement_y2006m02 PARTITION OF measurement FOR VALUES START ('2006-02-01') END ('2006-03-01'); --success CREATE TABLE measurement_fail PARTITION OF measurement FOR VALUES START ('2006-02-15') END ('2006-03-01'); ERROR: cannot create partition that overlaps with an existing one Specify only the START bound: add the partition on the left of some range covered by existing partitions provided no overlap occurs (also considering gaps between ranges, if any). If no such range exists, the new partition will cover the range [START, +INFINITY) and become the rightmost partition. Error is thrown if the specified START causes overlap. CREATE TABLE measurement_y2006m01 PARTITION OF measurement FOR VALUES START ('2006-01-01'); --success CREATE TABLE measurement_fail PARTITION OF measurement FOR VALUES START ('2006-02-01'); --overlaps with measurement_y2006m02 ERROR: cannot create partition that overlaps with an existing one Specify only the END bound: add the partition on the right of some range covered by existing partitions provided no overlap occurs (also considering gaps between ranges, if any). If no such range exists, the new partition would cover the range (-INFINITY, END) and become the leftmost partition. Error is thrown if the specified END causes overlap. CREATE TABLE measurement_y2006m03 PARTITION OF measurement FOR VALUES END ('2006-04-01'); --success CREATE TABLE measurement_fail PARTITION OF measurement FOR VALUES END ('2006-03-01'); --overlaps with measurement_y2006m02 ERROR: cannot create partition that overlaps with an existing one For each partition, START and END bound values are stored in the catalog. Note that the lower bound is inclusive, whereas the upper bound is exclusive. Note: At most one range partition can have null min bound in which case it covers the range
Re: [HACKERS] WIP: SCRAM authentication
* Robert Haas (robertmh...@gmail.com) wrote: On Tue, Aug 18, 2015 at 2:07 PM, Stephen Frost sfr...@snowman.net wrote: SCRAM itself, as has been discussed, supports multiple password verifiers- that's a specific case all by itself, and it's done specifically to address the issue that one or another of the algorithms used is compromised, or that a new algorithm becomes available which is better. AD and Kerberos support multiple password verifiers because of this and that it allows you to migrate from one to the next without having to do wholesale replacment across all systems involved. I bring them up as examples of the advances in password-based authentication which we've missed and because they are what users expect from current password-based authentication systems, not because we support them and therefore should just push everyone to them. OK, that's an interesting argument. If SCRAM supports multiple password verifiers, and we support SCRAM, then I guess we should probably do that, too. I still don't like it all that much, though. I think it's absolutely inevitable that people are going to end up with an account with 3 or more different passwords that can all be used to log into it, and that won't be good. How do other systems avoid this pitfall? They provide: a) ability to configure which algorithms are used at change-password MIT Kerberos, kdc.conf: supported_enctypes and kadmin: enctype-salttype b) ability to configure which algorithms are allowed to be used at all MIT Kerberos, libdefaults: default_tgs_enctypes, default_tkt_enctypes, and permitted_enctypes, along with allow_weak_crypto which removes all those considered 'weak' from the other sets if set to false (the default). c) password aging (to get users to change their password regularly, so you know after, say, 90 days of the change to the set configured in 'a' that all users will have either expired passwords or have been updated to the latest set). Note that we don't need quite so many options, these were done over time for Kerberos and also address the different types of tickets involved in that system. Today, they discourage explicitly setting default_tgs_enctypes and default_tkt_enctypes. We really just need to allow configuration of which should be stored at change-password time and the set of allowed types. I wouldn't want those to be the same because you'd want to remove a compromised type as soon as possible from the change-password set while still allowing those users to connect, to get them to change their password using self-service (doing it all through the adminstrators of the system would be very painful in large organizations). We also don't strictly need password aging, but it certainly helps. We definitely do need to provide administrators with easy interfaces through functions and tools to manage the password verifiers used. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] allowing wal_level change at run time
On 8/18/15 12:35 PM, Robert Haas wrote: If archive_mode=on or max_wal_senders0, then we need at least wal_level=archive. Otherwise wal_level=minimal is enough. Totally forgot about max_wal_senders. However, the thread I linked to earlier aimed for a different master plan (or if not, I'm aiming for it now). There is camp 1, which wants to keep all the defaults the same, for performance or something like that. And there is camp 2, which wants to have a replication-friendly setup by default. Instead of fighting over this, your idea was to be able to switch between 1 and 2 easily (which means in particular without restarts). But if we tie the effective wal_level to archive_mode or max_wal_senders, both of which are restart-only, then we haven't gained anything. (We would have removed half a GUC parameter, effectively. Not bad, but not very exciting.) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: SCRAM authentication
On Tue, Aug 18, 2015 at 2:07 PM, Stephen Frost sfr...@snowman.net wrote: I would expect there to be people who would run into pg_upgrade complaining, that's why there would be the check. That's actually a much better situation than what happened around standard_conforming_strings. Further, users would be able to continue with their existing version until they're ready to move or it goes out of support, by which time, if their connector isn't updated, they should be moving off of it also. We hear about people running 8.4 and older because of some application which was never maintained or updated, and that sucks, but it doesn't prevent us from making the changes we need to make to move the project forward for the users who properly manage their systems and use supported connectors. Sorry, that's a completely bogus argument. We do not need to prevent people from upgrading if they haven't moved off of MD5 authentication; that's just an arbitrary - and IMHO extremely user-hostile - policy decision. We have a legitimate need, to move the project forward, to introduce a better system for password authentication. Ripping out the old one is not a real need. I'm sure at some point it will seem like antiquated cruft that nobody uses any more, but that will not be in a year or two or anything close to that. SCRAM itself, as has been discussed, supports multiple password verifiers- that's a specific case all by itself, and it's done specifically to address the issue that one or another of the algorithms used is compromised, or that a new algorithm becomes available which is better. AD and Kerberos support multiple password verifiers because of this and that it allows you to migrate from one to the next without having to do wholesale replacment across all systems involved. I bring them up as examples of the advances in password-based authentication which we've missed and because they are what users expect from current password-based authentication systems, not because we support them and therefore should just push everyone to them. OK, that's an interesting argument. If SCRAM supports multiple password verifiers, and we support SCRAM, then I guess we should probably do that, too. I still don't like it all that much, though. I think it's absolutely inevitable that people are going to end up with an account with 3 or more different passwords that can all be used to log into it, and that won't be good. How do other systems avoid this pitfall? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [patch] psql tab completion for grant execute
On Mon, Aug 17, 2015 at 5:07 PM, Daniel Verite dan...@manitou-mail.org wrote: When tab-completing after GRANT EXECUTE, currently psql injects PROCEDURE, rather than the expected ON. The code for completing with ON is there, but it's not reached due to falling earlier into another branch, one that handles CREATE TRIGGER. A trivial patch is attached. It adds the condition that if EXECUTE is preceded by GRANT itself preceded by nothing, then that completion with PROCEDURE is skipped. Thanks, I committed this. I don't think we usually back-patch tab completion fixes, but I back-patched this one to 9.5 anyway, so that it gets out into the wild a little sooner. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: SCRAM authentication
Robert, * Robert Haas (robertmh...@gmail.com) wrote: On Tue, Aug 18, 2015 at 10:06 AM, Stephen Frost sfr...@snowman.net wrote: That was the imputus for my earlier suggestion that in a release or two, we actively make pg_upgrade error (or perhaps warn first, then error, across two releases) if any of the old verifiers exist. I think there's basically no chance of that being acceptable. The time at which it's possible to get rid of MD5 is going to vary widely between installations. People who are using only libpq or libpq-derived connectors will be able to get rid of it almost immediately, if they want, though some won't. People who are using obscure connectors that are poorly maintained may not even have a version that supports SCRAM for 5 years. Think about how long it took us to roll out the standard_conforming_strings changes, and there were still people who got bitten. I would expect there to be people who would run into pg_upgrade complaining, that's why there would be the check. That's actually a much better situation than what happened around standard_conforming_strings. Further, users would be able to continue with their existing version until they're ready to move or it goes out of support, by which time, if their connector isn't updated, they should be moving off of it also. We hear about people running 8.4 and older because of some application which was never maintained or updated, and that sucks, but it doesn't prevent us from making the changes we need to make to move the project forward for the users who properly manage their systems and use supported connectors. The other concern with a single password verifier is that we're locking ourselves into a one-verifier-per-role solution when most of the serious solutions in use today (Active Directory, Kerberos, certificate based systems) allow for more than one. So what? If you want to delegate authentication to AD or Kerberos, we already support that. That's not a reason to invent the same functionality inside the server. If you've got a tangible plan, other than SCRAM, that would require us to support multiple verifiers, then please say what it is. If not, the mere fact that some other people support it is not a reason why we should. SCRAM itself, as has been discussed, supports multiple password verifiers- that's a specific case all by itself, and it's done specifically to address the issue that one or another of the algorithms used is compromised, or that a new algorithm becomes available which is better. AD and Kerberos support multiple password verifiers because of this and that it allows you to migrate from one to the next without having to do wholesale replacment across all systems involved. I bring them up as examples of the advances in password-based authentication which we've missed and because they are what users expect from current password-based authentication systems, not because we support them and therefore should just push everyone to them. That said, if we aren't going to move password-based authentication in PG to beyond the stone age it's currently in, then perhaps we should just drop it completely and force users to pick a solution which is well written, uses standard protocols, supports multiple ciphers and hashes (in case one or more become a problem) and is updated to keep up with the field. Obviously, that's a strawman and I appreciate that you are not arguing to keep the status-quo, but I'm trying to make a point that multiple password verifiers is part of modern authentication technology, used by multiple standard and broadly installed solutions. In fact, we generally have taken the approach that needs which are already handled adequately by other tools to do not need to also be handled inside the database. That's not a perfect approach and we always argue about it around the edges, but generally, I think it's served us pretty well. Unless we take my strawman literally, which I don't think we can, we can't simply punt on pushing all password-based authentication out of the server. If we agree that we need to support password-based authentication then we should be working to maintain it as well as we do the WAL and other subsystems. I fully agree that we should leverage existing systems, technologies, and protocols where possible, to avoid having to add a lot of extra code which we have to maintain. I believe that this proposal already meets that goal and we should move forward with full support for SCRAM. That, in my view at least, means that we also need to provide for multiple password verifiers. Thanks! Stephen signature.asc Description: Digital signature
[HACKERS] a few doubts around check_for_isn_and_int8_passing_mismatch
pg_upgrade's check.c contains a function called check_for_isn_and_int8_passing_mismatch. If the float8 pass-by-value values differ between the old and new clusters, which is likely to happen on Windows 64 because of cf376a4ad, and if the old cluster contains any databases which contain functions from '$libdir/isn', then it fails the upgrade. The comments say this: * contrib/isn relies on data type int8, and in 8.4 int8 can now be passed * by value. The schema dumps the CREATE TYPE PASSEDBYVALUE setting so * it must match for the old and new servers. There are a couple of things that bother me here: 1. The check is restricted to isn, but the issue isn't. Presumably, any datatype that is passed either by value or by reference depending on whether float8 is passed-by-value has the same issue. isn is the only module in core that has this issue, but there could be third-party modules with the same issue. I'm not sure there's anything we can do about that, but it's troublesome. 2. If pg_dump knew that we were trying to dump a type of this sort, it could easily emit an incantation that would work everywhere. Commit 3f936aacc introduced a LIKE = typename syntax that suits this problem perfectly, and changed the scripts for contrib/isn to use it. But pg_dump can't use that syntax, because there's no catalog state to record whether we want new typbyval = old typbyval or whether we instead want new typbyval = new cluster's float8-pass-by-value. Is it worth trying to fix this? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Extension upgrade and GUCs
Hi hackers, I've been wrestling with this one for a while and gone down a couple blind alleys, so time to ask the experts. PostGIS has a couple things different from the extensions that live in contrib, - it has some GUCs - it has a versioned loadable library (postgis-2.1.so, postgis-2.2.so, etc) We've found that, when we run the following SQL, CREATE EXTENSION postgis VERSION '2.1.9'; ALTER EXTENSION postgis UPDATE TO '2.2.0'; The update fails, because of a collision in the GUC. When the extension is CREATEd, postgis-2.1.so is loaded, _PG_init() is called, and that in turn calls DefineCustomStringVariable() to create our GUC. When the ALTER is called, the first time a C function definition is called, the new library, postgis-2.2.so is loaded, the _PG_init() of *that* library is called, and DefineCustomStringVariable() is called, but this time it runs into the GUC definition from the first library load, and the EXTENSION update process stops as an ERROR is thrown. My initial attempt at avoiding this problem involved looking at GetConfigOption() before running DefineCustomStringVariable() to see if the GUC was already defined. This did not work, as it's possible to define a GUC before loading the library. So an otherwise innocent sequence of commands like: SET my_guc = 'foo'; -- no library loaded yet SELECT my_library_function(); -- causes library load and _PG_init() to fire Would now fail, as it hit my test for a pre-existing GUC. Unfortunately, the GUC we are using is not one where we simply read a value now and a again. It switches the backend geometry library that various functions use, so performance is a big deal: instead of reading the GUC value, the code expects that GUC changes will flip a global variable, using the GUC assign callback. So I need a way to either (a) notice when I already have a (old) copy of the library loaded and avoid trying to setup the GUC in that case or (b) set-up the GUC in a somewhat less brittle way than DefineCustomStringVariable() allows, something that can overwrite things instead of just erroring out. The ugly code in question is here https://github.com/postgis/postgis/blob/svn-trunk/postgis/lwgeom_backend_api.c#L105 Discussion is here https://trac.osgeo.org/postgis/ticket/2382 Thanks, P -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: SCRAM authentication
On Tue, Aug 18, 2015 at 7:19 PM, Robert Haas robertmh...@gmail.com wrote: Sorry, that's a completely bogus argument. We do not need to prevent people from upgrading if they haven't moved off of MD5 authentication; that's just an arbitrary - and IMHO extremely user-hostile - policy decision. We have a legitimate need, to move the project forward, to introduce a better system for password authentication. Ripping out the old one is not a real need. I'm sure at some point it will seem like antiquated cruft that nobody uses any more, but that will not be in a year or two or anything close to that. I would imagine a GUC like enable_legacy_authentication=true which we would just start defaulting to false after a few years and perhaps consider removing five years after that. pg_upgrade could check if there are any users which require it to be set to true and warn users that they must enable it but should check the documentation so they understand the impact. OK, that's an interesting argument. If SCRAM supports multiple password verifiers, and we support SCRAM, then I guess we should probably do that, too. I still don't like it all that much, though. I think it's absolutely inevitable that people are going to end up with an account with 3 or more different passwords that can all be used to log into it, and that won't be good. How do other systems avoid this pitfall? Fwiw having multiple passwords would make automated credential rotations *so* much easier. Heroku has a really baroque solution to this problem in Postgres involving creating new child roles and swapping them around. My team in Google wasted many man hours dealing with fallout from the quarterly password rotations. (I wish we could just drop the account management and authentication system entirely and dump the whole work on a system designed for this particular problem. It's a hard problem that's far outside our core features and Postgres is never going to be a good system for anyone let alone everyone when there are many different types of users.) -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows
Kouhei Kaigai kai...@ak.jp.nec.com wrote: longlbuckets; lbuckets = 1 my_log2(hash_table_bytes / bucket_size); Assert(nbuckets 0); In my case, the hash_table_bytes was 101017630802, and bucket_size was 48. So, my_log2(hash_table_bytes / bucket_size) = 31, then lbuckets will have negative number because both 1 and my_log2() is int32. So, Min(lbuckets, max_pointers) chooses 0x8000, then it was set on the nbuckets and triggers the Assert(). Attached patch fixes the problem. So you changed the literal of 1 to 1U, but doesn't that just double the threshold for failure? Wouldn't 1L (to match the definition of lbuckets) be better? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add support for RADIUS passwords longer than 16 octets
On 2015-08-15 17:55, I wrote: The attached patch adds support for RADIUS passwords longer than 16 octets. Improved the coding and comments a bit, new version attached. .m *** a/src/backend/libpq/auth.c --- b/src/backend/libpq/auth.c *** *** 2168,2173 CheckCertAuth(Port *port) --- 2168,2174 #define RADIUS_VECTOR_LENGTH 16 #define RADIUS_HEADER_LENGTH 20 + #define RADIUS_MAX_PASSWORD_LENGTH 128 typedef struct { *** *** 2241,2247 CheckRADIUSAuth(Port *port) radius_packet *receivepacket = (radius_packet *) receive_buffer; int32 service = htonl(RADIUS_AUTHENTICATE_ONLY); uint8 *cryptvector; ! uint8 encryptedpassword[RADIUS_VECTOR_LENGTH]; int packetlength; pgsocketsock; --- 2242,2250 radius_packet *receivepacket = (radius_packet *) receive_buffer; int32 service = htonl(RADIUS_AUTHENTICATE_ONLY); uint8 *cryptvector; ! int encryptedpasswordlen; ! uint8 encryptedpassword[RADIUS_MAX_PASSWORD_LENGTH]; ! uint8 *md5trailer; int packetlength; pgsocketsock; *** *** 2259,2264 CheckRADIUSAuth(Port *port) --- 2262,2268 fd_set fdset; struct timeval endtime; int i, + j, r; /* Make sure struct alignment is correct */ *** *** 2316,2325 CheckRADIUSAuth(Port *port) return STATUS_ERROR; } ! if (strlen(passwd) RADIUS_VECTOR_LENGTH) { ereport(LOG, ! (errmsg(RADIUS authentication does not support passwords longer than 16 characters))); return STATUS_ERROR; } --- 2320,2329 return STATUS_ERROR; } ! if (strlen(passwd) RADIUS_MAX_PASSWORD_LENGTH) { ereport(LOG, ! (errmsg(RADIUS authentication does not support passwords longer than %d characters, RADIUS_MAX_PASSWORD_LENGTH))); return STATUS_ERROR; } *** *** 2344,2371 CheckRADIUSAuth(Port *port) radius_add_attribute(packet, RADIUS_NAS_IDENTIFIER, (unsigned char *) identifier, strlen(identifier)); /* !* RADIUS password attributes are calculated as: e[0] = p[0] XOR !* MD5(secret + vector) */ ! cryptvector = palloc(RADIUS_VECTOR_LENGTH + strlen(port-hba-radiussecret)); memcpy(cryptvector, port-hba-radiussecret, strlen(port-hba-radiussecret)); ! memcpy(cryptvector + strlen(port-hba-radiussecret), packet-vector, RADIUS_VECTOR_LENGTH); ! if (!pg_md5_binary(cryptvector, RADIUS_VECTOR_LENGTH + strlen(port-hba-radiussecret), encryptedpassword)) { ! ereport(LOG, ! (errmsg(could not perform MD5 encryption of password))); ! pfree(cryptvector); ! return STATUS_ERROR; } pfree(cryptvector); ! for (i = 0; i RADIUS_VECTOR_LENGTH; i++) ! { ! if (i strlen(passwd)) ! encryptedpassword[i] = passwd[i] ^ encryptedpassword[i]; ! else ! encryptedpassword[i] = '\0' ^ encryptedpassword[i]; ! } ! radius_add_attribute(packet, RADIUS_PASSWORD, encryptedpassword, RADIUS_VECTOR_LENGTH); /* Length need to be in network order on the wire */ packetlength = packet-length; --- 2348,2390 radius_add_attribute(packet, RADIUS_NAS_IDENTIFIER, (unsigned char *) identifier, strlen(identifier)); /* !* RADIUS password attributes are calculated as: !* e[0] = p[0] XOR MD5(secret + Request Authenticator) !* for the first group of 16 octets, and then: !* e[i] = p[i] XOR MD5(secret + e[i-1]) !* for the following ones (if necessary) */ ! encryptedpasswordlen = ((strlen(passwd) + RADIUS_VECTOR_LENGTH - 1) / RADIUS_VECTOR_LENGTH) * RADIUS_VECTOR_LENGTH; ! cryptvector = palloc(strlen(port-hba-radiussecret) + RADIUS_VECTOR_LENGTH); memcpy(cryptvector, port-hba-radiussecret, strlen(port-hba-radiussecret)); ! ! /* for the first iteration, we use the Request Authenticator vector */ ! md5trailer = packet-vector; ! for (i = 0; i encryptedpasswordlen; i += RADIUS_VECTOR_LENGTH) { ! memcpy(cryptvector + strlen(port-hba-radiussecret), md5trailer, RADIUS_VECTOR_LENGTH); ! /* .. and for subsequent iterations the result of the previous XOR (calculated below) */ ! md5trailer = encryptedpassword + i; ! ! if (!pg_md5_binary(cryptvector, strlen(port-hba-radiussecret) +
Re: [HACKERS] Test code is worth the space
On Tue, Aug 18, 2015 at 04:54:07PM +0100, Greg Stark wrote: On Tue, Aug 18, 2015 at 2:16 PM, David Fetter da...@fetter.org wrote: I'm given to understand that this tight coupling is necessary for performance. Are you saying that it could be unwound, or that testing strategies mostly need to take it into account, or...? I'm just saying that we shouldn't expect to find a magic bullet test framework that solves all these problems. Without restructuring code, which I don't think is really feasible, we won't be able to have good unit test coverage for most existing code. It might be more practical to start using such a new tool for new code only. Then the new code could be structured in ways that allow the environment to be mocked more easily and the results observed more easily. Great! Do we have examples of such tools and code bases structured to accommodate them that we'd like to use for reference, or at least for inspiration? Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More WITH
Craig Ringer cr...@2ndquadrant.com writes: On 18 August 2015 at 01:18, David Fetter da...@fetter.org wrote: FETCH [in WITH] I'd be a huge fan of this one. I'd love to see FETCH in subqueries, too. Currently doing anything like this requires an ugly PL/PgSQL wrapper. The cursor would have to be known at plan-time so it could be interrogated for its types. That's barely the tip of the iceberg of the problems with this idea. How many rows would be fetched from the cursor? What row would it be left on? Whatever answer you give will be wrong from some perspective, but particularly that of giving the planner any freedom-of-action to optimize such a query. More generally, what would you hope to accomplish with such a construct that wouldn't be better done by writing the cursor's underlying query directly in the WITH clause? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Implement failover on libpq connect level.
I wonder how extended protocol is handled by this proposal. Suppose load balacing mode is enabled. PQprepare is executed on standby1. Then PQexecPrepared gets called. This may be executed on standby2, which will fail because there's no prepared statement created by the former PQprepare call. Even simple procotol is used, same thing can be said to SQL PREPARE/EXECUTE/DEALLOCATE. SQL BEGIN/COMMIT/ROLLBACK would be more interesting example in load balancing mode. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
On Thu, Aug 13, 2015 at 5:26 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: On Fri, Aug 7, 2015 at 2:15 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: Sure, that is what we should do, however the tricky part would be when the path for doing local scan is extremely cheaper than path for parallel scan for one of the child nodes. For such cases, pulling up Funnel-node can incur more cost. I think some of the other possible ways to make this work could be to extend Funnel so that it is capable of executing both parallel and non-parallel nodes, have a new Funnel like node which has such a capability. I think it is job of (more intelligent) planner but not in the first version. If subplans of Append are mixture of nodes which has or does not have worth of parallel execution, we will be able to arrange the original form: Append + Scan on rel1 (large) + Scan on rel2 (large) + Scan on rel3 (middle) + Scan on rel4 (tiny) + Scan on rel5 (tiny) to Funnel aware form, but partially: Append + Funnel | + Scan on rel1 (large) | + Scan on rel2 (large) | + Scan on rel3 (large) + Scan on rel4 (tiny) + Scan on rel5 (tiny) This is exactly what I have in mind. Here is one other issue I found. Existing code assumes a TOC segment has only one contents per node type, so it uses pre-defined key (like PARALLEL_KEY_SCAN) per node type, however, it is problematic if we put multiple PlannedStmt or PartialSeqScan node on a TOC segment. We have few keys in parallel-seq-scan patch (PARALLEL_KEY_TUPLE_QUEUE and PARALLEL_KEY_INST_INFO) for which multiple structures are shared between master and worker backends. Check if something similar can work for your use case. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows
I reported a similar issue before. * Re: DBT-3 with SF=20 got failed http://www.postgresql.org/message-id/557a19d1.9050...@2ndquadrant.com I didn't get a server crash at that time, however, palloc() complained about request size = 1GB. So, we may need a couple of overhaul around HashJoin to support large size of data, not only nbuckets around 0x8000. Also, we may need to pay attention to reliability of scale estimation by planner. Even though the plan below says that Join generates 60521928028 rows, it actually generates 776157676 rows (0.12%). tpcds100=# EXPLAIN ANALYZE select ws1.ws_order_number,ws1.ws_warehouse_sk wh1,ws2.ws_warehouse_sk wh2 from web_sales ws1,web_sales ws2 where ws1.ws_order_number = ws2.ws_order_number and ws1.ws_warehouse_sk ws2.ws_warehouse_sk; QUERY PLAN Merge Join (cost=25374644.08..1160509591.61 rows=60521928028 width=24) (actual time=138347.979..491889.343 rows=776157676 loops=1) Merge Cond: (ws1.ws_order_number = ws2.ws_order_number) Join Filter: (ws1.ws_warehouse_sk ws2.ws_warehouse_sk) Rows Removed by Join Filter: 127853313 - Sort (cost=12687322.04..12867325.16 rows=72001248 width=16) (actual time=73252.300..79017.420 rows=72001237 loops=1) Sort Key: ws1.ws_order_number Sort Method: quicksort Memory: 7083296kB - Seq Scan on web_sales ws1 (cost=0.00..3290612.48 rows=72001248 width=16) (actual time=0.023..39951.201 rows=72001237 loops=1) - Sort (cost=12687322.04..12867325.16 rows=72001248 width=16) (actual time=65095.655..128885.811 rows=904010978 loops=1) Sort Key: ws2.ws_order_number Sort Method: quicksort Memory: 7083296kB - Seq Scan on web_sales ws2 (cost=0.00..3290612.48 rows=72001248 width=16) (actual time=0.014..31046.888 rows=72001237 loops=1) Planning time: 0.232 ms Execution time: 530176.521 ms (14 rows) So, even if we allows nodeHash.c to allocate hash buckets larger than 1GB, its initial size may be determined carefully. Probably, 1GB is a good starting point even if expanded later. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of David Rowley Sent: Wednesday, August 19, 2015 10:07 AM To: Tom Lane Cc: Kevin Grittner; Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows On 19 August 2015 at 12:38, Tom Lane t...@sss.pgh.pa.us wrote: David Rowley david.row...@2ndquadrant.com writes: david=# set work_mem = '94GB'; ERROR: 98566144 is outside the valid range for parameter work_mem (64 .. 2097151) Apparently you're testing on a 32-bit server. 64-bit servers allow work_mem to go up to INT_MAX kilobytes. hmm, no, but it does appear that sizeof(long) is 4 bytes for me, despite 64 bit OS. I guess this accounts for my misunderstanding that work_mem was limited to 2GB even on 64 bit machines. From guc.h #if SIZEOF_SIZE_T 4 SIZEOF_LONG 4 #define MAX_KILOBYTES INT_MAX #else #define MAX_KILOBYTES (INT_MAX / 1024) #endif Apologies for the noise. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
On Tue, Aug 18, 2015 at 12:38 PM, Fabien COELHO coe...@cri.ensmp.fr wrote: Hello Amit, So the option is best kept as off for now, without further data, I'm fine with that. One point to think here is on what basis user can decide make this option on, is it predictable in any way? I think one case could be when the data set fits in shared_buffers. Yep. In general, providing an option is a good idea if user can decide with ease when to use that option or we can give some clear recommendation for the same otherwise one has to recommend that test your workload with this option and if it works then great else don't use it which might also be okay in some cases, but it is better to be clear. My opinion, which is not backed by any data (anyone can feel free to provide a FreeBSD box for testing...) is that it would mostly be an improvement if you have a significant write load to have the flush option on when running on non-Linux systems which provide posix_fadvise. If you have a lot of reads and few writes, then postgresql currently works reasonably enough, which is why people do not complain too much about write stalls, and I expect that the situation would not be significantly degraded. Now there are competing positive and negative effects induced by using posix_fadvise, and moreover its implementation varries from OS to OS, so without running some experiments it is hard to be definite. Sure, I think what can help here is a testcase/'s (in form of script file or some other form, to test this behaviour of patch) which you can write and post here, so that others can use that to get the data and share it. Ofcourse, that is not mandatory to proceed with this patch, but still can help you to prove your point as you might not have access to different kind of systems to run the tests. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Proposal: Implement failover on libpq connect level.
On Tue, Aug 18, 2015 at 9:48 AM, Victor Wagner vi...@wagner.pp.ru wrote: Behavoir If PQconnectdb encounters connect string with multiple hosts specified, it attempts to establish connection with all these hosts simultaneously, and begins to work with server which responds first, unless loadbalancing parameter is true. I think here you are mixing the behaviour for load balancing solution and failover solution. It seems to me that for client-side failover solution (which is also known as Transparent Application Failover), the connection attempt to second server should be done after the first connection is broken as that provide more flexibility. If the loadbalancing parameter is true, it tries servers sequentially in the random order. If the parameter readonly is false, after authenticating with server it executes show transaction_read_only, to find out whether current connection is to the master or to the hot standby, and connection is considered successful only if server allows read write transactions. Although both ideas (load balancing and failover) seems worth discussing, they are separate features and can be worked on separately. It will be easier to sort out the details as well that way. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] More WITH
On 18 August 2015 at 01:18, David Fetter da...@fetter.org wrote: FETCH [in WITH] I'd be a huge fan of this one. I'd love to see FETCH in subqueries, too. Currently doing anything like this requires an ugly PL/PgSQL wrapper. The cursor would have to be known at plan-time so it could be interrogated for its types. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services On 18 August 2015 at 01:18, David Fetter da...@fetter.org wrote: Folks, In the interest of consistency, which is to say, of not hitting barriers that are essentially implementation details, I'd like to propose that we allow the rest of the row-returning commands inside WITH clauses. We currently have: SELECT VALUES INSERT/UPDATE/DELETE ... RETURNING We don't yet have: EXPLAIN [ANALYZE] SHOW FETCH A little further out there, although this would be an API change, we might consider allowing the results of VACUUM and ANALYZE as row sets, which would also be good to wrap in WITH. Is there a good reason, or more than one, why we shouldn't have all the row-returning commands in WITH? Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to compile, link and use a C++ extension
On 15 August 2015 at 00:51, Andres Freund and...@anarazel.de wrote: I started my tests by cloning the contrib/worker_spi code, and when transforming the code into C++, I could only note that C++ is not supported in the provided Makefiles. Yes, that doesn't surprise me. Postgres itself doesn't need to care about looking for a usable C++ compiler et al. I think it'd be necessary to add C++ support to PGXS to do this nicely. Because PostgreSQL's headers have C++ extern C { wrappers in them, you could potentially compile the whole extension g++, using extern C to expose unmangled symbols. You'll run into some issues due to incompatible compiler flags, though: $ PATH=$HOME/pg/95/bin:$PATH make USE_PGXS=1 CC=g++ g++ -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -O2 -fpic -I. -I./ -I/home/craig/pg/95/include/postgresql/server -I/home/craig/pg/95/include/postgresql/internal -D_GNU_SOURCE -c -o worker_spi.o worker_spi.c cc1plus: warning: command line option ‘-Wmissing-prototypes’ is valid for C/ObjC but not for C++ cc1plus: warning: command line option ‘-Wdeclaration-after-statement’ is valid for C/ObjC but not for C++ cc1plus: sorry, unimplemented: -fexcess-precision=standard for C++ builtin: recipe for target 'worker_spi.o' failed make: *** [worker_spi.o] Error 1 How exactly do you need to use the C++ code? The easiest probably would be to have a separate object file, built using your own makefile rule, that contains a the C++ and provides a C interface, and then use that from a background worker purely written in C. This is what I would advise too. It helps to separate the PostgreSQL bits from the C++ bits, providing a clean interface across which exceptions are not thrown. It isn't safe to access PostgreSQL API from within your C++ code anyway, so the C++ code can be compiled without any PostgreSQL includes, exposing a pure C interface that your extension can then link to as just another library via PGXS's SHLIB_LINK option. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Make HeapTupleSatisfiesMVCC more concurrent
When we check a tuple for MVCC, it has to pass checks that the inserting transaction has committed, and that it committed before our snapshot began. And similarly that the deleting transaction hasn't committed, or did so after our snapshot. XidInMVCCSnapshot is (or can be) very much cheaper than TransactionIdIsInProgress, because the former touches only local memory while the latter takes a highly contended lock and inspects shared memory. We do the slow one first, but we could do the fast one first and sometimes short-circuit the slow one. If the transaction is in our snapshot, it doesn't matter if it is still in progress or not. This was discussed back in 2013 ( http://www.postgresql.org/message-id/CAMkU=1yy-YEQVvqj2xJitT1EFkyuFk7uTV_hrOMGyGMxpU=n...@mail.gmail.com), and I wanted to revive it. The recent lwlock atomic changes haven't made the problem irrelevant. This patch swaps the order of the checks under some conditions. So that hackers can readily do testing without juggling binaries, I've added an experimental guc which controls the behavior. JJ_SNAP=0 is the original (git HEAD) behavior, JJ_SNAP=1 turns on the new behavior. I've added some flag variables to record if XidInMVCCSnapshot was already called. XidInMVCCSnapshot is cheap, but not so cheap that we want to call it twice if we can avoid it. Those would probably stay in some form or another when the experimental guc goes away. We might be able to rearrange the series of if-tests to get rid of the flags variables, but I didn't want to touch the HEAP_MOVED_OFF and HEAP_MOVED_IN parts of the code, as those must get about zero regression testing. The situation where the performance of this really shows up is when there are tuples that remain in an unresolved state while highly concurrent processes keep stumbling over them. I set that up by using the pgbench tables with scale factor of 1, and running a custom query at high concurrency which seq_scans the accounts table: pgbench -f (echo 'select sum(abalance) from pgbench_accounts') -T 30 \ -n -c32 -j32 --startup='set JJ_SNAP=1' While the test is contrived, it reproduces complaints I've seen on several forums. To create the burden of unresolved tuples, I open psql and run: begin; update pgbench_accounts set abalance =1-abalance; ...and leave it uncommitted for a while. Representative numbers for test runs of the above custom query on a 8-CPU machine: tps = 542 regardless of JJ_SNAP, when no in-progress tuples tps = 30 JJ_SNAP=0 with uncommitted bulk update tps = 364 JJ_SNAP=1 with uncommitted bulk update A side effect of making this change would be that a query which finds a tuple inserted or deleted by a transaction still in the query's snapshot never checks to see if that transaction committed, and so it doesn't set the hint bit if it did commit or abort. Some future query with a newer snapshot will have to do that. It is at least theoretically possible that this could mean that many hint bits could fail to get set while the buffer is still dirty in shared_buffers, which means it needs to get dirtied again once set. I doubt this would be significant, but if anyone has a test case which they think could show up a problem in this area, please try it out or describe it. There are other places in tqual.c which could probably use similar re-ordering tricks, but this is the one for which I have a reproducible test case, Cheers Jeff SatisfiesMVCC_reorder_v001.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Make HeapTupleSatisfiesMVCC more concurrent
Jeff Janes jeff.ja...@gmail.com writes: When we check a tuple for MVCC, it has to pass checks that the inserting transaction has committed, and that it committed before our snapshot began. And similarly that the deleting transaction hasn't committed, or did so after our snapshot. XidInMVCCSnapshot is (or can be) very much cheaper than TransactionIdIsInProgress, because the former touches only local memory while the latter takes a highly contended lock and inspects shared memory. We do the slow one first, but we could do the fast one first and sometimes short-circuit the slow one. If the transaction is in our snapshot, it doesn't matter if it is still in progress or not. This was discussed back in 2013 ( http://www.postgresql.org/message-id/CAMkU=1yy-YEQVvqj2xJitT1EFkyuFk7uTV_hrOMGyGMxpU=n...@mail.gmail.com), and I wanted to revive it. The recent lwlock atomic changes haven't made the problem irrelevant. This patch swaps the order of the checks under some conditions. Just thinking about this ... I wonder why we need to call TransactionIdIsInProgress() at all rather than believing the answer from the snapshot? Under what circumstances could TransactionIdIsInProgress() return true where XidInMVCCSnapshot() had not? I'm thinking maybe TransactionIdIsInProgress is only needed for non-MVCC snapshot types. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows
Kouhei Kaigai kai...@ak.jp.nec.com wrote: longlbuckets; lbuckets = 1 my_log2(hash_table_bytes / bucket_size); Assert(nbuckets 0); In my case, the hash_table_bytes was 101017630802, and bucket_size was 48. So, my_log2(hash_table_bytes / bucket_size) = 31, then lbuckets will have negative number because both 1 and my_log2() is int32. So, Min(lbuckets, max_pointers) chooses 0x8000, then it was set on the nbuckets and triggers the Assert(). Attached patch fixes the problem. So you changed the literal of 1 to 1U, but doesn't that just double the threshold for failure? Wouldn't 1L (to match the definition of lbuckets) be better? Ah, yes, it is my careless mistake. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows
On 19 August 2015 at 08:54, Kevin Grittner kgri...@ymail.com wrote: Kouhei Kaigai kai...@ak.jp.nec.com wrote: longlbuckets; lbuckets = 1 my_log2(hash_table_bytes / bucket_size); Assert(nbuckets 0); In my case, the hash_table_bytes was 101017630802, and bucket_size was 48. So, my_log2(hash_table_bytes / bucket_size) = 31, then lbuckets will have negative number because both 1 and my_log2() is int32. So, Min(lbuckets, max_pointers) chooses 0x8000, then it was set on the nbuckets and triggers the Assert(). Attached patch fixes the problem. So you changed the literal of 1 to 1U, but doesn't that just double the threshold for failure? Wouldn't 1L (to match the definition of lbuckets) be better? I agree, but I can only imagine this is happening because the maximum setting of work_mem has been modified with the code you're running. hash_tables_bytes is set based on work_mem hash_table_bytes = work_mem * 1024L; The size of your hash table is 101017630802 bytes, which is: david=# select pg_size_pretty(101017630802); pg_size_pretty 94 GB (1 row) david=# set work_mem = '94GB'; ERROR: 98566144 is outside the valid range for parameter work_mem (64 .. 2097151) So I think the only way the following could cause an error, is if bucket_size was 1, which it can't be. lbuckets = 1 my_log2(hash_table_bytes / bucket_size); I think one day soon we'll need to allow larger work_mem sizes, but I think there's lots more to do than this change. Regards David Rowley -- David Rowley http://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows
On 19 August 2015 at 12:23, Kouhei Kaigai kai...@ak.jp.nec.com wrote: -Original Message- From: David Rowley [mailto:david.row...@2ndquadrant.com] Sent: Wednesday, August 19, 2015 9:00 AM The size of your hash table is 101017630802 bytes, which is: david=# select pg_size_pretty(101017630802); pg_size_pretty 94 GB (1 row) david=# set work_mem = '94GB'; ERROR: 98566144 is outside the valid range for parameter work_mem (64 .. 2097151) Hmm. Why I could set work_mem = '96GB' without error. It was described in the postgresql.conf. postgres=# SHOW work_mem; work_mem -- 96GB (1 row) So I think the only way the following could cause an error, is if bucket_size was 1, which it can't be. lbuckets = 1 my_log2(hash_table_bytes / bucket_size); I think one day soon we'll need to allow larger work_mem sizes, but I think there's lots more to do than this change. I oversight this limitation, but why I can bypass GUC limitation check? I'm unable to get the server to start if I set work_mem that big. I also tried starting the start with 1GB work_mem then doing a pg_ctl reload. With each of these I get the same error message that I would have gotten if I had done; set work_mem = '96GB'; Which version are you running? Are you sure there's not changes in guc.c for work_mem's range? Regards David Rowley -- David Rowley http://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] Make HeapTupleSatisfiesMVCC more concurrent
I wrote: Just thinking about this ... I wonder why we need to call TransactionIdIsInProgress() at all rather than believing the answer from the snapshot? Under what circumstances could TransactionIdIsInProgress() return true where XidInMVCCSnapshot() had not? I experimented with the attached patch, which replaces HeapTupleSatisfiesMVCC's calls of TransactionIdIsInProgress with XidInMVCCSnapshot, and then as a cross-check has all the return false exits from XidInMVCCSnapshot assert !TransactionIdIsInProgress(). The asserts did not fire in the standard regression tests nor in a pgbench run, which is surely not proof of anything but it suggests that I'm not totally nuts. I wouldn't commit the changes in XidInMVCCSnapshot for real, but otherwise this is a possibly committable patch. regards, tom lane diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c index de7b3fc..c8f59f7 100644 *** a/src/backend/utils/time/tqual.c --- b/src/backend/utils/time/tqual.c *** *** 10,21 * the passed-in buffer. The caller must hold not only a pin, but at least * shared buffer content lock on the buffer containing the tuple. * ! * NOTE: must check TransactionIdIsInProgress (which looks in PGXACT array) * before TransactionIdDidCommit/TransactionIdDidAbort (which look in * pg_clog). Otherwise we have a race condition: we might decide that a * just-committed transaction crashed, because none of the tests succeed. * xact.c is careful to record commit/abort in pg_clog before it unsets ! * MyPgXact-xid in PGXACT array. That fixes that problem, but it also * means there is a window where TransactionIdIsInProgress and * TransactionIdDidCommit will both return true. If we check only * TransactionIdDidCommit, we could consider a tuple committed when a --- 10,22 * the passed-in buffer. The caller must hold not only a pin, but at least * shared buffer content lock on the buffer containing the tuple. * ! * NOTE: When using a non-MVCC snapshot, we must check ! * TransactionIdIsInProgress (which looks in the PGXACT array) * before TransactionIdDidCommit/TransactionIdDidAbort (which look in * pg_clog). Otherwise we have a race condition: we might decide that a * just-committed transaction crashed, because none of the tests succeed. * xact.c is careful to record commit/abort in pg_clog before it unsets ! * MyPgXact-xid in the PGXACT array. That fixes that problem, but it also * means there is a window where TransactionIdIsInProgress and * TransactionIdDidCommit will both return true. If we check only * TransactionIdDidCommit, we could consider a tuple committed when a *** *** 26,31 --- 27,37 * subtransactions of our own main transaction and so there can't be any * race condition. * + * When using an MVCC snapshot, we rely on XidInMVCCSnapshot rather than + * TransactionIdIsInProgress, but the logic is otherwise the same: do not + * check pg_clog until after deciding that the xact is no longer in progress. + * + * * Summary of visibility functions: * * HeapTupleSatisfiesMVCC() *** HeapTupleSatisfiesMVCC(HeapTuple htup, S *** 961,967 if (TransactionIdIsCurrentTransactionId(xvac)) return false; ! if (!TransactionIdIsInProgress(xvac)) { if (TransactionIdDidCommit(xvac)) { --- 967,973 if (TransactionIdIsCurrentTransactionId(xvac)) return false; ! if (!XidInMVCCSnapshot(xvac, snapshot)) { if (TransactionIdDidCommit(xvac)) { *** HeapTupleSatisfiesMVCC(HeapTuple htup, S *** 980,986 if (!TransactionIdIsCurrentTransactionId(xvac)) { ! if (TransactionIdIsInProgress(xvac)) return false; if (TransactionIdDidCommit(xvac)) SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, --- 986,992 if (!TransactionIdIsCurrentTransactionId(xvac)) { ! if (XidInMVCCSnapshot(xvac, snapshot)) return false; if (TransactionIdDidCommit(xvac)) SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, *** HeapTupleSatisfiesMVCC(HeapTuple htup, S *** 1035,1041 else return false; /* deleted before scan started */ } ! else if (TransactionIdIsInProgress(HeapTupleHeaderGetRawXmin(tuple))) return false; else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple))) SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, --- 1041,1047 else return false; /* deleted before scan started */ } ! else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot)) return false; else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple))) SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, *** HeapTupleSatisfiesMVCC(HeapTuple htup, S *** 1048,1061 return false; } } ! /* ! * By here, the inserting transaction has
Re: [HACKERS] Our trial to TPC-DS but optimizer made unreasonable plan
On Mon, Aug 17, 2015 at 6:40 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: Here is one other thing I could learn from TPC-DS benchmark. The attached query is Q4 of TPC-DS, and its result was towards SF=100. It took long time to compete (about 30min), please see the attached EXPLAIN ANALYZE output. Look at this: - CTE Scan on year_total t_s_firstyear (cost=0.00..13120715.27 rows=3976 width=52) (actual time=0.020..5425.980 rows=1816438 loops=1) Filter: ((year_total '0'::numeric) AND (sale_type = 's'::text) AND (dyear = 2001)) Rows Removed by Filter: 19879897 - CTE Scan on year_total t_s_secyear (cost=0.00..11927922.98 rows=11928 width=164) (actual time=0.007..45.249 rows=46636 loops=1) Filter: ((sale_type = 's'::text) AND (dyear = 2002)) Rows Removed by Filter: 185596 CTE expansion shall help here as we can push the filer down. I did a quick patch to demonstrate the idea, following Tom's proposal (38448.1430519...@sss.pgh.pa.us). I see obvious performance boost: Turn off NLJ: original: Planning time: 4.391 ms Execution time: 77113.721 ms patched: Planning time: 8.429 ms Execution time: 18572.663 ms + work_mem to 1G original: Planning time: 4.487 ms Execution time: 29249.466 ms patched: Planning time: 11.148 ms Execution time: 7309.586 ms Attached please find the WIP patch and also the ANALYZE results. Notes: the patch may not directly apply to head as some network issue here so my Linux box can't talk to git server. Thanks for your patch. Let me test and report the result in my environment. BTW, did you register the patch on the upcoming commit-fest? I think it may be a helpful feature, if we can add alternative subquery-path towards cte-scan on set_cte_pathlist() and choose them according to the cost estimation. Best regards, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] allowing wal_level change at run time
On Wed, Aug 19, 2015 at 2:46 AM, Andres Freund and...@anarazel.de wrote: On 2015-08-18 13:24:54 -0400, Peter Eisentraut wrote: But if we tie the effective wal_level to archive_mode or max_wal_senders, both of which are restart-only, then we haven't gained anything. (We would have removed half a GUC parameter, effectively. Not bad, but not very exciting.) ISTM that it's not too hard to a) make archive_mode PGC_SIGHUP b) make wal_level PGC_SIGHUP c) automatically increase wal_level to logical whenever a logical replication slot is defined Switching archive_mode and wal_level to PGC_SIGHUP would be nice. We can already faking that by setting archive_command to '/usr/bin/true' for the first one or similar but that's not really the same as switching it completely to off. it seems considerably harder to d) make max_wal_senders PGC_SIGHUP e) make max_replication_slots PGC_SIGHUP because they need shmem, locks, and everything. Yes. Those have effects on the shared memory size allocated at postmaster startup. Therefore I propose something slightly different: We change the default of max_wal_senders, max_replication_slots, to some reasonably high setting and make wal_level an internal automagically determined variable. archive_mode = on automatically increases the level to what's now hot-standby. Or surely max_wal_senders 0. To deal with streaming replication, we automatically create slots for replicas, but, by default, *without* having them reserve WAL. The slot name would be determined in some automatic fashion (uuid or something) and stored in a new file in the data directory. That allows us to increase the internal wal_level to hot_standby/archive whenever a replica has connected (and thus a physical slot exists), and to logical whenever a logical slot exists. So, wal_level is automatically bumped to hot_standby when the physical slot is created (or logical for a logical slot) even if WAL is not reserved, right? When would those slots be created? Now, that'll mean that the wal_level doesn't decrease automatically until a slot has been dropped. But that seems relatively harmless if it's not reserving WAL. Too crazy? Perhaps, craziest ideas are usually worth it :) At least, we have a couple of things we can consider: - Make archive_mode SIGHUP - Remove wal_level and set it as follows: -- archive/hot_standby if max_wal_sender 0 (depends surely on restart) or archive_mode = on (gets more complicated if archive_mode switches to SIGHUP) at startup. -- logical should a logical slot be created. -- If max_wal_senders = 0 and archive_mode = off, switch wal_level to hot_standby once a physical slot is created. In short switching archive_mode to SIGHUP has as requirement to get rid of wal_level first. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] allowing wal_level change at run time
On 8/18/15 1:46 PM, Andres Freund wrote: I don't think not requiring restarts is sufficient, having to twiddle a bunch of parameters manually still is a lot more effort than people see as necessary. I agree that we want both. But requiring a restart is a hard stop, whereas making configuration easier is a soft feature. ISTM that it's not too hard to a) make archive_mode PGC_SIGHUP That has been contentious in the past, but I would agree that it seems that it should be doable. (I consider archiving a legacy feature at this point, so for this purpose I don't really care whether it's possible.) b) make wal_level PGC_SIGHUP c) automatically increase wal_level to logical whenever a logical replication slot is defined it seems considerably harder to d) make max_wal_senders PGC_SIGHUP e) make max_replication_slots PGC_SIGHUP because they need shmem, locks, and everything. check Therefore I propose something slightly different: We change the default of max_wal_senders, max_replication_slots, to some reasonably high setting and make wal_level an internal automagically determined variable. archive_mode = on automatically increases the level to what's now hot-standby. check To deal with streaming replication, we automatically create slots for replicas, but, by default, *without* having them reserve WAL. The slot name would be determined in some automatic fashion (uuid or something) and stored in a new file in the data directory. That allows us to increase the internal wal_level to hot_standby/archive whenever a replica has connected (and thus a physical slot exists), and to logical whenever a logical slot exists. That seems kind of weird. So every time a replication client connects, we create a new replication slot? Won't they accumulate? Do we need the replication slot, or could we just trigger this on the existence of a replication client? Also note that this scheme and any similar one requires merging the archive and hot_standby levels, which was the core of your original proposal [1], which was then objected to, which subsequently lead to Robert's proposal to make wal_level SIGHUP, which lead to my message, which lead to your message, which is similar to your initial one. Somehow we have to find a way break out of this circle. ;-) [1] http://www.postgresql.org/message-id/20150203124317.ga24...@awork2.anarazel.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Error message with plpgsql CONTINUE
Pavel Stehule pavel.steh...@gmail.com writes: when I implemented this check in plpgsql_check I found another minor issue in CONTINUE statement - the typename is wrong Hmmm ... a bit of nosing around says that fetch/move and get diagnostics are similarly sloppy. Will fix. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench - allow backslash-continuations in custom scripts
Hi, all. I don't think we actually want backslash-continuations. The feature we want is allow SQL statements span multiple lines, and using the psql lexer solves that. We don't need the backslash-continuations when we have that. Sure. The feature *I* initially wanted was to have multi-line meta-commands. For this feature ISTM that continuations are, alas, the solution. Indeed there are plenty of links already which are generated by makefiles (see src/bin/pg_xlogdump/*), and probably a copy is made on windows. There should no file duplication within the source tree. Yeah, following the example of pg_xlogdump and others is the way to go. Docs need updating, and there's probably some cleanup to do before this is ready for committing, but overall I think this is definitely the right direction. I've created an entry for the next commitfest, and put the status to waiting on author. I complained upthread that this makes it impossible to use multi-statements in pgbench, as they would be split into separate statements, but looking at psqlscan.l there is actually a syntax for that in psql already. You escape the semicolon as \;, e.g. SELECT 1 \; SELECT 2;, and then both queries will be sent to the server as one. So even that's OK. Good! Hmm. psqlscan.l handles multistatement naturally. I worked on that and the attached patche set does, - backslash continuation for pgbench metacommands. set variable \ some value - SQL statement natural continuation lines. SELECT :foo FROM :bar; - SQL multi-statement. SELECT 1; SELECT 2; The work to be left is eliminating double-format of Command struct. regards, -- Kyotaro Horiguchi NTT Open Source Software Center From 274bc1cd6de4fb5806e308b002b086b1dfdf7479 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi horiguchi.kyot...@lab.ntt.co.jp Date: Thu, 23 Jul 2015 20:44:37 +0900 Subject: [PATCH 1/3] Prepare for share psqlscan with pgbench. Eliminate direct usage of pset variables and enable parts unnecessary for other than psql to be disabled by defining OUTSIDE_PSQL. --- src/bin/psql/mainloop.c | 6 ++-- src/bin/psql/psqlscan.h | 14 + src/bin/psql/psqlscan.l | 79 - src/bin/psql/startup.c | 4 +-- 4 files changed, 67 insertions(+), 36 deletions(-) diff --git a/src/bin/psql/mainloop.c b/src/bin/psql/mainloop.c index b6cef94..e98cb94 100644 --- a/src/bin/psql/mainloop.c +++ b/src/bin/psql/mainloop.c @@ -233,7 +233,8 @@ MainLoop(FILE *source) /* * Parse line, looking for command separators. */ - psql_scan_setup(scan_state, line, strlen(line)); + psql_scan_setup(scan_state, line, strlen(line), + pset.db, pset.vars, pset.encoding); success = true; line_saved_in_history = false; @@ -373,7 +374,8 @@ MainLoop(FILE *source) resetPQExpBuffer(query_buf); /* reset parsing state since we are rescanning whole line */ psql_scan_reset(scan_state); - psql_scan_setup(scan_state, line, strlen(line)); + psql_scan_setup(scan_state, line, strlen(line), + pset.db, pset.vars, pset.encoding); line_saved_in_history = false; prompt_status = PROMPT_READY; } diff --git a/src/bin/psql/psqlscan.h b/src/bin/psql/psqlscan.h index 55070ca..4bf8dcb 100644 --- a/src/bin/psql/psqlscan.h +++ b/src/bin/psql/psqlscan.h @@ -11,7 +11,11 @@ #include pqexpbuffer.h #include prompt.h - +#if !defined OUTSIDE_PSQL +#include variables.h +#else +typedef int * VariableSpace; +#endif /* Abstract type for lexer's internal state */ typedef struct PsqlScanStateData *PsqlScanState; @@ -36,12 +40,11 @@ enum slash_option_type OT_NO_EVAL /* no expansion of backticks or variables */ }; - extern PsqlScanState psql_scan_create(void); extern void psql_scan_destroy(PsqlScanState state); -extern void psql_scan_setup(PsqlScanState state, -const char *line, int line_len); +extern void psql_scan_setup(PsqlScanState state, const char *line, int line_len, + PGconn *db, VariableSpace vars, int encoding); extern void psql_scan_finish(PsqlScanState state); extern PsqlScanResult psql_scan(PsqlScanState state, @@ -52,6 +55,7 @@ extern void psql_scan_reset(PsqlScanState state); extern bool psql_scan_in_quote(PsqlScanState state); +#if !defined OUTSIDE_PSQL extern char *psql_scan_slash_command(PsqlScanState state); extern char *psql_scan_slash_option(PsqlScanState state, @@ -60,5 +64,5 @@ extern char *psql_scan_slash_option(PsqlScanState state, bool semicolon); extern void psql_scan_slash_command_end(PsqlScanState state); - +#endif /* if !defined OUTSIDE_PSQL */ #endif /* PSQLSCAN_H */ diff --git a/src/bin/psql/psqlscan.l b/src/bin/psql/psqlscan.l index be059ab..f9a19cd 100644 --- a/src/bin/psql/psqlscan.l +++ b/src/bin/psql/psqlscan.l @@ -43,11 +43,6 @@ #include ctype.h -#include common.h -#include settings.h -#include variables.h - - /* * We use a stack of
Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows
-Original Message- From: David Rowley [mailto:david.row...@2ndquadrant.com] Sent: Wednesday, August 19, 2015 9:00 AM To: Kevin Grittner Cc: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows On 19 August 2015 at 08:54, Kevin Grittner kgri...@ymail.com wrote: Kouhei Kaigai kai...@ak.jp.nec.com wrote: longlbuckets; lbuckets = 1 my_log2(hash_table_bytes / bucket_size); Assert(nbuckets 0); In my case, the hash_table_bytes was 101017630802, and bucket_size was 48. So, my_log2(hash_table_bytes / bucket_size) = 31, then lbuckets will have negative number because both 1 and my_log2() is int32. So, Min(lbuckets, max_pointers) chooses 0x8000, then it was set on the nbuckets and triggers the Assert(). Attached patch fixes the problem. So you changed the literal of 1 to 1U, but doesn't that just double the threshold for failure? Wouldn't 1L (to match the definition of lbuckets) be better? I agree, but I can only imagine this is happening because the maximum setting of work_mem has been modified with the code you're running. hash_tables_bytes is set based on work_mem hash_table_bytes = work_mem * 1024L; The size of your hash table is 101017630802 bytes, which is: david=# select pg_size_pretty(101017630802); pg_size_pretty 94 GB (1 row) david=# set work_mem = '94GB'; ERROR: 98566144 is outside the valid range for parameter work_mem (64 .. 2097151) Hmm. Why I could set work_mem = '96GB' without error. It was described in the postgresql.conf. postgres=# SHOW work_mem; work_mem -- 96GB (1 row) So I think the only way the following could cause an error, is if bucket_size was 1, which it can't be. lbuckets = 1 my_log2(hash_table_bytes / bucket_size); I think one day soon we'll need to allow larger work_mem sizes, but I think there's lots more to do than this change. I oversight this limitation, but why I can bypass GUC limitation check? -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows
David Rowley david.row...@2ndquadrant.com writes: david=# set work_mem = '94GB'; ERROR: 98566144 is outside the valid range for parameter work_mem (64 .. 2097151) Apparently you're testing on a 32-bit server. 64-bit servers allow work_mem to go up to INT_MAX kilobytes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows
On 19 August 2015 at 12:38, Tom Lane t...@sss.pgh.pa.us wrote: David Rowley david.row...@2ndquadrant.com writes: david=# set work_mem = '94GB'; ERROR: 98566144 is outside the valid range for parameter work_mem (64 .. 2097151) Apparently you're testing on a 32-bit server. 64-bit servers allow work_mem to go up to INT_MAX kilobytes. hmm, no, but it does appear that sizeof(long) is 4 bytes for me, despite 64 bit OS. I guess this accounts for my misunderstanding that work_mem was limited to 2GB even on 64 bit machines. From guc.h #if SIZEOF_SIZE_T 4 SIZEOF_LONG 4 #define MAX_KILOBYTES INT_MAX #else #define MAX_KILOBYTES (INT_MAX / 1024) #endif Apologies for the noise.
Re: [HACKERS] Our trial to TPC-DS but optimizer made unreasonable plan
On Mon, Aug 17, 2015 at 6:40 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: Here is one other thing I could learn from TPC-DS benchmark. The attached query is Q4 of TPC-DS, and its result was towards SF=100. It took long time to compete (about 30min), please see the attached EXPLAIN ANALYZE output. Look at this: - CTE Scan on year_total t_s_firstyear (cost=0.00..13120715.27 rows=3976 width=52) (actual time=0.020..5425.980 rows=1816438 loops=1) Filter: ((year_total '0'::numeric) AND (sale_type = 's'::text) AND (dyear = 2001)) Rows Removed by Filter: 19879897 - CTE Scan on year_total t_s_secyear (cost=0.00..11927922.98 rows=11928 width=164) (actual time=0.007..45.249 rows=46636 loops=1) Filter: ((sale_type = 's'::text) AND (dyear = 2002)) Rows Removed by Filter: 185596 CTE expansion shall help here as we can push the filer down. I did a quick patch to demonstrate the idea, following Tom's proposal (38448.1430519...@sss.pgh.pa.us). I see obvious performance boost: Turn off NLJ: original: Planning time: 4.391 ms Execution time: 77113.721 ms patched: Planning time: 8.429 ms Execution time: 18572.663 ms + work_mem to 1G original: Planning time: 4.487 ms Execution time: 29249.466 ms patched: Planning time: 11.148 ms Execution time: 7309.586 ms Attached please find the WIP patch and also the ANALYZE results. Notes: the patch may not directly apply to head as some network issue here so my Linux box can't talk to git server. Regards, Qingqing ctes.patch Description: Binary data result.out Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: SCRAM authentication
On Wed, Aug 19, 2015 at 5:30 AM, Greg Stark st...@mit.edu wrote: On Tue, Aug 18, 2015 at 7:19 PM, Robert Haas robertmh...@gmail.com wrote: Sorry, that's a completely bogus argument. We do not need to prevent people from upgrading if they haven't moved off of MD5 authentication; that's just an arbitrary - and IMHO extremely user-hostile - policy decision. We have a legitimate need, to move the project forward, to introduce a better system for password authentication. Ripping out the old one is not a real need. I'm sure at some point it will seem like antiquated cruft that nobody uses any more, but that will not be in a year or two or anything close to that. I would imagine a GUC like enable_legacy_authentication=true which we would just start defaulting to false after a few years and perhaps consider removing five years after that. pg_upgrade could check if there are any users which require it to be set to true and warn users that they must enable it but should check the documentation so they understand the impact. Yep, that's one of the ideas mentioned upstread. This parameter should actually be filled with a list of verifiers that we consider out-of-support, deprecated, well things that users should be warned about. One solution may be to log in warnings when parsing pg_hba.conf should a deprecated method be used. OK, that's an interesting argument. If SCRAM supports multiple password verifiers, and we support SCRAM, then I guess we should probably do that, too. I still don't like it all that much, though. I think it's absolutely inevitable that people are going to end up with an account with 3 or more different passwords that can all be used to log into it, and that won't be good. How do other systems avoid this pitfall? Fwiw having multiple passwords would make automated credential rotations *so* much easier. Heroku has a really baroque solution to this problem in Postgres involving creating new child roles and swapping them around. My team in Google wasted many man hours dealing with fallout from the quarterly password rotations. (I wish we could just drop the account management and authentication system entirely and dump the whole work on a system designed for this particular problem. It's a hard problem that's far outside our core features and Postgres is never going to be a good system for anyone let alone everyone when there are many different types of users.) This makes me think that we may need actually a finer grammar than what the patch is proposing: ADD PASSWORD VERIFIERS (method = 'value' [, ...] ) [ OPTIONS (validUntil = 'timestamp') ] DROP PASSWORD VERIFIERS (method [, ...]) PASSWORD VERIFIERS (method = 'value' [, ...]) [OPTIONS validUntil = 'timestamp'] -- replace all the existing ones Reaching to the following catalog for pg_auth_verifiers: Oid roleoid; char method; text value; timestamp valueUntil; I am not sure if we would want to be able to have multiple verifier values for the same method, but feel free to comment. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] proposal: function parse_ident
Hi I miss a functionality that helps with parsing any identifier to basic three parts - database, schema, objectname. We have this function internally, but it is not available for SQL layer. FUNCTION parse_ident(IN ident text, OUT dbname text, OUT schemaname text, OUT objectname text) Examples: SELECT parse_ident('some schema.tablename'); Comments, ideas, notes? Regards Pavel
Re: [HACKERS] Our trial to TPC-DS but optimizer made unreasonable plan
On Mon, Aug 17, 2015 at 9:40 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: I think SortSupport logic provides a reasonable way to solve this kind of problem. For example, btint4sortsupport() informs a function pointer of the fast version of comparator (btint4fastcmp) which takes two Datum argument without indirect memory reference. This mechanism will also make sense for HashAggregate logic, to reduce the cost of function invocations. Please comment on the idea I noticed here. It's possible that this can work, but it might be a good idea to run 'perf' on this query and find out where the CPU time is actually going. That might give you a clearer picture of why the HashAggregate is slow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning
On Tue, Aug 18, 2015 at 07:30:20PM +0900, Amit Langote wrote: Hi, I would like propose $SUBJECT for this development cycle. Attached is a WIP patch that implements most if not all of what's described below. Some yet unaddressed parts are mentioned below, too. I'll add this to the CF-SEP. Thanks for pushing this forward! We've needed this done for at least a decade. 4. (yet unimplemented) Attach partition (from existing table) ALTER TABLE partitioned_table ATTACH PARTITION partition_name FOR VALUES values_spec USING [TABLE] table_name; ALTER TABLE table_name SET VALID PARTITION OF parent; The first of the above pair of commands would attach table_name as a (yet) 'invalid' partition of partitioned_table (after confirming that it matches the schema and does not overlap with other partitions per FOR VALUES spec). It would also record the FOR VALUES part in the partition catalog and set pg_class.relispartition to true for table_name. After the first command is done, the second command would take exclusive lock on table_name, scan the table to check if it contains any values outside the boundaries defined by FOR VALUES clause defined previously, throw error if so, mark as valid partition of parent if not. One small change to make this part more efficient: 1. Take the access exclusive lock on table_name. 2. Check for a matching constraint on it. 3. If it's there, mark it as a valid partition. 4. If not, check for values outside the boundaries as above. Should the be a *valid* constraint? Perhaps that should be parameterized, as I'm not yet seeing a compelling argument either direction. I'm picturing something like: ALTER TABLE table_name SET VALID PARTITION OF parent [TRUST] where TRUST would mean that an existing constraint need not be VALID. Does that make sense? Yep. 5. Detach partition ALTER TABLE partitioned_table DETACH PARTITION partition_name [USING table_name] This removes partition_name as partition of partitioned_table. The table continues to exist with the same name or 'table_name', if specified. pg_class.relispartition is set to false for the table, so it behaves like a normal table. Could this take anything short of an access exclusive lock on the parent? Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] allowing wal_level change at run time
Peter Eisentraut pete...@gmx.net writes: On 8/18/15 8:48 AM, Robert Haas wrote: What do you mean by prevent? If the user edits postgresql.conf and reduces the setting, and then reloads the configuration file, they have a right to expect that the changes got applied. We have certain checks in place that require a minimum wal_level before other things are allowed. For example, turning on archiving requires wal_level = archive. The issue is then, if you have archiving on and then turn wal_level to minimal at run time, we need to prevent that to preserve the integrity of the original check. IIRC, the reason for having a wal_level parameter separate from those other ones was precisely that only wal_level had to be PGC_POSTMASTER, and you could change the others if you wanted. If we are going to fix the mechanisms to allow dynamic changing in wal log verbosity, maybe we could simply get rid of wal_level as a user-settable parameter, and have its effective value be inferred from the active settings of the other parameters. IOW: let's simplify, not complicate even further. Try to get rid of the interdependencies between settable parameters. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb array-style subscripting
On 08/18/2015 01:11 AM, Kaare Rasmussen wrote: On 2015-08-17 22:33, Josh Berkus wrote: So, both perl and python do not allow deep nesting of assignments. For example: d = { a : { } } d[a][a1][a2] = 42 Traceback (most recent call last): File stdin, line 1, in module KeyError: 'a1' Not sure I understand what you mean. In Perl you'd do $ perl -e '%d = (a = {}); $d{a}{a1}{a2} = 42; print $d{a}{a1}{a2}' 42 which looks pretty much like what's proposed. Indeed, I mentioned recently that perl auto-vivifies intermediate paths like this. But we don't do that in jsonb_set, as was discussed back in May, and as JS doesn't either I think that decision is sound. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
On Mon, Aug 10, 2015 at 12:36 AM, Rahila Syed rahilasye...@gmail.com wrote: Hello, Say, 6 bigint counters, 6 float8 counters, and 3 strings up to 80 characters each. So we have a fixed-size chunk of shared memory per backend, and each backend that wants to expose progress information can fill in those fields however it likes, and we expose the results. This would be sorta like the way pg_statistic works: the same columns can be used for different purposes depending on what estimator will be used to access them. After thinking more on this suggestion, I came up with following generic structure which can be used to store progress of any command per backend in shared memory. Struct PgBackendProgress { int32 *counter[COMMAND_NUM_SLOTS]; float8 *counter_float[COMMAND_NUM_SLOTS]; char *progress_message[COMMAND_NUM_SLOTS]; } This can't actually work, because we don't have a dynamic allocator for shared memory. What you need to do is something like this: struct PgBackendProgress { uint64 progress_integer[N_PROGRESS_INTEGER]; float8 progress_float[N_PROGRESS_FLOAT]; char progress_string[PROGRESS_STRING_LENGTH][N_PROGRESS_STRING]; }; You probably want to protect this with the st_changecount protocol, or just put the fields in PgBackendStatus. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Configure with thread sanitizer fails the thread test
On Mon, Aug 17, 2015 at 3:02 PM, Andres Freund and...@anarazel.de wrote: On 2015-08-17 14:31:24 -0300, Alvaro Herrera wrote: The postmaster process in particular runs in a rather unusual arrangement, where most of the interesting stuff does happen in signal handlers. FWIW, I think it might be worthwhile to convert postmaster into a loop over a process local latch, with that latch being set in signal handlers. My feeling is that that'd simplify the code rather significantly. I'm not 100% it's worth the code churn, but it'd definitely be easier to understand. Thread sanitizer isn't the first analysis tool that has problems coping with forks in signal handlers btw, valgrind on amd64 for a long while had misaligned stacks in the children afterwards leading to very odd crashes. Yeah, I'm a little worried about whether we'd destabilize things by changing them in that way, but if we could avoid that pitfall I suspect we'd end up better off. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Warnings around booleans
On Mon, Aug 17, 2015 at 8:01 AM, Andres Freund and...@anarazel.de wrote: I'd rather see those split into separate commits. Doing polishing and active bugs in one commit imo isn't a good idea if the polishing goes beyond a line or two. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] allowing wal_level change at run time
On 8/18/15 8:48 AM, Robert Haas wrote: On Tue, Aug 18, 2015 at 7:59 AM, Peter Eisentraut pete...@gmx.net wrote: How would we handle decreases at run time? We can prevent =archive - minimal if archiving is running or there are physical replication slots, and we can prevent logical - something less if there are logical replication slots, but AFAICT, we don't have a way to check whether anyone currently needs level hot_standby. What do you mean by prevent? If the user edits postgresql.conf and reduces the setting, and then reloads the configuration file, they have a right to expect that the changes got applied. We have certain checks in place that require a minimum wal_level before other things are allowed. For example, turning on archiving requires wal_level = archive. The issue is then, if you have archiving on and then turn wal_level to minimal at run time, we need to prevent that to preserve the integrity of the original check. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: SCRAM authentication
Josh, * Josh Berkus (j...@agliodbs.com) wrote: I don't feel like you've correctly assessed the risk inherent in the md5 auth method, which is that, having captured an md5auth string by whatever means, and attacker can reuse that md5 string on other databases in the network *without* cracking it. That's the biggest risk as long as md5 is present. Robert and I, I believe, are both aware of that risk and had moved on to discussing the risk to the cleartext password by keeping an md5 present, even if it wasn't able to be used for login. We will want to make it simple for admins to enable/disable md5 (eg: pg_hba), but we also have to consider how to make it simple for admins to remove the old md5 password verifier or there will be risk of the cleartext password being compromised from backups, etc. I fully agree that it's a serious risk that, today, what's stored in the database is directly usable for logging into the running system (or any system where the user has the same username and password). Aside from code complexity, the user security concern with a multiple verifier per role approach is that the DBAs would never remember to completely disable md5auth and would capture md5 hashes either in flight or from backups. This approach can be used to capture an md5hash from a non-critical database which is poorly secured, and then re-use it against an important database. The 'in flight' case is at least a bit less of an issue, as we don't ship the password verifier directly over the network. The backup concern is certainly corret though. I fully agree that we need a way to make sure users don't end up having the old password verifiers around longer than necessary. That was the imputus for my earlier suggestion that in a release or two, we actively make pg_upgrade error (or perhaps warn first, then error, across two releases) if any of the old verifiers exist. Now, the counter-argument to this is that a DBA is just as likely to rememeber to remove md5 verifiers as she is to remember to remove roles with md5auth. Indeed, that's a serious concern also. The other concern with a single password verifier is that we're locking ourselves into a one-verifier-per-role solution when most of the serious solutions in use today (Active Directory, Kerberos, certificate based systems) allow for more than one. Regardless of the approach we take, encouraging users to migrate is going to be more of a matter of documentation, publicity, and administrative tools than one of multiple verifiers vs. multiple roles. That is, giving DBAs the ability to see and log who's using what kind of verifier, and what account has what verifier(s) available, will make more of a difference. Fully agreed. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] jsonb array-style subscripting
On 08/18/2015 01:32 AM, Pavel Stehule wrote: Hi 2015-08-17 21:12 GMT+02:00 Jim Nasby jim.na...@bluetreble.com mailto:jim.na...@bluetreble.com: On 8/17/15 12:57 PM, Dmitry Dolgov wrote: * is it interesting for the community? We definitely need better ways to manipulate JSON. * is that a good idea to extend the `ArrayRef` for jsonb? If it's appropriate, probably we can rename it to `ArrayJsonbRef` of something. * what can be improved in the code at the top level (function placement, probably, functionality duplication, etc.)? * are there any special cases, that I should take care of in this implementation? How would this work when you have a JSON array? Postgres array syntax suddenly becoming key/value syntax for JSON seems like a pretty bad idea to me. Could a different syntax (maybe {}) be used instead? I don't understand why '{}' should be better than '[]' ? The lot of modern languages doesn't different between arrays and hash. What is more, it would be a lot more intrusive to the parser, I suspect. We currently allow json path expressions to contain both object key and array index values, and I don't see any reason to do this differently. This is the syntax that was discussed at pgcon. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Autonomous Transaction is back
On Sat, Aug 15, 2015 at 6:47 PM, Noah Misch n...@leadboat.com wrote: CREATE TABLE t (c) AS SELECT 1; BEGIN; UPDATE t SET c = 2 WHERE c = 1; BEGIN_AUTONOMOUS; UPDATE t SET c = 3 WHERE c = 1; UPDATE t SET c = 4 WHERE c = 2; COMMIT_AUTONOMOUS; ROLLBACK; If you replace the autonomous transaction with a savepoint, the c=3 update finds no rows, and the c=4 update changes one row. When the outer transaction aborts, only the original c=1 row remains live. If you replace the autonomous transaction with a dblink/pg_background call, the c=3 update waits indefinitely for c=2 to commit or abort, an undetected deadlock. Suppose you make the autonomous transaction see tuples like in the savepoint case. The c=3 update finds no rows to update, and the c=4 update changes one row. When the outer transaction aborts, you have two live rows (c=1 and c=4). Suppose you instead make the autonomous transaction see tuples like in the dblink case, yet let c=3 ignore the lock and change a row. If both the autonomous transaction and the outer transaction were to commit, then you get two live rows (c=2 and c=3). Neither of those outcomes is acceptable, of course. In today's tuple update rules, c=3 must deadlock[1]. Other credible tuple update rules may not have this problem, but nothing jumps to mind. [1] That's not to say it must use the shmem lock structures and deadlock detector. This footnote goes to my point. It seems clear to me that having the autonomous transaction see the effects of the outer uncommitted transaction is a recipe for trouble. If the autonomous transaction updates a row and commits, and the outer transaction later aborts, the resulting state is inconsistent with any serial history. I'm fairly certain that's going to leave us in an unhappy place. Even more obviously, ending up with two committed row versions that are both updates of a single ancestor version is no good. So, I agree that this scenario should be an error. What I don't agree with is the idea that it should be the deadlock detector's job to throw that error. Rather, I think that when we examine the xmax of the tuple we can see - which is the original one, not the one updated by the outer transaction - we should check whether that XID belongs to an outer transaction. If it does, we should throw an error instead of trying to lock it. That way (1) the error message will be clear and specific to the situation and (2) we don't need a separate PGPROC for each autonomous transaction. The first of those benefits is agreeable; the second one is, in my opinion, a key design goal for this feature. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DTrace build dependency rules
On Sat, Aug 15, 2015 at 6:45 PM, Mark Johnston ma...@freebsd.org wrote: There seems to be a bug in the make rules when DTrace is enabled. It causes dtrace -G to be invoked twice when building PostgreSQL as a FreeBSD port: once during the build itself, and once during installation. For a long time this has been worked around on FreeBSD with a change to libdtrace itself, but this workaround is proving problematic and I'd like to fix the problem properly. I'm not sure whether the problem has been observed on other operating systems that support DTrace. The bug is in src/backend/Makefile. probes.o, the dtrace(1)-generated object file, depends on the objfiles.txt for each of the backend subdirs. These files depend in turn on the object files themselves; if objfiles.txt is out of date with respect to one of its object files, the mtime of objfiles.txt is updated with touch (see backend/common.mk). The problem is that dtrace -G, which runs at the end of the build, modifies a number of object files (it overwrites their probe sites with NOPs), thus making their corresponding objfiles.txt out of date. Then, when make install traverses the backend subdirs, it updates objfiles.txt, which causes probes.o to be rebuilt, resulting in an error from dtrace(1). Gosh, that's pretty ugly. I would have thought it would be a real no-no to update the .o file once it got generated. If nothing else, a modification to the .c file concurrent with a make invocation might lead to the .o not getting rebuilt the next time make is run. The attached patch fixes the problem by having probes.o depend on object files directly, rather than on objfiles.txt. I've tested it with PostgreSQL 9.0-9.4 on FreeBSD CURRENT. I don't see a particular reason not to make this change, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] allowing wal_level change at run time
On Tue, Aug 18, 2015 at 9:50 AM, Tom Lane t...@sss.pgh.pa.us wrote: Peter Eisentraut pete...@gmx.net writes: On 8/18/15 8:48 AM, Robert Haas wrote: What do you mean by prevent? If the user edits postgresql.conf and reduces the setting, and then reloads the configuration file, they have a right to expect that the changes got applied. We have certain checks in place that require a minimum wal_level before other things are allowed. For example, turning on archiving requires wal_level = archive. The issue is then, if you have archiving on and then turn wal_level to minimal at run time, we need to prevent that to preserve the integrity of the original check. IIRC, the reason for having a wal_level parameter separate from those other ones was precisely that only wal_level had to be PGC_POSTMASTER, and you could change the others if you wanted. If we are going to fix the mechanisms to allow dynamic changing in wal log verbosity, maybe we could simply get rid of wal_level as a user-settable parameter, and have its effective value be inferred from the active settings of the other parameters. Mmm, I like that idea. If we can do it, it seems much cleaner that way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows
Hello, I noticed ExecChooseHashTableSize() in nodeHash.c got failed by Assert(nbuckets 0), when crazy number of rows are expected. BACKTRACE: #0 0x003f79432625 in raise () from /lib64/libc.so.6 #1 0x003f79433e05 in abort () from /lib64/libc.so.6 #2 0x0092600a in ExceptionalCondition (conditionName=0xac1ea0 !(nbuckets 0), errorType=0xac1d88 FailedAssertion, fileName=0xac1d40 nodeHash.c, lineNumber=545) at assert.c:54 #3 0x006851ff in ExecChooseHashTableSize (ntuples=60521928028, tupwidth=8, useskew=1 '\001', numbuckets=0x7fff146bff04, numbatches=0x7fff146bff00, num_skew_mcvs=0x7fff146bfefc) at nodeHash.c:545 #4 0x00701735 in initial_cost_hashjoin (root=0x253a318, workspace=0x7fff146bffc0, jointype=JOIN_SEMI, hashclauses=0x257e4f0, outer_path=0x2569a40, inner_path=0x2569908, sjinfo=0x2566f40, semifactors=0x7fff146c0168) at costsize.c:2592 #5 0x0070e02a in try_hashjoin_path (root=0x253a318, joinrel=0x257d940, outer_path=0x2569a40, inner_path=0x2569908, hashclauses=0x257e4f0, jointype=JOIN_SEMI, extra=0x7fff146c0150) at joinpath.c:543 See the following EXPLAIN output, at the configuration without --enable-cassert. Planner expects 60.5B rows towards the self join by a relation with 72M rows. (Probably, this estimation is too much.) [kaigai@ayu ~]$ (echo EXPLAIN; cat ~/tpcds/query95.sql) | psql tpcds100 QUERY PLAN Limit (cost=9168667273.07..9168667273.08 rows=1 width=20) CTE ws_wh - Custom Scan (GpuJoin) (cost=3342534.49..654642911.88 rows=60521928028 width=24) Bulkload: On (density: 100.00%) Depth 1: Logic: GpuHashJoin, HashKeys: (ws_order_number), JoinQual: ((ws_warehouse_sk ws_warehouse_sk) AND (ws_order_number = ws_order_number)), nrows (ratio: 84056.77%) - Custom Scan (BulkScan) on web_sales ws1_1 (cost=0.00..3290612.48 rows=72001248 width=16) - Seq Scan on web_sales ws2 (cost=0.00..3290612.48 rows=72001248 width=16) - Sort (cost=8514024361.19..8514024361.20 rows=1 width=20) Sort Key: (count(DISTINCT ws1.ws_order_number)) : This crash was triggered by Assert(nbuckets 0), and nbuckets is calculated as follows. /* * If there's not enough space to store the projected number of tuples and * the required bucket headers, we will need multiple batches. */ if (inner_rel_bytes + bucket_bytes hash_table_bytes) { /* We'll need multiple batches */ longlbuckets; double dbatch; int minbatch; longbucket_size; /* * Estimate the number of buckets we'll want to have when work_mem is * entirely full. Each bucket will contain a bucket pointer plus * NTUP_PER_BUCKET tuples, whose projected size already includes * overhead for the hash code, pointer to the next tuple, etc. */ bucket_size = (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple)); lbuckets = 1 my_log2(hash_table_bytes / bucket_size); lbuckets = Min(lbuckets, max_pointers); nbuckets = (int) lbuckets; bucket_bytes = nbuckets * sizeof(HashJoinTuple); : : } Assert(nbuckets 0); Assert(nbatch 0); In my case, the hash_table_bytes was 101017630802, and bucket_size was 48. So, my_log2(hash_table_bytes / bucket_size) = 31, then lbuckets will have negative number because both 1 and my_log2() is int32. So, Min(lbuckets, max_pointers) chooses 0x8000, then it was set on the nbuckets and triggers the Assert(). Attached patch fixes the problem. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com pgsql-fix-hash-nbuckets.patch Description: pgsql-fix-hash-nbuckets.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On Tue, Aug 18, 2015 at 7:27 AM, Masahiko Sawada sawada.m...@gmail.com wrote: I have encountered the much cases where pg_stat_statement, pgstattuples are required in production, so I basically agree with moving such extension into core. But IMO, the diagnostic tools for visibility map, heap (pageinspect) and so on, are a kind of debugging tool. Just because something might be required in production isn't a sufficient reason to put it in core. Debugging tools, or anything else, can be required in production, too. Attached latest v11 patches, which is separated into 2 patches: frozen bit patch and diagnostic function patch. Moving diagnostic function into core is still under the discussion, but this patch puts such function into core because the diagnostic function for visibility map needs to be in core to execute regression test at least. As has been discussed recently, there are other ways to handle that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Implement failover on libpq connect level.
On Tue, Aug 18, 2015 at 6:07 AM, PostgreSQL - Hans-Jürgen Schönig postg...@cybertec.at wrote: On 18 Aug 2015, at 11:19, Albe Laurenz laurenz.a...@wien.gv.at wrote: Hans-Jürgen Schönig wrote: in addition to that you have the “problem” of transactions. if you failover in the middle of a transaction, strange things might happen from the application point of view. the good thing, however, is that stupid middleware is sometimes not able to handle failed connections. however, overall i think it is more of a danger than a benefit. Maybe I misunderstood the original proposal, but my impression was that the alternative servers would be tried only at the time the connection is established, and there would be no such problems as you describe. it would still leave the problem of having a read only on the other side unless you are using BDR or so. That doesn't make this a bad idea. Some people are using replication solutions that can cope with this already (EDB has a proprietary product, and I'm sure there are people using BDR, too) and, as the solutions get better and more widely deployed, more people will want to do it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: SCRAM authentication
On Tue, Aug 18, 2015 at 10:06 AM, Stephen Frost sfr...@snowman.net wrote: That was the imputus for my earlier suggestion that in a release or two, we actively make pg_upgrade error (or perhaps warn first, then error, across two releases) if any of the old verifiers exist. I think there's basically no chance of that being acceptable. The time at which it's possible to get rid of MD5 is going to vary widely between installations. People who are using only libpq or libpq-derived connectors will be able to get rid of it almost immediately, if they want, though some won't. People who are using obscure connectors that are poorly maintained may not even have a version that supports SCRAM for 5 years. Think about how long it took us to roll out the standard_conforming_strings changes, and there were still people who got bitten. The other concern with a single password verifier is that we're locking ourselves into a one-verifier-per-role solution when most of the serious solutions in use today (Active Directory, Kerberos, certificate based systems) allow for more than one. So what? If you want to delegate authentication to AD or Kerberos, we already support that. That's not a reason to invent the same functionality inside the server. If you've got a tangible plan, other than SCRAM, that would require us to support multiple verifiers, then please say what it is. If not, the mere fact that some other people support it is not a reason why we should. In fact, we generally have taken the approach that needs which are already handled adequately by other tools to do not need to also be handled inside the database. That's not a perfect approach and we always argue about it around the edges, but generally, I think it's served us pretty well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Implement failover on libpq connect level.
On 17 August 2015 at 23:18, Victor Wagner vi...@wagner.pp.ru wrote: Rationale = Since introduction of the WAL-based replication into the PostgreSQL, it is possible to create high-availability and load-balancing clusters. However, there is no support for failover in the client libraries. So, only way to provide transparent for client application failover is IP address migration. This approach has some limitation, i.e. it requires that master and backup servers reside in the same subnet or may not be feasible for other reasons. This is not completely true, you can always use something like pgbouncer or other middleware to change the server to which clients connect. you still need to solve the fact that you will have a read-only server at the other side. something like repmgr + pgbouncer will work fine. i agree that once/if we ever have multimaster included then this could be a good idea -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Potential GIN vacuum bug
On Mon, Aug 17, 2015 at 5:41 PM, Jeff Janes jeff.ja...@gmail.com wrote: User backends attempt to take the lock conditionally, because otherwise they would cause an autovacuum already holding the lock to cancel itself, which seems quite bad. Not that this a substantial behavior change, in that with this code the user backends which find the list already being cleaned will just add to the end of the pending list and go about their business. So if things are added to the list faster than they can be cleaned up, the size of the pending list can increase without bound. Under the existing code each concurrent user backend will try to clean the pending list at the same time. The work doesn't parallelize, so doing this is just burns CPU (and possibly consuming up to maintenance_work_mem for *each* backend) but it does server to throttle the insertion rate and so keep the list from growing without bound. This is just a proof-of-concept patch, because I don't know if this approach is the right approach. I'm not sure if this is the right approach, but I'm a little wary of involving the heavyweight lock manager in this. If pending list cleanups are frequent, this could involve a lot of additional lock manager traffic, which could be bad for performance. Even if they are infrequent, it seems like it would be more natural to handle this without some regime of locks and pins and buffer cleanup locks on the buffers that are storing the pending list, rather than a heavyweight lock on the whole relation. But I am just waving my hands wildly here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DTrace build dependency rules
Robert Haas wrote: On Sat, Aug 15, 2015 at 6:45 PM, Mark Johnston ma...@freebsd.org wrote: The bug is in src/backend/Makefile. probes.o, the dtrace(1)-generated object file, depends on the objfiles.txt for each of the backend subdirs. These files depend in turn on the object files themselves; if objfiles.txt is out of date with respect to one of its object files, the mtime of objfiles.txt is updated with touch (see backend/common.mk). The problem is that dtrace -G, which runs at the end of the build, modifies a number of object files (it overwrites their probe sites with NOPs), thus making their corresponding objfiles.txt out of date. Then, when make install traverses the backend subdirs, it updates objfiles.txt, which causes probes.o to be rebuilt, resulting in an error from dtrace(1). Gosh, that's pretty ugly. I would have thought it would be a real no-no to update the .o file once it got generated. If nothing else, a modification to the .c file concurrent with a make invocation might lead to the .o not getting rebuilt the next time make is run. I had the same thought, and wondered for a bit whether we should instead have the compilation rules produce some intermediate file (prior to dtrace fumbling), then emit the .o from dtrace -G. OTOH this might be more trouble than is worth for a feature that doesn't see a lot of use. The attached patch fixes the problem by having probes.o depend on object files directly, rather than on objfiles.txt. I've tested it with PostgreSQL 9.0-9.4 on FreeBSD CURRENT. I don't see a particular reason not to make this change, though. I wanted to test it on Linux yesterday but didn't get any further than installing a couple of packages. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] allowing wal_level change at run time
On Tue, Aug 18, 2015 at 12:27 PM, Peter Eisentraut pete...@gmx.net wrote: On 8/18/15 9:50 AM, Tom Lane wrote: IIRC, the reason for having a wal_level parameter separate from those other ones was precisely that only wal_level had to be PGC_POSTMASTER, and you could change the others if you wanted. I think you are thinking of having split archive_mode/archive_command, so you can change archive_command at run time. If we are going to fix the mechanisms to allow dynamic changing in wal log verbosity, maybe we could simply get rid of wal_level as a user-settable parameter, and have its effective value be inferred from the active settings of the other parameters. That would be nice, but those other parameters aren't really things that we can easily look at. In the old days, you could say that archive_mode = on was a pretty sure sign that you'd need wal_level = archive, but nowadays you can run replication without archiving. We could dial wal_level up and down every time someone connects to stream WAL, but that would surely lead to complications. Also, we have no way of knowing whether someone needs wal_level hot_standby until the WAL is on the standby and the standby decides to set hot_standby = on. The permutations of what could possibly influence this setting are quite enormous, if you consider archiving, streaming, hot standby, hot standby feedback, replication slots, etc., and synchronizing all of that sounds like a mess. If archive_mode=on or max_wal_senders0, then we need at least wal_level=archive. Otherwise wal_level=minimal is enough. If we eliminate the distinction between wal_level=archive and wal_level=hot_standby, then we just need a way to distinguish between hot_standby and logical. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] allowing wal_level change at run time
On 8/18/15 9:50 AM, Tom Lane wrote: IIRC, the reason for having a wal_level parameter separate from those other ones was precisely that only wal_level had to be PGC_POSTMASTER, and you could change the others if you wanted. I think you are thinking of having split archive_mode/archive_command, so you can change archive_command at run time. If we are going to fix the mechanisms to allow dynamic changing in wal log verbosity, maybe we could simply get rid of wal_level as a user-settable parameter, and have its effective value be inferred from the active settings of the other parameters. That would be nice, but those other parameters aren't really things that we can easily look at. In the old days, you could say that archive_mode = on was a pretty sure sign that you'd need wal_level = archive, but nowadays you can run replication without archiving. We could dial wal_level up and down every time someone connects to stream WAL, but that would surely lead to complications. Also, we have no way of knowing whether someone needs wal_level hot_standby until the WAL is on the standby and the standby decides to set hot_standby = on. The permutations of what could possibly influence this setting are quite enormous, if you consider archiving, streaming, hot standby, hot standby feedback, replication slots, etc., and synchronizing all of that sounds like a mess. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Colon Omitted From pgSQL Assignments
Hi 2015-08-18 17:41 GMT+02:00 Charles Sheridan cesh...@swbell.net: Hi, I was looking at PL/pgSQL documentation and realized that contrary to spec, I've been omitting the colon ':' from assignments, e.g. writing 'x = 5' rather than the correct 'x := 5' I don't see any error messages about this. this is undocumented feature, that should not be removed due possible compatibility issues. see comments from gram.y * Ada-based PL/SQL uses := for assignment and variable defaults, while * the SQL standard uses equals for these cases and for GET * DIAGNOSTICS, so we support both. FOR and OPEN only support :=. Personally I am thinking, so this design is unhappy, but at the end this is small problem not too important to solve. Regards Pavel I am not aware of any problems due to this. I suppose that if a condition precedes this syntax, the code could be interpreted as an equals test. Should I immediately update all assignments ? Are there known scenarios where this error becomes a problem ? Regards, Charles -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] missing documentation for partial WAL files
On Mon, Aug 17, 2015 at 2:50 PM, Peter Eisentraut pete...@gmx.net wrote: The commit message for de76884 contains some important information about the purpose and use of the new .partial WAL files. But I don't see anything about this in the documentation or another user-visible place. We should probably add something. +1. Heikki talked about this at his PGCon presentation, and I thought that was pretty helpful stuff, but not everyone was there (or will remember what he said when the time comes that they need to know). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Colon Omitted From pgSQL Assignments
Hi, I was looking at PL/pgSQL documentation and realized that contrary to spec, I've been omitting the colon ':' from assignments, e.g. writing 'x = 5' rather than the correct 'x := 5' I don't see any error messages about this. I am not aware of any problems due to this. I suppose that if a condition precedes this syntax, the code could be interpreted as an equals test. Should I immediately update all assignments ? Are there known scenarios where this error becomes a problem ? Regards, Charles -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Test code is worth the space
On Tue, Aug 18, 2015 at 2:16 PM, David Fetter da...@fetter.org wrote: I'm given to understand that this tight coupling is necessary for performance. Are you saying that it could be unwound, or that testing strategies mostly need to take it into account, or...? I'm just saying that we shouldn't expect to find a magic bullet test framework that solves all these problems. Without restructuring code, which I don't think is really feasible, we won't be able to have good unit test coverage for most existing code. It might be more practical to start using such a new tool for new code only. Then the new code could be structured in ways that allow the environment to be mocked more easily and the results observed more easily. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers