Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
On Sat, Nov 12, 2016 at 10:12 PM, Pavan Deolaseewrote: > > > On Tue, Nov 8, 2016 at 9:13 AM, Haribabu Kommi > wrote: > >> >> Thanks for the patch. This shows a very good performance improvement. >> >> > Thank you. Can you please share the benchmark you ran, results and > observations? > I just ran a performance test on my laptop with minimal configuration, it didn't show much improvement, currently I don't have access to a big machine to test the performance. > I started reviewing the patch, during this process and I ran the regression >> test on the WARM patch. I observed a failure in create_index test. >> This may be a bug in code or expected that needs to be corrected. >> > > Can you please share the diff? I ran regression after applying patch on > the current master and did not find any change? Does it happen consistently? > Yes, it is happening consistently. I ran the make installcheck. Attached the regression.diffs file with the failed test. I applied the previous warm patch on this commit - e3e66d8a9813d22c2aa027d8f373a96d4d4c1b15 Regards, Hari Babu Fujitsu Australia regression.diffs Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress
Hello, At Mon, 14 Nov 2016 16:53:35 +0530, Amit Kapilawrote in > On Mon, Nov 14, 2016 at 9:33 AM, Michael Paquier > >>> Is it right to set XLOG_NO_PROGRESS flag in LogAccessExclusiveLocks? > >>> This function is called not only in LogStandbySnapshot(), but during > >>> DDL operations as well where lockmode >= AccessExclusiveLock. > >> > >> This does not remove any record from WAL. So theoretically any > >> kind of record can be NO_PROGRESS, but practically as long as > >> checkpoints are not unreasonably suppressed. Any explicit > >> database operation must be accompanied with at least commit > >> record that triggers checkpoint. NO_PROGRESSing there doesn't > >> seem to me to harm database durability for this reason. > >> > > By this theory, you can even mark the insert record as no progress > which is not good. Of course. So we carefully choose the kinds of records to be so. If we mark all xlog records to be SKIP_PROGRESS, archive_timeout gets useless and as its result vacuum may leave certain number of records not removed for maybe problematic time. > >> The objective of this patch is skipping WALs on completely-idle > >> state and the NO_PROGRESSing is necessary to do its work. Of > >> course we can distinguish exclusive lock with PROGRESS and > >> without PROGRESS but it is unnecessary complexity. > > > > The point that applies here is that logging the exclusive lock > > information is necessary for the *standby* recovery conflicts, not the > > primary which is why it should not influence the checkpoint activity > > that is happening on the primary. So marking this record with > > NO_PROGRESS is actually fine to me. > > > > The progress parameter is used not only for checkpoint activity but by > bgwriter as well for logging standby snapshot. If you want to keep > this record under no_progress category (which I don't endorse), then > it might be better to add a comment, so that it will be easier for the > readers of this code to understand the reason. I rather agree to that. But how detailed it should be? >LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks) >{ >... > XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock)); > /* Needs XLOG_SKIP_PROGRESS because called from LogStandbySnapshot */ > XLogSetFlags(XLOG_SKIP_PROGRESS); or > /* >* Needs XLOG_SKIP_PROGRESS because called from LogStandbySnapshot. >* See the comment for LogCurrentRunningXact for the detail. >*/ or more detiled? The term "WAL activity' is used in the comment for GetProgressRecPtr. Its meaning is not clear but not well defined. Might need a bit detailed explanation about that or "WAL activity tracking". What do you think about this? regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] Cascading standby cannot catch up and get stuck emitting the same message repeatedly
From: pgsql-hackers-ow...@postgresql.org > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas > Let me try to be more clear. I will not commit this patch if it is not > properly commented. I doubt that anyone else will, either. > > The fact that those code changes already exist in 9.4+ is not a reason to > back-port them to earlier releases without a proper explanation of why we > are doing it. Very possibly, we should also improve the comments in newer > branches so that future authors don't reintroduce whatever bugs were fixed > by these changes. But whether we do that or not, I am not going to commit > uncommented patches to complex code in order to fix obscure bugs in > 3+-year-old branches. I think that is a non-starter. > OK, although I'm not perfectly sure what to add as a comment, I added an example scenario as a comment because I thought a concrete situation helps to understand the existing two paragraphs. Is this good? Regards Takayuki Tsunakawa cascading_standby_stuck_v2.patch Description: cascading_standby_stuck_v2.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Adding the optional clause 'AS' in CREATE TRIGGER
"Okano, Naoki"writes: > I would like to add the following support for a trigger. > This implementation enables to create a trigger efficiently > in single command. > It had been discussed before. The link is as shown below. > https://www.postgresql.org/message-id/CAA-aLv4m%3Df9cc1zcUzM49pE8%2B2NpytUDraTgfBmkTOkMN_wO2w%40mail.gmail.com I think people pretty much lost interest in that proposal, which is why it never got finished. Aside from the definitional issues discussed in that thread, I can think of a few other problems: 1. The impression I have is that most people write trigger functions so that they can be shared across multiple uses. That'd be impossible with anonymous trigger function blocks. So you'd be trading off merging two commands into one, versus having to write out the whole body of the trigger multiple times, which wouldn't be much of a win. 2. I am concerned that every time we add any syntax to CREATE FUNCTION, we'll have to think about whether we need to add it to CREATE TRIGGER. (Or more likely, we'll forget and then users will complain.) 3. There's a lot of infrastructure that's built up around CREATE FUNCTION, such as psql's \ef and \sf commands. We'd soon find ourselves having to duplicate all that for triggers. So personally I feel that the value-for-work-expended ratio here is pretty low. But in any case it would be a serious mistake to do this without first implementing CREATE OR REPLACE TRIGGER. I think that's an entirely separate proposal and you would be well advised to treat it as such. It's far from trivial because of locking considerations. You probably have to take an exclusive lock on the owning relation in order to change trigger properties. (An advantage of the separate-function-definition approach is that you can replace the function body with a much weaker lock.) > Supporting 'AS' clause in CREATE TRIGGER syntax will enable the option of > defining the trigger in single command. > As a bonus, it will be compatible with oracle. It certainly would not match Oracle's syntax for trigger bodies. It might slightly reduce the conversion pain, but only slightly. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] Cascading standby cannot catch up and get stuck emitting the same message repeatedly
From: pgsql-hackers-ow...@postgresql.org > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Amit Kapila > It looks like the code in 9.3 or later version uses the recptr as the target > segment location > (targetSegmentPtr) whereas 9.2 uses recptr as beginning of segment (readOff > = 0;). If above understanding is right then it will set different values > for latestPagePtr in 9.2 and 9.3 onwards code. > In 9.2, the relevant variable is not recptr but recaddr. recaddr in 9.2 and recptr in later releases point to the beginning of a page just read, which is not always the beginning of the segment (targetSegmentPtr). Regards Takayuki Tsunakawa -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary
On Mon, Nov 14, 2016 at 10:21:53AM -0800, Peter Geoghegan wrote: > BTW, I recently noticed that the latest version of Valgrind, 3.12, > added this new feature: > > * Memcheck: > > - Added meta mempool support for describing a custom allocator which: > - Auto-frees all chunks assuming that destroying a pool destroys all >objects in the pool > - Uses itself to allocate other memory blocks > > It occurred to me that we might be able to make good use of this. PostgreSQL memory contexts don't acquire blocks from other contexts; they all draw from malloc() directly. Therefore, I don't see this feature giving us something that the older Memcheck MEMPOOL support does not give us. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Remove the comment on the countereffectiveness of large shared_buffers on Windows
From: pgsql-hackers-ow...@postgresql.org > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Amit Kapila > Okay, not a problem. However, I am not sure the results in this thread > are sufficient proof as for read-only tests, there is no noticeable win > by increasing shared buffers and read-write tests seems to be quite short > (60 seconds) to rely on it. I think the reason why increasing shared_buffers didn't give better performance for read-only tests than you expect is that the relation files are cached in the filesystem cache. The purpose of this verification is to know that the effective upper limit is not 512MB (which is too small now), and I think the purpose is achieved. There may be another threshold, say 32GB or 128GB, over which the performance degrades due to PostgreSQL implementation, but that's another topic which also applies to other OSes. How about 3 minutes for read-write tests? How long do you typically run? Regards Takayuki Tsunakawa -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Adding the optional clause 'AS' in CREATE TRIGGER
Hi, I would like to add the following support for a trigger. This implementation enables to create a trigger efficiently in single command. It had been discussed before. The link is as shown below. https://www.postgresql.org/message-id/CAA-aLv4m%3Df9cc1zcUzM49pE8%2B2NpytUDraTgfBmkTOkMN_wO2w%40mail.gmail.com Currently, PostgreSQL requires two steps to create a trigger. 1. to create a function. 2. to define a trigger with action specified via already created function. Supporting 'AS' clause in CREATE TRIGGER syntax will enable the option of defining the trigger in single command. As a bonus, it will be compatible with oracle. Also, the optional clause 'OR REPLACE' is required as below. https://www.postgresql.org/message-id/CAA-aLv6KYgVt2CwaRdcnptzWVngEm72Cp4mUFnF-MfeH0gS91g%40mail.gmail.com Currently, to change the definition of a trigger, trigger needs to be dropped first before creating it again with new definition. To change the definition of a function in CREATE TRIGGER syntax, trigger needs to be dropped first before creating it again with new definition, too! So, we need to add the optional clause 'OR REPLACE'. Adding the optional clauses 'AS' and 'OR REPLACE' in CREATE TRIGGER syntax gives the comfort of defining the trigger or redefining the trigger definition which contains the function definition in single command. Here is the syntax based on the previous discussion. CREATE [ OR REPLACE ] [ CONSTRAINT ] TRIGGER name { BEFORE | AFTER | INSTEAD OF } { event [ OR ... ] } ON table_name [ FROM referenced_table_name ] [ NOT DEFERRABLE | [ DEFERRABLE ] { INITIALLY IMMEDIATE | INITIALLY DEFERRED } ] [ FOR [ EACH ] { ROW | STATEMENT } ] [ WHEN ( condition ) ] { EXECUTE PROCEDURE function_name ( arguments ) | AS 'trigger function definition' [ LANGUAGE lang_name ] [ SET configuration_parameter { TO value | = value | FROM CURRENT }] } If you have your opinion on this concept, please give me it. Regards, Okano Naoki Fujitsu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Implement failover on libpq connect level.
From: pgsql-hackers-ow...@postgresql.org > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Mithun Cy > Thanks, my concern is suppose you have 3 server in cluster A(new version), > B(new version), C(old version). If we implement as above only new servers > will send ParameterStatus message to indicate what type of server we are > connected. Server C will not send same. So we will not be able to use new > feature "failover to new master" for such a kind of cluster. No, the streaming replication requires the same major release for all member servers, so there's no concern about the mixed-version cluster. Sorry, pmState can only be used in postmaster. In our context, postgres can use RecoveryInProgress(). Anyway, in addition to the reduced round trip, the libpq code would be much simpler. Regards Takayuki Tsunakawa -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.6 TAP tests and extensions
On 14 November 2016 at 14:55, Craig Ringerwrote: > On 27 October 2016 at 00:42, Robert Haas wrote: >> On Wed, Oct 26, 2016 at 7:17 AM, Andres Freund wrote: >>> On 2016-09-23 16:04:32 -0400, Tom Lane wrote: Looking back over the thread, I see that you also proposed installing isolationtester and pg_isolation_regress for the benefit of extensions. I'm very much less excited about that idea. It'd be substantially more dead weight in typical installations, and I'm not sure that it'd be useful to common extensions, and I'm not eager to treat isolationtester's API and behavior as something we need to hold stable for extension use. >>> >>> FWIW, I'd be quite happy if it were installed. Running isolationtester >>> when compiling extensions against distribution postgres packages would >>> be quite useful. >> >> +1. > > Patch attached. As Andres pointed out elsewhere this is only half the picture. It should really add PGXS support for ISOLATION_TESTS and bringing up the test harness too. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: two slab-like memory allocators
Attached is v6 of the patch series, fixing most of the points: * common bits (valgrind/randomization/wipe) moved to memdebug.h/c Instead of introducing a new header file, I've added the prototypes to memdebug.h (which was already used for the valgrind stuff anyway), and the implementations to a new memdebug.c file. Not sure what you meant by "static inlines" though. So the patch series now has three parts - 0001 with memdebug stuff, 0002 with slab and 0003 with gen (still a poor name). * removing AllocSet references from both new memory contexts * using FLEXIBLE_ARRAY_ELEMENT in SlabContext * using dlist instead of the custom linked list I've however kept SlabContext->freelist as an array, because there may be many blocks with the same number of free chunks, in which case moving the block in the list would be expensive. This way it's simply dlist_delete + dlist_push. * use StandardChunkHeader instead of the common fields * removing pointer to context from block header for both contexts * fix format in FreeInfo/AllocInfo (including for AllocSet) * improved a bunch of comments (bitmap size, chunksPerBlock formula) * did a pgindent run on the patch * implement the missing methods in Gen (Stats/Check) * fix a few minor bugs in both contexts I haven't done anything with hiding pointers behind typedefs, because I don't know what's so wrong about that. I also haven't done anything with the bitmap access in SlabAlloc - I haven't found any reasonable case when it would be measurable, and I don't expect this to be even measurable in practice. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services slab-allocators-v6.tgz Description: application/compressed-tar -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] DECLARE STATEMENT setting up a connection in ECPG
Hi, Making developing ECPG application more efficiently and improve portability, I'd like to suggest DECLARE STATEMENT support in ECPG. This DECLARE STATEMENT is one of the statement that lets users declare an identifier pointing a connection. This identifier will be used in other embedded dynamic SQL statement such as PREPARE, EXECUTE, DECLARE CURSOR and so on. (Oracle implements this.) https://docs.oracle.com/cd/B10501_01/appdev.920/a42525/apf.htm#declare_stmt Under the current system, a user must use the AT clause in every SQL statements when executing the dynamic SQL at non-default connection. https://www.postgresql.org/docs/current/static/ecpg-connect.html When a user needs to connect to a non-default connection, AT clause can be used in DECLARE STATEMENT once and need not to be in every dynamic SQL statements. This helps a user with making ECPG application easily and efficiently without explicitly designating a connection for each SQL statement. Moreover, writing code without designating connection explicitly improves portability. [Use-case] It is very useful when the data needed for a report, business decision is spread across several data sources, and one application needs to connect multiple database server. Especially these days, multiple database servers are easily set up without taking time and cost because virtualization such as docker and microservices are in fashion. This trend leads to be growing importance of this handy connection switching function. [Interface] The syntax for the DECLARE STATEMENT in ECPG is as following: EXEC SQL [ AT connection-name ] DECLARE statement-name STATEMENT , where "statement-name" is an SQL identifier and "connection name" points to the connection which will be used to execute the dynamic SQL statements. [Example] EXEC SQL AT con1 DECLARE sql_stmt STATEMENT EXEC SQL DECLARE cursor_name CURSOR FOR sql_stmt EXEC SQL PREPARE sql_stmt FROM :dyn_string [System Design Plan] To support above functionality, ecpg precompiler should support: - To understand the DECLARE STATEMENT syntax - Translate the DECLARE STATEMENT into a new function with parameters. These parameters carry the information like connection_name and statement_name. - The function is a new function defined in the ECPG library. Following functions are going to be modified: - ECPGprepare - ECPGdeallocate - ECPGdescribe - ECPGdo But I think there is room for discussing modifying ECPGdo, because it's a very common function that will map many SQL statement including SELECT, INSERT, EXECTUTE, CURSOR and so on. It seems to me there is no discussion on this topic. But if exists, would you let me know? Regards. Ideriha Takeshi, Fujitsu (Fujitsu Enterprise Postgres ) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Do we need use more meaningful variables to replace 0 in catalog head files?
Andrew Dunstanwrites: > On 11/13/2016 11:11 AM, Tom Lane wrote: >> 1. Are we going to try to keep these things in the .h files, or split >> them out? I'd like to get them out, as that eliminates both the need >> to keep the things looking like macro calls, and the need for the data >> within the macro call to be at least minimally parsable as C. > That would work fine for pg_proc.h, less so for pg_type.h where we have > a whole lot of > #define FOOOID nn > directives in among the data lines. Moving these somewhere remote from > the catalog lines they relate to seems like quite a bad idea. We certainly don't want multiple files to be sources of truth for that. What I was anticipating is that those #define's would also be generated from the same input files, much as fmgroids.h is handled today. We could imagine driving the creation of a macro off an additional, optional field in the data entries, say "macro=FOOOID", if we want only selected entries to have #defines. Or we could do like we do with pg_proc.h and generate macros for everything according to some fixed naming rule. I could see approaching pg_type that way, but am less excited about pg_operator, pg_opclass, etc, where we only need macros for a small fraction of the entries. >> 2. Andrew's example above implies some sort of mapping between the >> keywords and the actual column names (or at least column positions). >> Where and how is that specified? > There are several possibilities. The one I was leaning towards was to > parse out the Anum_pg_foo_* definitions. I'm okay with that if the field labels used in the data entries are to be exactly the same as the column names. Your example showed abbreviated names (omitting "pro"), which is something I doubt we want to try to hard-wire a rule for. Also, if we are going to abbreviate at all, I think it might be useful to abbreviate *a lot*, say like "v" for "provolatile", and that would be something that ought to be set up with some explicit manually-provided declarations. >> 3. Also where are we going to provide the per-column default values? >> How does the converter script know which columns to convert to type oids, >> proc oids, etc? Is it going to do any data validation beyond that, and >> if so on what basis? > a) something like DATA_DEFAULTS( foo=bar ); > b) something like DATA_TYPECONV ( rettype argtypes allargtypes ); I'm thinking a bit about per-column declarations in the input file, along the line of this for provolatile: declare v col=15 type=char default='v' Some of those items could be gotten out of pg_proc.h, but not all. I guess a second alternative would be to add the missing info to pg_proc.h and have the conversion script parse it out of there. >> I think we want to do them all. pg_proc.h is actually one of the easier >> catalogs to work on presently, IMO, because the only kind of >> cross-references it has are type OIDs. Things like pg_amop are a mess. >> And I really don't want to be dealing with multiple notations for catalog >> data. Also I think this will be subject to Polya's paradox: designing a >> general solution will be easier and cleaner than a hack that works only >> for one catalog. > I don't know that we need to handle everything at once, as long as the > solution is sufficiently general. Well, we could convert the catalogs one at a time if that seems useful, but I don't want to be rewriting the bki-generation script repeatedly. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Allow TAP tests to be run individually
On Tue, Nov 15, 2016 at 12:02 AM, Peter Eisentrautwrote: > On 11/14/16 3:52 AM, Michael Paquier wrote: >> I don't mind. This patch uses the following pattern: >> $(or $(PROVE_TESTS),t/*.pl) >> While something more spread in Postgres source would be something like that: >> $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl) >> It seems to me that we'd prefer that for consistency, but I see no >> reason to not keep your patch as well. I am marking that as ready for >> committer. > > ($or ...) is a newer feature of GNU make, so we have avoided that so > far. I have committed your v2 with $(if ...). Thanks, I am just going to use it... -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Something is broken about connection startup
I wrote: > Robert Haaswrites: >> I assume you're going to back-patch this, and the consequences of >> failing to reset it before going idle could easily be vastly worse >> than the problem you're trying to fix. So I'd rather not make >> assumptions like "the client will probably never sleep between Bind >> and Execute". I bet that's actually false, and I certainly wouldn't >> want to bet the farm on it being true. > No, I'm not at all proposing to assume that. But I may be willing to > assume that "we don't hold a CatalogSnapshot between Bind and Execute > unless we're also holding a transaction snapshot". I need to do a bit > more research to see if that's actually true, though. Turns out it's not true. I still don't much like having the main loop in PostgresMain know about this hack, so I experimented with putting the InvalidateCatalogSnapshot() calls into places in postgres.c that were already dealing with transaction commit/abort or snapshot management. I ended up needing five such calls (as in the attached patch), which is just about equally ugly. So at this point I'm inclined to hold my nose and stick a call into step (1) of the main loop instead. Also, wherever we end up putting those calls, is it worth providing a variant invalidation function that only kills the catalog snapshot when it's the only one outstanding? (If it isn't, the transaction snapshot should be older, so there's no chance of advancing our xmin by killing it.) In principle this would save some catalog snapshot rebuilds for inside-a-transaction-block cases, but I'm not sure it's worth sweating that when we're doing client message exchange anyway. Lastly, I find myself disliking the separate CatalogSnapshotStale flag variable. The other special snapshots in snapmgr.c are managed by setting the pointer to NULL when it's not valid, so I wonder why CatalogSnapshot wasn't done that way. Since this patch is touching almost every use of that flag already, it wouldn't take much to switch it over. Comments? regards, tom lane diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 599874e..c9da8ed 100644 *** a/src/backend/tcop/postgres.c --- b/src/backend/tcop/postgres.c *** exec_parse_message(const char *query_str *** 1352,1357 --- 1352,1358 /* Done with the snapshot used for parsing */ if (snapshot_set) PopActiveSnapshot(); + InvalidateCatalogSnapshot(); } else { *** exec_bind_message(StringInfo input_messa *** 1778,1783 --- 1779,1785 /* Done with the snapshot used for parameter I/O and parsing/planning */ if (snapshot_set) PopActiveSnapshot(); + InvalidateCatalogSnapshot(); /* * And we're ready to start portal execution. *** exec_execute_message(const char *portal_ *** 2000,2005 --- 2002,2012 * those that start or end a transaction block. */ CommandCounterIncrement(); + + /* + * Flush catalog snapshot after every query, too. + */ + InvalidateCatalogSnapshot(); } /* Send appropriate CommandComplete to client */ *** finish_xact_command(void) *** 2456,2461 --- 2463,2471 CommitTransactionCommand(); + /* Flush catalog snapshot if any */ + InvalidateCatalogSnapshot(); + #ifdef MEMORY_CONTEXT_CHECKING /* Check all memory contexts that weren't freed during commit */ /* (those that were, were checked before being deleted) */ *** PostgresMain(int argc, char *argv[], *** 3871,3876 --- 3881,3888 */ AbortCurrentTransaction(); + InvalidateCatalogSnapshot(); + if (am_walsender) WalSndErrorCleanup(); diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c index 1ec9f70..842e135 100644 *** a/src/backend/utils/time/snapmgr.c --- b/src/backend/utils/time/snapmgr.c *** GetTransactionSnapshot(void) *** 316,321 --- 316,327 /* First call in transaction? */ if (!FirstSnapshotSet) { + /* + * Don't allow catalog snapshot to be older than xact snapshot. Must + * do this first to allow the empty-heap Assert to succeed. + */ + InvalidateCatalogSnapshot(); + Assert(pairingheap_is_empty()); Assert(FirstXactSnapshot == NULL); *** GetTransactionSnapshot(void) *** 347,355 else CurrentSnapshot = GetSnapshotData(); - /* Don't allow catalog snapshot to be older than xact snapshot. */ - CatalogSnapshotStale = true; - FirstSnapshotSet = true; return CurrentSnapshot; } --- 353,358 *** GetTransactionSnapshot(void) *** 358,364 return CurrentSnapshot; /* Don't allow catalog snapshot to be older than xact snapshot. */ ! CatalogSnapshotStale = true; CurrentSnapshot = GetSnapshotData(); --- 361,367 return CurrentSnapshot; /* Don't allow catalog snapshot to be older than
Re: [HACKERS] Physical append-only tables
On Sun, Nov 13, 2016 at 3:45 PM, Magnus Haganderwrote: > For a scenario like this, would it make sense to have an option that could > be set on an individual table making it physical append only? Basically > VACUUM would run as normal and clean up the old space when rows are deleted > back in history, but when new space is needed for a row the system would > never look at the old blocks, and only append to the end. I don't think "appending" is the right way to think about this. It happens to address the problem but only accidentally and only partially. More generally what you have is two different kinds of data with two different access patterns and storage requirements in the same table. They're logically similar but have different practical requirements. If there was some way to teach the database that your table is made of two different types of data and how to distinguish the two types then when the update occurs it could move the row to the right section of storage... This might be something the new partitioning could handle or it might need something more low-level and implicit. That said, I don't think the "maintain clustering a bit better using BRIN" is a bad idea. It's just the bit about turning a table append-only to deal with update-once data that I think is overreach. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Change the way that LWLocks for extensions are allocated.
Andres Freund wrote: > On 2016-11-12 11:30:42 -0500, Tom Lane wrote: > > Andres Freundwrites: > > > Committed after simplifying the Makefile. > > > > While I have no particular objection to adding these tests, the > > commit log's claim that this will improve buildfarm testing is > > quite wrong. The buildfarm runs "make installcheck" not > > "make check" in contrib. > > Gah. It was more aimed at the coverage stuff, but that might work the > same. Alvaro? Already replied on IM, but for the record, coverage.postgresql.org currently runs this: make -j4 >> $LOG 2>&1 make check-world >> $LOG 2>&1 make -C src/test/ssl check >> $LOG 2>&1 -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Change the way that LWLocks for extensions are allocated.
On 2016-11-14 12:14:10 -0800, Andres Freund wrote: > On 2016-11-12 11:42:12 -0500, Tom Lane wrote: > > Andres Freundwrites: > > > On 2016-11-12 11:30:42 -0500, Tom Lane wrote: > > >> which is a rather blatant waste of cycles. I would suggest an explicit > > >> do-nothing installcheck rule rather than the hack you came up with here. > > > > > I had that at first, but that generates a warning about overwriting the > > > makefile target - which afaics cannot be fixed. > > > > Hm. What about inventing an additional macro NO_INSTALLCHECK that > > prevents pgxs.mk from generating an installcheck rule? It's not > > like we don't have similar issues elsewhere, eg contrib/sepgsql. > > Well, that one seems a bit different. Seems easy enough to do. Do we > want to make that macro that prevents installcheck from being defined, > or one that forces it to be empty? The former has the disadvantage that > one has to be careful to define a target, to avoid breaking recursion > from the upper levels. Oh, that disadvantage doesn't actually exist, because installcheck is a .PHONY target... I've for now not added a variant that prevents plain 'make check' from doing anything. I guess it could be useful when a contrib module wants to run tests via something else than pg_regress? Patch attached. Andres >From 906051a23831c5d337556e56d19962f2e857 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 14 Nov 2016 12:29:30 -0800 Subject: [PATCH] Provide NO_INSTALLCHECK for pgxs. This allows us to avoid running the regression tests in contrib modules like pg_stat_statement in a less ugly manner. Discussion: <22432.1478968...@sss.pgh.pa.us> --- contrib/pg_stat_statements/Makefile | 7 +++ doc/src/sgml/extend.sgml| 9 + src/makefiles/pgxs.mk | 4 3 files changed, 16 insertions(+), 4 deletions(-) diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile index f1a45eb..298951a 100644 --- a/contrib/pg_stat_statements/Makefile +++ b/contrib/pg_stat_statements/Makefile @@ -13,6 +13,9 @@ LDFLAGS_SL += $(filter -lm, $(LIBS)) REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_stat_statements/pg_stat_statements.conf REGRESS = pg_stat_statements +# Disabled because these tests require "shared_preload_libraries=pg_stat_statements", +# which typical installcheck users do not have (e.g. buildfarm clients). +NO_INSTALLCHECK = 1 ifdef USE_PGXS PG_CONFIG = pg_config @@ -24,7 +27,3 @@ top_builddir = ../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif - -# Disabled because these tests require "shared_preload_libraries=pg_stat_statements", -# which typical installcheck users do not have (e.g. buildfarm clients). -installcheck: REGRESS= diff --git a/doc/src/sgml/extend.sgml b/doc/src/sgml/extend.sgml index e19c657..f9d91a3 100644 --- a/doc/src/sgml/extend.sgml +++ b/doc/src/sgml/extend.sgml @@ -1194,6 +1194,15 @@ include $(PGXS) + NO_INSTALLCHECK + + +don't define an installcheck target, useful e.g. if tests require special configuration, or don't use pg_regress + + + + + EXTRA_CLEAN diff --git a/src/makefiles/pgxs.mk b/src/makefiles/pgxs.mk index 2b4d684..c27004e 100644 --- a/src/makefiles/pgxs.mk +++ b/src/makefiles/pgxs.mk @@ -40,6 +40,8 @@ # which need to be built first # REGRESS -- list of regression test cases (without suffix) # REGRESS_OPTS -- additional switches to pass to pg_regress +# NO_INSTALLCHECK -- don't define an installcheck target, useful e.g. if +# tests require special configuration, or don't use pg_regress # EXTRA_CLEAN -- extra files to remove in 'make clean' # PG_CPPFLAGS -- will be added to CPPFLAGS # PG_LIBS -- will be added to PROGRAM link line @@ -268,8 +270,10 @@ ifndef PGXS endif # against installed postmaster +ifndef NO_INSTALLCHECK installcheck: submake $(REGRESS_PREP) $(pg_regress_installcheck) $(REGRESS_OPTS) $(REGRESS) +endif ifdef PGXS check: -- 2.10.2 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Change the way that LWLocks for extensions are allocated.
On 2016-11-12 11:42:12 -0500, Tom Lane wrote: > Andres Freundwrites: > > On 2016-11-12 11:30:42 -0500, Tom Lane wrote: > >> which is a rather blatant waste of cycles. I would suggest an explicit > >> do-nothing installcheck rule rather than the hack you came up with here. > > > I had that at first, but that generates a warning about overwriting the > > makefile target - which afaics cannot be fixed. > > Hm. What about inventing an additional macro NO_INSTALLCHECK that > prevents pgxs.mk from generating an installcheck rule? It's not > like we don't have similar issues elsewhere, eg contrib/sepgsql. Well, that one seems a bit different. Seems easy enough to do. Do we want to make that macro that prevents installcheck from being defined, or one that forces it to be empty? The former has the disadvantage that one has to be careful to define a target, to avoid breaking recursion from the upper levels. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Do we need use more meaningful variables to replace 0 in catalog head files?
On 11/13/2016 11:11 AM, Tom Lane wrote: Andrew Dunstanwrites: I'm not convinced the line prefix part is necessary, though. What I'm thinking of is something like this: PROCDATA( oid=1242 name=boolin isstrict=t volatile=i parallel=s nargs=1 rettype=bool argtypes="cstring" src=boolin ); We could go in that direction too, but the apparent flexibility to split entries into multiple lines is an illusion, at least if you try to go beyond a few lines; you'd end up with duplicated line sequences in different entries and thus ambiguity for patch(1). I don't have any big objection to the above, but it's not obviously better either. Yeah, I looked and there are too many cases where the name would be outside the normal 3 lines of context. Some things we should try to resolve before settling definitively on a data representation: 1. Are we going to try to keep these things in the .h files, or split them out? I'd like to get them out, as that eliminates both the need to keep the things looking like macro calls, and the need for the data within the macro call to be at least minimally parsable as C. That would work fine for pg_proc.h, less so for pg_type.h where we have a whole lot of #define FOOOID nn directives in among the data lines. Moving these somewhere remote from the catalog lines they relate to seems like quite a bad idea. 2. Andrew's example above implies some sort of mapping between the keywords and the actual column names (or at least column positions). Where and how is that specified? There are several possibilities. The one I was leaning towards was to parse out the Anum_pg_foo_* definitions. 3. Also where are we going to provide the per-column default values? How does the converter script know which columns to convert to type oids, proc oids, etc? Is it going to do any data validation beyond that, and if so on what basis? a) something like DATA_DEFAULTS( foo=bar ); b) something like DATA_TYPECONV ( rettype argtypes allargtypes ); Hadn't thought about procoids, but something similar. 4. What will we do about the #define's that some of the .h files provide for (some of) their object OIDs? I assume that we want to move in the direction of autogenerating those macros a la fmgroids.h, but this needs a concrete spec as well. If we don't want this change to result in a big hit to the source code, we're probably going to need to be able to specify the macro names to generate in the data files. Yeah, as I noted above it's a bit messy, 5. One of the requirements that was mentioned in previous discussions was to make it easier to add new columns to catalogs. This format does that only to the extent that you don't have to touch entries that can use the default value for such a column. Is that good enough, and if not, what might we be able to do to make it better? I think it is good enough, at least for a first cut. I'd actually like to roll up the DESCR lines in pg_proc.h into this too, they strike me as a bit of a wart. But I'm flexible on that. +1, if we can come up with a better syntax. This together with the OID-macro issue suggests that there will be items in each data entry that correspond to something other than columns of the target catalog. But that seems fine. If we can generalize this to other catalogs, then that will be good, but my inclination is to handle the elephant in the room (pg_proc.h) and worry about the gnats later. I think we want to do them all. pg_proc.h is actually one of the easier catalogs to work on presently, IMO, because the only kind of cross-references it has are type OIDs. Things like pg_amop are a mess. And I really don't want to be dealing with multiple notations for catalog data. Also I think this will be subject to Polya's paradox: designing a general solution will be easier and cleaner than a hack that works only for one catalog. I don't know that we need to handle everything at once, as long as the solution is sufficiently general. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary
On Mon, Nov 14, 2016 at 10:17 AM, Andres Freundwrote: > I think so, yes. IIRC I discussed it with Noah and Peter G. at a > conference recently. We'd basically mark the content of shared buffers > inaccessible at backend startup, and mark it accessible whenever a > PinBuffer() happens, and then inaccessible during unpinning. We probably > have to exclude the page header though, as we intentionally access them > unpinned in some cases IIRC. BTW, I recently noticed that the latest version of Valgrind, 3.12, added this new feature: * Memcheck: - Added meta mempool support for describing a custom allocator which: - Auto-frees all chunks assuming that destroying a pool destroys all objects in the pool - Uses itself to allocate other memory blocks It occurred to me that we might be able to make good use of this. To be clear, I don't think that there is reason to tie it to adding the PinBuffer() stuff, which we've been talking about for years now. It just caught my eye. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary
On 2016-11-14 13:12:28 -0500, Tom Lane wrote: > Andres Freundwrites: > > On 2016-11-14 12:32:53 -0500, Tom Lane wrote: > >> Basically my concern is that this restriction isn't documented anywhere > >> and I'm not entirely certain it's been adhered to everywhere. I'd feel > >> much better about it if there were some way we could verify that. > > > Would support for valgrind complaining about access to unpinned buffers > > suffice? > > I don't think it directly addresses the issue, but certainly it'd help. Well, it detects situations where removed pins cause "unprotected access", but of course that doesn't protect against cases where independent pins hide that issue. > Do you think that's easily doable? I think so, yes. IIRC I discussed it with Noah and Peter G. at a conference recently. We'd basically mark the content of shared buffers inaccessible at backend startup, and mark it accessible whenever a PinBuffer() happens, and then inaccessible during unpinning. We probably have to exclude the page header though, as we intentionally access them unpinned in some cases IIRC. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary
Andres Freundwrites: > On 2016-11-14 12:32:53 -0500, Tom Lane wrote: >> Basically my concern is that this restriction isn't documented anywhere >> and I'm not entirely certain it's been adhered to everywhere. I'd feel >> much better about it if there were some way we could verify that. > Would support for valgrind complaining about access to unpinned buffers > suffice? I don't think it directly addresses the issue, but certainly it'd help. Do you think that's easily doable? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary
On 2016-11-14 12:32:53 -0500, Tom Lane wrote: > Heikki Linnakangaswrites: > > On 11/14/2016 06:18 PM, Tom Lane wrote: > >> You're implicitly assuming that a scan always returns its results in the > >> same slot, and that no other slot could contain a copy of that data, but > >> there is no guarantee of either. See bug #14344 and d8589946d for a > >> pretty recent example where that failed to be true --- admittedly, for > >> a tuplesort scan not a table scan. > > > It's the other way round. ExecProcNode might not always return its > > result in the same slot, but all the callers must assume that it might. > > Basically my concern is that this restriction isn't documented anywhere > and I'm not entirely certain it's been adhered to everywhere. I'd feel > much better about it if there were some way we could verify that. Would support for valgrind complaining about access to unpinned buffers suffice? Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary
On Mon, Nov 14, 2016 at 9:22 AM, Heikki Linnakangaswrote: > I think that difference in the API is exactly what caught Peter by surprise > and led to bug #14344. And I didn't see it either, until you two debugged > it. That is accurate, of course. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary
Heikki Linnakangaswrites: > On 11/14/2016 06:18 PM, Tom Lane wrote: >> You're implicitly assuming that a scan always returns its results in the >> same slot, and that no other slot could contain a copy of that data, but >> there is no guarantee of either. See bug #14344 and d8589946d for a >> pretty recent example where that failed to be true --- admittedly, for >> a tuplesort scan not a table scan. > It's the other way round. ExecProcNode might not always return its > result in the same slot, but all the callers must assume that it might. Basically my concern is that this restriction isn't documented anywhere and I'm not entirely certain it's been adhered to everywhere. I'd feel much better about it if there were some way we could verify that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary
On 11/14/2016 06:18 PM, Tom Lane wrote: Robert Haaswrites: On Mon, Nov 14, 2016 at 10:23 AM, Tom Lane wrote: I don't believe Andres' claim anyway. There are certainly cases where an allegedly-valid slot could be pointing at garbage, but table scans aren't one of them, precisely because of the pin held by the slot. It would take a fairly wide-ranging code review to convince me that it's okay to lose that mechanism. I don't understand your objection. It seems to me that the TupleTableSlot is holding a pin, and the scan is also holding a pin, so one of them is redundant. You speculated that the slot could continue to point at the tuple after the scan has moved on, but how could such a thing actually happen? You're implicitly assuming that a scan always returns its results in the same slot, and that no other slot could contain a copy of that data, but there is no guarantee of either. See bug #14344 and d8589946d for a pretty recent example where that failed to be true --- admittedly, for a tuplesort scan not a table scan. It's the other way round. ExecProcNode might not always return its result in the same slot, but all the callers must assume that it might. The tuplesort interface is slightly different: the caller passes a slot to tuplesort_gettupleslot() as argument, and tuplesort_gettupleslot() places the next tuple in that slot. With executor nodes, a node returns a slot that it allocated itself, or it got from a child node. After you call ExecProcNode(), you can *not* assume that the old tuple in the slot is still valid. The child node can, and in most cases will, reuse the same slot, overwriting its contents. I think that difference in the API is exactly what caught Peter by surprise and led to bug #14344. And I didn't see it either, until you two debugged it. Also, there might well be places that are relying on the ability of a slot to hold a pin for slots that are not simply the return slot of a plan node. We could perhaps remove the *use* of this slot feature in the normal table-scan case, without removing the feature itself. I didn't see any such use. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary
On Mon, Nov 14, 2016 at 11:18 AM, Tom Lanewrote: > Robert Haas writes: >> On Mon, Nov 14, 2016 at 10:23 AM, Tom Lane wrote: >>> I don't believe Andres' claim anyway. There are certainly cases where an >>> allegedly-valid slot could be pointing at garbage, but table scans aren't >>> one of them, precisely because of the pin held by the slot. It would take >>> a fairly wide-ranging code review to convince me that it's okay to lose >>> that mechanism. > >> I don't understand your objection. It seems to me that the >> TupleTableSlot is holding a pin, and the scan is also holding a pin, >> so one of them is redundant. You speculated that the slot could >> continue to point at the tuple after the scan has moved on, but how >> could such a thing actually happen? > > You're implicitly assuming that a scan always returns its results in the > same slot, and that no other slot could contain a copy of that data, but > there is no guarantee of either. See bug #14344 and d8589946d for a > pretty recent example where that failed to be true --- admittedly, for > a tuplesort scan not a table scan. > > We could certainly set up some global convention that would make this > universally true, but I think we'd have to do a lot of code-reading > to ensure that everything is following that convention. > > Also, there might well be places that are relying on the ability of a > slot to hold a pin for slots that are not simply the return slot of > a plan node. We could perhaps remove the *use* of this slot feature in > the normal table-scan case, without removing the feature itself. I'm still not getting it. We don't have to guess who is using this feature; we can find it out by looking at every place that calls ExecStoreTuple and looking at whether they are passing InvalidBuffer. As Heikki's original patch upthread demonstrates, most of them are. The exceptions are in BitmapHeapNext, IndexNext, ExecOnConflictUpdate, SampleNext, SeqNext, and TidNext. If all of those places are or can be made to hold a buffer pin for as long as the slot would have held the same pin, we're done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary
Robert Haaswrites: > On Mon, Nov 14, 2016 at 10:23 AM, Tom Lane wrote: >> I don't believe Andres' claim anyway. There are certainly cases where an >> allegedly-valid slot could be pointing at garbage, but table scans aren't >> one of them, precisely because of the pin held by the slot. It would take >> a fairly wide-ranging code review to convince me that it's okay to lose >> that mechanism. > I don't understand your objection. It seems to me that the > TupleTableSlot is holding a pin, and the scan is also holding a pin, > so one of them is redundant. You speculated that the slot could > continue to point at the tuple after the scan has moved on, but how > could such a thing actually happen? You're implicitly assuming that a scan always returns its results in the same slot, and that no other slot could contain a copy of that data, but there is no guarantee of either. See bug #14344 and d8589946d for a pretty recent example where that failed to be true --- admittedly, for a tuplesort scan not a table scan. We could certainly set up some global convention that would make this universally true, but I think we'd have to do a lot of code-reading to ensure that everything is following that convention. Also, there might well be places that are relying on the ability of a slot to hold a pin for slots that are not simply the return slot of a plan node. We could perhaps remove the *use* of this slot feature in the normal table-scan case, without removing the feature itself. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: scan key push down to heap [WIP]
On Mon, Nov 14, 2016 at 9:30 PM, Robert Haaswrote: > Couldn't we just change the current memory context before calling > heap_getnext()? And then change back? Right, seems like it will not have any problem.. > > Also, what if we abandoned the idea of pushing qual evaluation all the > way down into the heap and just tried to do HeapKeyTest in SeqNext > itself? Would that be almost as fast, or would it give up most of the > benefits? This we can definitely test. I will test and post the data. Thanks for the suggestion. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: scan key push down to heap [WIP]
On Sun, Nov 13, 2016 at 12:16 AM, Dilip Kumarwrote: > Problem1: As Andres has mentioned, HeapKeyTest uses heap_getattr, > whereas ExecQual use slot_getattr().So we can have worst case > performance problem when very less tuple are getting filter out and we > have table with many columns with qual on most of the columns. > > Problem2. In HeapKeyTest we are under per_query_ctx, whereas in > ExecQual we are under per_tuple_ctx , so in former we can not afford > to have any palloc. Couldn't we just change the current memory context before calling heap_getnext()? And then change back? Also, what if we abandoned the idea of pushing qual evaluation all the way down into the heap and just tried to do HeapKeyTest in SeqNext itself? Would that be almost as fast, or would it give up most of the benefits? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for changes to recovery.conf API
On Sat, Nov 12, 2016 at 11:09 AM, Andres Freundwrote: > I'm very tempted to rename this during the move to GUCs ... > Slightly less so, but still tempted to also rename these. They're not > actually necessarily pointing towards a primary, and namespace-wise > they're not grouped with recovery_*, which has become more important now > that recovery.conf isn't a separate namespace anymore. -1 for renaming these. I don't think the current names are particularly bad, and I think trying to agree on what would be better could easily sink the whole patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for changes to recovery.conf API
On Fri, Nov 4, 2016 at 10:17 AM, Abhijit Menon-Senwrote: > At 2016-11-04 10:35:05 +, si...@2ndquadrant.com wrote: >> Since the "lots of parameters" approach is almost exactly what we have >> now, I think we should do that first, get this patch committed and >> then sit back and discuss an improved API and see what downsides it >> introduces. > > Thanks, I agree strongly with this. > > Someone (Michael?) earlier mentioned the potential for introducing bugs > with this patch (the idea of merging recovery.conf into postgresql.conf > at all, not this particular incarnation). > > I think the current proposed approach with > > recovery_target=xid > recovery_target_xid=xxx > > is preferable because (a) it doesn't introduce much new code, e.g., new > parsing functions, nor (b) need many changes in the documentation—all we > need is to say that of the various recovery_target_xxx parameters, the > one that has priority is the one named by recovery_target. > > If I were to introduce recovery_target='xid xxx', I would at a minimum > need to factor out (or duplicate) parsing and error handling code, make > a type/value union-in-struct to store in the GUC *extra, then make sure > that we handle the older values in a backwards-compatible way, and move > a bunch of documentation around. Just to be clear, the sum total of the additional "parsing" we are talking about is looking for the first sequence of 1 or more spaces in the input string and separating the stuff before them from the stuff after them. I agree that there's some work there, but I'm surprised to hear that you think it's a lot of work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary
On Mon, Nov 14, 2016 at 10:23 AM, Tom Lanewrote: > Robert Haas writes: >> On Sat, Nov 12, 2016 at 10:28 AM, Andres Freund wrote: >>> On 2016-08-30 07:38:10 -0400, Tom Lane wrote: I think this is probably wrong, or at least very dangerous to remove. The reason for the feature is that the slot may continue to point at the tuple after the scan has moved on. > >>> FWIW, that's not safe to assume in upper layers *anyway*. If you want to >>> do that, the slot has to be materialized, and that'd make a local >>> copy. If you don't materialize tts_values/isnull can point into random >>> old memory (common e.g. for projections and virtual tuples in general). > >> So, I think you are arguing in favor of proceeding with this patch? > > I don't believe Andres' claim anyway. There are certainly cases where an > allegedly-valid slot could be pointing at garbage, but table scans aren't > one of them, precisely because of the pin held by the slot. It would take > a fairly wide-ranging code review to convince me that it's okay to lose > that mechanism. I don't understand your objection. It seems to me that the TupleTableSlot is holding a pin, and the scan is also holding a pin, so one of them is redundant. You speculated that the slot could continue to point at the tuple after the scan has moved on, but how could such a thing actually happen? Once the scan finds a tuple, it's going to store the tuple in the slot and return. It won't get control back to advance the scan until the next time it's asked for a tuple, and then it has to update the slot before returning. So I don't see the problem. If in the future somebody wants to write an executor node that does random extra work - like advancing the scan - before returning the tuple the scan already found, they'll have to take special precautions, but why should anybody want to do that? I'm kind of puzzled, here. Perhaps I am missing something obvious. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary
On 2016-11-14 10:09:02 -0500, Robert Haas wrote: > On Sat, Nov 12, 2016 at 10:28 AM, Andres Freundwrote: > > On 2016-08-30 07:38:10 -0400, Tom Lane wrote: > >> Heikki Linnakangas writes: > >> > While profiling some queries and looking at executor overhead, I > >> > realized that we're not making much use of TupleTableSlot's ability to > >> > hold a buffer pin. In a SeqScan, the buffer is held pinned by the > >> > underlying heap-scan anyway. Same with an IndexScan, and the SampleScan. > >> > >> I think this is probably wrong, or at least very dangerous to remove. > >> The reason for the feature is that the slot may continue to point at > >> the tuple after the scan has moved on. > > > > FWIW, that's not safe to assume in upper layers *anyway*. If you want to > > do that, the slot has to be materialized, and that'd make a local > > copy. If you don't materialize tts_values/isnull can point into random > > old memory (common e.g. for projections and virtual tuples in general). > > So, I think you are arguing in favor of proceeding with this patch? Not really, now. I don't buy the argument here against it. I do think the overhead is quite noticeable. But I also think it has quite the potential for subtle bugs. I think I'd feel better if we had some form of instrumentation trapping buffer accesses without pins present. We've previously discussed doing that with valgrind... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Something is broken about connection startup
Robert Haaswrites: > I assume you're going to back-patch this, and the consequences of > failing to reset it before going idle could easily be vastly worse > than the problem you're trying to fix. So I'd rather not make > assumptions like "the client will probably never sleep between Bind > and Execute". I bet that's actually false, and I certainly wouldn't > want to bet the farm on it being true. No, I'm not at all proposing to assume that. But I may be willing to assume that "we don't hold a CatalogSnapshot between Bind and Execute unless we're also holding a transaction snapshot". I need to do a bit more research to see if that's actually true, though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] Cascading standby cannot catch up and get stuck emitting the same message repeatedly
On Thu, Nov 10, 2016 at 12:13 AM, Tsunakawa, Takayukiwrote: > From: pgsql-hackers-ow...@postgresql.org >> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas >> OK. I agree that's a problem. However, your patch adds zero new comment >> text while removing some existing comments, so I can't easily tell how it >> solves that problem or whether it does so correctly. Even if I were smart >> enough to figure it out, I wouldn't want to rely on the next person also >> being that smart. This is obviously a subtle problem in tricky code, so >> a clear explanation of the fix seems like a very good idea. > > The comment describes what the code is trying to achieve. Actually, I just > imitated the code and comment of later major releases. The only difference > between later releases and my patch (for 9.2) is whether the state is stored > in XLogReaderStruct or as global variables. Below is the comment from 9.6, > where the second paragraph describes what the two nested if conditions mean. > The removed comment lines are what became irrelevant, which is also not > present in later major releases. Let me try to be more clear. I will not commit this patch if it is not properly commented. I doubt that anyone else will, either. The fact that those code changes already exist in 9.4+ is not a reason to back-port them to earlier releases without a proper explanation of why we are doing it. Very possibly, we should also improve the comments in newer branches so that future authors don't reintroduce whatever bugs were fixed by these changes. But whether we do that or not, I am not going to commit uncommented patches to complex code in order to fix obscure bugs in 3+-year-old branches. I think that is a non-starter. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Something is broken about connection startup
On Mon, Nov 14, 2016 at 10:17 AM, Tom Lanewrote: >> I think the really important thing is that we don't leave xmin set >> when the backend is idle. > > Agreed. I did some stats-gathering on this over the weekend, seeing how > often the various situations occur during the core regression tests. > It's not as big a problem as I first thought, because we hold a snapshot > while doing planning, but the case does arise for various utility commands > (particularly LOCK TABLE) so it's not negligible. > > What is looking like the best answer to me is to treat CatalogSnapshot > as registered, but forcibly unregister it before going idle. We don't > need to unregister it more often than that, because we'd hold a > transaction snapshot of some description throughout any long-running > command, and the CatalogSnapshot would be no older than that anyway. Makes sense. > "Before going idle" could be implemented with an InvalidateCatalogSnapshot > call either in postgres.c's finish_xact_command(), or right in the main > message-receipt loop. The latter seems like a wart, but it would cover > cases like a client going to sleep between Bind and Execute phases of an > extended query. On the other hand, I think we might be holding a > transaction snapshot at that point anyway. It's at least arguable that > we should be going through finish_xact_command() at any point where it's > likely that we have a snapshot to release. I assume you're going to back-patch this, and the consequences of failing to reset it before going idle could easily be vastly worse than the problem you're trying to fix. So I'd rather not make assumptions like "the client will probably never sleep between Bind and Execute". I bet that's actually false, and I certainly wouldn't want to bet the farm on it being true. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary
Robert Haaswrites: > On Sat, Nov 12, 2016 at 10:28 AM, Andres Freund wrote: >> On 2016-08-30 07:38:10 -0400, Tom Lane wrote: >>> I think this is probably wrong, or at least very dangerous to remove. >>> The reason for the feature is that the slot may continue to point at >>> the tuple after the scan has moved on. >> FWIW, that's not safe to assume in upper layers *anyway*. If you want to >> do that, the slot has to be materialized, and that'd make a local >> copy. If you don't materialize tts_values/isnull can point into random >> old memory (common e.g. for projections and virtual tuples in general). > So, I think you are arguing in favor of proceeding with this patch? I don't believe Andres' claim anyway. There are certainly cases where an allegedly-valid slot could be pointing at garbage, but table scans aren't one of them, precisely because of the pin held by the slot. It would take a fairly wide-ranging code review to convince me that it's okay to lose that mechanism. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Something is broken about connection startup
Robert Haaswrites: > On Fri, Nov 11, 2016 at 3:15 PM, Tom Lane wrote: >> The basic problem here, therefore, is that SnapshotResetXmin isn't aware >> that GetCatalogSnapshot is keeping a possibly-unregistered snapshot in >> its hip pocket. That has to change. We could either treat the saved >> CatalogSnapshot as always being registered, or we could add some logic >> to force invalidating the CatalogSnapshot whenever we clear MyPgXact->xmin >> or advance it past that snapshot's xmin. >> >> Neither of those is very nice from a performance standpoint. With the >> first option we risk delaying global cleanup by holding back >> MyPgXact->xmin to protect a CatalogSnapshot we might not actually use >> anytime soon. With the second option we will likely end up doing many >> more GetSnapshotData calls than we do now, because in a transaction >> that hasn't yet set a regular snapshot, we will need to get a new >> CatalogSnapshot for every catalog access (since we'll go back to >> MyPgXact->xmin = 0 after every access). And the parser, in particular, >> tends to do a lot of catalog accesses before we ever get a regular >> transaction snapshot. >> >> Ideally, perhaps, we'd treat the saved CatalogSnapshot as registered >> but automatically kill it "every so often". I'm not sure how to drive >> that though. > I think the really important thing is that we don't leave xmin set > when the backend is idle. Agreed. I did some stats-gathering on this over the weekend, seeing how often the various situations occur during the core regression tests. It's not as big a problem as I first thought, because we hold a snapshot while doing planning, but the case does arise for various utility commands (particularly LOCK TABLE) so it's not negligible. What is looking like the best answer to me is to treat CatalogSnapshot as registered, but forcibly unregister it before going idle. We don't need to unregister it more often than that, because we'd hold a transaction snapshot of some description throughout any long-running command, and the CatalogSnapshot would be no older than that anyway. "Before going idle" could be implemented with an InvalidateCatalogSnapshot call either in postgres.c's finish_xact_command(), or right in the main message-receipt loop. The latter seems like a wart, but it would cover cases like a client going to sleep between Bind and Execute phases of an extended query. On the other hand, I think we might be holding a transaction snapshot at that point anyway. It's at least arguable that we should be going through finish_xact_command() at any point where it's likely that we have a snapshot to release. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pinning a buffer in TupleTableSlot is unnecessary
On Sat, Nov 12, 2016 at 10:28 AM, Andres Freundwrote: > On 2016-08-30 07:38:10 -0400, Tom Lane wrote: >> Heikki Linnakangas writes: >> > While profiling some queries and looking at executor overhead, I >> > realized that we're not making much use of TupleTableSlot's ability to >> > hold a buffer pin. In a SeqScan, the buffer is held pinned by the >> > underlying heap-scan anyway. Same with an IndexScan, and the SampleScan. >> >> I think this is probably wrong, or at least very dangerous to remove. >> The reason for the feature is that the slot may continue to point at >> the tuple after the scan has moved on. > > FWIW, that's not safe to assume in upper layers *anyway*. If you want to > do that, the slot has to be materialized, and that'd make a local > copy. If you don't materialize tts_values/isnull can point into random > old memory (common e.g. for projections and virtual tuples in general). So, I think you are arguing in favor of proceeding with this patch? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Something is broken about connection startup
On Fri, Nov 11, 2016 at 3:15 PM, Tom Lanewrote: > So this has pretty much been broken since we put in MVCC snapshots for > catalog searches. The problem would be masked when inside a transaction > that has already got a transaction snapshot, but whenever we don't have > one already, our catalog lookups are basically unprotected against > premature row removal. I don't see any reason to think that this is > specific to connection startup. Yeah. I was focused on the case where the backend is already in a transaction, and obviously didn't think hard enough about the scenario where that isn't the case. > The basic problem here, therefore, is that SnapshotResetXmin isn't aware > that GetCatalogSnapshot is keeping a possibly-unregistered snapshot in > its hip pocket. That has to change. We could either treat the saved > CatalogSnapshot as always being registered, or we could add some logic > to force invalidating the CatalogSnapshot whenever we clear MyPgXact->xmin > or advance it past that snapshot's xmin. > > Neither of those is very nice from a performance standpoint. With the > first option we risk delaying global cleanup by holding back > MyPgXact->xmin to protect a CatalogSnapshot we might not actually use > anytime soon. With the second option we will likely end up doing many > more GetSnapshotData calls than we do now, because in a transaction > that hasn't yet set a regular snapshot, we will need to get a new > CatalogSnapshot for every catalog access (since we'll go back to > MyPgXact->xmin = 0 after every access). And the parser, in particular, > tends to do a lot of catalog accesses before we ever get a regular > transaction snapshot. > > Ideally, perhaps, we'd treat the saved CatalogSnapshot as registered > but automatically kill it "every so often". I'm not sure how to drive > that though. I think the really important thing is that we don't leave xmin set when the backend is idle. If we set MyProc->xmin during startup or in some other out-of-transaction activity and never updated it until we went idle, I think that wouldn't hurt much because we're not going to spend tons of time in that state anyway. But if we start leaving xmin set for idle backends, our users will be howling in agony. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical Replication WIP
On 13/11/16 10:21, Andres Freund wrote: > On 2016-11-13 00:40:12 -0500, Peter Eisentraut wrote: >> On 11/12/16 2:18 PM, Andres Freund wrote: > I also wonder if we want an easier to >>> extend form of pubinsert/update/delete (say to add pubddl, pubtruncate, >>> pub ... without changing the schema). >>> > > So like, text array that's then parsed everywhere (I am not doing > bitmask/int definitely)? >>> Yes, that sounds good to me. Then convert it to individual booleans or a >>> bitmask when loading the publications into the in-memory form (which you >>> already do). >> >> I'm not sure why that would be better. Adding catalog columns in future >> versions is not a problem. > > It can be extended from what core provides, for extended versions of > replication solutions, for one. I presume publications/subscriptions > aren't only going to be used by built-in code. > I understand the desire here (especially as an author of such out of the core tools), but I am not sure if this is a good place where to start having pluggable catalogs given that we have no generic idea for those. Currently, plugins writing arbitrary data to catalogs will cause things to break when those plugins get uninstalled (and we don't have good mechanism for cleaning that up when that happens). And that won't change if we convert this into array. Besides, shouldn't the code then anyway check that we only have expected data in that array otherwise we might miss corruption? So if the main reason for turning this into array is extendability for other providers then I am -1 on the idea. IMHO this is for completely different path that adds user catalogs with proper syscache-like interface and everything but has nothing to do with publications. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Mon, Nov 14, 2016 at 9:57 AM, Tom Lanewrote: > Robert Haas writes: >> On Fri, Nov 4, 2016 at 6:52 AM, Ashutosh Bapat >> wrote: >>> Costing PartitionJoinPath needs more thought so that we don't end up >>> with bad overall plans. Here's an idea. Partition-wise joins are >>> better compared to the unpartitioned ones, because of the smaller >>> sizes of partitions. If we think of join as O(MN) operation where M >>> and N are sizes of unpartitioned tables being joined, partition-wise >>> join computes P joins each with average O(M/P * N/P) order where P is >>> the number of partitions, which is still O(MN) with constant factor >>> reduced by P times. I think, we need to apply similar logic to >>> costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N) >>> where S and R are setup cost and joining cost (for M, N rows) resp. >>> Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P) >>> + P * R(M/P, N/P). Each of the join methods will have different S and >>> R functions and may not be linear on the number of rows. So, >>> PartitionJoinPath costs are obtained from corresponding regular path >>> costs subjected to above transformation. This way, we will be >>> protected from choosing a PartitionJoinPath when it's not optimal. > >> I'm not sure that I really understand the stuff with big-O notation >> and M, N, and P. But I think what you are saying is that we could >> cost a PartitionJoinPath by costing some of the partitions (it might >> be a good idea to choose the biggest ones) and assuming the cost for >> the remaining ones will be roughly proportional. That does seem like >> a reasonable strategy to me. > > I'm not sure to what extent the above argument depends on the assumption > that join is O(MN), but I will point out that in no case of practical > interest for large tables is it actually O(MN). That would be true > only for the stupidest possible nested-loop join method. It would be > wise to convince ourselves that the argument holds for more realistic > big-O costs, eg hash join is more like O(M+N) if all goes well. Yeah, I agree. To recap briefly, the problem we're trying to solve here is how to build a path for a partitionwise join without an explosion in the amount of memory the planner uses or the number of paths created. In the initial design, if there are N partitions per relation, the total number of paths generated by the planner increases by a factor of N+1, which gets ugly if, say, N = 1000, or even N = 100. To reign that in, we want to do a rough cut at costing the partitionwise join that will be good enough to let us throw away obviously inferior paths, and then work out the exact paths we're going to use only for partitionwise joins that are actually selected. I think costing one or a few of the larger sub-joins and assuming those costs are representative is probably a reasonable approach to that problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Allow TAP tests to be run individually
On 11/14/16 3:52 AM, Michael Paquier wrote: > I don't mind. This patch uses the following pattern: > $(or $(PROVE_TESTS),t/*.pl) > While something more spread in Postgres source would be something like that: > $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl) > It seems to me that we'd prefer that for consistency, but I see no > reason to not keep your patch as well. I am marking that as ready for > committer. ($or ...) is a newer feature of GNU make, so we have avoided that so far. I have committed your v2 with $(if ...). -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
Robert Haaswrites: > On Fri, Nov 4, 2016 at 6:52 AM, Ashutosh Bapat > wrote: >> Costing PartitionJoinPath needs more thought so that we don't end up >> with bad overall plans. Here's an idea. Partition-wise joins are >> better compared to the unpartitioned ones, because of the smaller >> sizes of partitions. If we think of join as O(MN) operation where M >> and N are sizes of unpartitioned tables being joined, partition-wise >> join computes P joins each with average O(M/P * N/P) order where P is >> the number of partitions, which is still O(MN) with constant factor >> reduced by P times. I think, we need to apply similar logic to >> costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N) >> where S and R are setup cost and joining cost (for M, N rows) resp. >> Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P) >> + P * R(M/P, N/P). Each of the join methods will have different S and >> R functions and may not be linear on the number of rows. So, >> PartitionJoinPath costs are obtained from corresponding regular path >> costs subjected to above transformation. This way, we will be >> protected from choosing a PartitionJoinPath when it's not optimal. > I'm not sure that I really understand the stuff with big-O notation > and M, N, and P. But I think what you are saying is that we could > cost a PartitionJoinPath by costing some of the partitions (it might > be a good idea to choose the biggest ones) and assuming the cost for > the remaining ones will be roughly proportional. That does seem like > a reasonable strategy to me. I'm not sure to what extent the above argument depends on the assumption that join is O(MN), but I will point out that in no case of practical interest for large tables is it actually O(MN). That would be true only for the stupidest possible nested-loop join method. It would be wise to convince ourselves that the argument holds for more realistic big-O costs, eg hash join is more like O(M+N) if all goes well. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minor improvement to delete.sgml
On Sun, Nov 13, 2016 at 10:55 PM, Etsuro Fujitawrote: > On 2016/10/19 2:51, Robert Haas wrote: >> >> On Fri, Oct 14, 2016 at 12:05 AM, Etsuro Fujita >> wrote: >>> >>> I think it's better to mention that an alias is needed for the target >>> table >>> specified in the USING clause of a DELETE statement, to set up a >>> self-join, >>> as the documentation on the from_list parameter of UPDATE does. Please >>> find >>> attached a patch. > >> The statement you are proposing to add to the documentation isn't true. > > Consider a counterexample of DELETE doing a self-join of a target table: > > postgres=# create table t1 (c1 int); > CREATE TABLE > postgres=# insert into t1 values (1); > INSERT 0 1 > postgres=# delete from t1 using t1 where t1.c1 = t1.c1; > ERROR: table name "t1" specified more than once > > Giving an alias to the target table t1 in the USING clause, > > postgres=# delete from t1 using t1 r1 where t1.c1 = r1.c1; > DELETE 1 > > Am I missing something? Well, you could also alias the target table, like this: delete from t1 q1 using t1 where q1.c1 = t1.c1; The point is that, while it's true that you can't have the same table alias twice at the same query level, you can fix that in more than one way. Your suggestion of adding an alias to the appearance in the using list is one approach, but not the only one. I don't think there's any real need for a documentation change here. The fact that repeating a table alias doesn't work is not unique to DELETE, nor is it unique to self-joins. The documentation here just needs to explain that repeating the table name will set up a self-join; it doesn't need to describe every SQL mistake that you could make while trying to do so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Physical append-only tables
On Mon, Nov 14, 2016 at 3:39 PM, Tom Lanewrote: > Magnus Hagander writes: > > What I'm talking about is something that would be a lot simpler than > > auto-clustering. I'm not even talking about trying to detect if the row > was > > about to go into the right place -- this might be expensive and certainly > > more complicated. I'm only talking about a simple case where we *never* > put > > anything anywhere other than at the end of the table, period. That should > > make the check both cheap and simple. > > It also makes it so much of a corner case that even a cheap check could be > a net performance degradation, especially for people whose usage pattern > doesn't match this. > > I agree that it definitely solves just one problem. But it seems to be a fairly common problem, particularly for users of BRIN users. Full auto-clustering would cover many more usecases, but would also be a lot more expensive to maintain. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Fri, Nov 4, 2016 at 6:52 AM, Ashutosh Bapatwrote: > Costing PartitionJoinPath needs more thought so that we don't end up > with bad overall plans. Here's an idea. Partition-wise joins are > better compared to the unpartitioned ones, because of the smaller > sizes of partitions. If we think of join as O(MN) operation where M > and N are sizes of unpartitioned tables being joined, partition-wise > join computes P joins each with average O(M/P * N/P) order where P is > the number of partitions, which is still O(MN) with constant factor > reduced by P times. I think, we need to apply similar logic to > costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N) > where S and R are setup cost and joining cost (for M, N rows) resp. > Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P) > + P * R(M/P, N/P). Each of the join methods will have different S and > R functions and may not be linear on the number of rows. So, > PartitionJoinPath costs are obtained from corresponding regular path > costs subjected to above transformation. This way, we will be > protected from choosing a PartitionJoinPath when it's not optimal. I'm not sure that I really understand the stuff with big-O notation and M, N, and P. But I think what you are saying is that we could cost a PartitionJoinPath by costing some of the partitions (it might be a good idea to choose the biggest ones) and assuming the cost for the remaining ones will be roughly proportional. That does seem like a reasonable strategy to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Physical append-only tables
Magnus Haganderwrites: > What I'm talking about is something that would be a lot simpler than > auto-clustering. I'm not even talking about trying to detect if the row was > about to go into the right place -- this might be expensive and certainly > more complicated. I'm only talking about a simple case where we *never* put > anything anywhere other than at the end of the table, period. That should > make the check both cheap and simple. It also makes it so much of a corner case that even a cheap check could be a net performance degradation, especially for people whose usage pattern doesn't match this. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Physical append-only tables
On Mon, Nov 14, 2016 at 2:35 AM, Alvaro Herrerawrote: > Magnus Hagander wrote: > > > But then consider the same table. Except rows are typically updated once > or > > twice when they are new, and *then* go read only. And we also have a > > process that at some point deletes *some* old rows (but not all - in > fact, > > only a small portion). > > > > In this case, the next INSERT once VACUUM has run is likely to stick a > > "new" row somewhere very "far back" in the table, since there is now free > > space there. This more or less completely ruins the BRIN index usability, > > as the "old" blocks will now contain a single row from a "new" series. > > Yeah. When we initially discussed BRIN, there was a mention of allowing > a BRIN index to guide new tuple location -- something like > auto-clustering, if you will. We haven't discussed the exact details > but I think something along those lines is worth considering. > What I'm talking about is something that would be a lot simpler than auto-clustering. I'm not even talking about trying to detect if the row was about to go into the right place -- this might be expensive and certainly more complicated. I'm only talking about a simple case where we *never* put anything anywhere other than at the end of the table, period. That should make the check both cheap and simple. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] Transaction user id through logical decoding
Hi Craig, Thanks for your answer, I'll need to dig deep in the solutions you suggested although I was looking for something less convoluted. Valerio -- View this message in context: http://postgresql.nabble.com/Transaction-user-id-through-logical-decoding-tp5923261p5930219.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Danger of automatic connection reset in psql
On Fri, Nov 11, 2016 at 5:37 AM, Pavel Stehulewrote: > > 2016-11-11 5:14 GMT+01:00 Ashutosh Bapat > : > >> > >> > How about, instead of all this, adding an option to psql to suppress >> > the automatic reconnect behavior? When enabled, psql just exits >> > instead of trying to reconnect. >> > >> +1. But, existing users may not notice addition of the new option and >> still continue to face problem. If we add the option and make it >> default not to reconnect, they will notice it and use option to get >> older behaviour, but that will break applications relying on the >> current behaviour. Either way, users will have at least something to >> control the connection reset. >> > > The reconnect in not interactive mode is a bad idea - and there should be > disabled everywhere (it cannot to break any application - the behave of > script when a server is restarted must be undefined (100% if ON_ERROR_STOP > is active). In interactive mode it can be controlled by some psql session > variables like AUTOCOMMIT. > Yes, I've never suggested it should affect non-interactive mode. Automatic connection reset is a nice feature for server development, IMO. Is it really useful for anything else is a good question. At least an option to control that behavior seems like a good idea, maybe even set it to 'no reconnect' by default, so that people who really use it can make conscious choice about enabling it in their .psqlrc or elsewhere. -- Alex
Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress
On Mon, Nov 14, 2016 at 9:33 AM, Michael Paquierwrote: > On Mon, Nov 14, 2016 at 12:49 PM, Kyotaro HORIGUCHI > wrote: >> At Sat, 12 Nov 2016 10:28:56 +0530, Amit Kapila >> wrote in >>> I think it is good to check the performance impact of this patch on >>> many core m/c. Is it possible for you to once check with Alexander >>> Korotkov to see if he can provide you access to his powerful m/c which >>> has 70 cores (if I remember correctly)? > > I heard about a number like that, and there is no reason to not do > tests to be sure. With that many cores we are more likely going to see > the limitation of the number of XLOG insert slots popping up as a > bottleneck, but that's just an assumption without any numbers. > Alexander (added in CC), could it be possible to get an access to this > machine? > >>> @@ -1035,6 +1038,7 @@ LogAccessExclusiveLocks(int nlocks, >>> xl_standby_lock *locks) >>> XLogBeginInsert(); >>> XLogRegisterData((char *) , offsetof(xl_standby_locks, locks)); >>> XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock)); >>> + XLogSetFlags(XLOG_NO_PROGRESS); >>> >>> >>> Is it right to set XLOG_NO_PROGRESS flag in LogAccessExclusiveLocks? >>> This function is called not only in LogStandbySnapshot(), but during >>> DDL operations as well where lockmode >= AccessExclusiveLock. >> >> This does not remove any record from WAL. So theoretically any >> kind of record can be NO_PROGRESS, but practically as long as >> checkpoints are not unreasonably suppressed. Any explicit >> database operation must be accompanied with at least commit >> record that triggers checkpoint. NO_PROGRESSing there doesn't >> seem to me to harm database durability for this reason. >> By this theory, you can even mark the insert record as no progress which is not good. >> The objective of this patch is skipping WALs on completely-idle >> state and the NO_PROGRESSing is necessary to do its work. Of >> course we can distinguish exclusive lock with PROGRESS and >> without PROGRESS but it is unnecessary complexity. > > The point that applies here is that logging the exclusive lock > information is necessary for the *standby* recovery conflicts, not the > primary which is why it should not influence the checkpoint activity > that is happening on the primary. So marking this record with > NO_PROGRESS is actually fine to me. > The progress parameter is used not only for checkpoint activity but by bgwriter as well for logging standby snapshot. If you want to keep this record under no_progress category (which I don't endorse), then it might be better to add a comment, so that it will be easier for the readers of this code to understand the reason. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: Implement failover on libpq connect level.
On Mon, Nov 14, 2016 at 1:37 PM, Tsunakawa, Takayuki < tsunakawa.ta...@jp.fujitsu.com> wrote: > No, there's no concern about compatibility. Please look at this: > https://www.postgresql.org/docs/devel/static/protocol- flow.html#PROTOCOL-ASYNC Thanks, my concern is suppose you have 3 server in cluster A(new version), B(new version), C(old version). If we implement as above only new servers will send ParameterStatus message to indicate what type of server we are connected. Server C will not send same. So we will not be able to use new feature "failover to new master" for such a kind of cluster. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Floating point comparison inconsistencies of the geometric types
> What way to deal with it is in your mind? The problem hides > behind operators. To fix it a user should rewrite a expression > using more primitive operators. For example, (line_a # point_a) > should be rewritten as ((point_a <-> lineseg_a) < EPSILON), or in > more primitive way. I regared this that the operator # just > become useless. Simple equations like this works well with the current algorithm: > select '(0.1,0.1)'::point @ '(0,0),(1,1)'::line; The operator does what you expect from it. Users can use something like you described to get fuzzy behaviour with an epsilon they choose. > Regarding optimization, at least gcc generates seemingly not so > different code for the two. The both generally generates extended > code directly calling isnan() and so. Have you measured the > performance of the two implement (with -O2, without > --enable-cassert)? This kind of hand-optimization gets > legitimacy when we see a siginificant difference, according to > the convention here.. I suppose. I tested it with this program: > int main() > { >double i, >j; >int result = 0; > >for (i = 0.1; i < 1.0; i += 1.0) >for (j = 0.1; j < 1.0; j += 1.0) >if (float8_lt(i, j)) >result = (result + 1) % 10; > >return result; > } The one calling cmp() was noticeable slower. ./test1 0.74s user 0.00s system 99% cpu 0.748 total ./test2 0.89s user 0.00s system 99% cpu 0.897 total This would probably be not much noticeable by calling SQL functions which are doing a few comparisons only, but it may be necessary to do many more comparisons on some places. I don't find the optimised versions less clear than calling the cmp(). I can change it the other way, if you find it more clear. > At least the comment you dropped by the patch, > > int > float4_cmp_internal(float4 a, float4 b) > { > - /* > -* We consider all NANs to be equal and larger than any non-NAN. This > is > -* somewhat arbitrary; the important thing is to have a consistent > sort > -* order. > -*/ > > seems very significant and should be kept anywhere relevant. I will add it back on the next version. > I seached pgsql-general ML but counldn't find a complaint about > the current behavior. Even though I'm not familar with PostGIS, I > found that it uses exactly the same EPSILON method with > PostgreSQL. Is it? I understood from Paul Ramsey's comment on this thread [1] that they don't. > If we had an apparent plan to use them for other than > earth-scale(?) geometric usage, we could design what they should > be. But without such a plan it is just a breakage of the current > usage. We give no promises about the geometric types being useful in earth scale. > About What kind of (needless) complication you are saying? The > fuzziness seems to me essential for geometric comparisons to work > practically. Addition to that, I don't think that we're not > allowed to change the behavior in such area of released versions > the time after time. Even when it is a total mess? > I don't think index scan and tolerant comparison are not > contradicting. Could you let me have an description about the > indexing capabilities and the inconsistencies? The first problem is that some operators are not using the epsilon. This requires special treatment while developing index support for operators. I have tried to support point for BRIN using the box operators, and failed because of that. Comparing differences with epsilon is not applicable the same way to every operator. Even with simple operators like "point in box" it covers different distances outside the box depending on where the point is. For example, "point <-> box < EPSILON" wouldn't be equivalent with "point <@ box", when the point is outside corner of the box. Things get more complicated with lines. Because of these, we are easily violating basic expectations of the operators: > regression=# select '{1000,0.01,0}'::line ?|| '{9,0.9,0}'::line; > > ?column? > -- > f > (1 row) > > regression=# select '{9,0.9,0}'::line ?|| '{1000,0.01,0}'::line; > ?column? > -- > t > (1 row) Another problem is lack of hash and btree operator classes. In my experience, the point datatype is by far most used one. People often try to use it on DISTINCT, GROUP BY, or ORDER BY clauses and complain when it doesn't work. There are many complaints like this on the archives. If we get rid of the epsilon, we can easily add those operator classes. [1] https://www.postgresql.org/message-id/CACowWR0DBEjCfBscKKumdRLJUkObjB7D%3Diw7-0_ZwSFJM9_gpw%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Remove the comment on the countereffectiveness of large shared_buffers on Windows
On Sun, Nov 13, 2016 at 1:23 AM, Jeff Janeswrote: > On Mon, Nov 7, 2016 at 7:03 PM, Amit Kapila wrote: >> >> On Tue, Nov 8, 2016 at 5:12 AM, Jeff Janes wrote: >> > On Mon, Nov 7, 2016 at 5:55 AM, Amit Kapila >> > wrote: >> >> >> >> >> >> Isn't it somewhat strange that writes are showing big win whereas >> >> reads doesn't show much win? >> > >> > >> > I don't find that unusual, and have seen the same thing on Linux. >> > >> > With small shared_buffers, you are constantly throwing dirty buffers at >> > the >> > kernel in no particular order, and the kernel does a poor job of >> > predicting >> > when the same buffer will be dirtied repeatedly and only needs the final >> > version of the data actually written to disk. >> > >> >> Okay and I think partially it might be because we don't have writeback >> optimization (done in 9.6) for Windows. > > > Is the writeback optimization the introduction of checkpoint_flush_after, or > is it something else? > Yes. > If it is checkpoint_flush_after, then I don't think that that is related. > In fact, they operate in opposite directions. The writeback optimization > forces the kernel to be more eager about writing out dirty data, so it > doesn't have a giant pile of it when the fsyncs comes at the end of the > checkpoint. Using a large shared_buffers forces it to be less eager, by not > turning the dirty buffers over to the kernel as often. > Okay, I got your point. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Gather Merge
On Sat, Nov 12, 2016 at 1:56 AM, Rushabh Lathiawrote: > On Fri, Nov 4, 2016 at 8:30 AM, Thomas Munro > wrote: >> + * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group >> + * Portions Copyright (c) 1994, Regents of the University of California >> >> Shouldn't this say just "(c) 2016, PostgreSQL Global Development >> Group"? > > Fixed. The year also needs updating to 2016 in nodeGatherMerge.h. >> + /* Per-tuple heap maintenance cost */ >> + run_cost += path->path.rows * comparison_cost * 2.0 * logN; >> >> Why multiply by two? The comment above this code says "about log2(N) >> comparisons to delete the top heap entry and another log2(N) >> comparisons to insert its successor". In fact gather_merge_getnext >> calls binaryheap_replace_first, which replaces the top element without >> any comparisons at all and then performs a sift-down in log2(N) >> comparisons to find its new position. There is no per-tuple "delete" >> involved. We "replace" the top element with the value it already had, >> just to trigger the sift-down, because we know that our comparator >> function might have a new opinion of the sort order of this element. >> Very clever! The comment and the 2.0 factor in cost_gather_merge seem >> to be wrong though -- or am I misreading the code? >> > See cost_merge_append. That just got tweaked in commit 34ca0905. > Looking at the plan I realize that this is happening because wrong costing > for Gather Merge. Here in the plan we can see the row estimated by > Gather Merge is wrong. This is because earlier patch GM was considering > rows = subpath->rows, which is not true as subpath is partial path. So > we need to multiple it with number of worker. Attached patch also fixed > this issues. I also run the TPC-H benchmark with the patch and results > are same as earlier. In create_grouping_paths: + double total_groups = gmpath->rows * gmpath->parallel_workers; This hides a variable of the same name in the enclosing scope. Maybe confusing? In some other places like create_ordered_paths: + double total_groups = path->rows * path->parallel_workers; Though it probably made sense to use this variable name in create_grouping_paths, wouldn't total_rows be better here? It feels weird to be working back to a total row count estimate from the partial one by simply multiplying by path->parallel_workers. Gather Merge will underestimate the total rows when parallel_workers < 4, if using partial row estimates ultimately from cost_seqscan which assume some leader contribution. I don't have a better idea though. Reversing cost_seqscan's logic certainly doesn't seem right. I don't know how to make them agree on the leader's contribution AND give principled answers, since there seems to be some kind of cyclic dependency in the costing logic (cost_seqscan really needs to be given a leader contribution estimate from its superpath which knows whether it will allow the leader to pull tuples greedily/fairly or not, but that superpath hasn't been created yet; cost_gather_merge needs the row count from its subpath). Or maybe I'm just confused. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Remove the comment on the countereffectiveness of large shared_buffers on Windows
On Sat, Nov 12, 2016 at 11:00 PM, Magnus Haganderwrote: > On Sat, Nov 12, 2016 at 4:03 AM, Amit Kapila > wrote: >> >> On Fri, Nov 11, 2016 at 3:01 PM, Magnus Hagander >> wrote: >> >> > Based on this optimization we might want to keep the text that says >> >> > large >> >> > shared buffers on Windows aren't as effective perhaps, >> >> Sounds sensible or may add a line to say why it isn't as effective as on >> Linux. > > > Do we actually know *why*? > No, I have never investigated it for Windows. I am just telling based on results reported in this thread. I have seen that there is a noticeable difference of read-only performance when data fits in shared buffers as compared to when it doesn't fit on Linux systems. >> >> Right, but for other platforms, the recommendation seems to be 25% of >> RAM, can we safely say that for Windows as well? As per test results >> in this thread, it seems the read-write performance degrades when >> shared buffers have increased from 12.5 to 25%. I think as the test >> is done for a short duration so that degradation could be just a run >> to run to run variation, that's why I suggested doing few more tests. > > > We talk about 25%, but only up to a certain size. It's suggested as a > starting point. The 25% value us probably good as a starting point, as it's > recommended, but not as a "recommended setting". I'm fine with doing > something similar for Windows -- say "10-15% as a starting point, but you > have to check with your workload" kind of statements. > Okay, not a problem. However, I am not sure the results in this thread are sufficient proof as for read-only tests, there is no noticeable win by increasing shared buffers and read-write tests seems to be quite short (60 seconds) to rely on it. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?
On Mon, Nov 14, 2016 at 5:16 AM, Tsunakawa, Takayukiwrote: > From: pgsql-hackers-ow...@postgresql.org >> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Ashutosh Bapat >> I have changed some comments around this block. See attached patch. >> Let me know if that looks good. > > Thanks, it looks good. > Thanks. The patch 02_close_listen... closes the listen sockets explicitly when it's known that postmaster is going to stop all the children and then die. I have tried to see, if there's a possibility that it closes the listen sockets but do not actually die, thus causing a server which doesn't accept any connections and doesn't die. But I have not found that possibility. I do not have any further comments about both the patches. None else has volunteered to review the patch till now. So, marking the entry as ready for committer. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Allow TAP tests to be run individually
On 14 November 2016 at 16:52, Michael Paquierwrote: > On Mon, Nov 14, 2016 at 3:45 PM, Craig Ringer wrote: >> On 11 November 2016 at 18:13, Michael Paquier >> wrote: >>> On Fri, Nov 11, 2016 at 6:10 PM, Craig Ringer wrote: Please backpatch to at least 9.6 since it's trivial and we seem to be doing that for TAP. 9.5 and 9.4 would be nice too :) >>> >>> Yes please! >> >> No immediate takers, so adding to CF. >> >> I've taken the liberty of adding you as a reviewer based on your >> response and the simplicity of the patch. if you get the chance to >> test and verify please set ready for committer. > > I don't mind. This patch uses the following pattern: > $(or $(PROVE_TESTS),t/*.pl) > While something more spread in Postgres source would be something like that: > $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl) > It seems to me that we'd prefer that for consistency, but I see no > reason to not keep your patch as well. I am marking that as ready for > committer. Thanks. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress
On Mon, Nov 14, 2016 at 6:10 PM, Kyotaro HORIGUCHIwrote: > It applies the master and compiled cleanly and no error by > regtest. (I didn't confirmed that the problem is still fixed but > seemingly no problem) Thanks for double-checking. > If I'm not missing something, at the worst we have a checkpoint > after a checkpointer restart that should have been supressed. Is > it worth picking it up for the complexity? I think so, that's not that much code if you think about it as there is already a routine to get the timestamp of the lastly switched segment that gets used by checkpointer.c. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress
Hello, It applies the master and compiled cleanly and no error by regtest. (I didn't confirmed that the problem is still fixed but seemingly no problem) At Mon, 14 Nov 2016 15:09:09 +0900, Michael Paquierwrote in > On Sat, Nov 12, 2016 at 9:01 PM, Andres Freund wrote: > > On 2016-11-11 16:42:43 +0900, Michael Paquier wrote: > > > >> + * This takes also > >> + * advantage to avoid 8-byte torn reads on some platforms by using the > >> + * fact that each insert lock is located on the same cache line. > > > > Something residing on the same cache line doens't provide that guarantee > > on all platforms. > > OK. Let's remove it then. > > >> + * XXX: There is still room for more improvements here, particularly > >> + * WAL operations related to unlogged relations (INIT_FORKNUM) should not > >> + * update the progress LSN as those relations are reset during crash > >> + * recovery so enforcing buffers of such relations to be flushed for > >> + * example in the case of a load only on unlogged relations is a waste > >> + * of disk write. > > > > Commit records still have to be written, everything else doesn't write > > WAL. So I'm doubtful this matters much? > > Hm, okay. In most cases this may not matter... Let's rip that off. > > >> @@ -997,6 +1022,24 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr > >> fpw_lsn) > >> inserted = true; > >> } > >> > >> + /* > >> + * Update the LSN progress positions. At least one WAL insertion lock > >> + * is already taken appropriately before doing that, and it is > >> simpler > >> + * to do that here when the WAL record data and type are at hand. > > > > But we don't use the "WAL record data and type"? > > Yes, at some point this patch did so... > > >> + * GetProgressRecPtr -- Returns the newest WAL activity position, or in > >> + * other words any activity not referring to standby logging or segment > >> + * switches. Finding the last activity position is done by scanning each > >> + * WAL insertion lock by taking directly the light-weight lock associated > >> + * to it. > >> + */ > > > > I'd prefer not to list the specific records here - that's just > > guaranteed to get out of date. Why not say something "any activity not > > requiring a checkpoint to be triggered" or such? > > OK. Makes sense to minimize maintenance. > > >> + * If this isn't a shutdown or forced checkpoint, and if there has > >> been no > >> + * WAL activity, skip the checkpoint. The idea here is to avoid > >> inserting > >> + * duplicate checkpoints when the system is idle. That wastes log > >> space, > >> + * and more importantly it exposes us to possible loss of both > >> current and > >> + * previous checkpoint records if the machine crashes just as we're > >> writing > >> + * the update. > > > > Shouldn't this mention archiving and also that we also ignore some forms > > of WAL activity? > > I have reworded that as: > "If this isn't a shutdown or forced checkpoint, and if there has been > no WAL activity requiring a checkpoint, skip it." > > >> -/* Should the in-progress insertion log the origin? */ > >> -static bool include_origin = false; > >> +/* status flags of the in-progress insertion */ > >> +static uint8 status_flags = 0; > > > > that seems a bit too generic a name. 'curinsert_flags'? > > OK. > > >> /* > >> - * only log if enough time has passed and some xlog > >> record has > >> - * been inserted. > >> + * Only log if enough time has passed, that some WAL > >> activity > >> + * has happened since last checkpoint, and that some > >> new WAL > >> + * records have been inserted since the last time we > >> came here. > > > > I think that sentence needs some polish. > > Let's do this better: > /* > -* only log if enough time has passed and some xlog record has > -* been inserted. > +* Only log if one of the following conditions is satisfied since > +* the last time we came here:: > +* - timeout has been reached. > +* - WAL activity has happened since last checkpoint. > +* - New WAL records have been inserted. > */ > > >>*/ > >> if (now >= timeout && > >> - last_snapshot_lsn != GetXLogInsertRecPtr()) > >> + GetLastCheckpointRecPtr() < > >> current_progress_lsn && > >> + last_progress_lsn < current_progress_lsn) > >> { > > > > Hm. I don't think it's correct to use GetLastCheckpointRecPtr() here? > > Don't we need to do the comparisons here (and when doing the checkpoint > > itself) with the REDO pointer
Re: [HACKERS] pg_dump, pg_dumpall and data durability
On Mon, Nov 14, 2016 at 6:04 PM, Albe Laurenzwrote: > Michael Paquier wrote: >> Meh. I forgot docs and --help output updates. > > Looks good, except that you left the "N" option in the getopt_long > call for pg_dumpall. I fixed that in the attached patch. No, v5 has removed it, but it does not matter much now... > I'll mark the patch "ready for committer". Thanks! -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump, pg_dumpall and data durability
Michael Paquier wrote: > Meh. I forgot docs and --help output updates. Looks good, except that you left the "N" option in the getopt_long call for pg_dumpall. I fixed that in the attached patch. I'll mark the patch "ready for committer". Yours, Laurenz Albe pgdump-sync-v5.patch Description: pgdump-sync-v5.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Allow TAP tests to be run individually
On Mon, Nov 14, 2016 at 3:45 PM, Craig Ringerwrote: > On 11 November 2016 at 18:13, Michael Paquier > wrote: >> On Fri, Nov 11, 2016 at 6:10 PM, Craig Ringer wrote: >>> Please backpatch to at least 9.6 since it's trivial and we seem to be >>> doing that for TAP. 9.5 and 9.4 would be nice too :) >> >> Yes please! > > No immediate takers, so adding to CF. > > I've taken the liberty of adding you as a reviewer based on your > response and the simplicity of the patch. if you get the chance to > test and verify please set ready for committer. I don't mind. This patch uses the following pattern: $(or $(PROVE_TESTS),t/*.pl) While something more spread in Postgres source would be something like that: $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl) It seems to me that we'd prefer that for consistency, but I see no reason to not keep your patch as well. I am marking that as ready for committer. -- Michael diff --git a/src/Makefile.global.in b/src/Makefile.global.in index ea61eb5..aa1fa65 100644 --- a/src/Makefile.global.in +++ b/src/Makefile.global.in @@ -354,12 +354,12 @@ ifeq ($(enable_tap_tests),yes) define prove_installcheck rm -rf $(CURDIR)/tmp_check/log -cd $(srcdir) && TESTDIR='$(CURDIR)' PATH="$(bindir):$$PATH" PGPORT='6$(DEF_PGPORT)' top_builddir='$(CURDIR)/$(top_builddir)' PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) t/*.pl +cd $(srcdir) && TESTDIR='$(CURDIR)' PATH="$(bindir):$$PATH" PGPORT='6$(DEF_PGPORT)' top_builddir='$(CURDIR)/$(top_builddir)' PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl) endef define prove_check rm -rf $(CURDIR)/tmp_check/log -cd $(srcdir) && TESTDIR='$(CURDIR)' $(with_temp_install) PGPORT='6$(DEF_PGPORT)' PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) t/*.pl +cd $(srcdir) && TESTDIR='$(CURDIR)' $(with_temp_install) PGPORT='6$(DEF_PGPORT)' PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl) endef else diff --git a/src/test/perl/README b/src/test/perl/README index 710a0d8..cfb45a1 100644 --- a/src/test/perl/README +++ b/src/test/perl/README @@ -6,6 +6,12 @@ across the source tree, particularly tests in src/bin and src/test. It's used to drive tests for backup and restore, replication, etc - anything that can't really be expressed using pg_regress or the isolation test framework. +The tests are invoked via perl's 'prove' command, wrapped in PostgreSQL +makefiles to handle instance setup etc. See the $(prove_check) and +$(prove_installcheck) targets in Makefile.global. By default every test in the +t/ subdirectory is run. Individual test(s) can be run instead by passing +something like PROVE_TESTS="t/001_testname.pl t/002_othertestname.pl" to make. + You should prefer to write tests using pg_regress in src/test/regress, or isolation tester specs in src/test/isolation, if possible. If not, check to see if your new tests make sense under an existing tree in src/test, like -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Quorum commit for multiple synchronous replication.
On Tue, Nov 8, 2016 at 10:12 PM, Michael Paquierwrote: > On Tue, Nov 8, 2016 at 12:25 AM, Masahiko Sawada > wrote: >> On Tue, Oct 25, 2016 at 10:35 PM, Michael Paquier >> wrote: >>> + if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY) >>> + return SyncRepGetSyncStandbysPriority(am_sync); >>> + else /* SYNC_REP_QUORUM */ >>> + return SyncRepGetSyncStandbysQuorum(am_sync); >>> Both routines share the same logic to detect if a WAL sender can be >>> selected as a candidate for sync evaluation or not, still per the >>> selection they do I agree that it is better to keep them as separate. >>> >>> + /* In quroum method, all sync standby priorities are always 1 */ >>> + if (found && SyncRepConfig->sync_method == SYNC_REP_QUORUM) >>> + priority = 1; >>> Honestly I don't understand why you are enforcing that. Priority can >>> be important for users willing to switch from ANY to FIRST to have a >>> look immediately at what are the standbys that would become sync or >>> potential. >> >> I thought that since all standbys appearing in s_s_names list are >> treated equally in quorum method, these standbys should have same >> priority. >> If these standby have different sync_priority, it looks like that >> master server replicates to standby server based on priority. > > No actually, because we know that they are a quorum set, and that they > work in the same set. The concept of priorities has no real meaning > for quorum as there is no ordering of the elements. Another, perhaps > cleaner idea may be to mark the field as NULL actually. We know that but I'm concerned it might confuse the user. If these priorities are the same, it can obviously imply that all of the standby listed in s_s_names are handled equally. >>> It is not possible to guess from how many standbys this needs to wait >>> for. One idea would be to mark the sync_state not as "quorum", but >>> "quorum-N", or just add a new column to indicate how many in the set >>> need to give a commit confirmation. >> >> As Simon suggested before, we could support another feature that >> allows the client to control the quorum number. >> Considering adding that feature, I thought it's better to have and >> control that information as a GUC parameter. >> Thought? > > Similarly that would be a SIGHUP parameter? Why not. Perhaps my worry > is not that much legitimate, users could just look at s_s_names to > guess how many in hte set a commit needs to wait for. It would be PGC_USRSET similar to synchronous_commit. The user can specify it in statement level. > + > +FIRST and ANY are case-insensitive word > +and the standby name having these words are must be double-quoted. > + > s/word/words/. > > +FIRST and ANY specify the method of > +that how master server controls the standby servers. > A little bit hard to understand, I would suggest: > FIRST and ANY specify the method used by the master to control the > standby servers. > > +The keyword FIRST, coupled with an integer > +number N higher-priority standbys and makes transaction commit > +when their WAL records are received. > This is unclear to me. Here is a correction: > The keyword FIRST, coupled with an integer N, makes transaction commit > wait until WAL records are received fron the N standbys with higher > priority number. > > +synchronous_standby_names. For example, a setting > +of ANY 3 (s1, s2, s3, s4) makes transaction commits > +wait until receiving receipts from at least any three standbys > +of four listed servers s1, s2, > s3, > This could just mention WAL records instead of "receipts". > > Instead of saying "an integer number N", we could use num_sync. > > + If FIRST or ANY are not specified, > this parameter > + behaves as ANY. Note that this grammar is > incompatible with > + PostgresSQL 9.6, where no keyword specified > is equivalent > + as if FIRST was specified. In short, there is no > real need to > + specify num_sync as > this > + behavior does not have changed, as well as it is not > necessary to mention > + pre-9.6 versions are the multi-sync grammar has been added in 9.6. > This paragraph could be reworked, say: > if FIRST or ANY are not specified this parameter behaves as if ANY is > used. Note that this grammar is incompatible with PostgreSQL 9.6 which > is the first version supporting multiple standbys with synchronous > replication, where no such keyword FIRST or ANY can be used. Note that > the grammar behaves as if FIRST is used, which is incompatible with > the post-9.6 behavior. > > + Synchronous state of this standby server. quorum-N > + , where N is the number of synchronous standbys that transactions > + need to wait for replies from, when standby is considered as a > + candidate of quorum commit. >
Re: [HACKERS] WIP: Covering + unique indexes
Anastasia, et al, This is a review of including_columns_9.7_v5.patch. I looked through the commit fest list and this patch was interesting and I wanted to try it. I have used include columns on Microsoft SQL Server; DB2 for Linux, Unix, Windows; and DB2 for z/OS. After reading the e-mail discussions, there are still points that I am not clear on. Given "create table foo (a int, b int, c int, d int)" and "create unique index foo_a_b on foo (a, b) including (c)". index only? heap tuple needed? select a, b, c from foo where a = 1yes no select a, b, d from foo where a = 1no yes select a, bfrom foo where a = 1 and c = 1 ?? It was clear from the discussion that the index scan/access method would not resolve the "c = 1" predicate but it was not clear if "c = 1" could be resolved without accessing the heap tuple. Are included columns counted against the 32 column and 2712 byte index limits? I did not see either explicitly mentioned in the discussion or the documentation. I only ask because in SQL Server the limits are different for include columns. 1. syntax - on 2016-08-14, Andrey Borodin wrote "I think MS SQL syntax INCLUDE instead of INCLUDING would be better". I would go further than that. This feature is already supported by 2 of the top 5 SQL databases and they both use INCLUDE. Using different syntax because of an internal implementation detail seems short sighted. 2. The patch does not seem to apply cleanly anymore. > git checkout master Already on 'master' > git pull From http://git.postgresql.org/git/postgresql d49cc58..06f5fd2 master -> origin/master Updating d49cc58..06f5fd2 Fast-forward src/include/port.h | 2 +- src/port/dirmod.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) > patch -pl < including_columns_9.7_v5.patch patching file contrib/dblink/dblink.c ... patching file src/backend/parser/gram.y ... Hunk #2 FAILED at 375. ... 1 out of 12 hunks FAILED -- saving rejects to file src/backend/parser/gram.y.rej ... patching file src/bin/pg_dump/pg_dump.c ... Hunk #8 FAILED at 6399. Hunk #9 FAILED at 6426. ... 2 out of 13 hunks FAILED -- saving rejects to file src/bin/pg_dump/pg_dump.c.rej ... 3. After "fixing" compilation errors (best guess based on similar change in other location), "make check" failed. > make check ... parallel group (3 tests): index_including create_view create_index create_index ... ok create_view ... ok index_including ... FAILED ... Looking at regression.diffs, it looks like the output format of \d tbl changed (lines 288,300) from the expected "Column | Type | Modifiers" to "Column | Type | Collation | Nullable | Default". 4. documentation - minor items (these are not actual diffs) create_index.sgml -[ INCLUDING ( column_name )] +[ INCLUDING ( column_name [, ...] )] An optional INCLUDING clause allows a list of columns to be -specified which will be included in the index, in the non-key portion of the index. +specified which will be included in the non-key portion of the index. The whole paragraph on INCLUDING does not include many of the points raised in the feature discussions. create_table.sgml (for both UNIQUE and PRIMARY KEY constraints) (similar change could be made to nbtree/README) -Optional clause INCLUDING allows to add into the index -a portion of columns on which the uniqueness is not enforced upon. -Note, that althogh constraint is not enforced on included columns, it still -depends on them. Consequently, some operations on these columns (e.g. DROP COLUMN) -can cause cascade constraint and index deletion. +An optional INCLUDING clause allows a list of columns +to be specified which will be included in the non-key portion of the index. +Although uniqueness is not enforced on the included columns, the constraint +still depends on them. Consequently, some operations on the included columns +(e.g. DROP COLUMN) can cause cascading constraint and index deletion. indexcmds.c -* are key columns, and which are just part of the INCLUDING list by check +* are key columns, and which are just part of the INCLUDING list by checking ruleutils.c - * meaningless there, so do not include them into the message. + * meaningless there, so do not include them in the message. pg_dump.c (does "if (fout->remoteVersion >= 90600)" also need to change?) - * In 9.6 we add
Re: [HACKERS] Patch: Implement failover on libpq connect level.
From: pgsql-hackers-ow...@postgresql.org > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Mithun Cy > If you are suggesting me to change in protocol messages, I think that would > not be backward compatible to older version servers. I also think such level > of protocol changes will not be allowed. with connection status > CONNECTION_SETENV used for protocol version 2.0 setup, we sent some query > like "select pg_catalog.pg_client_encoding()" for same. So I think using > "SELECT pg_is_in_recovery()" should be fine. No, there's no concern about compatibility. Please look at this: https://www.postgresql.org/docs/devel/static/protocol-flow.html#PROTOCOL-ASYNC [Excerpt] ParameterStatus messages will be generated whenever the active value changes for any of the parameters the backend believes the frontend should know about. Most commonly this occurs in response to a SET SQL command executed by the frontend, and this case is effectively synchronous — but it is also possible for parameter status changes to occur because the administrator changed a configuration file and then sent the SIGHUP signal to the server. Also, if a SET command is rolled back, an appropriate ParameterStatus message will be generated to report the current effective value. At present there is a hard-wired set of parameters for which ParameterStatus will be generated: they are server_version, server_encoding, client_encoding, application_name, is_superuser, session_authorization, DateStyle, IntervalStyle, TimeZone, integer_datetimes, and standard_conforming_strings. (server_encoding, TimeZone, and integer_datetimes were not reported by releases before 8.0; standard_conforming_strings was not reported by releases before 8.1; IntervalStyle was not reported by releases before 8.4; application_name was not reported by releases before 9.0.) Note that server_version, server_encoding and integer_datetimes are pseudo-parameters that cannot change after startup. This set might change in the future, or even become configurable. Accordingly, a frontend should simply ignore ParameterStatus for parameters that it does not understand or care about. Regards Takayuki Tsunakawa -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Floating point comparison inconsistencies of the geometric types
Hello, At Sun, 13 Nov 2016 22:41:01 +0100, Emre Hasegeliwrote in > > We can remove the fuzz factor altogether but I think we also > > should provide a means usable to do similar things. At least "is > > a point on a line" might be useless for most cases without any > > fuzzing feature. (Nevertheless, it is a problem only when it is > > being used to do that:) If we don't find reasonable policy on > > fuzzing operations, it would be the proof that we shouldn't > > change the behavior. > > It was my initial idea to keep the fuzzy comparison behaviour on some > places, but the more I get into I realised that it is almost > impossible to get this right. Instead, I re-implemented some > operators to keep precision as much as possible. The previous "is a > point on a line" operator would *never* give the true result without > the fuzzy comparison. The new implementation would return true, when > precision is not lost. There's no way to accurately store numbers other than some special ones in floating point format where mantissa precision is limited. Even 0.1 is not stored accurately with a binary mantissa. (It would be concealed by machine rounding for most cases though.) The same is said for numeric. It cannot store 1/3 accurately and doesn't even conceal the incaccuracy. Even if they could store numbers accurately, anyway , say, the constant pi does not have infinite precision. As the result, we have a tradition that equal operator on real numbers should be avoided generally. Numerics are provided mainly for financial use, on which finite-but-high precision decimal arithmetic are performed. > I think this is a behaviour people, who are > working with floating points, are prepared to deal with. What way to deal with it is in your mind? The problem hides behind operators. To fix it a user should rewrite a expression using more primitive operators. For example, (line_a # point_a) should be rewritten as ((point_a <-> lineseg_a) < EPSILON), or in more primitive way. I regared this that the operator # just become useless. > By the way, "is a point on a line" operator is quite wrong with > the fuzzy comparison at the moment [1]. Th bug is a isolate problem from the epsilon behavior. It of course should be fixed, but in "appropriate" way that may defined in this discussion. > > The 0001 patch adds many FP comparison functions individually > > considering NaN. As the result the sort order logic involving NaN > > is scattered around into the functions, then, you implement > > generic comparison function using them. It seems inside-out to > > me. Defining ordering at one place, then comparison using it > > seems to be reasonable. > > I agree that it would be simpler to use the comparison function for > implementing other operators. I have done it other way around to make > them more optimised. They are called very often. I don't think > checking exit code of the comparison function would be optimised the > same way. I could leave the comparison functions as they are, but > re-implemented them using the others to keep documentation of NaN > comparison in the single place. Regarding optimization, at least gcc generates seemingly not so different code for the two. The both generally generates extended code directly calling isnan() and so. Have you measured the performance of the two implement (with -O2, without --enable-cassert)? This kind of hand-optimization gets legitimacy when we see a siginificant difference, according to the convention here.. I suppose. At least the comment you dropped by the patch, int float4_cmp_internal(float4 a, float4 b) { - /* -* We consider all NANs to be equal and larger than any non-NAN. This is -* somewhat arbitrary; the important thing is to have a consistent sort -* order. -*/ seems very significant and should be kept anywhere relevant. > > If the center somehow goes extremely near to the origin, it could > > result in a false error. > > > >> =# select @@ box'(-8e-324, -8e-324), (4.9e-324, 4.9e-324)'; > >> ERROR: value out of range: underflow > > > > I don't think this underflow is an error, and actually it is a > > change of the current behavior without a reasonable reason. More > > significant (and maybe unacceptable) side-effect is that it > > changes the behavior of ordinary operators. I don't think this is > > acceptable. More consideration is needed. > > > >> =# select ('-8e-324'::float8 + '4.9e-324'::float8) / 2.0; > >> ERROR: value out of range: underflow > > This is the current behaviour of float datatype. My patch doesn't > change that. Very sorry for the mistake! I somehow saw float8pl on master and float8_div with your patch. Pleas forget it. > This problem would probably also apply to multiplying > very small values. I agree that this is not the ideal behaviour. > Though I am not sure, if we should go to a