Re: [HACKERS] [PATCH] Remove some duplicate if conditions
On Fri, Jan 3, 2014 at 6:59 AM, Tom Lane wrote: > David Rowley writes: > > I've attached a simple patch which removes some duplicate if conditions > > that seemed to have found their way into the code. > > > These are per PVS-Studio's warnings. > > -1. If PVS-Studio is complaining about this type of coding, to hell with > it; it should just optimize away the extra tests and be quiet about it, > not opinionate about coding style. What you propose would cause logically > independent pieces of code to become tied together, and I don't find that > to be an improvement. It's the compiler's job to micro-optimize where > possible, and most compilers that I've seen will do that rather than tell > the programmer he ought to do it. > > I had imagined that PVS-Studio would have highlighted these due to it thinking that it may be a possible bug rather than a chance for optimisation. Here's some real world examples of bugs it has found: http://www.viva64.com/en/examples/v581/ I understand that the compiler will likely optimise some of these cases, most likely I would guess cases that test simple local variables. I've not tested any compiler, but I imagined that the duplicate check in fe-protocol3.c would be the less likely to be optimized as it may be hard for the compiler to determine that conn->verbosity can't be changed from within say, appendPQExpBuffer. I sent the patch in because after looking at the fe-protocol3.c case I just found it weird and I had to spend a bit of time to determine that the code was not meant to check conn->verbosity against some other constant. This case just seems plain weird to be a duplicate if condition to me. The other case that's in the patch I don't really care so much for either, I only added it since I had already written the fix for the fe-protocol3.c one. Regards David Rowley
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Fri, Jan 3, 2014 at 12:23 PM, Erik Rijkers wrote: > On Fri, January 3, 2014 00:09, Erik Rijkers wrote: > > > > connection to server was lost > > > > So, to repeat, this runs fine on a server compiled for speed. > > > > I forgot to append the log messages: > > 2014-01-03 00:19:17.073 CET 14054 LOG: database system is ready to accept > connections > TRAP: FailedAssertion("!(!((bool) ((invtransfn_oid) != ((Oid) 0", > File: "parse_agg.c", Line: 1255) > 2014-01-03 00:19:29.605 CET 14054 LOG: server process (PID 14143) was > terminated by signal 6: Aborted > 2014-01-03 00:19:29.605 CET 14054 DETAIL: Failed process was running: > SELECT depname, empno, salary, sum(salary) OVER > (PARTITION BY depname) FROM empsalary ORDER BY depname, salary; > 2014-01-03 00:19:29.605 CET 14054 LOG: terminating any other active > server processes > 2014-01-03 00:19:29.607 CET 14054 LOG: all server processes terminated; > reinitializing > etc. etc. > > > hmm, yeah, compiling and testing a build with assets enabled... That's a good idea! I probably should have tried that :) I've attached another patch which should fix this problem. The single failing SUM(numeric) regression test is still in there and it is a known failure to do the extra trailing zeros that it can now produce that would not be present in an unpatched version. It's there purely in the hope to generate some discussion about it to find out if we can use inverse transitions for sum(numeric) or not. Regards David Rowley inverse_transition_functions_v1.9.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ERROR: missing chunk number 0 for toast value
On Fri, Jan 3, 2014 at 12:51 AM, Heikki Linnakangas wrote: > On 01/02/2014 02:24 PM, Rushabh Lathia wrote: >> >> Hi All, >> >> Test case: >> >> drop table if exists t; >> create table t(c text); >> insert into t values ('x'), (repeat(md5('abcdefghijklmnop'), 1)); >> select pg_column_size(c), pg_column_size(c || '') FROM t; >> >> CREATE OR REPLACE FUNCTION copy_toast_out() RETURNS VOID AS $$ >> declare >> v text; >> BEGIN >> SELECT c INTO v FROM t WHERE c <> 'x'; >> Select 1/0; >> Exception >> When Others Then >> PERFORM pg_sleep(30); -- go run "TRUNCATE t" in a 2nd session >> >> >> raise notice 'length :%', length(v || ''); -- force detoast >> >> >> END; >> $$ language plpgsql; >> >> postgres=# select copy_toast_out(); >> ERROR: missing chunk number 0 for toast value 16390 in pg_toast_16384 >> CONTEXT: PL/pgSQL function copy_toast_out() line 10 at RAISE >> >> Analysis: >> >> The basic problem here is that if the lock is released on table before >> extracting toasted value, and in meantime someone truncates the table, >> this error can occur. Here error coming with PL block contains an >> Exception >> block (as incase there is an exception block, it calls >> RollbackAndReleaseCurrentSubTransaction). > > > This is another variant of the bug discussed here: > http://www.postgresql.org/message-id/0c41674c-fa02-4768-9e1b-548e56887...@quarantainenet.nl. > > >> Do you think we should detoast the local variable before >> RollbackAndReleaseCurrentSubTransaction ? Or any other options ? > > > Hmm, that would fix this particular test case, but not the other case where > you DROP or TRUNCATE the table in the same transaction. > > The simplest fix would be to just detoast everything on assignment but that > was rejected on performance grounds in that previous thread. I don't see any > other realistic way to fix this, however, so maybe we should just bite the > bullet and do it anyway. For simple variables like, in your test case, it's > a good bet to detoast the value immediately; it'll be detoasted as soon as > you try to do anything with it anyway. But it's not a good bet for record or > row variables, because you often fetch the whole row into a variable but > only access a field or two. Yeah, this is exactly what came to my mind as well the first time I saw this problem that for row and record variables it can be penalty which user might not expect as he might not be using toasted values. However is it possible that we do detoasting on assignment when the variable of function is declared with some specific construct. For example, we do detoasting at commit time for holdable portals (referred below code) /* * Change the destination to output to the tuplestore. Note we tell * the tuplestore receiver to detoast all data passed through it. */ queryDesc->dest = CreateDestReceiver(DestTuplestore); SetTuplestoreDestReceiverParams(..); When the Hold option is specified with cursor, then we perform detoast on commit, so on similar lines if the specific variable or function is declared with some particular construct, then we detoast on assignment. Another option is that we give more meaningful error with Hint suggesting the possible reason of error. This option can be used along with above option in case variable/function is not declared with particular construct. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] connection service file doesn't take effect with ECPG apps
From: "Michael Meskes" However, I'd prefer to solve the problem slightly differently by not creating an empty host variable instead of checking for it after the fact. But I take it you don't mind that. Fixed in HEAD and all back branches. Thanks for the report. Thank you for committing the fix. I'm comfortable with your change. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Make various variables read-only (const)
By an odd coincidence, I also decided to try to const-ify libpq recently, and noticed this thread as I was cleaning up my patch for submission. For what it's worth, I've attached my patch to this message. It doesn't move as much data into the text segment as Oskari Saarenmaa's patch does, but it is less intrusive (simply adding const modifiers here and there). I just went after the low-hanging fruit in libpq. As an aside, it might make sense to make pg_encname_tbl and pg_encname_tbl_sz static, since as far as I can tell those symbols are never used outside of encnames.c nor are they likely to be. I didn't make that change in this patch though. On Mon, Dec 23, 2013, Robert Haas wrote: > And how much does this really affect data sharing? Doesn't copy-on-write do > the same thing for writable data? It can have a surprisingly large effect if read-only data gets intermixed on pages with actual read-write data and can get COWd unnecessarily. My motivation, though, was more about code correctness than memory sharing, though sharing is a nice benefit--- I was examining unexpected symbols in .data/.bss in case they represented a thread-safety problem for my program linked against libpq. (They didn't, FWIW, except for the known and documented issue with PQoidStatus(). Not that I really expected to find a problem in libpq, but marking those structures const makes it nice and clear that they're not mutable.) diff --git a/src/backend/utils/mb/encnames.c b/src/backend/utils/mb/encnames.c index 772d4a5..3f8c592 100644 --- a/src/backend/utils/mb/encnames.c +++ b/src/backend/utils/mb/encnames.c @@ -29,7 +29,7 @@ * Karel Zak, Aug 2001 * -- */ -pg_encname pg_encname_tbl[] = +const pg_encname pg_encname_tbl[] = { { "abc", PG_WIN1258 @@ -291,7 +291,7 @@ pg_encname pg_encname_tbl[] = } /* last */ }; -unsigned int pg_encname_tbl_sz = \ +const unsigned int pg_encname_tbl_sz = \ sizeof(pg_encname_tbl) / sizeof(pg_encname_tbl[0]) - 1; /* -- @@ -304,7 +304,7 @@ sizeof(pg_encname_tbl) / sizeof(pg_encname_tbl[0]) - 1; #else #define DEF_ENC2NAME(name, codepage) { #name, PG_##name, codepage } #endif -pg_enc2name pg_enc2name_tbl[] = +const pg_enc2name pg_enc2name_tbl[] = { DEF_ENC2NAME(SQL_ASCII, 0), DEF_ENC2NAME(EUC_JP, 20932), @@ -356,7 +356,7 @@ pg_enc2name pg_enc2name_tbl[] = * This covers all encodings except MULE_INTERNAL, which is alien to gettext. * -- */ -pg_enc2gettext pg_enc2gettext_tbl[] = +const pg_enc2gettext pg_enc2gettext_tbl[] = { {PG_SQL_ASCII, "US-ASCII"}, {PG_UTF8, "UTF-8"}, @@ -469,11 +469,11 @@ clean_encoding_name(const char *key, char *newkey) * Search encoding by encoding name * -- */ -pg_encname * +const pg_encname * pg_char_to_encname_struct(const char *name) { unsigned int nel = pg_encname_tbl_sz; - pg_encname *base = pg_encname_tbl, + const pg_encname *base = pg_encname_tbl, *last = base + nel - 1, *position; int result; @@ -521,7 +521,7 @@ pg_char_to_encname_struct(const char *name) int pg_char_to_encoding(const char *name) { - pg_encname *p; + const pg_encname *p; if (!name) return -1; @@ -545,7 +545,7 @@ pg_encoding_to_char(int encoding) { if (PG_VALID_ENCODING(encoding)) { - pg_enc2name *p = &pg_enc2name_tbl[encoding]; + const pg_enc2name *p = &pg_enc2name_tbl[encoding]; Assert(encoding == p->encoding); return p->name; diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c index 6d1cd8e..08440e9 100644 --- a/src/backend/utils/mb/mbutils.c +++ b/src/backend/utils/mb/mbutils.c @@ -55,9 +55,9 @@ static FmgrInfo *ToClientConvProc = NULL; /* * These variables track the currently-selected encodings. */ -static pg_enc2name *ClientEncoding = &pg_enc2name_tbl[PG_SQL_ASCII]; -static pg_enc2name *DatabaseEncoding = &pg_enc2name_tbl[PG_SQL_ASCII]; -static pg_enc2name *MessageEncoding = &pg_enc2name_tbl[PG_SQL_ASCII]; +static const pg_enc2name *ClientEncoding = &pg_enc2name_tbl[PG_SQL_ASCII]; +static const pg_enc2name *DatabaseEncoding = &pg_enc2name_tbl[PG_SQL_ASCII]; +static const pg_enc2name *MessageEncoding = &pg_enc2name_tbl[PG_SQL_ASCII]; /* * During backend startup we can't set client encoding because we (a) diff --git a/src/backend/utils/mb/wchar.c b/src/backend/utils/mb/wchar.c index 45bc3c1..6d03a10 100644 --- a/src/backend/utils/mb/wchar.c +++ b/src/backend/utils/mb/wchar.c @@ -1720,7 +1720,7 @@ pg_eucjp_increment(unsigned char *charptr, int length) * XXX must be sorted by the same order as enum pg_enc (in mb/pg_wchar.h) *--- */ -pg_wchar_tbl pg_wchar_table[] = { +const pg_wchar_tbl pg_wchar_
Re: [HACKERS] truncating pg_multixact/members
On Mon, Dec 30, 2013 at 10:59 PM, Alvaro Herrera wrote: > One problem I see is length of time before freezing multis: they live > for far too long, causing the SLRU files to eat way too much disk space. > I ran burnmulti in a loop, creating multis of 3 members each, with a min > freeze age of 50 million, and this leads to ~770 files in > pg_multixact/offsets and ~2900 files in pg_multixact/members. Each file > is 32 pages long. 256kB apiece. Probably enough to be bothersome. > > I think for computing the freezing point for multis, we should slash > min_freeze_age by 10 or something like that. Or just set a hardcoded > one million. Yeah. Since we expect mxids to be composed at a much lower rate than xids, we can keep pg_multixact small without needing to increase the rate of full table scans. However, it seems to me that we ought to have GUCs for mxid_freeze_table_age and mxid_freeze_min_age. There's no principled way to derive those values from the corresponding values for XIDs, and I can't see any reason to suppose that we know how to auto-tune brand new values better than we know how to auto-tune their XID equivalents that we've had for years. One million is probably a reasonable default for mxid_freeze_min_age, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fix_PGSTAT_NUM_TABENTRIES_macro patch
Mark Dilger writes: > The mechanism that occurs to me (and I'm not wedded to > this idea) is: > typedef uint8 T_HOFF_TYPE; > typedef struct xl_heap_header > { > uint16 t_infomask2; > uint16 t_infomask; > T_HOFF_TYPE t_hoff; > } xl_heap_header; > #define SizeOfHeapHeader (offsetof(xl_heap_header, t_hoff) + > sizeof(T_HOFF_TYPE)) Meh. That does nothing for the "add a field in the wrong place" risk. Yes, it would prevent people from changing the type of t_hoff without updating the macro --- but I'm not convinced that defending against that alone is worth any notational pain. If you're changing t_hoff's type without looking fairly closely at every reference to t_hoff, you're practicing unsafe programming to begin with. I wonder though if we could invent a macro that produces the sizeof a struct field, and then use that in macros like this. Something like #define field_sizeof(typename, fieldname) \ sizeof(((typename *) NULL)->fieldname) Compare the default definition of offsetof ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Store Extension Options
On Thu, Jan 2, 2014 at 4:19 AM, Andres Freund wrote: > On 2013-12-31 13:37:59 +0100, Pavel Stehule wrote: >> > We use the namespace "ext" to the internal code >> > (src/backend/access/common/reloptions.c) skip some validations and store >> > the custom GUC. >> > >> > Do you think we don't need to use the "ext" namespace? >> > >> >> yes - there be same mechanism as we use for GUC > > There is no existing mechanism to handle conflicts for GUCs. The > difference is that for GUCs nearly no "namespaced" GUCs exist (plperl, > plpgsql have some), but postgres defines at least autovacuum. and > toast. namespaces for relation options. I continue to think that the case for having this feature at all has not been well-made. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fix_PGSTAT_NUM_TABENTRIES_macro patch
The mechanism that occurs to me (and I'm not wedded to this idea) is: typedef uint8 T_HOFF_TYPE; typedef struct xl_heap_header { uint16 t_infomask2; uint16 t_infomask; T_HOFF_TYPE t_hoff; } xl_heap_header; #define SizeOfHeapHeader (offsetof(xl_heap_header, t_hoff) + sizeof(T_HOFF_TYPE)) On Thursday, January 2, 2014 3:39 PM, Tom Lane wrote: Mark Dilger writes: > I still don't understand why this case in src/include/pgstat.h > is different from cases elsewhere in the code. The reason why I'm exercised about it is that (a) somebody actually made a mistake of this type, and (b) it wasn't caught by any automated testing. The catalog and WAL-related examples you cite would probably crash and burn in fairly obvious ways if somebody broke them --- for instance, the most likely way to break SizeOfHeapHeader would be by adding another field after t_hoff, but we'd notice that before long because said field would be corrupted on arrival at a replication slave. In contrast, messing up the pgstats message sizes would have no consequences worse than a hard-to-detect, and probably platform-specific, performance penalty for stats transmission. So unless we think that's of absolutely zero concern, adding a mechanism to make such bugs more apparent seems useful. I'm not against adding more assertions elsewhere, but it's a bit hard to see what those asserts should test. I don't see any practical way to assert that field X is the last one in its struct, for instance. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Thu, Jan 2, 2014 at 11:58 AM, Peter Geoghegan wrote: > My executive summary is that the exclusion patch performs about the > same on lower client counts, presumably due to not having the > additional window of btree lock contention. By 8 clients, the > exclusion patch does noticeably better, but it's a fairly modest > improvement. I forgot to mention that synchronous_commit was turned off, so as to eliminate noise that might have been added by commit latency, while still obligating btree to WAL log everything with an exclusive buffer lock held. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication bug in 9.3.2, "WAL contains references to invalid pages"
From: "Christophe Pettus" We've had two clients experience a crash on the secondary of a streaming replication pair, running PostgreSQL 9.3.2. In both cases, the messages were close to this example: 2013-12-30 18:08:00.464 PST,,,23869,,52ab4839.5d3d,16,,2013-12-13 09:47:37 PST,1/0,0,WARNING,01000,"page 45785 of relation base/236971/365951 is uninitialized","xlog redo vacuum: rel 1663/236971/365951; blk 45794, lastBlockVacuumed 45784""" 2013-12-30 18:08:00.465 PST,,,23869,,52ab4839.5d3d,17,,2013-12-13 09:47:37 PST,1/0,0,PANIC,XX000,"WAL contains references to invalid pages","xlog redo vacuum: rel 1663/236971/365951; blk 45794, lastBlockVacuumed 45784""" 2013-12-30 18:08:00.950 PST,,,23866,,52ab4838.5d3a,8,,2013-12-13 09:47:36 PST,,0,LOG,0,"startup process (PID 23869) was terminated by signal 6: Aborted","" In both cases, the indicated relation was a primary key index. In one case, rebuilding the primary key index caused the problem to go away permanently (to date). In the second case, the problem returned even after a full dump / restore of the master database (that is, after a dump / restore of the master, and reimaging the secondary, the problem returned at the same primary key index, although of course with a different OID value). It looks like this has been experienced on 9.2.6, as well: I've experienced this problem with 9.2.4 once at the end of last year, too. The messages were the same except the relation and page numbers. In addition, I encountered a similar (possibly the same) problem with 9.1.6 about a year ago. At that time, I found in the pgsql-* MLs several people report similar problems in the past several years, but those were not solved. There seems to be a big dangerous bug hiding somewhere. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fix_PGSTAT_NUM_TABENTRIES_macro patch
I completely understand the padding issues that you are dealing with. I was mostly curious why Tom wanted to use asserts to double-check the code in one place, while happily not doing so in what seemed the same kind of situation elsewhere. He has since made the reason for that clear. On Thursday, January 2, 2014 3:27 PM, Andres Freund wrote: On 2014-01-02 15:15:58 -0800, Mark Dilger wrote: > I still don't understand why this case in src/include/pgstat.h > is different from cases elsewhere in the code. Taken from > src/include/access/heapam_xlog.h: > > > typedef struct xl_heap_header > { > uint16 t_infomask2; > uint16 t_infomask; > uint8 t_hoff; > } xl_heap_header; > > #define SizeOfHeapHeader (offsetof(xl_heap_header, t_hoff) + sizeof(uint8)) > > > > Now, if somebody changed t_hoff to be a uint16, that SizeOfHeapHeader > macro would be wrong. Should we put a static assert in the code for that? The reason the various SizeOfHeapHeader are written that way is that we do not want to uselessly store trailing padding in the WAL. E.g. sizeof(xl_heap_header) will be 6bytes, but SizeOfHeapHeader will be 5. I don't see an easy way to guarantee this with asserts and I think you'd notice pretty fast if you got things wrong there because WAL replay will just have incomplete data. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] fix_PGSTAT_NUM_TABENTRIES_macro patch
Mark Dilger writes: > I still don't understand why this case in src/include/pgstat.h > is different from cases elsewhere in the code. The reason why I'm exercised about it is that (a) somebody actually made a mistake of this type, and (b) it wasn't caught by any automated testing. The catalog and WAL-related examples you cite would probably crash and burn in fairly obvious ways if somebody broke them --- for instance, the most likely way to break SizeOfHeapHeader would be by adding another field after t_hoff, but we'd notice that before long because said field would be corrupted on arrival at a replication slave. In contrast, messing up the pgstats message sizes would have no consequences worse than a hard-to-detect, and probably platform-specific, performance penalty for stats transmission. So unless we think that's of absolutely zero concern, adding a mechanism to make such bugs more apparent seems useful. I'm not against adding more assertions elsewhere, but it's a bit hard to see what those asserts should test. I don't see any practical way to assert that field X is the last one in its struct, for instance. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fix_PGSTAT_NUM_TABENTRIES_macro patch
On 2014-01-02 15:15:58 -0800, Mark Dilger wrote: > I still don't understand why this case in src/include/pgstat.h > is different from cases elsewhere in the code. Taken from > src/include/access/heapam_xlog.h: > > > typedef struct xl_heap_header > { > uint16 t_infomask2; > uint16 t_infomask; > uint8 t_hoff; > } xl_heap_header; > > #define SizeOfHeapHeader (offsetof(xl_heap_header, t_hoff) + sizeof(uint8)) > > > > Now, if somebody changed t_hoff to be a uint16, that SizeOfHeapHeader > macro would be wrong. Should we put a static assert in the code for that? The reason the various SizeOfHeapHeader are written that way is that we do not want to uselessly store trailing padding in the WAL. E.g. sizeof(xl_heap_header) will be 6bytes, but SizeOfHeapHeader will be 5. I don't see an easy way to guarantee this with asserts and I think you'd notice pretty fast if you got things wrong there because WAL replay will just have incomplete data. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Fri, January 3, 2014 00:09, Erik Rijkers wrote: > connection to server was lost > > So, to repeat, this runs fine on a server compiled for speed. > I forgot to append the log messages: 2014-01-03 00:19:17.073 CET 14054 LOG: database system is ready to accept connections TRAP: FailedAssertion("!(!((bool) ((invtransfn_oid) != ((Oid) 0", File: "parse_agg.c", Line: 1255) 2014-01-03 00:19:29.605 CET 14054 LOG: server process (PID 14143) was terminated by signal 6: Aborted 2014-01-03 00:19:29.605 CET 14054 DETAIL: Failed process was running: SELECT depname, empno, salary, sum(salary) OVER (PARTITION BY depname) FROM empsalary ORDER BY depname, salary; 2014-01-03 00:19:29.605 CET 14054 LOG: terminating any other active server processes 2014-01-03 00:19:29.607 CET 14054 LOG: all server processes terminated; reinitializing etc. etc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fix_PGSTAT_NUM_TABENTRIES_macro patch
I still don't understand why this case in src/include/pgstat.h is different from cases elsewhere in the code. Taken from src/include/access/heapam_xlog.h: typedef struct xl_heap_header { uint16 t_infomask2; uint16 t_infomask; uint8 t_hoff; } xl_heap_header; #define SizeOfHeapHeader (offsetof(xl_heap_header, t_hoff) + sizeof(uint8)) Now, if somebody changed t_hoff to be a uint16, that SizeOfHeapHeader macro would be wrong. Should we put a static assert in the code for that? I have no objection, but it seems you like the static assert in one place and not the other, and (perhaps due to some incredible ignorance on my part) I cannot see why. I tried looking for an assert of this kind already in the code. The use of this macro is in src/backend/access/heap/heapam.c, but I didn't see any asserts for it, though there are lots of asserts for other stuff. Maybe I just didn't recognize it? mark On Thursday, January 2, 2014 2:05 PM, Tom Lane wrote: I wrote: > It occurs to me that, rather than trying to improve the struct definition > methodology, maybe we should just add static asserts to catch any > inconsistency here. It wouldn't be that hard: > #define PGSTAT_MAX_MSG_SIZE 1000 > #define PGSTAT_MSG_PAYLOAD (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr)) > ... all else in pgstat.h as before ... > StaticAssertStmt(sizeof(PgStat_MsgTabstat) <= PGSTAT_MAX_MSG_SIZE, > 'bad PgStat_MsgTabstat size'); > ... and similarly for other pgstat message structs ... After further thought it seems to me that this is a desirable approach, because it is a direct check of the property we want, and will complain about *any* mistake that results in too-large struct sizes. The other ideas that were kicked around upthread still left a lot of ways to mess up the array size calculations, for instance referencing the wrong array element type in the sizeof calculation. So unless anyone has a concrete objection, I'll go put in something like this along with Mark's fix. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Thu, January 2, 2014 17:33, Erik Rijkers wrote: > On Thu, January 2, 2014 13:36, Erik Rijkers wrote: >> On Thu, January 2, 2014 13:05, David Rowley wrote: >>> here's a slightly updated patch >>> [inverse_transition_functions_v1.8.patch.gz ] >> >> patch applies, and compiles (although with new warnings). >> But make check complains loudly To figure out where this 'make check' failed, I lifted a few statements from the offending sql: src/test/regress/sql/window.sql (see snafu.sh below). That reliably crashes the server. It is caused by a SUM, but only when configured like this (i.e. *not* configured for speed) : $ pg_config --configure '--prefix=/home/aardvark/pg_stuff/pg_installations/pgsql.inverse' '--bindir=/home/aardvark/pg_stuff/pg_installations/pgsql.inverse/bin' '--libdir=/home/aardvark/pg_stuff/pg_installations/pgsql.inverse/lib' '--with-pgport=6554' '--enable-depend' '--enable-cassert' '--enable-debug' '--with-openssl' '--with-perl' '--with-libxml' '--with-libxslt' '--with-ossp-uuid' '--with-zlib' $ cat snafu.sh #!/bin/sh echo " -- -- WINDOW FUNCTIONS -- drop TABLE if exists empsalary ; -- CREATE TEMPORARY TABLE empsalary ( CREATE TABLE empsalary ( depname varchar, empno bigint, salary int, enroll_date date ); INSERT INTO empsalary VALUES ('develop', 10, 5200, '2007-08-01'), ('sales', 1, 5000, '2006-10-01'), ('personnel', 5, 3500, '2007-12-10'), ('sales', 4, 4800, '2007-08-08'), ('personnel', 2, 3900, '2006-12-23'), ('develop', 7, 4200, '2008-01-01'), ('develop', 9, 4500, '2008-01-01'), ('sales', 3, 4800, '2007-08-01'), ('develop', 8, 6000, '2006-10-01'), ('develop', 11, 5200, '2007-08-15'); -- SELECT depname, empno, salary, sum(salary) OVER (PARTITION BY depname) FROM empsalary ORDER BY depname, salary; " | psql echo "select * from empsalary;" | psql echo " SELECT depname, empno, salary, sum(salary) OVER (PARTITION BY depname) FROM empsalary ORDER BY depname, salary; " | psql $ ./snafu.sh Timing is on. DROP TABLE Time: 1.093 ms CREATE TABLE Time: 80.161 ms INSERT 0 10 Time: 11.964 ms Timing is on. depname | empno | salary | enroll_date ---+---++- develop |10 | 5200 | 2007-08-01 sales | 1 | 5000 | 2006-10-01 personnel | 5 | 3500 | 2007-12-10 sales | 4 | 4800 | 2007-08-08 personnel | 2 | 3900 | 2006-12-23 develop | 7 | 4200 | 2008-01-01 develop | 9 | 4500 | 2008-01-01 sales | 3 | 4800 | 2007-08-01 develop | 8 | 6000 | 2006-10-01 develop |11 | 5200 | 2007-08-15 (10 rows) Time: 1.854 ms Timing is on. connection to server was lost So, to repeat, this runs fine on a server compiled for speed. I haven't looked any further (whether perhaps more statements are faulty) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planning time in explain/explain analyze
On 01/02/2014 04:08 AM, Robert Haas wrote: I'm wondering whether the time should be stored inside the PlannedStmt node instead of passing it around separately. One possible problem with the way you've done things here is that, in the case of a prepared statement, EXPLAIN ANALYZE will emit the time needed to call GetCachedPlan(), even if that function didn't do any replanning. Now you could argue that this is the correct behavior, but I think there's a decent argument that what we ought to show there is the amount of time that was required to create the plan that we're displaying at the time it was created, rather than the amount of time that was required to figure out that we didn't need to replan. Since we support re-planning of prepared statements I would argue the most useful thing is to print the time it took to fetch the old plan or the create a new one as the planning time, but I am not a heavy user of prepared statements. Also, I am inclined to think we ought to update the examples, rather than say this: +rewriting and parsing. We will not include this line in later examples in +this section. + Ok, I will fix this in the next version of the patch. -- Andreas Karlsson -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Thu, Jan 2, 2014 at 2:37 AM, Andres Freund wrote: > Locking the definitely visible row only works if there's a row matching > the index's columns. If the values of the new row don't have > corresponding values in all the indexes you have the same old race > conditions again. I still don't get it - perhaps you should break down exactly what you mean with an example. I'm talking about potentially doing multiple upserts per row proposed for insertion to handle multiple conflicts, perhaps with some deletes between upserts, not just one upsert with a single update part. > I think to be useful for many cases you really need to be able to ask > for a potentially conflicting row and be sure that if there's none you > are able to insert the row separately. Why? What work do you need to perform after reserving the right to insert but before inserting? Can't you just upsert resulting in insert, and then perform that work, potentially deleting the row inserted if and when you change your mind? Is there any real difference between what that does for you, and what any particular variety of promise tuple might do for you? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fix_PGSTAT_NUM_TABENTRIES_macro patch
I wrote: > It occurs to me that, rather than trying to improve the struct definition > methodology, maybe we should just add static asserts to catch any > inconsistency here. It wouldn't be that hard: > #define PGSTAT_MAX_MSG_SIZE 1000 > #define PGSTAT_MSG_PAYLOAD(PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr)) > ... all else in pgstat.h as before ... > StaticAssertStmt(sizeof(PgStat_MsgTabstat) <= PGSTAT_MAX_MSG_SIZE, >'bad PgStat_MsgTabstat size'); > ... and similarly for other pgstat message structs ... After further thought it seems to me that this is a desirable approach, because it is a direct check of the property we want, and will complain about *any* mistake that results in too-large struct sizes. The other ideas that were kicked around upthread still left a lot of ways to mess up the array size calculations, for instance referencing the wrong array element type in the sizeof calculation. So unless anyone has a concrete objection, I'll go put in something like this along with Mark's fix. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Thu, Jan 2, 2014 at 8:08 AM, Heikki Linnakangas wrote: >> Yeah, it seems like PromiseTupleInsertionLockAcquire should be locking >> the tuple, rather than the XID. > > Well, that would be ideal, because we already have tuple locks. It would be > nice to use the same concept for this. It's a bit tricky, however. I guess > the most straightforward way to do it would be to grab a heavy-weight lock > after you've inserted the tuple, but before releasing the buffer lock. I > don't immediately see a problem with that, although it's a bit scary to > acquire a heavy-weight lock while holding a buffer lock. That's a really big modularity violation. Everything after RelationPutHeapTuple() but before the buffer unlock in heap_insert() is currently critical section. I'm not saying that it can't be done, but it certainly is scary. We also have heavyweight page locks, currently used by hash indexes. That approach does not require us to contort the row locking code, and certainly does not require us to acquire heavyweight locks with buffer locks already held. I could understand your initial disinclination to doing things this way, particularly when the unprincipled deadlocking problem was not well understood, but I think that this must tip the balance in favor of the approach I advocate. What I've done with heavyweight locks is a modest, localized, logical expansion on the existing mechanism, that is easy to reason about, with room for further optimization in the future, that still has reasonable performance characteristics today, including I believe better worst-case latency. Heavyweight locks on btree pages are very well precedented, if you look beyond Postgres. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ERROR: missing chunk number 0 for toast value
On 2014-01-02 16:05:09 -0500, Robert Haas wrote: > On Thu, Jan 2, 2014 at 3:19 PM, Andres Freund wrote: > >> I was wondering if we could somehow arrange to not > >> release the subtransaction's AccessShareLock on the table, as long as it > >> was protecting toasted references someplace. > > > > Sounds fairly ugly... > > I think the only principled fixes are to either retain the lock or > forcibly detoast before releasing it. I don't think that's sufficient. Unless I miss something the problem isn't restricted to TRUNCATE and such at all. I think a plain VACUUM should be sufficient? I haven't tested it, but INSERT RETURNING toasted_col a row, storing the result in a record, and then aborting the subtransaction will allow the inserted row to be VACUUMed by a concurrent transaction. So I don't think anything along those lines will be sufficient. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ERROR: missing chunk number 0 for toast value
On Thu, Jan 2, 2014 at 3:19 PM, Andres Freund wrote: >> I was wondering if we could somehow arrange to not >> release the subtransaction's AccessShareLock on the table, as long as it >> was protecting toasted references someplace. > > Sounds fairly ugly... I think the only principled fixes are to either retain the lock or forcibly detoast before releasing it. The main problem I see with retaining the lock is that you'd need a way of finding out the relation OIDs of all toast pointers you might later decide to expand. I don't have an amazingly good idea about how to figure that out. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLUSTER FREEZE
On Mon, Dec 23, 2013 at 6:53 AM, Andres Freund wrote: > On 2013-12-22 20:45:02 -0500, Robert Haas wrote: >> I suspect we ought to extend this to rewriting variants of ALTER TABLE >> as well, but a little thought is needed there. ATRewriteTables() >> appears to just call heap_insert() for each updated row, which if I'm >> not mistaken is an MVCC violation - offhand, it seems like a >> transaction with an older MVCC snapshot could access the table for >> this first time after the rewriter commits, and fail to see rows which >> would have appeared to be there before the rewrite. (I haven't >> actually tested this, so watch me be wrong.) If we're OK with an MVCC >> violation here, we could just pass HEAP_INSERT_FROZEN and have a >> slightly different flavor of violation; if not, this needs some kind >> of more extensive surgery quite apart from what we do about freezing. > > Yes, rewriting ALTER TABLEs certainly aren't MVCC safe. I thought that > was documented, but apparently not. > I am not sure it can be fixed easily using the tricks CLUSTER plays, > there might be nasty edge-cases because of the changes in the column > definition. Certainly not a trivial project. > > I think we should leave ALTER TABLE as a completely separate project and > just improve CLUSTER for now. The practical impact of rewriting ALTER > TABLEs not freezing is far smaller, because they are very seldomly > performed in bigger databases. OK, I have committed my patch to make CLUSTER and VACUUM FULL freeze aggressively, and have left ALTER TABLE alone for now. It would be nice to get to the point where a database-wide VACUUM FULL would serve to bump datfrozenxid, so as to avoid having to give this sort of advice: http://www.postgresql.org/message-id/CA+Tgmobth=aqkwmwtcsqlaenv59gt4g3oqpqs45cb+fvg9m...@mail.gmail.com However, it doesn't, quite: a bare VACUUM FULL now bumps relfrozenxid for every table in the database *except pg_class*. It does call vac_update_datfrozenxid() afterwards, but that only helps if pg_class is not among the things holding back datfrozenxid. I haven't dug into this much yet, but I think it might be worth fixing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ERROR: missing chunk number 0 for toast value
On 2014-01-02 15:00:58 -0500, Tom Lane wrote: > Andres Freund writes: > > On 2014-01-02 21:21:15 +0200, Heikki Linnakangas wrote: > >> I don't see any other realistic way to fix this, however, so maybe we > >> should just bite the bullet and do it anyway. > > > We could remember the subtransaction a variable was created in and error > > out if it the creating subtransaction aborted and it's not a > > pass-by-value datum or similar. > > That would still result in throwing an error, though, so it isn't likely > to make the OP happy. Yea, it would give a better error message which might help diagnose the issue, but not more. We could disallow accessing such variables generally unless they explicitly had been detoasted, that would make people notice the problem more easily. I shortly wondered if we couldn't "just" iterate over plpgsql variables and detoast them on subabort if created in the aborted xact, but that doesn't really work because we're in an aborted transaction where it might not be safe to access relations... Theoretically the subabort could be split into two phases allowing it by only releasing the lock after safely switching to the upper transaction but that sounds like a hammer too big for the problem. > I was wondering if we could somehow arrange to not > release the subtransaction's AccessShareLock on the table, as long as it > was protecting toasted references someplace. Sounds fairly ugly... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] preserving forensic information when we freeze
On 2014-01-02 14:44:34 -0500, Tom Lane wrote: > Andres Freund writes: > > On 2014-01-02 12:46:34 -0500, Tom Lane wrote: > >> For real > >> forensics work, you need to be able to see all tuples, which makes me > >> think that something akin to pgstattuple is the right API; that is "return > >> a set of the header info for all tuples on such-and-such pages of this > >> relation". That should dodge any performance problem, because the > >> heap_open overhead could be amortized across lots of tuples, and it also > >> sidesteps all problems with adding new system columns. > > > The biggest problem with such an API is that it's painful to use - I've > > used pageinspect a fair bit, and not being able to easily get the > > content of the rows you're looking at makes it really far less useful in > > many scenarios. That could partially be improved by a neater API > > Surely. Why couldn't you join against the table on ctid? For the case of pageinspect it's because pageinspect doesn't return the ctid of a tuple in a useful way - its t_ctid is HeapTupleHeader->t_ctid, not HeapTuple->t_self... In many cases bulk access really isn't all that useful - you do a SELECT searching for data that's looking strange and then need the forensic data for those. That's just painful with any of the proposed fast APIs afaics. > > And I really don't see any page-at-a-time access that's going to be > > convenient. > > As I commented to Robert, the page-at-a-time behavior of pageinspect > is not an API detail we'd want to copy for this. I envision something > like > >select hdr.*, foo.* > from tuple_header_details('foo'::regclass) as hdr > left join foo on hdr.ctid = foo.ctid; > > On a large table you might want a version that restricts its scan > to pages M through N, but that's just optimization. More useful > would be to improve the planner's intelligence about joins on ctid ... That really makes for but ugly queries. E.g. the database I found the multixact bugs on was ~300GB and I had to look about 80GB of it. So I would have had to write chunking code for individual tables. Not what you want to do when shit has hit the fan. > >>> [ removing system columns from pg_attribute ]] > >> I think this will inevitably break a lot of code, not all of it ours, > >> so I'm not in favor of pursuing that direction. > > > Are you thinking of client or extension code? From what I've looked at I > > don't think it's all that likely too break much of either. > > It will break anything that assumes that every column is represented in > pg_attribute. I think if you think this assumption is easily removed, > you've not looked hard enough. Uh. And how much code actually is that? Note that system columns already aren't in a Relation's TupleDesc. So it's not they would be missing from that - and if that were the issue we could easily add them there when the cache entry is built. There really isn't much code in postgres itself that iterates over all columns including system columns. Some bits around heap.c, tablecmd.c, lsyscache.c do lookups by name, but they are easily converted to SystemAttributeByName()/SystemAttributeDefinition(). Most non DDL code doesn't care at all. > > It would make pg_attribute a fair bit smaller, especially on systems > > with lots of narrow relations. > > I'd like to do that too, but I think getting rid of xmin/xmax/cmin/cmax > would be enough to get most of the benefit, and we could do that without > any inconsistency if we stopped exposing those as system columns. Well, I have yet to see any realistic proposal to get rid of them. Having to write significantly more complex and/or significantly more expensive queries doesn't qualify. And there really is code out there using xmin/xmax as part of their code, so I think the rumor of nobody crying about their near death is just that, a rumor. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] preserving forensic information when we freeze
Robert Haas writes: > Well, that's fair enough. I don't mind having two functions. Should > the whole-table function also include invisible tuples? Certainly, that's exactly why I was proposing it. You can do a join if you want to suppress them. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] preserving forensic information when we freeze
On Thu, Jan 2, 2014 at 2:56 PM, Tom Lane wrote: > Robert Haas writes: >> On Thu, Jan 2, 2014 at 2:44 PM, Tom Lane wrote: >>> As I commented to Robert, the page-at-a-time behavior of pageinspect >>> is not an API detail we'd want to copy for this. I envision something >>> like >>> >>> select hdr.*, foo.* >>> from tuple_header_details('foo'::regclass) as hdr >>> left join foo on hdr.ctid = foo.ctid; >>> >>> On a large table you might want a version that restricts its scan >>> to pages M through N, but that's just optimization. More useful >>> would be to improve the planner's intelligence about joins on ctid ... > >> /me blinks. > >> Surely you're not serious. That's going to scan the whole darn table >> even if we only want the details for one row. > > And? The complaint about the other function was that it was inefficient > when getting the details for a whole table, so I don't think you can > complain about an approach that handles that case well. Use the other > function if you just want info for one row (that you can see). Well, that's fair enough. I don't mind having two functions. Should the whole-table function also include invisible tuples? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ERROR: missing chunk number 0 for toast value
Andres Freund writes: > On 2014-01-02 21:21:15 +0200, Heikki Linnakangas wrote: >> I don't see any other realistic way to fix this, however, so maybe we >> should just bite the bullet and do it anyway. > We could remember the subtransaction a variable was created in and error > out if it the creating subtransaction aborted and it's not a > pass-by-value datum or similar. That would still result in throwing an error, though, so it isn't likely to make the OP happy. I was wondering if we could somehow arrange to not release the subtransaction's AccessShareLock on the table, as long as it was protecting toasted references someplace. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Streaming replication bug in 9.3.2, "WAL contains references to invalid pages"
Greetings, We've had two clients experience a crash on the secondary of a streaming replication pair, running PostgreSQL 9.3.2. In both cases, the messages were close to this example: 2013-12-30 18:08:00.464 PST,,,23869,,52ab4839.5d3d,16,,2013-12-13 09:47:37 PST,1/0,0,WARNING,01000,"page 45785 of relation base/236971/365951 is uninitialized","xlog redo vacuum: rel 1663/236971/365951; blk 45794, lastBlockVacuumed 45784""" 2013-12-30 18:08:00.465 PST,,,23869,,52ab4839.5d3d,17,,2013-12-13 09:47:37 PST,1/0,0,PANIC,XX000,"WAL contains references to invalid pages","xlog redo vacuum: rel 1663/236971/365951; blk 45794, lastBlockVacuumed 45784""" 2013-12-30 18:08:00.950 PST,,,23866,,52ab4838.5d3a,8,,2013-12-13 09:47:36 PST,,0,LOG,0,"startup process (PID 23869) was terminated by signal 6: Aborted","" In both cases, the indicated relation was a primary key index. In one case, rebuilding the primary key index caused the problem to go away permanently (to date). In the second case, the problem returned even after a full dump / restore of the master database (that is, after a dump / restore of the master, and reimaging the secondary, the problem returned at the same primary key index, although of course with a different OID value). It looks like this has been experienced on 9.2.6, as well: http://www.postgresql.org/message-id/flat/CAL_0b1s4QCkFy_55kk_8XWcJPs7wsgVWf8vn4=jxe6v4r7h...@mail.gmail.com Let me know if there's any further information I can provide. Best, -- -- Christophe Pettus x...@thebuild.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
I decided to make at least a cursory attempt to measure or characterize the performance of each of our approaches to value locking. Being fair here is a non-trivial matter, because of the fact that upserts can behave quite differently based on the need to insert or update, lock contention and so on. Also, I knew that anything I came up with would not be comparing like with like: as things stand, the btree locking code is more or less correct, and the alternative exclusion constraint supporting implementation is more or less incorrect (of course, you may yet describe a way to fix the unprincipled deadlocking previously revealed by my testcase [1], but it is far from clear what impact this fix will have on performance). Still, something is better than nothing. This was run on a Linux server ("Linux 3.8.0-31-generic #46~precise1-Ubuntu") with these specifications: https://www.hetzner.de/en/hosting/produkte_rootserver/ex40 . Everything fits in shared_buffers, but the I/O system is probably the weakest link here. To be 100% clear, I am comparing btreelock_insert_on_dup.v5.2013_12_28.patch.gz [2] with exclusion_insert_on_dup.2013_12_19.patch.gz [3]. I'm also testing a third approach, involving avoidance of exclusive buffer locks and heavyweight locks for upserts in the first phase of speculative insertion. That patch is unposted, but shows a modest improvement over [2]. I ran this against the table foo: pgbench=# \d+ foo Table "public.foo" Column | Type | Modifiers | Storage | Stats target | Description +-+---+--+--+- a | integer | not null | plain| | b | integer | | plain| | c | text| | extended | | Indexes: "foo_pkey" PRIMARY KEY, btree (a) Has OIDs: no My custom script was: \setrandom rec 1 :scale with rej as(insert into foo(a, b, c) values(:rec, :rec, 'insert') on duplicate key lock for update returning rejects *) update foo set c = 'update' from rej where foo.a = rej.a; I specified that each pgbench client in each run should last for 200k upserts (with 100k possible distinct key values), not that it should last some number of seconds. The idea is that there is a reasonable mix of inserts and updates initially, for lower client counts, but exactly the same number of queries are run for each patch, so as to normalize the effects of contention across both runs (this sure is hand-wavy, but likely better than nothing). I'm just looking for approximate numbers here, and I'm sure that you could find more than one way to benchmark this feature, with varying degrees of sympathy towards each of our two approaches to value locking. This benchmark isn't sympathetic to btree locking at all, because there is a huge amount of contention for the higher client counts, with 100% of possible rows updated by the time we're done at 16 clients, for example. To compensate somewhat for the relatively low duration of each run, I take an average-of-5, rather than an average-of-3 as representative for each client count + run/patch combination. Full report of results are here: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/upsert-cmp/ My executive summary is that the exclusion patch performs about the same on lower client counts, presumably due to not having the additional window of btree lock contention. By 8 clients, the exclusion patch does noticeably better, but it's a fairly modest improvement. Forgive me if I'm belaboring the point, but even though I'm benchmarking the simplest possible upsert statements, had I chosen small pgbench scale factors (e.g. scales that would see 100 - 1000 possible distinct key values in total) the btree locking implementation would surely win very convincingly, just because the alternative implementation would spend almost all of its time deadlocked, waiting for the deadlock detector to free clients in one second deadlock_timeout cycles. My goal here was just to put a rough number on how these two approaches compare, while trying to be as fair as possible. I have to wonder about the extent to which the exclusion approach benefits from holding old value locks. So even if the unprincipled deadlocking issue can be fixed without much additional cost, it might be that the simple fact that that approach holds those pseudo "value locks" (i.e. old dead rows from previous iterations on the same tuple slot) indefinitely helps performance, and losing that property alone will hurt performance, even though it's necessary. For those that wonder what the effect on multiple unique index would be, that isn't really all that relevant, since contention on multiple unique indexes isn't expected with idiomatic usage (though I suppose an upsert's non-HOT update would have to compete). [1] http://www.postgresql.org/message-id/cam3swzshbe29kpod44cvc3vpzjgmder6k_6fghiszeozgmt...@mail.gmail.com [2] http://www.postgre
Re: [HACKERS] preserving forensic information when we freeze
Robert Haas writes: > On Thu, Jan 2, 2014 at 2:44 PM, Tom Lane wrote: >> As I commented to Robert, the page-at-a-time behavior of pageinspect >> is not an API detail we'd want to copy for this. I envision something >> like >> >> select hdr.*, foo.* >> from tuple_header_details('foo'::regclass) as hdr >> left join foo on hdr.ctid = foo.ctid; >> >> On a large table you might want a version that restricts its scan >> to pages M through N, but that's just optimization. More useful >> would be to improve the planner's intelligence about joins on ctid ... > /me blinks. > Surely you're not serious. That's going to scan the whole darn table > even if we only want the details for one row. And? The complaint about the other function was that it was inefficient when getting the details for a whole table, so I don't think you can complain about an approach that handles that case well. Use the other function if you just want info for one row (that you can see). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Fri, Jan 3, 2014 at 5:33 AM, Erik Rijkers wrote: > *** /var/data1/pg_stuff/pg_sandbox/pgsql.inverse/src/ > test/regress/expected/window.out 2014-01-02 16:19:48.0 +0100 > --- > /var/data1/pg_stuff/pg_sandbox/pgsql.inverse/src/test/regress/results/window.out >2014-01-02 16:21:43.0 +0100 > *** > *** 1188,1195 >sum > -- >6.01 > ! 5 > ! 3 > (3 rows) > > SELECT i,COUNT(v) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND > UNBOUNDED FOLLOWING) > --- 1188,1195 >sum > -- >6.01 > ! 5.00 > ! 3.00 > (3 rows) > > I've left those failures in for now in the hope to generate some discussion on if we can inverse transition for sum(numeric). Please see my email before the previous one for details. To fix it pg_aggregate.h just needs to be changed to remove the inverse transition for sum(numeric).
Re: [HACKERS] preserving forensic information when we freeze
Robert Haas writes: > We could certainly add a function that returns SETOF record, taking > e.g. regclass as an argument, but it doesn't seem a stretch to me to > think that you might want to get tuple header information for some but > not all tuples in the relation, and I don't see any real good way to > tell the function exactly what tuples you want except by invoking it > once per TID. I have no objection to having a function that retrieves the details for a given TID alongside one that does it for a whole relation. The point here is just that we should be headed in the direction of removing as many system columns as we can, not adding more; especially not ones that (a) have no purpose except forensics and (b) are virtually certain to change across system versions. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] preserving forensic information when we freeze
On Thu, Jan 2, 2014 at 2:44 PM, Tom Lane wrote: > As I commented to Robert, the page-at-a-time behavior of pageinspect > is not an API detail we'd want to copy for this. I envision something > like > >select hdr.*, foo.* > from tuple_header_details('foo'::regclass) as hdr > left join foo on hdr.ctid = foo.ctid; > > On a large table you might want a version that restricts its scan > to pages M through N, but that's just optimization. More useful > would be to improve the planner's intelligence about joins on ctid ... /me blinks. Surely you're not serious. That's going to scan the whole darn table even if we only want the details for one row. And making the planner smarter about joins on CTID doesn't help a bit, unless you can push the join qual down *inside the tuple_header_details() function call*. That'd be pretty a pretty sweet thing to allow for SRFs in general, but it doesn't sound like something we're going to conjure up at the drop of a hat. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] preserving forensic information when we freeze
Andres Freund writes: > On 2014-01-02 12:46:34 -0500, Tom Lane wrote: >> For real >> forensics work, you need to be able to see all tuples, which makes me >> think that something akin to pgstattuple is the right API; that is "return >> a set of the header info for all tuples on such-and-such pages of this >> relation". That should dodge any performance problem, because the >> heap_open overhead could be amortized across lots of tuples, and it also >> sidesteps all problems with adding new system columns. > The biggest problem with such an API is that it's painful to use - I've > used pageinspect a fair bit, and not being able to easily get the > content of the rows you're looking at makes it really far less useful in > many scenarios. That could partially be improved by a neater API Surely. Why couldn't you join against the table on ctid? > And I really don't see any page-at-a-time access that's going to be > convenient. As I commented to Robert, the page-at-a-time behavior of pageinspect is not an API detail we'd want to copy for this. I envision something like select hdr.*, foo.* from tuple_header_details('foo'::regclass) as hdr left join foo on hdr.ctid = foo.ctid; On a large table you might want a version that restricts its scan to pages M through N, but that's just optimization. More useful would be to improve the planner's intelligence about joins on ctid ... >>> [ removing system columns from pg_attribute ]] >> I think this will inevitably break a lot of code, not all of it ours, >> so I'm not in favor of pursuing that direction. > Are you thinking of client or extension code? From what I've looked at I > don't think it's all that likely too break much of either. It will break anything that assumes that every column is represented in pg_attribute. I think if you think this assumption is easily removed, you've not looked hard enough. > It would make pg_attribute a fair bit smaller, especially on systems > with lots of narrow relations. I'd like to do that too, but I think getting rid of xmin/xmax/cmin/cmax would be enough to get most of the benefit, and we could do that without any inconsistency if we stopped exposing those as system columns. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: multiple read-write masters in a cluster with wal-streaming synchronization
On 2014-01-02 11:35:57 -0800, Mark Dilger wrote: > BTW, since the space shuttle has already left orbit, as you > metaphorically put it, maybe there should be more > visibility to the wider world about this? You can go to > postgresql.org and find diddly squat about it. I grant you > that it is not a completed project yet, and so maybe you > want to wait before making major announcements, but > the sort of people who would use this feature are probably > the sort of people who would not mind hearing about it > early. Well, changeset extraction isn't committed yet, so it's not surprising that you don't find anything there ;). Parts of the patches (notably wal_level=logical, enriching the WAL to allow decoding) have landed, the others are being worked over in response to review comments of Robert. I am pretty sure there will be more publicity once it's committed ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ERROR: missing chunk number 0 for toast value
On 2014-01-02 21:21:15 +0200, Heikki Linnakangas wrote: > I don't see any other realistic way to fix this, however, so maybe we > should just bite the bullet and do it anyway. We could remember the subtransaction a variable was created in and error out if it the creating subtransaction aborted and it's not a pass-by-value datum or similar. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] preserving forensic information when we freeze
On Thu, Jan 2, 2014 at 2:03 PM, Tom Lane wrote: >> I both agree and disagree with this. I think that pgstattuple is >> useful, and I further agree that adding a stat to it about how much of >> the heap is frozen would be worthwhile. However, an aggregate number >> isn't always what you want, and being able to scrutinize specific >> tuples is, I think, a useful thing. > > Oh, I guess I referenced the wrong contrib module, because what I was > trying to propose is a function that returns a setof record, one row for > each physical tuple it finds. There are some functions in > contrib/pageinspect that work like that, but not pgstattuple. I don't > think pageinspect's API is ideal because it's tightly tied to processing > one page at a time, but it does show information a bit like what we need > here. Sure, Álvaro mentioned the same thing upthread. Also, if you notice, the function I was proposing to add does basically the same thing as pageinsect, except one tuple at a time rather than one page at a time. I think both are useful, though. pageinspect is superuser-only, and needing work by pages isn't always convenient. I thought about making the pg_tuple_header() function I proposed scan using SnapshotAny rather than the active snapshot, but then I think it'd need to be superuser-only. I also thought about making it use SnapshotAny for superusers and the active snapshot for everybody else, but that seemed like it could be confusing. We could certainly add a function that returns SETOF record, taking e.g. regclass as an argument, but it doesn't seem a stretch to me to think that you might want to get tuple header information for some but not all tuples in the relation, and I don't see any real good way to tell the function exactly what tuples you want except by invoking it once per TID. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: multiple read-write masters in a cluster with wal-streaming synchronization
Thanks to both of you for all the feedback. Your reasoning about why it is not worth implementing, what the problems with it would be, etc., are helpful. Sorry about using the word multimaster where it might have been better to say sharded. BTW, since the space shuttle has already left orbit, as you metaphorically put it, maybe there should be more visibility to the wider world about this? You can go to postgresql.org and find diddly squat about it. I grant you that it is not a completed project yet, and so maybe you want to wait before making major announcements, but the sort of people who would use this feature are probably the sort of people who would not mind hearing about it early. mark On Thursday, January 2, 2014 11:18 AM, Andres Freund wrote: On 2014-01-02 10:18:52 -0800, Mark Dilger wrote: > I anticipated that my proposal would require partitioning the catalogs. > For instance, autovacuum could only run on locally owned tables, and > would need to store the analyze stats data in a catalog partition belonging > to the local server, but that doesn't seem like a fundamental barrier to > it working. It would make every catalog lookup noticeably more expensive. > The partitioned catalog tables would get replicated like > everything else. The code that needs to open catalogs and look things > up could open the specific catalog partition needed if it already knew the > Oid of the table/index/whatever that it was interested in, as the catalog > partition desired would have the same modulus as the Oid of the object > being researched. Far, far, far from every lookup is by oid. Most prominently the names of database objects. Those will have to scan every catalog partition. Not fun. > Your point about increasing the runtime of pg_upgrade is taken. I will > need to think about that some more. It's not about increasing the runtime, it's about simply breaking it. pg_upgrade relies on binary compatibility of user relation's files and you're breaking that if you change the width of datatypes. > Your claim that what I describe is not multi-master is at least partially > correct, depending on how you think about the word "master". Certainly > every server is the master of its own chunk. Well, you're essentially just describing a sharded system - that's not usually coined multimaster. > Your claim that BDR doesn't have to be much slower than what I am > proposing is quite interesting, as if that is true I can ditch this idea and > use BDR instead. It is hard to empirically test, though, as I don't have > the alternate implementation on hand. Well, I can tell you that for the changeset extraction stuff (which is the basis for BDR) the biggest bottleneck so far seems to be the CRC computation when reading the WAL - and that's something plain WAL apply has to do as well. And it is optimizable. When actually testing decoding & apply, for workloads fitting into memory I had to try very hard to construe situations where apply was a big bottleneck. It is easier for seek bound workloads, where the standby is less powerful than the primary, since there's more random reads for those due to full page writes removing the need for reads in many cases. > I think the expectation that performance will be harmed if postgres > uses 8 byte Oids is not quite correct. > > Several years ago I ported postgresql sources to use 64bit everything. > Oids, varlena headers, variables tracking offsets, etc. It was a fair > amount of work, but all the doom and gloom predictions that I have > heard over the years about how 8-byte varlena headers would kill > performance, 8-byte Oids would kill performance, etc, turned out to > be quite inaccurate. Well, it can increase the size of the database, turning a system where the hot set fits into memory into one where it doesn't anymore. But really, the performance concerns were more about the catalog lookups. Fundamentally, I think there's nothing I see preventing such a scheme from being implemented - but I think there's about zap chance of it ever getting integrated, it's just far to invasive with very high costs in scenarios where it's not used for not all that much gain. Not to speak about the amount of engineering it would require to implement. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ERROR: missing chunk number 0 for toast value
Heikki Linnakangas writes: > The simplest fix would be to just detoast everything on assignment but > that was rejected on performance grounds in that previous thread. I > don't see any other realistic way to fix this, however, so maybe we > should just bite the bullet and do it anyway. Or just say "don't do that". TRUNCATE on a table that's in use by open transactions has all sorts of issues besides this one. The given example is a pretty narrow corner case anyway --- with a less contorted coding pattern, we'd still have AccessShareLock on the table, blocking the TRUNCATE from removing data. I'd still not want to blow up performance in order to make this example work. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] preserving forensic information when we freeze
On 2014-01-02 12:46:34 -0500, Tom Lane wrote: > Andres Freund writes: > > On 2014-01-02 09:40:54 -0500, Tom Lane wrote: > >> Actually, I thought the function approach was a good proposal. You are > >> right that func(tab.*) isn't going to work, because it's going to get a > >> Datum-ified tuple not a pointer to raw on-disk storage. But an inspection > >> function that's handed a ctid could work. > > > Well, we discussed that upthread, and the overhead of going through a > > function is quite noticeable because the tuple needs to be fetched from > > the heap again. > > Yeah, I read those results, but that seems like it could probably be > optimized. I'm guessing the function was doing a new heap_open every > time? There's probably a way around that. How? Everything I can think of is either fragile or too ugly. > In any case, upon further reflection I'm not convinced that doing this > with a SELECT-based query is the right thing, no matter whether the query > looks at a function or a system column; because by definition, you'll only > be able to see tuples that are visible to your current snapshot. I think it really depends - quite often you really want to just investigate tuples you actually can see, because you can easily write normal queries and such. It sure wouldn't replace pageinspect. > For real > forensics work, you need to be able to see all tuples, which makes me > think that something akin to pgstattuple is the right API; that is "return > a set of the header info for all tuples on such-and-such pages of this > relation". That should dodge any performance problem, because the > heap_open overhead could be amortized across lots of tuples, and it also > sidesteps all problems with adding new system columns. The biggest problem with such an API is that it's painful to use - I've used pageinspect a fair bit, and not being able to easily get the content of the rows you're looking at makes it really far less useful in many scenarios. That could partially be improved by a neater API (returning t_self would help greatly, accepting a page without a wrapper function as well), but it still pretty fundamentally sucks. And I really don't see any page-at-a-time access that's going to be convenient. A big step would be returning the tuple's contents in a normal fashion, but that's not easy without big notational overhead. > > Upthread there's a POC patch of mine, that started to explore what's > > necessary to simply never store system columns (except maybe oid) in > > pg_attribute. While it passes the regression tests it's not complete, > > but the amount of work looks reasonable. > > I think this will inevitably break a lot of code, not all of it ours, > so I'm not in favor of pursuing that direction. Are you thinking of client or extension code? From what I've looked at I don't think it's all that likely too break much of either. Extension code isn't likely to do its own catalog lookups and thus doesn't really care. Most client code isn't interested in system catalog attribute numbers - the few examples of client code looking at pg_attribute I remember all explicitly excluded attnum < 0. It would make pg_attribute a fair bit smaller, especially on systems with lots of narrow relations. There's also the possibility of only going that route for new system catalog attributes, and leave the old ones where they are and removed them in a release or two. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: multiple read-write masters in a cluster with wal-streaming synchronization
On Tue, Dec 31, 2013 at 3:51 PM, Mark Dilger wrote: > The BDR documentation > http://wiki.postgresql.org/images/7/75/BDR_Presentation_PGCon2012.pdf > says, > > "Physical replication forces us to use just one > node: multi-master required for write scalability" > > "Physical replication provides best read scalability" > > I am inclined to agree with the second statement, but > I think my proposal invalidates the first statement, at > least for a particular rigorous partitioning over which > server owns which data. > > In my own workflow, I load lots of data from different > sources. The partition the data loads into depends on > which source it came from, and it is never mixed or > cross referenced in any operation that writes the data. > It is only "mixed" in the sense that applications query > data from multiple sources. > > So for me, multi-master with physical replication seems > possible, and would presumably provide the best > read scalability. I doubt that I am in the only database > user who has this kind of workflow. > > The alternatives are ugly. I can load data from separate > sources into separate database servers *without* replication > between them, but then the application layer has to > emulate queries across the data. (Yuck.) Or I can use > logical replication such as BDR, but then the servers > are spending more effort than with physical replication, > so I get less bang for the buck when I purchase more > servers to add to the cluster. Or I can use FDW to access > data from other servers, but that means the same data > may be pulled across the wire arbitrarily many times, with > corresponding impact on the bandwidth. > > Am I missing something here? Does BDR really provide > an equivalent solution? I think BDR is better: while it does only support schema-equivalent replication that is the typical case for distributed write systems like this. Also, there are a lot less assumptions about the network architecture in the actual data itself (for example, what happens when you want to change onwer/mege/split data?). IMNSHO, It's better that each node is managing WAL for itself, not the other way around except in the very special case you want an exact replica of the database on each node at all times as with the current HS/SR. A **huge** amount of work has/is being put in to wal based logical replication support (LLSR in the BDR docs) that should mostly combine the flexibility of trigger based logical replication with the robustness of wal based replication that we have in core now. LLSR a low level data transmission framework that can be wrapped by higher level user facing stuff like BDR. LLSR, by the way, does not come attached with the assumption that all databases have the same schema. If I were you, I'd be studying up on LLSR and seeing how it could be molded into the use cases you're talking about. From a development point of view, the replication train hasn't just left the station, it's a space shuttle that just broke out of earth's orbit. By my reckoning a new 'from the ground up' implementation of replication requiring in-core changes has an exactly zero percent change of being adopted. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ERROR: missing chunk number 0 for toast value
On 01/02/2014 02:24 PM, Rushabh Lathia wrote: Hi All, Test case: drop table if exists t; create table t(c text); insert into t values ('x'), (repeat(md5('abcdefghijklmnop'), 1)); select pg_column_size(c), pg_column_size(c || '') FROM t; CREATE OR REPLACE FUNCTION copy_toast_out() RETURNS VOID AS $$ declare v text; BEGIN SELECT c INTO v FROM t WHERE c <> 'x'; Select 1/0; Exception When Others Then PERFORM pg_sleep(30); -- go run "TRUNCATE t" in a 2nd session raise notice 'length :%', length(v || ''); -- force detoast END; $$ language plpgsql; postgres=# select copy_toast_out(); ERROR: missing chunk number 0 for toast value 16390 in pg_toast_16384 CONTEXT: PL/pgSQL function copy_toast_out() line 10 at RAISE Analysis: The basic problem here is that if the lock is released on table before extracting toasted value, and in meantime someone truncates the table, this error can occur. Here error coming with PL block contains an Exception block (as incase there is an exception block, it calls RollbackAndReleaseCurrentSubTransaction). This is another variant of the bug discussed here: http://www.postgresql.org/message-id/0c41674c-fa02-4768-9e1b-548e56887...@quarantainenet.nl. Do you think we should detoast the local variable before RollbackAndReleaseCurrentSubTransaction ? Or any other options ? Hmm, that would fix this particular test case, but not the other case where you DROP or TRUNCATE the table in the same transaction. The simplest fix would be to just detoast everything on assignment but that was rejected on performance grounds in that previous thread. I don't see any other realistic way to fix this, however, so maybe we should just bite the bullet and do it anyway. For simple variables like, in your test case, it's a good bet to detoast the value immediately; it'll be detoasted as soon as you try to do anything with it anyway. But it's not a good bet for record or row variables, because you often fetch the whole row into a variable but only access a field or two. Then again, if you run into that, at least you can work around it by changing your plpgsql code to only fetch the fields you need. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: multiple read-write masters in a cluster with wal-streaming synchronization
On 2014-01-02 10:18:52 -0800, Mark Dilger wrote: > I anticipated that my proposal would require partitioning the catalogs. > For instance, autovacuum could only run on locally owned tables, and > would need to store the analyze stats data in a catalog partition belonging > to the local server, but that doesn't seem like a fundamental barrier to > it working. It would make every catalog lookup noticeably more expensive. > The partitioned catalog tables would get replicated like > everything else. The code that needs to open catalogs and look things > up could open the specific catalog partition needed if it already knew the > Oid of the table/index/whatever that it was interested in, as the catalog > partition desired would have the same modulus as the Oid of the object > being researched. Far, far, far from every lookup is by oid. Most prominently the names of database objects. Those will have to scan every catalog partition. Not fun. > Your point about increasing the runtime of pg_upgrade is taken. I will > need to think about that some more. It's not about increasing the runtime, it's about simply breaking it. pg_upgrade relies on binary compatibility of user relation's files and you're breaking that if you change the width of datatypes. > Your claim that what I describe is not multi-master is at least partially > correct, depending on how you think about the word "master". Certainly > every server is the master of its own chunk. Well, you're essentially just describing a sharded system - that's not usually coined multimaster. > Your claim that BDR doesn't have to be much slower than what I am > proposing is quite interesting, as if that is true I can ditch this idea and > use BDR instead. It is hard to empirically test, though, as I don't have > the alternate implementation on hand. Well, I can tell you that for the changeset extraction stuff (which is the basis for BDR) the biggest bottleneck so far seems to be the CRC computation when reading the WAL - and that's something plain WAL apply has to do as well. And it is optimizable. When actually testing decoding & apply, for workloads fitting into memory I had to try very hard to construe situations where apply was a big bottleneck. It is easier for seek bound workloads, where the standby is less powerful than the primary, since there's more random reads for those due to full page writes removing the need for reads in many cases. > I think the expectation that performance will be harmed if postgres > uses 8 byte Oids is not quite correct. > > Several years ago I ported postgresql sources to use 64bit everything. > Oids, varlena headers, variables tracking offsets, etc. It was a fair > amount of work, but all the doom and gloom predictions that I have > heard over the years about how 8-byte varlena headers would kill > performance, 8-byte Oids would kill performance, etc, turned out to > be quite inaccurate. Well, it can increase the size of the database, turning a system where the hot set fits into memory into one where it doesn't anymore. But really, the performance concerns were more about the catalog lookups. Fundamentally, I think there's nothing I see preventing such a scheme from being implemented - but I think there's about zap chance of it ever getting integrated, it's just far to invasive with very high costs in scenarios where it's not used for not all that much gain. Not to speak about the amount of engineering it would require to implement. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] more psprintf() use
Peter Eisentraut wrote: > psprintf() in place of hardcoded palloc(N) + sprintf() and the like. > > + values[j++] = psprintf("%d", stat.blkno); > + values[j++] = psprintf("%c", stat.type); > + values[j++] = psprintf("%d", stat.live_items); > + values[j++] = psprintf("%d", stat.dead_items); > + values[j++] = psprintf("%d", stat.avg_item_size); > + values[j++] = psprintf("%d", stat.page_size); > + values[j++] = psprintf("%d", stat.free_size); > + values[j++] = psprintf("%d", stat.btpo_prev); > + values[j++] = psprintf("%d", stat.btpo_next); > + values[j++] = psprintf("%d", (stat.type == 'd') ? stat.btpo.xact : > stat.btpo.level); > + values[j++] = psprintf("%d", stat.btpo_flags); > > tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc), > values); In cases such as this one, I have often wondered whether it'd be better to write this as DatumGetSometype() plus heap_form_tuple, instead of printing to strings and then building a tuple from those. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
"Erik Rijkers" writes: > The TRACE_POSTGRESQL_SORT_DONE warnings were not from your patch; sorry about > that. They occur on HEAD too (with a debug > compile). > tuplesort.c:935:44: warning: comparison between pointer and integer [enabled > by default] > TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, spaceUsed); > ^ > tuplesort.c:935:2: note: in expansion of macro â > TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, spaceUsed); FWIW, I don't see any such warnings on either of the machines I have that will accept --enable-dtrace. state->tapeset is a pointer, so these warnings look a tad bogus. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] preserving forensic information when we freeze
Robert Haas writes: > On Thu, Jan 2, 2014 at 12:46 PM, Tom Lane wrote: >> In any case, upon further reflection I'm not convinced that doing this >> with a SELECT-based query is the right thing, no matter whether the query >> looks at a function or a system column; because by definition, you'll only >> be able to see tuples that are visible to your current snapshot. For real >> forensics work, you need to be able to see all tuples, which makes me >> think that something akin to pgstattuple is the right API; that is "return >> a set of the header info for all tuples on such-and-such pages of this >> relation". That should dodge any performance problem, because the >> heap_open overhead could be amortized across lots of tuples, and it also >> sidesteps all problems with adding new system columns. > I both agree and disagree with this. I think that pgstattuple is > useful, and I further agree that adding a stat to it about how much of > the heap is frozen would be worthwhile. However, an aggregate number > isn't always what you want, and being able to scrutinize specific > tuples is, I think, a useful thing. Oh, I guess I referenced the wrong contrib module, because what I was trying to propose is a function that returns a setof record, one row for each physical tuple it finds. There are some functions in contrib/pageinspect that work like that, but not pgstattuple. I don't think pageinspect's API is ideal because it's tightly tied to processing one page at a time, but it does show information a bit like what we need here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] preserving forensic information when we freeze
On Thu, Jan 2, 2014 at 12:46 PM, Tom Lane wrote: > Andres Freund writes: >> On 2014-01-02 09:40:54 -0500, Tom Lane wrote: >>> Actually, I thought the function approach was a good proposal. You are >>> right that func(tab.*) isn't going to work, because it's going to get a >>> Datum-ified tuple not a pointer to raw on-disk storage. But an inspection >>> function that's handed a ctid could work. > >> Well, we discussed that upthread, and the overhead of going through a >> function is quite noticeable because the tuple needs to be fetched from >> the heap again. > > Yeah, I read those results, but that seems like it could probably be > optimized. I'm guessing the function was doing a new heap_open every > time? There's probably a way around that. Yeah, it was. I don't see any plausible way around that, but I'm all ears. > In any case, upon further reflection I'm not convinced that doing this > with a SELECT-based query is the right thing, no matter whether the query > looks at a function or a system column; because by definition, you'll only > be able to see tuples that are visible to your current snapshot. For real > forensics work, you need to be able to see all tuples, which makes me > think that something akin to pgstattuple is the right API; that is "return > a set of the header info for all tuples on such-and-such pages of this > relation". That should dodge any performance problem, because the > heap_open overhead could be amortized across lots of tuples, and it also > sidesteps all problems with adding new system columns. I both agree and disagree with this. I think that pgstattuple is useful, and I further agree that adding a stat to it about how much of the heap is frozen would be worthwhile. However, an aggregate number isn't always what you want, and being able to scrutinize specific tuples is, I think, a useful thing. I'm not sure that it needs to be fast, which is why I think a function rather than a system column might be OK, but I think it ought to be possible. >> Upthread there's a POC patch of mine, that started to explore what's >> necessary to simply never store system columns (except maybe oid) in >> pg_attribute. While it passes the regression tests it's not complete, >> but the amount of work looks reasonable. > > I think this will inevitably break a lot of code, not all of it ours, > so I'm not in favor of pursuing that direction. I thought that that approach had been discussed previously and found desirable on the grounds that it would cut down on the size of pg_attribute, but it's not something I want to rush through when we have a release to get out the door. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: Async query processing
On Wed, Dec 18, 2013 at 1:50 PM, Florian Weimer wrote: > On 11/04/2013 02:51 AM, Claudio Freire wrote: >> >> On Sun, Nov 3, 2013 at 3:58 PM, Florian Weimer wrote: >>> >>> I would like to add truly asynchronous query processing to libpq, >>> enabling >>> command pipelining. The idea is to to allow applications to auto-tune to >>> the bandwidth-delay product and reduce the number of context switches >>> when >>> running against a local server. >> >> ... >>> >>> If the application is not interested in intermediate query results, it >>> would >>> use something like this: >> >> ... >>> >>> If there is no need to exit from the loop early (say, because errors are >>> expected to be extremely rare), the PQgetResultNoWait call can be left >>> out. >> >> >> It doesn't seem wise to me making such a distinction. It sounds like >> you're oversimplifying, and that's why you need "modes", to overcome >> the evidently restrictive limits of the simplified interface, and that >> it would only be a matter of (a short) time when some other limitation >> requires some other mode. > > > I need modes because I want to avoid unbound buffering, which means that > result data has to be consumed in the order queries are issued. ... > In any case, I don't want to change the wire protocol, I just want to enable > libpq clients to use more of its capabilities. I believe you will at least need to use TCP_CORK or some advanced socket options if you intend to decrease the number of packets without changing the protocol. Due to the interactive and synchronized nature of the protocol, TCP will immediately send the first query in a packet since it's already ready to do so. Buffering will only happen from the second query onwards, and this won't benefit a two-query loop as the one in your sample. As for expectations, they can be part of the connection object and not the wire protocol if you wish. The point I was making, is that the expectation should be part of the query call, since that's less error prone than setting a "discard results" mode. Think of it as PQsendQueryParams with an extra "async" argument that defaults to PQASYNC_NOT (ie, sync). There you can tell libpq to expect either no results, expect and discard them, or whatever. The benefit here is a simplified usage: your example code will be part of libpq and thus all this complexity will be hidden from users. Furthermore, libpq will do the small sanity check of actually checking that the server returns no results when expecting no result. >>>PGAsyncMode oldMode = PQsetsendAsyncMode(conn, PQASYNC_RESULT); >>>bool more_data; >>>do { >>> more_data = ...; >>> if (more_data) { >>> int ret = PQsendQueryParams(conn, >>> "INSERT ... RETURNING ...", ...); >>> if (ret == 0) { >>> // handle low-level error >>> } >>> } >>> // Consume all pending results. >>> while (1) { >>> PGresult *res; >>> if (more_data) { >>> res = PQgetResultNoWait(conn); >>> } else { >>> res = PQgetResult(conn); >>> } >> >> >> Somehow, that code looks backwards. I mean, really backwards. Wouldn't >> that be !more_data? > > No, if more data is available to transfer to the server, the no-wait variant > has to be used to avoid a needless synchronization with the server. Ok, yeah. Now I get it. It's client-side more_data. >> In any case, pipelining like that, without a clear distinction, in the >> wire protocol, of which results pertain to which query, could be a >> recipe for trouble when subtle bugs, either in lib usage or >> implementation, mistakenly treat one query's result as another's. > > > We already use pipelining in libpq (see pqFlush, PQsendQueryGuts and > pqParseInput3), the server is supposed to support it, and there is a lack of > a clear tit-for-tat response mechanism anyway because of NOTIFY/LISTEN and > the way certain errors are reported. pqFlush doesn't seem overly related, since the API specifically states that you cannot queue multiple PQsendQuery. It looks more like low-level buffering. Ie: when the command itself is larger than the os buffer and nonblocking operation requires multiple send() calls for one PQsendQuery. Am I wrong? >>> Instead of buffering the results, we could buffer the encoded command >>> messages in PQASYNC_RESULT mode. This means that PQsendQueryParams would >>> not block when it cannot send the (complete) command message, but store >>> in >>> the connection object so that the subsequent PQgetResultNoWait and >>> PQgetResult would send it. This might work better with single-tuple >>> result >>> mode. We cannot avoid buffering either multiple queries or multiple >>> responses if we want to utilize the link bandwidth, or we'd risk >>> deadlocks. >> >> >> This is a non-solution. Such an implementation, at least as described, >> would not remove neither network latency nor context switches, it >> would be a purely API change with no ex
Re: [HACKERS] proposal: multiple read-write masters in a cluster with wal-streaming synchronization
My original email was mostly a question about whether WAL data could be merged from multiple servers, or whether I was overlooking some unsolvable difficulty. I'm still mostly curious about that question. I anticipated that my proposal would require partitioning the catalogs. For instance, autovacuum could only run on locally owned tables, and would need to store the analyze stats data in a catalog partition belonging to the local server, but that doesn't seem like a fundamental barrier to it working. The partitioned catalog tables would get replicated like everything else. The code that needs to open catalogs and look things up could open the specific catalog partition needed if it already knew the Oid of the table/index/whatever that it was interested in, as the catalog partition desired would have the same modulus as the Oid of the object being researched. Your point about increasing the runtime of pg_upgrade is taken. I will need to think about that some more. Your claim that what I describe is not multi-master is at least partially correct, depending on how you think about the word "master". Certainly every server is the master of its own chunk. I see that as a downside for some people, who want to be able to insert/update/delete any data on any server. But the ability to modify *anything anywhere* brings performance problems with it. Either the servers have to wait for each other before commits go through, in order to avoid incompatible data changes being committed on both ends, or the servers have to reject commits after they have already been reported to the client as successful. I expect my proposal to have better read scalability in a write-heavy environment, because the less work it takes to integrate data changes from other workers, the more resources remain per server to answer read queries. Your claim that BDR doesn't have to be much slower than what I am proposing is quite interesting, as if that is true I can ditch this idea and use BDR instead. It is hard to empirically test, though, as I don't have the alternate implementation on hand. I think the expectation that performance will be harmed if postgres uses 8 byte Oids is not quite correct. Several years ago I ported postgresql sources to use 64bit everything. Oids, varlena headers, variables tracking offsets, etc. It was a fair amount of work, but all the doom and gloom predictions that I have heard over the years about how 8-byte varlena headers would kill performance, 8-byte Oids would kill performance, etc, turned out to be quite inaccurate. The performance impact was ok for me. The disk space impact wasn't much either, as with 8-byte varlena headers, anything under 127 bytes had a 1-byte header, and anything under 16383 had a 2-byte header, with 8-bytes only used after that, which pretty much meant that disk usage for representing varlena data shrunk slightly rather than growing. Tom Lane had mentioned in a threadthat he didn't want to make the #define for processing varlena headers any more complicated than it was, because it gets executed quite a lot. So I tried the 1,2,8 byte vs 1,8 byte varlena design both ways and found it made little difference to me which I chose. Of course, my analysis was based on my own usage patterns, my own schemas, my own data, and might not apply to everyone else. I tend to conflate the 8-byte Oid change with all these other changes from 4-byte to 8-byte, because that's what I did and what I have experience with. Having 8-byte everything witheverything aligned allowed me to use SSE functions on some stuffthat postgres was (at least at the time) doing less efficiently. Since then, I havenoticed that the hash function for disk blocks is implemented withSSE in mind. With 8-byte aligned datums, SSE based hashing can be used without all the calls to realign the data. I was experimenting with forcing data to be 16-byte aligned to take advantage of newer SSE functions, but this was years ago and I didn't own any hardware with the newer SSE capabilities, so I never got to benchmark that. All this is to say that increasing to 8 bytes is not a pure performance loss. It is a trade-off, and one that I did not find particularly problematic. On the up side, I didn't need to worry about Oid exhaustion anymore, which allows removing the code that checks for it (though I left that code in place.) It allows using varlena objects instead of the large object interface, so I could yank that interface and make my code size smaller. (I never much used the LO interface to begin with, so I might not be the right person to ask about this.) It allows not worrying about accidentally bumping into the 1GBlimit on varlenas, which means you don't have to code for that errorcondition in applications. mark On Thursday, January 2, 2014 1:19 AM, Andres Freund wrote: On 2013-12-31 13:51:08 -0800, Mark Dilger wrote: > The BDR documentation > http://wiki.postgresql.org/images/7/75/BDR_Present
Re: [HACKERS] fix_PGSTAT_NUM_TABENTRIES_macro patch
Robert Haas writes: > On Tue, Dec 31, 2013 at 12:20 PM, Tom Lane wrote: >> I'm inclined to think that we should look for a less breakable way to >> manage the message size limit; if Robert missed this issue in that patch, >> other people are going to miss it in future patches. The existing coding >> isn't really right anyway when you consider possible alignment padding. >> (I think the present struct definitions probably don't involve any >> padding, but that's not a very future-proof assumption either.) It'd be >> better to somehow drive the calculation from offsetof(m_entry). I'm not >> seeing any way to do that that doesn't require some notational changes >> in the code, though. One way would be to rejigger it like this: >> ... > Rather than using a union, I'd be inclined to declare one type that's > just the header (PgStat_MsgTabstat_Hdr), and then work out > PGSTAT_NUM_TABENTRIES based on sizeof(PgStat_MsgTabstat_Hdr), and then > declare PgStat_MsgTabstat as a two-element struct, the header struct > followed by an array of entries. That is indeed a bit of notational > churn but it seems worth it to me. That approach would fail to account for any alignment padding occurring just before the m_entry array, though. That could happen if the array members contain some 8-byte fields while there are none in the header part. I think it would net out to the same amount of notational change in the code proper as my approach, since either way you end up with a nested struct containing the header fields. It occurs to me that, rather than trying to improve the struct definition methodology, maybe we should just add static asserts to catch any inconsistency here. It wouldn't be that hard: #define PGSTAT_MAX_MSG_SIZE 1000 #define PGSTAT_MSG_PAYLOAD (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr)) ... all else in pgstat.h as before ... StaticAssertStmt(sizeof(PgStat_MsgTabstat) <= PGSTAT_MAX_MSG_SIZE, 'bad PgStat_MsgTabstat size'); ... and similarly for other pgstat message structs ... This would possibly lead to machine-dependent results if alignment comes into play, but it's reasonable to suppose that we'd find out about any mistakes as soon as they hit the buildfarm. So this might be an acceptable tradeoff for not having to change source code. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fix_PGSTAT_NUM_TABENTRIES_macro patch
On Tue, Dec 31, 2013 at 12:20 PM, Tom Lane wrote: > Mark Dilger writes: >> In src/include/pgstat.h, the PGSTAT_NUM_TABENTRIES macro >> attempts to subtract off the size of the PgStat_MsgTabstat >> struct up to the m_entry[] field. This macro was correct up >> until the fields m_block_read_time and m_block_write_time >> were added to that struct, as the macro was not changed to >> include their size. > > Yeah, that's a bug. Ick. >> This patch fixes the macro. > > I'm inclined to think that we should look for a less breakable way to > manage the message size limit; if Robert missed this issue in that patch, > other people are going to miss it in future patches. The existing coding > isn't really right anyway when you consider possible alignment padding. > (I think the present struct definitions probably don't involve any > padding, but that's not a very future-proof assumption either.) It'd be > better to somehow drive the calculation from offsetof(m_entry). I'm not > seeing any way to do that that doesn't require some notational changes > in the code, though. One way would be to rejigger it like this: > > #define PGSTAT_MAX_MSG_SIZE 1000 > > typedef union PgStat_MsgTabstat > { > struct { > PgStat_MsgHdr hdr; > Oid databaseid; > int nentries; > int xact_commit; > int xact_rollback; > PgStat_Counter block_read_time;/* times in microseconds */ > PgStat_Counter block_write_time; > PgStat_TableEntry entry[FLEXIBLE_ARRAY_MEMBER]; > } m; > char padding[PGSTAT_MAX_MSG_SIZE];/* pad sizeof this union to target > */ > } PgStat_MsgTabstat; > > #define PGSTAT_NUM_TABENTRIES ((PGSTAT_MAX_MSG_SIZE - > offsetof(PgStat_MsgTabstat, m.entry)) / sizeof(PgStat_TableEntry)) > > but then we'd have to run around and replace "m_hdr" with "m.hdr" etc > in the code referencing the message types that use this mechanism. > Kind of annoying. Rather than using a union, I'd be inclined to declare one type that's just the header (PgStat_MsgTabstat_Hdr), and then work out PGSTAT_NUM_TABENTRIES based on sizeof(PgStat_MsgTabstat_Hdr), and then declare PgStat_MsgTabstat as a two-element struct, the header struct followed by an array of entries. That is indeed a bit of notational churn but it seems worth it to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Remove some duplicate if conditions
David Rowley writes: > I've attached a simple patch which removes some duplicate if conditions > that seemed to have found their way into the code. > These are per PVS-Studio's warnings. -1. If PVS-Studio is complaining about this type of coding, to hell with it; it should just optimize away the extra tests and be quiet about it, not opinionate about coding style. What you propose would cause logically independent pieces of code to become tied together, and I don't find that to be an improvement. It's the compiler's job to micro-optimize where possible, and most compilers that I've seen will do that rather than tell the programmer he ought to do it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] preserving forensic information when we freeze
Andres Freund writes: > On 2014-01-02 09:40:54 -0500, Tom Lane wrote: >> Actually, I thought the function approach was a good proposal. You are >> right that func(tab.*) isn't going to work, because it's going to get a >> Datum-ified tuple not a pointer to raw on-disk storage. But an inspection >> function that's handed a ctid could work. > Well, we discussed that upthread, and the overhead of going through a > function is quite noticeable because the tuple needs to be fetched from > the heap again. Yeah, I read those results, but that seems like it could probably be optimized. I'm guessing the function was doing a new heap_open every time? There's probably a way around that. In any case, upon further reflection I'm not convinced that doing this with a SELECT-based query is the right thing, no matter whether the query looks at a function or a system column; because by definition, you'll only be able to see tuples that are visible to your current snapshot. For real forensics work, you need to be able to see all tuples, which makes me think that something akin to pgstattuple is the right API; that is "return a set of the header info for all tuples on such-and-such pages of this relation". That should dodge any performance problem, because the heap_open overhead could be amortized across lots of tuples, and it also sidesteps all problems with adding new system columns. > Upthread there's a POC patch of mine, that started to explore what's > necessary to simply never store system columns (except maybe oid) in > pg_attribute. While it passes the regression tests it's not complete, > but the amount of work looks reasonable. I think this will inevitably break a lot of code, not all of it ours, so I'm not in favor of pursuing that direction. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Thu, Jan 2, 2014 at 11:08 AM, Heikki Linnakangas wrote: > On 01/02/2014 02:53 PM, Robert Haas wrote: >> On Tue, Dec 31, 2013 at 4:12 AM, Peter Geoghegan wrote: >>> >>> On Tue, Dec 31, 2013 at 12:52 AM, Heikki Linnakangas >>> wrote: 1. PromiseTupleInsertionLockAcquire() 2. Insert heap tuple 3. Insert index tuples 4. Check if conflict happened. Kill the already-inserted tuple on conflict. 5. PromiseTupleInsertionLockRelease() IOW, the only change to the current patch is that you acquire the new kind of lock before starting the insertion, and you release it after you've killed the tuple, or you know you're not going to kill it. >>> >>> >>> Where does row locking fit in there? - you may need to retry when that >>> part is incorporated, of course. What if you have multiple promise >>> tuples from a contended attempt to insert a single slot, or multiple >>> broken promise tuples across multiple slots or even multiple commands >>> in the same xact? > > You can only have one speculative insertion in progress at a time. After > you've done all the index insertions and checked that you really didn't > conflict with anyone, you're not going to go back and kill the tuple > anymore. After that point, the insertion is not speculation anymore. Yeah... but how does someone examining the tuple know that? We need to avoid having them block on the promise-tuple insertion lock if we've reacquired it meanwhile for a new speculative insertion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Thu, January 2, 2014 13:36, Erik Rijkers wrote: > On Thu, January 2, 2014 13:05, David Rowley wrote: >> here's a slightly updated patch >> [inverse_transition_functions_v1.8.patch.gz ] > > patch applies, and compiles (although with new warnings). > But make check complains loudly: see attached. > > warnings: The TRACE_POSTGRESQL_SORT_DONE warnings were not from your patch; sorry about that. They occur on HEAD too (with a debug compile). tuplesort.c:935:44: warning: comparison between pointer and integer [enabled by default] TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, spaceUsed); ^ tuplesort.c:935:2: note: in expansion of macro â TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, spaceUsed); The 'make check' failure remains a problem The output I sent earlier today was for this configure: ./configure --prefix=/var/data1/pg_stuff/pg_installations/pgsql.inverse --with-pgport=6594 \ --bindir=/var/data1/pg_stuff/pg_installations/pgsql.inverse/bin \ --libdir=/var/data1/pg_stuff/pg_installations/pgsql.inverse/lib \ --quiet --enable-depend --enable-cassert --enable-debug --with-perl \ --with-openssl --with-libxml --enable-dtrace (and that's still repeatable) Perhaps this helps: with another configure: ./configure --prefix=/var/data1/pg_stuff/pg_installations/pgsql.inverse --with-pgport=6594 --bindir=/var/data1/pg_stuff/pg_installations/pgsql.inverse/bin.fast --libdir=/var/data1/pg_stuff/pg_installations/pgsql.inverse/lib.fast --quiet --enable-depend --with-perl --with-openssl --with-libxml I get only this single 'make check' error: *** /var/data1/pg_stuff/pg_sandbox/pgsql.inverse/src/test/regress/expected/window.out 2014-01-02 16:19:48.0 +0100 --- /var/data1/pg_stuff/pg_sandbox/pgsql.inverse/src/test/regress/results/window.out 2014-01-02 16:21:43.0 +0100 *** *** 1188,1195 sum -- 6.01 ! 5 ! 3 (3 rows) SELECT i,COUNT(v) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) --- 1188,1195 sum -- 6.01 ! 5.00 ! 3.00 (3 rows) SELECT i,COUNT(v) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) == Centos 5.7 gcc 4.8.2 Thanks; and Happy New Year Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On 01/02/2014 02:53 PM, Robert Haas wrote: On Tue, Dec 31, 2013 at 4:12 AM, Peter Geoghegan wrote: On Tue, Dec 31, 2013 at 12:52 AM, Heikki Linnakangas wrote: 1. PromiseTupleInsertionLockAcquire() 2. Insert heap tuple 3. Insert index tuples 4. Check if conflict happened. Kill the already-inserted tuple on conflict. 5. PromiseTupleInsertionLockRelease() IOW, the only change to the current patch is that you acquire the new kind of lock before starting the insertion, and you release it after you've killed the tuple, or you know you're not going to kill it. Where does row locking fit in there? - you may need to retry when that part is incorporated, of course. What if you have multiple promise tuples from a contended attempt to insert a single slot, or multiple broken promise tuples across multiple slots or even multiple commands in the same xact? You can only have one speculative insertion in progress at a time. After you've done all the index insertions and checked that you really didn't conflict with anyone, you're not going to go back and kill the tuple anymore. After that point, the insertion is not speculation anymore. Yeah, it seems like PromiseTupleInsertionLockAcquire should be locking the tuple, rather than the XID. Well, that would be ideal, because we already have tuple locks. It would be nice to use the same concept for this. It's a bit tricky, however. I guess the most straightforward way to do it would be to grab a heavy-weight lock after you've inserted the tuple, but before releasing the buffer lock. I don't immediately see a problem with that, although it's a bit scary to acquire a heavy-weight lock while holding a buffer lock. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] preserving forensic information when we freeze
On 2014-01-02 09:40:54 -0500, Tom Lane wrote: > Robert Haas writes: > > On Thu, Jan 2, 2014 at 12:26 AM, Greg Stark wrote: > >> 1) it's a bit weird to go to this effort to eliminate system columns by > >> using a scheme that depends on having a system column -- ctid > > > At any rate, my goal isn't really to get rid of system columns, but to > > address Jim Nasby's gripe that the changes thus far committed make it > > difficult to use system columns to determine whether a given tuple has > > been frozen. It sounds like the consensus is that a system column is > > a better way of doing that than a function, so I suppose the next step > > is to try to hammer out the details. > > Actually, I thought the function approach was a good proposal. You are > right that func(tab.*) isn't going to work, because it's going to get a > Datum-ified tuple not a pointer to raw on-disk storage. But an inspection > function that's handed a ctid could work. Well, we discussed that upthread, and the overhead of going through a function is quite noticeable because the tuple needs to be fetched from the heap again. Check http://archives.postgresql.org/message-id/CA%2BTgmoYdWOZa_UiewbqE73AQCM2etpMjukkcmYPkGatH8kPTow%40mail.gmail.com To get rid of part of the performance problem, we could possibly invent a way to pass a HeapTuple instead of a blessed HeapTupleHeader to functions, but that might not be so easy. > And I really really *don't* want to see us bloating pg_attribute, nor > infringing further on application column namespace, by adding even more > exposed columns. So -1 for just blindly doing that. Upthread there's a POC patch of mine, that started to explore what's necessary to simply never store system columns (except maybe oid) in pg_attribute. While it passes the regression tests it's not complete, but the amount of work looks reasonable. Check http://archives.postgresql.org/message-id/20131223124504.GB19795%40alap2.anarazel.de Now, that obviously doesn't get rid of the namespace problem. I'd suggested a system: prefix, Robert _pg_. Neither is pretty, but imo should work. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] preserving forensic information when we freeze
Robert Haas writes: > On Thu, Jan 2, 2014 at 12:26 AM, Greg Stark wrote: >> 1) it's a bit weird to go to this effort to eliminate system columns by >> using a scheme that depends on having a system column -- ctid > At any rate, my goal isn't really to get rid of system columns, but to > address Jim Nasby's gripe that the changes thus far committed make it > difficult to use system columns to determine whether a given tuple has > been frozen. It sounds like the consensus is that a system column is > a better way of doing that than a function, so I suppose the next step > is to try to hammer out the details. Actually, I thought the function approach was a good proposal. You are right that func(tab.*) isn't going to work, because it's going to get a Datum-ified tuple not a pointer to raw on-disk storage. But an inspection function that's handed a ctid could work. I don't think that Greg's objection above has much force. The fact of the matter is that we are never going to be able to get rid of the exposed tableoid, oid, or ctid columns, because there are too many situations where those are useful, and are being used, in actual client-application queries. This is less true however of xmin/xmax/cmin/cmax; we might hope to deprecate those at the SQL level someday, especially in view of the fact that we keep whacking around their detailed meanings, with few user complaints. And I really really *don't* want to see us bloating pg_attribute, nor infringing further on application column namespace, by adding even more exposed columns. So -1 for just blindly doing that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fixing pg_basebackup with tablespaces found in $PGDATA
--On 1. Januar 2014 23:53:46 +0100 Dimitri Fontaine wrote: Hi, As much as I've seen people frown upon $subject, it still happens in the wild, and Magnus seems to agree that the current failure mode of our pg_basebackup tool when confronted to the situation is a bug. So here's a fix, attached. I've seen having tablespaces under PGDATA as a policy within several data centres in the past. The main reasoning behind this seems that they strictly separate platform and database administration and for database inventory reasons. They are regularly surprised if you tell them to not use tablespaces in such a way, since they absorbed this practice over the years from other database systems. So +1 for fixing this. -- Thanks Bernd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] more psprintf() use
Andres Freund writes: > On 2014-01-02 09:49:48 +0200, Heikki Linnakangas wrote: >> Is it legal to return a constant with PG_RETURN_CSTRING? Grepping around, I >> don't see that being done anywhere else, but there are places that do >> PG_RETURN_CSTRING(pstrdup())... > I don't see why it wouldn't be legal - constant strings have static > storage duration, i.e. the program lifetime. And I can see nothing that > would allow pfree()ing the return value of cstring returning functions > in the general case. Heikki is right and you are wrong. There is an ancient supposition that datatype output functions, in particular, always return palloc'd strings. I recently got rid of the pfree's in the main output path, cf commit b006f4ddb988568081f8290fac77f9402b137120, which might explain why this patch passes regression tests; but there are still places in the code (and even more likely in third-party code) that will try to pfree the results. So -1 for this particular change. The pstrdup that Heikki suggests would be safer practice. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fixing pg_basebackup with tablespaces found in $PGDATA
Magnus Hagander writes: > We can't get away with just comparing the relative part of the pathname. > Because it will fail if there is another path with exactly the same length, > containing the tablespace. Actually… yeah. > I think we might want to store a value in the tablespaceinfo struct > indicating whether it's actually inside PGDATA (since we have the full path > at that point), and then skip it based on that instead. Or store and pass > the value of getcwd() perhaps. I think it's best to stuff in the tablespaceinfo struct either NIL or the relative path of the tablespace when found in $PGDATA, as done in the attached. > I've attached a slightly updated patch - I changed around a bit of logic > order and updated some comments during my review. And added error-checking. Thanks! I started again from your version for v3. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support *** a/src/backend/replication/basebackup.c --- b/src/backend/replication/basebackup.c *** *** 45,51 typedef struct } basebackup_options; ! static int64 sendDir(char *path, int basepathlen, bool sizeonly); static int64 sendTablespace(char *path, bool sizeonly); static bool sendFile(char *readfilename, char *tarfilename, struct stat * statbuf, bool missing_ok); --- 45,51 } basebackup_options; ! static int64 sendDir(char *path, int basepathlen, bool sizeonly, List *tablespaces); static int64 sendTablespace(char *path, bool sizeonly); static bool sendFile(char *readfilename, char *tarfilename, struct stat * statbuf, bool missing_ok); *** *** 72,77 typedef struct --- 72,78 { char *oid; char *path; + char *rpath; /* relative path within PGDATA, or nil. */ int64 size; } tablespaceinfo; *** *** 100,105 perform_base_backup(basebackup_options *opt, DIR *tblspcdir) --- 101,119 XLogRecPtr endptr; TimeLineID endtli; char *labelfile; + char cwd[MAXPGPATH]; + int rootpathlen; + + /* + * We need to compute rootpathlen to allow for skipping tablespaces + * installed within PGDATA. + */ + if (!getcwd(cwd, MAXPGPATH)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not determine current directory: %m"))); + + rootpathlen = strlen(cwd); backup_started_in_recovery = RecoveryInProgress(); *** *** 119,124 perform_base_backup(basebackup_options *opt, DIR *tblspcdir) --- 133,139 { char fullpath[MAXPGPATH]; char linkpath[MAXPGPATH]; + char *relpath = NULL; int rllen; /* Skip special stuff */ *** *** 145,153 perform_base_backup(basebackup_options *opt, DIR *tblspcdir) --- 160,178 } linkpath[rllen] = '\0'; + /* + * Relpath is the relative path of the tablespace linkpath when + * the realname is found within PGDATA, or NULL. + */ + if (rllen > rootpathlen + && strncmp(linkpath, cwd, rootpathlen) == 0 + && linkpath[rootpathlen] == '/') + relpath = linkpath + rootpathlen + 1; + ti = palloc(sizeof(tablespaceinfo)); ti->oid = pstrdup(de->d_name); ti->path = pstrdup(linkpath); + ti->rpath = relpath ? pstrdup(relpath) : NULL; ti->size = opt->progress ? sendTablespace(fullpath, true) : -1; tablespaces = lappend(tablespaces, ti); #else *** *** 165,171 perform_base_backup(basebackup_options *opt, DIR *tblspcdir) /* Add a node for the base directory at the end */ ti = palloc0(sizeof(tablespaceinfo)); ! ti->size = opt->progress ? sendDir(".", 1, true) : -1; tablespaces = lappend(tablespaces, ti); /* Send tablespace header */ --- 190,196 /* Add a node for the base directory at the end */ ti = palloc0(sizeof(tablespaceinfo)); ! ti->size = opt->progress ? sendDir(".", 1, true, tablespaces) : -1; tablespaces = lappend(tablespaces, ti); /* Send tablespace header */ *** *** 191,197 perform_base_backup(basebackup_options *opt, DIR *tblspcdir) sendFileWithContent(BACKUP_LABEL_FILE, labelfile); /* ... then the bulk of the files ... */ ! sendDir(".", 1, false); /* ... and pg_control after everything else. */ if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0) --- 216,222 sendFileWithContent(BACKUP_LABEL_FILE, labelfile); /* ... then the bulk of the files ... */ ! sendDir(".", 1, false, tablespaces); /* ... and pg_control after everything else. */ if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0) *** *** 778,785 sendTablespace(char *path, bool sizeonly) _tarWriteHeader(TABLESPACE_VERSION_DIRECTORY, NULL, &statbuf); size = 512; /* Size of the header just added */ ! /* Send all the files in the tablespace version directory */ ! size += sendDir(pathbuf, strlen(path), sizeonly); return size; } ---
Re: [HACKERS] preserving forensic information when we freeze
On Thu, Jan 2, 2014 at 7:40 AM, Andres Freund wrote: > On 2014-01-02 07:35:11 -0500, Robert Haas wrote: >> On Thu, Jan 2, 2014 at 12:26 AM, Greg Stark wrote: >> >> I fail to see why. What is so ugly about this: >> > >> >> select x.* from pgbench_accounts a, pg_tuple_header(a.tableoid, a.ctid) x; >> > >> > Two points: >> > >> > 1) it's a bit weird to go to this effort to eliminate system columns by >> > using a scheme that depends on having a system column -- ctid >> > >> > 2) refetching a row could conceivably end up retrieving different data than >> > was present when the row was originally read. (In some cases that might >> > actually be the intended behaviour) >> > >> > If this came up earlier I'm sorry but I suppose it's too hard to have a >> > function like foo(tab.*) which magically can tell that the record is a heap >> > tuple and look in the header? And presumably throw an error if passed a non >> > heap tuple. >> >> Yeah, that was tried. Doesn't work: >> >> http://www.postgresql.org/message-id/ca+tgmozwzzw_wr8adhwc8iejweihxlpxranqyrfb8mw0sch...@mail.gmail.com >> >> At any rate, my goal isn't really to get rid of system columns, but to >> address Jim Nasby's gripe that the changes thus far committed make it >> difficult to use system columns to determine whether a given tuple has >> been frozen. It sounds like the consensus is that a system column is >> a better way of doing that than a function, so I suppose the next step >> is to try to hammer out the details. > > So, I think something like my POC patch would need to get in before we > can go there, right? I won't be able to work on it in the next 10days or > so, there's a bunch of stuff that's more pressing unfortunately. There's > a fair bit of work left there... Well, that's the question, I guess. I think the options are: 1. Add pg_infomask now. Leave your patch for a future optimization. Suck up the catalog bloat. 2. Wait for your patch to be done. Then add pg_infomask afterwards. Accept that it may not make 9.4, or else accept that we may have to accept that patch after the nominal deadline. 3. Decide that exposing the frozen status of tuples is not a priority and just don't worry about it. One concern I have is that if we spend too much time now worrying about these system column problems, it may reduce the amount of optimization that gets done on top of this optimization in the 9.4 time frame. And that, I think, would be sad. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
2013/12/15 David Rowley : > I've been working on speeding up aggregate functions when used in the > context of a window's with non fixed frame heads. > 1. Fully implement negative transition functions for SUM and AVG. I would like to mention that this functionality is also extremely useful to have for the incremental maintenance of materialized views that use aggregation (which IMHO is one of the most useful kinds). (Simply imagine a view of the form “SELECT a, agg_function(b) FROM table GROUP BY a”, a lot of rows in the table, a lot of rows in each group, and changes that both remove and add new rows.) For this to work, two things are needed: (1) A way to apply a value normally (already supported) and inversely (i.e., this patch) to the current “internal state” of an aggregation. (2) A way to store the “internal state” of an aggregation in the materialized view’s “extent” (i.e., the physical rows that represent the view’s contents, which may or may not be slightly different from what you get when you do SELECT * FROM matview). As (AFAIK) that state is stored as a normal value, the maintenance code could just take the value, store it in the extent, and next time retrieve it again and perform normal or inverse transitions. When selecting from the matview, the state could be retrieved, and the final function applied to it to yield the value to be returned. To understand (2), assume that one wants to store an AVG() in a materialized view; To be able to update the value incrementally, one needs to actually store the SUM() and COUNT(), and perform the division when selecting from the materialized view. Or it could (initially) be decided to define AVG() as “not supporting fast incremental maintenance,” and require the user (if he/she wants fast incremental maintenance) to put SUM() and COUNT() in the materialized view manually, and perform the division manually when wanting to retrieve the average. Nicolas -- A. Because it breaks the logical sequence of discussion. Q. Why is top posting bad? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Tue, Dec 31, 2013 at 4:12 AM, Peter Geoghegan wrote: > On Tue, Dec 31, 2013 at 12:52 AM, Heikki Linnakangas > wrote: >> 1. PromiseTupleInsertionLockAcquire() >> 2. Insert heap tuple >> 3. Insert index tuples >> 4. Check if conflict happened. Kill the already-inserted tuple on conflict. >> 5. PromiseTupleInsertionLockRelease() >> >> IOW, the only change to the current patch is that you acquire the new kind >> of lock before starting the insertion, and you release it after you've >> killed the tuple, or you know you're not going to kill it. > > Where does row locking fit in there? - you may need to retry when that > part is incorporated, of course. What if you have multiple promise > tuples from a contended attempt to insert a single slot, or multiple > broken promise tuples across multiple slots or even multiple commands > in the same xact? Yeah, it seems like PromiseTupleInsertionLockAcquire should be locking the tuple, rather than the XID. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Show lossy heap block info in EXPLAIN ANALYZE for bitmap heap scan
On Thu, Jan 2, 2014 at 4:27 AM, Andres Freund wrote: > On 2014-01-01 21:15:46 -0500, Robert Haas wrote: >> [ sensible reasoning ] However, I'm not sure it's really worth it. >> I think what people really care about is knowing whether the bitmap >> lossified or not, and generally how much got lossified. The counts of >> exact and lossy pages are sufficient for that, without anything >> additional > > Showing the amount of memory currently required could tell you how soon > accurate bitmap scans will turn into lossy scans though. Which is not a > bad thing to know, some kinds of scans (e.g. tsearch over expression > indexes, postgis) can get ridiculously slow once lossy. Hmm, interesting. I have not encountered that myself. If we want that, I'm tempted to think that we should display statistics for each bitmap index scan - but I'd be somewhat inclined to see if we could get by with the values that are already stored in a TIDBitmap rather than adding new ones - e.g. show npages (the number of exact entries), nchunks (the number of lossy entries), and maxentries. From that, you can work out the percentage of available entries that were actually used. The only thing that's a bit annoying about that is that we'd probably have to copy those values out of the tid bitmap and into an executor state node, because the tid bitmap will subsequently get modified destructively. But I think that's probably OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] preserving forensic information when we freeze
On 2014-01-02 07:35:11 -0500, Robert Haas wrote: > On Thu, Jan 2, 2014 at 12:26 AM, Greg Stark wrote: > >> I fail to see why. What is so ugly about this: > > > >> select x.* from pgbench_accounts a, pg_tuple_header(a.tableoid, a.ctid) x; > > > > Two points: > > > > 1) it's a bit weird to go to this effort to eliminate system columns by > > using a scheme that depends on having a system column -- ctid > > > > 2) refetching a row could conceivably end up retrieving different data than > > was present when the row was originally read. (In some cases that might > > actually be the intended behaviour) > > > > If this came up earlier I'm sorry but I suppose it's too hard to have a > > function like foo(tab.*) which magically can tell that the record is a heap > > tuple and look in the header? And presumably throw an error if passed a non > > heap tuple. > > Yeah, that was tried. Doesn't work: > > http://www.postgresql.org/message-id/ca+tgmozwzzw_wr8adhwc8iejweihxlpxranqyrfb8mw0sch...@mail.gmail.com > > At any rate, my goal isn't really to get rid of system columns, but to > address Jim Nasby's gripe that the changes thus far committed make it > difficult to use system columns to determine whether a given tuple has > been frozen. It sounds like the consensus is that a system column is > a better way of doing that than a function, so I suppose the next step > is to try to hammer out the details. So, I think something like my POC patch would need to get in before we can go there, right? I won't be able to work on it in the next 10days or so, there's a bunch of stuff that's more pressing unfortunately. There's a fair bit of work left there... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] preserving forensic information when we freeze
On Thu, Jan 2, 2014 at 12:26 AM, Greg Stark wrote: >> I fail to see why. What is so ugly about this: > >> select x.* from pgbench_accounts a, pg_tuple_header(a.tableoid, a.ctid) x; > > Two points: > > 1) it's a bit weird to go to this effort to eliminate system columns by > using a scheme that depends on having a system column -- ctid > > 2) refetching a row could conceivably end up retrieving different data than > was present when the row was originally read. (In some cases that might > actually be the intended behaviour) > > If this came up earlier I'm sorry but I suppose it's too hard to have a > function like foo(tab.*) which magically can tell that the record is a heap > tuple and look in the header? And presumably throw an error if passed a non > heap tuple. Yeah, that was tried. Doesn't work: http://www.postgresql.org/message-id/ca+tgmozwzzw_wr8adhwc8iejweihxlpxranqyrfb8mw0sch...@mail.gmail.com At any rate, my goal isn't really to get rid of system columns, but to address Jim Nasby's gripe that the changes thus far committed make it difficult to use system columns to determine whether a given tuple has been frozen. It sounds like the consensus is that a system column is a better way of doing that than a function, so I suppose the next step is to try to hammer out the details. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] ERROR: missing chunk number 0 for toast value
Hi All, Test case: drop table if exists t; create table t(c text); insert into t values ('x'), (repeat(md5('abcdefghijklmnop'), 1)); select pg_column_size(c), pg_column_size(c || '') FROM t; CREATE OR REPLACE FUNCTION copy_toast_out() RETURNS VOID AS $$ declare v text; BEGIN SELECT c INTO v FROM t WHERE c <> 'x'; Select 1/0; Exception When Others Then PERFORM pg_sleep(30); -- go run "TRUNCATE t" in a 2nd session raise notice 'length :%', length(v || ''); -- force detoast END; $$ language plpgsql; postgres=# select copy_toast_out(); ERROR: missing chunk number 0 for toast value 16390 in pg_toast_16384 CONTEXT: PL/pgSQL function copy_toast_out() line 10 at RAISE Analysis: The basic problem here is that if the lock is released on table before extracting toasted value, and in meantime someone truncates the table, this error can occur. Here error coming with PL block contains an Exception block (as incase there is an exception block, it calls RollbackAndReleaseCurrentSubTransaction). Do you think we should detoast the local variable before RollbackAndReleaseCurrentSubTransaction ? Or any other options ? Regards, Rushabh Lathia www.EnterpriseDB.com
Re: [HACKERS] Fixing pg_basebackup with tablespaces found in $PGDATA
On Wed, Jan 1, 2014 at 11:53 PM, Dimitri Fontaine wrote: > Hi, > > As much as I've seen people frown upon $subject, it still happens in the > wild, and Magnus seems to agree that the current failure mode of our > pg_basebackup tool when confronted to the situation is a bug. > > So here's a fix, attached. > > To reproduce, mkdir -p $PGDATA/tbs/foo then CREATE TABLESPACE there, and > then pg_basebackup your server. If doing so from the same server, as I > did, then pick the tar format, as here: > > pg_basebackup -Ft -z -c fast -v -X fetch -D /tmp/backup > > Then use tar to see that the base backup contains the whole content of > your foo tablespace, and if you did create another tablespace within > $PGDATA/pg_tblspc (which is the other common way to trigger that issue) > then add it to what you want to see: > > tar tzvf /tmp/backup/base.tar.gz pg_tblspc tbs/foo pg_tblspc/bar > > Note that empty directories are expected, so tar should output their > entries. Those directories are where you need to be restoring the > tablespace tarballs. > > When using pg_basebackup in plain mode, the error is that you get a copy > of all your tablespaces first, then the main PGDATA is copied over and > as the destination directories already do exists (and not empty) the > whole backup fails there. > > The bug should be fixed against all revisions of pg_basebackup, though I > didn't try to apply this very patch on all target branches. > I had a second round of thought about this, and I don't think this one is going to work. Unfortunately, it's part of the advice I gave you yesterday that was bad I think :) We can't get away with just comparing the relative part of the pathname. Because it will fail if there is another path with exactly the same length, containing the tablespace. I think we might want to store a value in the tablespaceinfo struct indicating whether it's actually inside PGDATA (since we have the full path at that point), and then skip it based on that instead. Or store and pass the value of getcwd() perhaps. Or am I thinking wrong now instead? :) I've attached a slightly updated patch - I changed around a bit of logic order and updated some comments during my review. And added error-checking. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ *** a/src/backend/replication/basebackup.c --- b/src/backend/replication/basebackup.c *** *** 45,51 typedef struct } basebackup_options; ! static int64 sendDir(char *path, int basepathlen, bool sizeonly); static int64 sendTablespace(char *path, bool sizeonly); static bool sendFile(char *readfilename, char *tarfilename, struct stat * statbuf, bool missing_ok); --- 45,52 } basebackup_options; ! static int64 sendDir(char *path, int basepathlen, int rootpathlen, ! bool sizeonly, List *tablespaces); static int64 sendTablespace(char *path, bool sizeonly); static bool sendFile(char *readfilename, char *tarfilename, struct stat * statbuf, bool missing_ok); *** *** 100,105 perform_base_backup(basebackup_options *opt, DIR *tblspcdir) --- 101,119 XLogRecPtr endptr; TimeLineID endtli; char *labelfile; + char cwd[MAXPGPATH]; + int rootpathlen; + + /* + * We need to compute rootpathlen to allow for skipping tablespaces + * installed within PGDATA. + */ + if (!getcwd(cwd, MAXPGPATH)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not determine current directory: %m"))); + + rootpathlen = strlen(cwd) + 1; backup_started_in_recovery = RecoveryInProgress(); *** *** 165,171 perform_base_backup(basebackup_options *opt, DIR *tblspcdir) /* Add a node for the base directory at the end */ ti = palloc0(sizeof(tablespaceinfo)); ! ti->size = opt->progress ? sendDir(".", 1, true) : -1; tablespaces = lappend(tablespaces, ti); /* Send tablespace header */ --- 179,186 /* Add a node for the base directory at the end */ ti = palloc0(sizeof(tablespaceinfo)); ! ti->size = opt->progress ? ! sendDir(".", 1, rootpathlen, true, tablespaces) : -1; tablespaces = lappend(tablespaces, ti); /* Send tablespace header */ *** *** 191,197 perform_base_backup(basebackup_options *opt, DIR *tblspcdir) sendFileWithContent(BACKUP_LABEL_FILE, labelfile); /* ... then the bulk of the files ... */ ! sendDir(".", 1, false); /* ... and pg_control after everything else. */ if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0) --- 206,212 sendFileWithContent(BACKUP_LABEL_FILE, labelfile); /* ... then the bulk of the files ... */ ! sendDir(".", 1, rootpathlen, false, tablespaces); /* ... and pg_control after everything else. */ if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0) *** *** 778,785 sendTablespace(char *path, bool sizeonly) _tarWriteHeader(TABLESPACE_VERSI
[HACKERS] [PATCH] Remove some duplicate if conditions
I've attached a simple patch which removes some duplicate if conditions that seemed to have found their way into the code. These are per PVS-Studio's warnings. Regards David Rowley duplicate_if_test.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Fri, Dec 27, 2013 at 1:36 AM, David Rowley wrote: > From what I can tell adding an inverse transition function to support AVG > for numeric does not affect the number of trailing zeros in the results, so > I've attached a patch which now has inverse transition functions to support > numeric types in AVG and all of the STDDEV* aggregates. > here's a slightly updated patch in which I've fixed up an comment that had become out-dated because of the patch. I also changed the wording in quite a few of my comments and made some changes to the docs. The only other change was an extra error check just in case window_gettupleslot was to fail to get a tuple that should always be in the tuple store. I'm now classing the patch is not WIP anymore. I think it's ready for the commitfest. Regards David Rowley inverse_transition_functions_v1.8.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: show relation and tuple infos of a lock to acquire
Hi, On 02/01/14 10:02, Andres Freund wrote: > > Christian's idea of a context line seems plausible to me. I don't > > care for this implementation too much --- a global variable? Ick. > > Yea, the data should be stored in ErrorContextCallback.arg instead. Fixed. I also palloc() the ErrorContextCallback itself. But it doesn't feel right since I could not find a piece of code doing so. What do you think? > > Make a wrapper function for XactLockTableWait instead, please. > > I don't think that'd work out all too nicely - we do the > XactLockTableWait() inside other functions like MultiXactIdWait(), > passing all the detail along for those would end up being ugly. So using > error context callbacks properly seems like the best way in the end. Wrapping XactLockTableWait()/MultiXactIdWait() at least allows the ErrorContextCallback and ErrorContextCallback.arg to live on the stack. So what we could do is wrap this in a macro instead of a function (like e.g. PG_TRY) or write two different functions. And I don't like the two functions approach since it means duplication of code. While writing the response I really think writing a macro in PG_TRY style for setting up and cleaning the error context callback I begin to think that this would be the best way to go. > > And I'm not real sure that showing the whole tuple contents is a good > > thing; I can see people calling that a security leak, not to mention > > that the performance consequences could be dire. > > I don't think that the security argument has too much merit given > today's PG - we already log the full tuple for various constraint > violations. And we also accept the performance penalty there. I don't > see any easy way to select a sensible subset of columns to print here, > and printing the columns is what would make investigating issues around > this so much easier. At the very least we'd need to print the pkey > columns and the columns of the unique key that might have been violated. I agree. Even printing only the PK values would be much more complex. As far as I can see we would even have to gain another lock for this (see comment for relationHasPrimaryKey() in src/backend/catalog/index.c). Regards, -- Christian Kruse http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index f8545c1..cfa49c2 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -2702,9 +2702,11 @@ l1: if (infomask & HEAP_XMAX_IS_MULTI) { + XactLockTableWaitSetupErrorContextCallback(relation, &tp); /* wait for multixact */ MultiXactIdWait((MultiXactId) xwait, MultiXactStatusUpdate, NULL, infomask); + XactLockTableWaitCleanupErrorContextCallback(); LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); /* @@ -2730,7 +2732,10 @@ l1: else { /* wait for regular transaction to end */ + XactLockTableWaitSetupErrorContextCallback(relation, &tp); + XactLockTableWait(xwait); + XactLockTableWaitCleanupErrorContextCallback(); LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); /* @@ -3254,9 +3259,11 @@ l2: TransactionId update_xact; int remain; + XactLockTableWaitSetupErrorContextCallback(relation, &oldtup); /* wait for multixact */ MultiXactIdWait((MultiXactId) xwait, mxact_status, &remain, infomask); + XactLockTableWaitCleanupErrorContextCallback(); LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); /* @@ -3330,7 +3337,10 @@ l2: else { /* wait for regular transaction to end */ +XactLockTableWaitSetupErrorContextCallback(relation, &oldtup); + XactLockTableWait(xwait); +XactLockTableWaitCleanupErrorContextCallback(); LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); /* @@ -4398,7 +4408,11 @@ l3: RelationGetRelationName(relation; } else +{ + XactLockTableWaitSetupErrorContextCallback(relation, tuple); MultiXactIdWait((MultiXactId) xwait, status, NULL, infomask); + XactLockTableWaitCleanupErrorContextCallback(); +} /* if there are updates, follow the update chain */ if (follow_updates && @@ -4453,7 +4467,11 @@ l3: RelationGetRelationName(relation; } else +{ + XactLockTableWaitSetupErrorContextCallback(relation, tuple); XactLockTableWait(xwait); + XactLockTableWaitCleanupErrorContextCallback(); +} /* if there are updates, follow the update chain */ if (follow_updates && @@ -5139,9 +5157,14 @@ l4: if (needwait) { + XactLockTableWaitSetupErrorContextCallback(rel, &mytup); + LockBuffer(buf, BUFFER_LOCK_UNLOCK); XactLockTableWait(members[i].xid); pfree(members); + + XactLockTableWaitCleanupErrorContextCallback(); + goto l4; } if (res != HeapTupleMayBeUpdated) @@ -5199,8 +5222,13 @@ l4: &needwait); if (needwait)
Re: [HACKERS] [PATCH] Store Extension Options
On 2014-01-02 08:26:20 -0200, Fabrízio de Royes Mello wrote: > On Thu, Jan 2, 2014 at 7:19 AM, Andres Freund > wrote: > > > > On 2013-12-31 13:37:59 +0100, Pavel Stehule wrote: > > > > We use the namespace "ext" to the internal code > > > > (src/backend/access/common/reloptions.c) skip some validations and > store > > > > the custom GUC. > > > > > > > > Do you think we don't need to use the "ext" namespace? > > > > > > > > > > yes - there be same mechanism as we use for GUC > > > > There is no existing mechanism to handle conflicts for GUCs. The > > difference is that for GUCs nearly no "namespaced" GUCs exist (plperl, > > plpgsql have some), but postgres defines at least autovacuum. and > > toast. namespaces for relation options. > > > > autovacuum. namespace ??? Yea, right, it's autovacuum_... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On 2014-01-02 02:20:02 -0800, Peter Geoghegan wrote: > On Thu, Jan 2, 2014 at 1:49 AM, Andres Freund wrote: > >> Well, you're not totally on your own for something like that with this > >> feature. You can project the conflicter's tid, and possibly do a more > >> sophisticated recovery, like inspecting the locked row and iterating. > > > > Yea, but in that case I *do* conflict with more than one index and old > > values need to stay locked. Otherwise anything resembling > > forward-progress guarantee is lost. > > I'm not sure I understand. In a very real sense they do stay locked. > What is insufficient about locking the definitively visible row with > the value, rather than the value itself? Locking the definitely visible row only works if there's a row matching the index's columns. If the values of the new row don't have corresponding values in all the indexes you have the same old race conditions again. I think to be useful for many cases you really need to be able to ask for a potentially conflicting row and be sure that if there's none you are able to insert the row separately. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Store Extension Options
On Thu, Jan 2, 2014 at 7:19 AM, Andres Freund wrote: > > On 2013-12-31 13:37:59 +0100, Pavel Stehule wrote: > > > We use the namespace "ext" to the internal code > > > (src/backend/access/common/reloptions.c) skip some validations and store > > > the custom GUC. > > > > > > Do you think we don't need to use the "ext" namespace? > > > > > > > yes - there be same mechanism as we use for GUC > > There is no existing mechanism to handle conflicts for GUCs. The > difference is that for GUCs nearly no "namespaced" GUCs exist (plperl, > plpgsql have some), but postgres defines at least autovacuum. and > toast. namespaces for relation options. > autovacuum. namespace ??? The HEAP_RELOPT_NAMESPACES (src/include/access/reloptions.h) constant define only "toast" and "null" as a valid relation option namespace. I missed something? Regards, -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL >> Timbira: http://www.timbira.com.br >> Blog sobre TI: http://fabriziomello.blogspot.com >> Perfil Linkedin: http://br.linkedin.com/in/fabriziomello >> Twitter: http://twitter.com/fabriziomello
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Thu, Jan 2, 2014 at 1:49 AM, Andres Freund wrote: >> Well, you're not totally on your own for something like that with this >> feature. You can project the conflicter's tid, and possibly do a more >> sophisticated recovery, like inspecting the locked row and iterating. > > Yea, but in that case I *do* conflict with more than one index and old > values need to stay locked. Otherwise anything resembling > forward-progress guarantee is lost. I'm not sure I understand. In a very real sense they do stay locked. What is insufficient about locking the definitively visible row with the value, rather than the value itself? What distinction are you making? On the first conflict you can delete the row you locked, and then re-try, possibly further merging some stuff from the just-deleted row when you next upsert. It's possible that an "earlier" unique index value that is unlocked before row locking proceeds will get a new would-be duplicate after you're returned a locked row, but it's not obvious that that's a problem for your use-case (a problem that can't be worked around), or that promise tuples get you anything better. >> That's probably not at all ideal, but then I can't even imagine what >> the best interface for what you describe here looks like. If there are >> multiple conflicts, do you delete or update some or all of them? How >> do you express that concept from a DML statement? > > For my usecases just getting the tid back is fine - it's in C > anyway. But I'd rather be in a position to do it from SQL as well... I believe you can. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: show relation and tuple infos of a lock to acquire
On 2014-01-02 01:40:38 -0800, Peter Geoghegan wrote: > On Thu, Jan 2, 2014 at 12:56 AM, Andres Freund wrote: > > I agree that the message needs improvement, but I don't agree that we > > shouldn't lock the tuple's location. If you manually investigate the > > situation that's where you'll find the conflicting tuple - I don't see > > what we gain by not logging the ctid. > > I suppose so, but the tuple probably isn't going to be visible anyway, > at least when the message is initially logged. But that's the case for all these, otherwise we wouldn't wait on the row. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On 2013-12-27 14:11:44 -0800, Peter Geoghegan wrote: > On Fri, Dec 27, 2013 at 12:57 AM, Andres Freund > wrote: > > I don't think the current syntax the feature implements can be used as > > the sole argument what the feature should be able to support. > > > > If you think from the angle of a async MM replication solution > > replicating a table with multiple unique keys, not having to specify a > > single index we to expect conflicts from, is surely helpful. > > Well, you're not totally on your own for something like that with this > feature. You can project the conflicter's tid, and possibly do a more > sophisticated recovery, like inspecting the locked row and iterating. Yea, but in that case I *do* conflict with more than one index and old values need to stay locked. Otherwise anything resembling forward-progress guarantee is lost. > That's probably not at all ideal, but then I can't even imagine what > the best interface for what you describe here looks like. If there are > multiple conflicts, do you delete or update some or all of them? How > do you express that concept from a DML statement? For my usecases just getting the tid back is fine - it's in C anyway. But I'd rather be in a position to do it from SQL as well... If there are multiple conflicts the conflicting row should be updated. If we didn't release the value locks on the individual indexes, we can know beforehand whether only one row is going to be affected. If there really are more than one, error out. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: show relation and tuple infos of a lock to acquire
On Thu, Jan 2, 2014 at 12:56 AM, Andres Freund wrote: > I agree that the message needs improvement, but I don't agree that we > shouldn't lock the tuple's location. If you manually investigate the > situation that's where you'll find the conflicting tuple - I don't see > what we gain by not logging the ctid. I suppose so, but the tuple probably isn't going to be visible anyway, at least when the message is initially logged. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Show lossy heap block info in EXPLAIN ANALYZE for bitmap heap scan
On 2014-01-01 21:15:46 -0500, Robert Haas wrote: > [ sensible reasoning ] However, I'm not sure it's really worth it. > I think what people really care about is knowing whether the bitmap > lossified or not, and generally how much got lossified. The counts of > exact and lossy pages are sufficient for that, without anything > additional Showing the amount of memory currently required could tell you how soon accurate bitmap scans will turn into lossy scans though. Which is not a bad thing to know, some kinds of scans (e.g. tsearch over expression indexes, postgis) can get ridiculously slow once lossy. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Store Extension Options
On 2013-12-31 13:37:59 +0100, Pavel Stehule wrote: > > We use the namespace "ext" to the internal code > > (src/backend/access/common/reloptions.c) skip some validations and store > > the custom GUC. > > > > Do you think we don't need to use the "ext" namespace? > > > > yes - there be same mechanism as we use for GUC There is no existing mechanism to handle conflicts for GUCs. The difference is that for GUCs nearly no "namespaced" GUCs exist (plperl, plpgsql have some), but postgres defines at least autovacuum. and toast. namespaces for relation options. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: multiple read-write masters in a cluster with wal-streaming synchronization
On 2013-12-31 13:51:08 -0800, Mark Dilger wrote: > The BDR documentation > http://wiki.postgresql.org/images/7/75/BDR_Presentation_PGCon2012.pdf > says, > > "Physical replication forces us to use just one > node: multi-master required for write scalability" > > "Physical replication provides best read scalability" > > I am inclined to agree with the second statement, but > I think my proposal invalidates the first statement, at > least for a particular rigorous partitioning over which > server owns which data. I think you *massively* underestimate the amount of work implementing this would require. For one, you'd need to have a catalog that is written to on only one server, you cannot have several nodes writing to the same table, even if it's to disparate oid ranges. So you'd need to partition the whole catalog by oid ranges - which would be a major efficiency loss for many, many cases. Not to speak of breaking pg_upgrade and noticeably increasing the size of the catalog due to bigger oids and additional relations. > So for me, multi-master with physical replication seems > possible, and would presumably provide the best > read scalability. What you describe isn't really multi master though, as every row can only be written to by a single node (the owner). Also, why would this have a better read scalability? Whether a row is written by streaming rep or not doesn't influence read speed. > Or I can use logical replication such as BDR, but then the servers > are spending more effort than with physical replication, > so I get less bang for the buck when I purchase more > servers to add to the cluster. The efficiency difference really hasn't to be big if done right. If you're so write-heavy that the difference is becoming a problem you wouldn't implement a shared-everything architecture anyway. > Am I missing something here? Does BDR really provide > an equivalent solution? Not yet, but the plan is to get there. > Second, it seems that BDR leaves to the client the responsibility > for making schemas the same everywhere. Perhaps this is just > a limitation of the implementation so far, which will be resolved > in the future? Hopefully something that's going to get lifted. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What exactly is drop-index-concurrently-1.spec trying to test?
On 2013-12-31 17:14:11 -0500, Tom Lane wrote: > Peter pointed out in > http://www.postgresql.org/message-id/527c0fe9.7000...@gmx.net > that Kyotaro-san's patch to treat unique indexes as satisfying any sort > condition that they are a prefix of broke the drop-index-concurrently-1 > isolation test. The latest iterations of the patch respond to that by > changing the expected output. However, that seems rather wrongheaded, > because AFAICT the intent of that part of the test is to demonstrate that > a seqscan has particular behavior; so if the planner starts generating an > indexscan instead, the test no longer proves anything of the kind. > > What I'm wondering though is what's the point of testing that a concurrent > DROP INDEX doesn't affect a seqscan? That seems kinda silly, so it's > tempting to address the patch's problem by just removing the steps > involving the getrow_seq query, rather than hacking it up enough so we'd > still get a seqscan plan. The point is to show that an index scan returns the same rows a sequential scan would, even though the index is in the process of being dropped and has been updated *after* the DROP started. That was broken at some point. Now, you could argue that that would also be shown without the sequential scan, but I think that would make understanding the faulty output harder. > I'd have thought the test would be designed to allow > the DROP to complete and then re-test that the results of the prepared > query are still sane, but it does no such thing. We could add a permutation like this, but ISTM that it would just test plan invalidation? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: show relation and tuple infos of a lock to acquire
On 2013-12-31 11:36:36 -0500, Tom Lane wrote: > Simon Riggs writes: > > On 31 December 2013 09:12, Christian Kruse > > wrote: > >> Output with patch: > >> > >> LOG: process 24774 acquired ShareLock on transaction 696 after 11688.720 > >> ms > >> CONTEXT: relation name: foo (OID 16385) > >> tuple (ctid (0,1)): (1) > > > That is useful info. > > > I think the message should be changed to say this only, without a context > > line > > > LOG: process 24774 acquired ShareLock on relation "foo" (OID 16385) > > tuple (0,1) after 11688.720 ms > > > My reason is that pid 24774 was waiting for a *tuple lock* and it was > > eventually granted, so thats what it should say. > > No, that wasn't what it was waiting for, and lying to the user like that > isn't going to make things better. Agreed. > Christian's idea of a context line seems plausible to me. I don't > care for this implementation too much --- a global variable? Ick. Yea, the data should be stored in ErrorContextCallback.arg instead. > Make a wrapper function for XactLockTableWait instead, please. I don't think that'd work out all too nicely - we do the XactLockTableWait() inside other functions like MultiXactIdWait(), passing all the detail along for those would end up being ugly. So using error context callbacks properly seems like the best way in the end. > And I'm not real sure that showing the whole tuple contents is a good > thing; I can see people calling that a security leak, not to mention > that the performance consequences could be dire. I don't think that the security argument has too much merit given today's PG - we already log the full tuple for various constraint violations. And we also accept the performance penalty there. I don't see any easy way to select a sensible subset of columns to print here, and printing the columns is what would make investigating issues around this so much easier. At the very least we'd need to print the pkey columns and the columns of the unique key that might have been violated. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: show relation and tuple infos of a lock to acquire
On 2013-12-31 13:56:53 -0800, Peter Geoghegan wrote: > ISTM that you should be printing just the value and the unique index > there, and not any information about the tuple proper. Doing any less > could be highly confusing. I agree that the message needs improvement, but I don't agree that we shouldn't lock the tuple's location. If you manually investigate the situation that's where you'll find the conflicting tuple - I don't see what we gain by not logging the ctid. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] more psprintf() use
On 2014-01-02 09:49:48 +0200, Heikki Linnakangas wrote: > On 01/02/2014 05:14 AM, Peter Eisentraut wrote: > >diff --git a/contrib/hstore/hstore_io.c b/contrib/hstore/hstore_io.c > >index 772a5ca..8331a56 100644 > >--- a/contrib/hstore/hstore_io.c > >+++ b/contrib/hstore/hstore_io.c > >@@ -1114,11 +1114,7 @@ > > HEntry *entries = ARRPTR(in); > > > > if (count == 0) > >-{ > >-out = palloc(1); > >-*out = '\0'; > >-PG_RETURN_CSTRING(out); > >-} > >+PG_RETURN_CSTRING(""); > > > > buflen = 0; > > Is it legal to return a constant with PG_RETURN_CSTRING? Grepping around, I > don't see that being done anywhere else, but there are places that do > PG_RETURN_CSTRING(pstrdup())... I don't see why it wouldn't be legal - constant strings have static storage duration, i.e. the program lifetime. And I can see nothing that would allow pfree()ing the return value of cstring returning functions in the general case. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] preserving forensic information when we freeze
Hi, On 2014-01-02 05:26:26 +, Greg Stark wrote: > 2) refetching a row could conceivably end up retrieving different data than > was present when the row was originally read. (In some cases that might > actually be the intended behaviour) That's possible with system columns as well. In the normal cases we'll only have copied the HeapTuple, not the HeapTupleHeader, so it will be re-fetched from the (pinned) buffer. > If this came up earlier I'm sorry but I suppose it's too hard to have a > function like foo(tab.*) which magically can tell that the record is a heap > tuple and look in the header? And presumably throw an error if passed a non > heap tuple. I don't see how that could be a good API. What happens if you have two relations in a query? Even if that wouldn't be a query, why would this be a helpful? Seems like a poor reinvention of system columns. Andres PS: Could you not always include the full quoted message below your -- signature? Greetings, Andres Freund Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers