Re: [HACKERS] New replication mode: write
On Wed, Jan 25, 2012 at 5:28 AM, Simon Riggs wrote: > On Tue, Jan 24, 2012 at 11:00 AM, Simon Riggs wrote: > >> Yes, it might be too hard, but lets look. > > Your committer has timed out ;-) > > committed write mode only Thanks for the commit! The apply mode is attractive, but I need more time to implement that completely. I might not be able to complete that within this CF. So committing the write mode only is right decision, I think. If I have time after all of the patches which I'm interested in will have been committed, I will try the apply mode again, but maybe for 9.3dev. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Add new replication mode synchronous_commit = 'write'.
On Wed, Jan 25, 2012 at 5:35 AM, Jaime Casanova wrote: > On Tue, Jan 24, 2012 at 3:22 PM, Simon Riggs wrote: >> Add new replication mode synchronous_commit = 'write'. >> Replication occurs only to memory on standby, not to disk, >> so provides additional performance if user wishes to >> reduce durability level slightly. Adds concept of multiple >> independent sync rep queues. >> >> Fujii Masao and Simon Riggs >> > > i guess, you should add the new value in postgresql.conf too. Yes, I forgot to do that. Patch attached. > just a question, why not add a flush value and make it behave like on? > actually sync rep is on in both cases and having the different names > makes more easy to unserstand what is happening. > i only want to keep on for compatibility but i wouldn't mind if we > remove it in favor of more descriptive names. Also we should add "async" as an alias for "off"? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center noname Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PgNext: CFP
On Tuesday, January 24, 2012, Stefan Kaltenbrunner wrote: > On 01/24/2012 07:36 PM, Dave Page wrote: >> What date & venue? >> >> On Tuesday, January 24, 2012, Joshua D. Drake wrote: >>> >>> Hello, >>> >>> The call for papers for PgNext (the old PgWest/PgEast) is now open: >>> >>> January 19th: Talk submission opens >>> April 15th: Talk submission closes >>> April 30th: Speaker notification >>> >>> Submit: https://www.postgresqlconference.org/talk_types > > > the url has in big letters has "Denver Convention Center" on June > 26th-29th, 2012... Not too visible on my phone... -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches
On Wed, Jan 25, 2012 at 2:23 AM, Robert Haas wrote: > Early yesterday morning, I was able to use Nate Boley's test machine > do a single 30-minute pgbench run at scale factor 300 using a variety > of trees built with various patches, and with the -l option added to > track latency on a per-transaction basis. All tests were done using > 32 clients and permanent tables. The configuration was otherwise > identical to that described here: > > http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6...@mail.gmail.com > > By doing this, I hoped to get a better understanding of (1) the > effects of a scale factor too large to fit in shared_buffers, (2) what > happens on a longer test run, and (3) how response time varies > throughout the test. First, here are the raw tps numbers: > > background-clean-slru-v2: tps = 2027.282539 (including connections > establishing) > buffreelistlock-reduction-v1: tps = 2625.155348 (including connections > establishing) > buffreelistlock-reduction-v1-freelist-ok-v2: tps = 2468.638149 > (including connections establishing) > freelist-ok-v2: tps = 2467.065010 (including connections establishing) > group-commit-2012-01-21: tps = 2205.128609 (including connections > establishing) > master: tps = 2200.848350 (including connections establishing) > removebufmgrfreelist-v1: tps = 2679.453056 (including connections > establishing) > xloginsert-scale-6: tps = 3675.312202 (including connections establishing) > > Obviously these numbers are fairly noisy, especially since this is > just one run, so the increases and decreases might not be all that > meaningful. Time permitting, I'll try to run some more tests to get > my hands around that situation a little better, > This is nice. I am sure long running tests will point out many more issues. If we are doing these tests, it might be more effective if we run even longer runs, such as to get at least 3-4 checkpoints/vacuum/analyze (and other such events which can impact final numbers either way) per test. Otherwise, one patch may stood out if it avoids, say one checkpoint. It would definitely help to log the checkpoint, auto-vacuum/auto-analyze details and plot them on the graph to see if the drop in performance has anything to do with these activities. It might also be a good idea to collect pg statistics such as relation sizes at the end of the run. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Group commit, revised
On Fri, Jan 20, 2012 at 10:30 PM, Heikki Linnakangas wrote: > I spent some time cleaning this up. Details below, but here are the > highlights: > > * Reverted the removal of wal_writer_delay > * Removed heuristic on big flushes No contested viewpoints on anything there. > * Doesn't rely on WAL writer. Any process can act as the "leader" now. > * Uses PGSemaphoreLock/Unlock instead of latches Thanks for producing an alternate version, it will allow us to comment on various approaches. There is much yet to discuss so please don't think about committing anything yet. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_upgrade with plpython is broken
On Fri, Jan 20, 2012 at 07:01:46AM +0200, Peter Eisentraut wrote: > On tor, 2012-01-19 at 17:04 -0500, Bruce Momjian wrote: > > For that reason, I wonder if I should just hard-code the plpython > > rename into the pg_upgrade test in check_loadable_libraries(). > > Yes, I haven't come up with a better solution either. > > If this becomes a general problem, we might need to add a command line > option to ignore certain names or something. But not yet. Well, the problem is a little more complex than reported. It turns out in PG 9.0 we kept the plpython.so file and symlinked plpython2.so to it. In PG 9.1, we removed plpython.so, and only have plpython2.so, so the problem is with PG >= 9.1, and does not affect 9.0, which explains why we didn't get any 9.0 reports of a problem. I have applied the attached patch to PG head and 9.1 to fix the library checking problem. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/contrib/pg_upgrade/function.c b/contrib/pg_upgrade/function.c new file mode 100644 index 54f139a..4505e51 *** a/contrib/pg_upgrade/function.c --- b/contrib/pg_upgrade/function.c *** check_loadable_libraries(void) *** 228,235 char *cmd = (char *) pg_malloc(8 + 2 * llen + 1); PGresult *res; strcpy(cmd, "LOAD '"); ! PQescapeStringConn(conn, cmd + 6, lib, llen, NULL); strcat(cmd, "'"); res = PQexec(conn, cmd); --- 228,251 char *cmd = (char *) pg_malloc(8 + 2 * llen + 1); PGresult *res; + /* + * In Postgres 9.0, Python 3 support was added, and to do that, a + * plpython2u language was created with library name plpython2.so + * as a symbolic link to plpython.so. In Postgres 9.1, only the + * plpython2.so library was created, and both plpythonu and + * plpython2u pointing to it. For this reason, any reference to + * library name "plpython" in an old PG <= 9.1 cluster must look + * for "plpython2" in the new cluster. + */ + if (GET_MAJOR_VERSION(old_cluster.major_version) < 901 && + strcmp(lib, "$libdir/plpython") == 0) + { + lib = "$libdir/plpython2"; + llen = strlen(lib); + } + strcpy(cmd, "LOAD '"); ! PQescapeStringConn(conn, cmd + strlen(cmd), lib, llen, NULL); strcat(cmd, "'"); res = PQexec(conn, cmd); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Fix for pg_upgrade tablespace function usage
I have applied the attached patch to git head to fix the new SQL tablespace location function usage in pg_upgrade to properly check cluster version numbers, and a fix missing table alias. I found this problem during testing. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/contrib/pg_upgrade/info.c b/contrib/pg_upgrade/info.c new file mode 100644 index e8361ce..692cdc2 *** a/contrib/pg_upgrade/info.c --- b/contrib/pg_upgrade/info.c *** get_db_infos(ClusterInfo *cluster) *** 204,210 /* we don't preserve pg_database.oid so we sort by name */ "ORDER BY 2", /* 9.2 removed the spclocation column */ ! (GET_MAJOR_VERSION(old_cluster.major_version) <= 901) ? "t.spclocation" : "pg_catalog.pg_tablespace_location(t.oid) AS spclocation"); res = executeQueryOrDie(conn, "%s", query); --- 204,210 /* we don't preserve pg_database.oid so we sort by name */ "ORDER BY 2", /* 9.2 removed the spclocation column */ ! (GET_MAJOR_VERSION(cluster->major_version) <= 901) ? "t.spclocation" : "pg_catalog.pg_tablespace_location(t.oid) AS spclocation"); res = executeQueryOrDie(conn, "%s", query); *** get_rel_infos(ClusterInfo *cluster, DbIn *** 287,293 /* we preserve pg_class.oid so we sort by it to match old/new */ "ORDER BY 1;", /* 9.2 removed the spclocation column */ !(GET_MAJOR_VERSION(old_cluster.major_version) <= 901) ? "t.spclocation" : "pg_catalog.pg_tablespace_location(t.oid) AS spclocation", /* see the comment at the top of old_8_3_create_sequence_script() */ (GET_MAJOR_VERSION(old_cluster.major_version) <= 803) ? --- 287,293 /* we preserve pg_class.oid so we sort by it to match old/new */ "ORDER BY 1;", /* 9.2 removed the spclocation column */ !(GET_MAJOR_VERSION(cluster->major_version) <= 901) ? "t.spclocation" : "pg_catalog.pg_tablespace_location(t.oid) AS spclocation", /* see the comment at the top of old_8_3_create_sequence_script() */ (GET_MAJOR_VERSION(old_cluster.major_version) <= 803) ? diff --git a/contrib/pg_upgrade/tablespace.c b/contrib/pg_upgrade/tablespace.c new file mode 100644 index 11fd9d0..6b61f4b *** a/contrib/pg_upgrade/tablespace.c --- b/contrib/pg_upgrade/tablespace.c *** get_tablespace_paths(void) *** 53,59 " spcname != 'pg_global'", /* 9.2 removed the spclocation column */ (GET_MAJOR_VERSION(old_cluster.major_version) <= 901) ? ! "t.spclocation" : "pg_catalog.pg_tablespace_location(oid) AS spclocation"); res = executeQueryOrDie(conn, "%s", query); --- 53,59 " spcname != 'pg_global'", /* 9.2 removed the spclocation column */ (GET_MAJOR_VERSION(old_cluster.major_version) <= 901) ? ! "spclocation" : "pg_catalog.pg_tablespace_location(oid) AS spclocation"); res = executeQueryOrDie(conn, "%s", query); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
On Tue, Jan 24, 2012 at 8:13 PM, Tom Lane wrote: >> Well, the bytea experience was IMNSHO a complete disaster (It was >> earlier mentioned that jdbc clients were silently corrupting bytea >> datums) and should be held up as an example of how *not* to do things; > > Yeah. In both cases, the (proposed) new output format is > self-identifying *to clients that know what to look for*.Unfortunately > it would only be the most anally-written pre-existing client code that > would be likely to spit up on the unexpected variations. What's much > more likely to happen, and did happen in the bytea case, is silent data > corruption. The lack of redundancy in binary data makes this even more > likely, and the documentation situation makes it even worse. If we had > had a clear binary-data format spec from day one that told people that > they must check for unexpected contents of the flag field and fail, then > maybe we could get away with considering not doing so to be a > client-side bug ... but I really don't think we have much of a leg to > stand on given the poor documentation we've provided. > >> In regards to the array optimization, I think it's great -- but if you >> truly want to avoid blowing up user applications, it needs to be >> disabled automatically. > > Right. We need to fix things so that this format will not be sent to > clients unless the client code has indicated ability to accept it. > A GUC is a really poor proxy for that. OK. It seems clear to me at this point that there is no appetite for this patch in its present form: https://commitfest.postgresql.org/action/patch_view?id=715 Furthermore, while we haven't settled the question of exactly what a good negotiation facility would look like, we seem to agree that a GUC isn't it. I think that means this isn't going to happen for 9.2, so we should mark this patch Returned with Feedback and return to this topic for 9.3. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
Merlin Moncure writes: > On Tue, Jan 24, 2012 at 11:55 AM, Robert Haas wrote: >> I do wonder whether we are making a mountain out of a mole-hill here, >> though. If I properly understand the proposal on the table, which >> it's possible that I don't, but if I do, the new format is >> self-identifying: when the optimization is in use, it sets a bit that >> previously would always have been clear. So if we just go ahead and >> change this, clients that have been updated to understand the new >> format will work just fine. The server uses the proposed optimization >> only for arrays that meet certain criteria, so any properly updated >> client must still be able to handle the case where that bit isn't set. >> On the flip side, clients that aren't expecting the new optimization >> might break. But that's, again, no different than what happened when >> we changed the default bytea output format. If you get bit, you >> either update your client or shut off the optimization and deal with >> the performance consequences of so doing. > Well, the bytea experience was IMNSHO a complete disaster (It was > earlier mentioned that jdbc clients were silently corrupting bytea > datums) and should be held up as an example of how *not* to do things; Yeah. In both cases, the (proposed) new output format is self-identifying *to clients that know what to look for*. Unfortunately it would only be the most anally-written pre-existing client code that would be likely to spit up on the unexpected variations. What's much more likely to happen, and did happen in the bytea case, is silent data corruption. The lack of redundancy in binary data makes this even more likely, and the documentation situation makes it even worse. If we had had a clear binary-data format spec from day one that told people that they must check for unexpected contents of the flag field and fail, then maybe we could get away with considering not doing so to be a client-side bug ... but I really don't think we have much of a leg to stand on given the poor documentation we've provided. > In regards to the array optimization, I think it's great -- but if you > truly want to avoid blowing up user applications, it needs to be > disabled automatically. Right. We need to fix things so that this format will not be sent to clients unless the client code has indicated ability to accept it. A GUC is a really poor proxy for that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] basic pgbench runs with various performance-related patches
> My test was run with synchronous_commit=off, so I didn't expect the > group commit patch to have much of an impact. I included it mostly to > see whether by chance it helped anyway (since it also helps other WAL > flushes, not just commits) or whether it caused any regression. Oh, I see. > One somewhat odd thing about these numbers is that, on permanent > tables, all of the patches seemed to show regressions vs. master in > single-client throughput. That's a slightly difficult result to > believe, though, so it's probably a testing artifact of some kind. Maybe kernel cache effect? -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Different error messages executing CREATE TABLE or ALTER TABLE to create a column "xmin"
I took my first stab at hacking the sources to fix the bug reported here: http://archives.postgresql.org/pgsql-bugs/2012-01/msg00152.php It's such a simple patch but it took me several hours with Google and IRC and I'm sure I did many things wrong. It seems to work as advertised, though, so I'm submitting it. I suppose since I have now submitted a patch, it is my moral obligation to review at least one. I'm not sure how helpful I'll be, but I'll go bite the bullet and sign myself up anyway. diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c new file mode 100644 index cb8ac67..7555d19 *** a/src/backend/commands/tablecmds.c --- b/src/backend/commands/tablecmds.c *** ATExecAddColumn(List **wqueue, AlteredTa *** 4235,4240 --- 4235,4241 List *children; ListCell *child; AclResult aclresult; + HeapTuple attTuple; /* At top level, permission check was done in ATPrepCmd, else do it */ if (recursing) *** ATExecAddColumn(List **wqueue, AlteredTa *** 4314,4326 * this test is deliberately not attisdropped-aware, since if one tries to * add a column matching a dropped column name, it's gonna fail anyway. */ ! if (SearchSysCacheExists2(ATTNAME, ! ObjectIdGetDatum(myrelid), ! PointerGetDatum(colDef->colname))) ! ereport(ERROR, ! (errcode(ERRCODE_DUPLICATE_COLUMN), ! errmsg("column \"%s\" of relation \"%s\" already exists", ! colDef->colname, RelationGetRelationName(rel; /* Determine the new attribute's number */ if (isOid) --- 4315,4341 * this test is deliberately not attisdropped-aware, since if one tries to * add a column matching a dropped column name, it's gonna fail anyway. */ ! attTuple = SearchSysCache2(ATTNAME, ! ObjectIdGetDatum(myrelid), ! PointerGetDatum(colDef->colname)); ! ! if (HeapTupleIsValid(attTuple)) ! { ! int attnum; ! attnum = ((Form_pg_attribute) GETSTRUCT(attTuple))->attnum; ! if (attnum <= 0) ! ereport(ERROR, ! (errcode(ERRCODE_DUPLICATE_COLUMN), ! errmsg("column \"%s\" conflicts with a system column name", ! colDef->colname))); ! else ! ereport(ERROR, ! (errcode(ERRCODE_DUPLICATE_COLUMN), ! errmsg("column \"%s\" of relation \"%s\" already exists", ! colDef->colname, RelationGetRelationName(rel; ! ! ReleaseSysCache(attTuple); ! } /* Determine the new attribute's number */ if (isOid) diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out new file mode 100644 index e992549..8385afb *** a/src/test/regress/expected/alter_table.out --- b/src/test/regress/expected/alter_table.out *** COMMENT ON TABLE tmp_wrong IS 'table com *** 7,12 --- 7,14 ERROR: relation "tmp_wrong" does not exist COMMENT ON TABLE tmp IS 'table comment'; COMMENT ON TABLE tmp IS NULL; + ALTER TABLE tmp ADD COLUMN xmin integer; -- fails + ERROR: column "xmin" conflicts with a system column name ALTER TABLE tmp ADD COLUMN a int4 default 3; ALTER TABLE tmp ADD COLUMN b name; ALTER TABLE tmp ADD COLUMN c text; diff --git a/src/test/regress/sql/alter_table.sql b/src/test/regress/sql/alter_table.sql new file mode 100644 index d9bf08d..d4e4c49 *** a/src/test/regress/sql/alter_table.sql --- b/src/test/regress/sql/alter_table.sql *** COMMENT ON TABLE tmp_wrong IS 'table com *** 9,14 --- 9,16 COMMENT ON TABLE tmp IS 'table comment'; COMMENT ON TABLE tmp IS NULL; + ALTER TABLE tmp ADD COLUMN xmin integer; -- fails + ALTER TABLE tmp ADD COLUMN a int4 default 3; ALTER TABLE tmp ADD COLUMN b name; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches
On Tue, Jan 24, 2012 at 8:53 PM, Robert Haas wrote: > > do a single 30-minute pgbench run at scale factor 300 using a variety Nice A minor but necessary point: Repeated testing of the Group commit patch when you have synch commit off is clearly pointless, so publishing numbers for that without saying very clearly that's what's happening doesn't really help. > Graphs are here: > > http://wiki.postgresql.org/wiki/Robert_Haas_9.2CF4_Benchmark_Results Nicer > There are two graphs for each branch. The first is a scatter plot of > latency vs. transaction time. I found that graph hard to understand, > though; I couldn't really tell what I was looking at. So I made a > second set of graphs which graph number of completed transactions in a > given second of the test against time. The results are also included > on the previous page, below the latency graphs, and I find them much > more informative. > > A couple of things stand out at me from these graphs. First, some of > these transactions had really long latency. Second, there are a > remarkable number of seconds all of the test during which no > transactions at all manage to complete, sometimes several seconds in a > row. I'm not sure why. Third, all of the tests initially start of > processing transactions very quickly, and get slammed down very hard, > probably because the very high rate of transaction processing early on > causes a checkpoint to occur around 200 s. Check I'm happy that this exposes characteristics I've seen and have been discussing for a while now. It would be useful to calculate the slow-down contribution of the longer txns. What % of total time is given over to slowest 10% of transactions. Looking at that, I think you'll see why we should care about sorting out what happens in the worst cases. > I didn't actually log when > the checkpoints were occuring, but it seems like a good guess. It's > also interesting to wonder whether the checkpoint I/O itself causes > the drop-off, or the ensuing full page writes. Fourth, > xloginsert-scale-6 helps quite a bit; in fact, it's the only patch > that actually changes the whole shape of the tps graph. I'm > speculating here, but that may be because it blunts the impact of full > page writes by allowing backends to copy their full page images into > the write-ahead log in parallel. I think we should be working to commit XLogInsert and then Group Commit, then come back to the discussion. There's no way we're committing other patches but not those, so everything needs to be viewed with those two in. i.e. commit and then re-baseline. So I'd say no more tests just yet, but then lots of testing next week++. > But I plan to drop > buffreelistlock-reduction-v1 and freelist-ok-v2 from future test runs > based on Simon's comments elsewhere. I'm including the results here > just because these tests were already running when he made those > comments. Yep One you aren't testing is clog_history, which is designed to work in conjunction with background_clean_slru. That last one clearly needs a little tuning though... -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
On Tue, Jan 24, 2012 at 11:55 AM, Robert Haas wrote: > I do wonder whether we are making a mountain out of a mole-hill here, > though. If I properly understand the proposal on the table, which > it's possible that I don't, but if I do, the new format is > self-identifying: when the optimization is in use, it sets a bit that > previously would always have been clear. So if we just go ahead and > change this, clients that have been updated to understand the new > format will work just fine. The server uses the proposed optimization > only for arrays that meet certain criteria, so any properly updated > client must still be able to handle the case where that bit isn't set. > On the flip side, clients that aren't expecting the new optimization > might break. But that's, again, no different than what happened when > we changed the default bytea output format. If you get bit, you > either update your client or shut off the optimization and deal with > the performance consequences of so doing. Well, the bytea experience was IMNSHO a complete disaster (It was earlier mentioned that jdbc clients were silently corrupting bytea datums) and should be held up as an example of how *not* to do things; it's better to avoid having to depend on the GUC or defensive programmatic intervention to prevent further occurrences of application failure since the former doesn't work and the latter won't be reliably done. Waiting for applications to break in the field only to point affected users at the GUC is weak sauce. It's creating a user culture that is terrified of database upgrades which hurts everybody. Database apps tend to have long lives in computer terms such that they can greatly outlive the service life of a particular postgres dot release or even the programmers who originally wrote the application. I'm not too concerned about the viability of a programming department with Robert Haas at the helm, but what about when he leaves? What about critical 3rd party software that is no longer maintained? In regards to the array optimization, I think it's great -- but if you truly want to avoid blowing up user applications, it needs to be disabled automatically. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] some longer, larger pgbench tests with various performance-related patches
Early yesterday morning, I was able to use Nate Boley's test machine do a single 30-minute pgbench run at scale factor 300 using a variety of trees built with various patches, and with the -l option added to track latency on a per-transaction basis. All tests were done using 32 clients and permanent tables. The configuration was otherwise identical to that described here: http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6...@mail.gmail.com By doing this, I hoped to get a better understanding of (1) the effects of a scale factor too large to fit in shared_buffers, (2) what happens on a longer test run, and (3) how response time varies throughout the test. First, here are the raw tps numbers: background-clean-slru-v2: tps = 2027.282539 (including connections establishing) buffreelistlock-reduction-v1: tps = 2625.155348 (including connections establishing) buffreelistlock-reduction-v1-freelist-ok-v2: tps = 2468.638149 (including connections establishing) freelist-ok-v2: tps = 2467.065010 (including connections establishing) group-commit-2012-01-21: tps = 2205.128609 (including connections establishing) master: tps = 2200.848350 (including connections establishing) removebufmgrfreelist-v1: tps = 2679.453056 (including connections establishing) xloginsert-scale-6: tps = 3675.312202 (including connections establishing) Obviously these numbers are fairly noisy, especially since this is just one run, so the increases and decreases might not be all that meaningful. Time permitting, I'll try to run some more tests to get my hands around that situation a little better, Graphs are here: http://wiki.postgresql.org/wiki/Robert_Haas_9.2CF4_Benchmark_Results There are two graphs for each branch. The first is a scatter plot of latency vs. transaction time. I found that graph hard to understand, though; I couldn't really tell what I was looking at. So I made a second set of graphs which graph number of completed transactions in a given second of the test against time. The results are also included on the previous page, below the latency graphs, and I find them much more informative. A couple of things stand out at me from these graphs. First, some of these transactions had really long latency. Second, there are a remarkable number of seconds all of the test during which no transactions at all manage to complete, sometimes several seconds in a row. I'm not sure why. Third, all of the tests initially start of processing transactions very quickly, and get slammed down very hard, probably because the very high rate of transaction processing early on causes a checkpoint to occur around 200 s. I didn't actually log when the checkpoints were occuring, but it seems like a good guess. It's also interesting to wonder whether the checkpoint I/O itself causes the drop-off, or the ensuing full page writes. Fourth, xloginsert-scale-6 helps quite a bit; in fact, it's the only patch that actually changes the whole shape of the tps graph. I'm speculating here, but that may be because it blunts the impact of full page writes by allowing backends to copy their full page images into the write-ahead log in parallel. One thing I also noticed while running the tests is that the system was really not using much CPU time. It was mostly idle. That could be because waiting for I/O leads to waiting for locks, or it could be fundamental lock contention. I don't know which. A couple of obvious further tests suggest themselves: (1) rerun some of the tests with full_page_writes=off, and (2) repeat this test with the remaining performance-related patches. It would be especially interesting, I think, to see what effect the checkpoint-related patches have on these graphs. But I plan to drop buffreelistlock-reduction-v1 and freelist-ok-v2 from future test runs based on Simon's comments elsewhere. I'm including the results here just because these tests were already running when he made those comments. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New replication mode: write
On Tue, Jan 24, 2012 at 11:00 AM, Simon Riggs wrote: > Yes, it might be too hard, but lets look. Your committer has timed out ;-) committed write mode only -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PgNext: CFP
On 01/24/2012 07:36 PM, Dave Page wrote: > What date & venue? > > On Tuesday, January 24, 2012, Joshua D. Drake wrote: >> >> Hello, >> >> The call for papers for PgNext (the old PgWest/PgEast) is now open: >> >> January 19th: Talk submission opens >> April 15th: Talk submission closes >> April 30th: Speaker notification >> >> Submit: https://www.postgresqlconference.org/talk_types the url has in big letters has "Denver Convention Center" on June 26th-29th, 2012... Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] controlling the location of server-side SSL files
On tor, 2012-01-19 at 15:44 -0500, Robert Haas wrote: > On Sat, Jan 14, 2012 at 8:40 AM, Peter Eisentraut wrote: > > On mån, 2012-01-02 at 06:32 +0200, Peter Eisentraut wrote: > >> I think I would like to have a set of GUC parameters to control the > >> location of the server-side SSL files. > > > > Here is the patch for this. > > > > One thing that is perhaps worth thinking about: Currently, we just > > ignore missing root.crt and root.crl files. With this patch, we still > > do this, even if the user has given a specific nondefault location. > > That seems a bit odd, but I can't think of a simple way to do it better. > > There's a review in the CF app for this finding only minor issues, so > I'm marking this patch therein as "Ready for Committer". OK, no one had any concerns about the missing file behavior I described above? If not, then I'll commit it soon. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PgNext: CFP
What date & venue? On Tuesday, January 24, 2012, Joshua D. Drake wrote: > > Hello, > > The call for papers for PgNext (the old PgWest/PgEast) is now open: > > January 19th: Talk submission opens > April 15th: Talk submission closes > April 30th: Speaker notification > > Submit: https://www.postgresqlconference.org/talk_types > > Sincerely, > > JD > > -- > Command Prompt, Inc. - http://www.commandprompt.com/ > PostgreSQL Support, Training, Professional Services and Development > The PostgreSQL Conference - http://www.postgresqlconference.org/ > @cmdpromptinc - @postgresconf - 509-416-6579 > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] lots of unused variable warnings in assert-free builds
Robert Haas writes: > Yes, that's what I meant when I suggested it originally. I'm just not > sure it's any nicer than adding ifdefs for USE_ASSERT_CHECKING. I'm inclined to think that it probably is nicer, just because of less vertical space used. But again, this opinion is contingent on what it will look like after pgindent gets done with it ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch : Allow toast tables to be moved to a different tablespace
On Sun, Jan 22, 2012 at 11:04 AM, Julien Tachoires wrote: > 2011/12/15 Alvaro Herrera : >> >> Uhm, surely you could compare the original toast tablespace to the heap >> tablespace, and if they differ, handle appropriately when creating the >> new toast table? Just pass down the toast tablespace into >> AlterTableCreateToastTable, instead of having it assume that >> rel->rd_rel->relnamespace is sufficient. This should be done in all >> cases where a toast tablespace is created, which shouldn't be more than >> a handful of them. > > Thank you, that way seems right. > Now, I distinguish before each creation of a TOAST table with > AlterTableCreateToastTable() : if it will create a new one or recreate > an existing one. > Thus, in create_toast_table() when toastTableSpace is equal to > InvalidOid, we are able : > - to fallback to the main table tablespace in case of new TOAST table creation > - to keep it previous tablespace in case of recreation. > > Here's a new version rebased against HEAD. To ask more directly the question that's come up a few times upthread, why do *you* think this is useful? What motivated you to want this behavior, and/or how do you think it could benefit other PostgreSQL users? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] lots of unused variable warnings in assert-free builds
On Tue, Jan 24, 2012 at 1:03 PM, Tom Lane wrote: > I wrote: >> Also, it occurs to me that an intermediate macro >> "PG_USED_FOR_ASSERTS_ONLY" would be a good idea, first because it >> documents *why* you want to mark the variable as possibly unused, >> and second because changing the macro definition would provide an easy way >> to check for totally-unused variables, in case we wanted to periodically >> make such checks. > > Uh, wait a second. Why not > > #ifdef USE_ASSERT_CHECKING > #define PG_USED_FOR_ASSERTS_ONLY > #else > #define PG_USED_FOR_ASSERTS_ONLY __attribute__((unused)) > #endif > > Then, when you build with asserts on, you *automatically* get told > if the variable is entirely unused. Yes, that's what I meant when I suggested it originally. I'm just not sure it's any nicer than adding ifdefs for USE_ASSERT_CHECKING. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] lots of unused variable warnings in assert-free builds
I wrote: > Also, it occurs to me that an intermediate macro > "PG_USED_FOR_ASSERTS_ONLY" would be a good idea, first because it > documents *why* you want to mark the variable as possibly unused, > and second because changing the macro definition would provide an easy way > to check for totally-unused variables, in case we wanted to periodically > make such checks. Uh, wait a second. Why not #ifdef USE_ASSERT_CHECKING #define PG_USED_FOR_ASSERTS_ONLY #else #define PG_USED_FOR_ASSERTS_ONLY __attribute__((unused)) #endif Then, when you build with asserts on, you *automatically* get told if the variable is entirely unused. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] lots of unused variable warnings in assert-free builds
Robert Haas writes: > On Tue, Jan 24, 2012 at 12:12 PM, Tom Lane wrote: >> Ouch. Is it really true that __attribute__((unused)) disables detection >> of use of uninitialized variables? > Oh, I think maybe I am confused. The downsides of disabling *unused* > variable detection are obviously much less than the downsides of > disabling *uninitialized* variable declaration... although neither is > ideal. OK. MHO is that __attribute__((unused)) is probably less annoying than #ifdef overall. Also, it occurs to me that an intermediate macro "PG_USED_FOR_ASSERTS_ONLY" would be a good idea, first because it documents *why* you want to mark the variable as possibly unused, and second because changing the macro definition would provide an easy way to check for totally-unused variables, in case we wanted to periodically make such checks. This is all modulo the question of what pgindent will do with it, which I would still like to see tested before we commit to a method. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
On Tue, Jan 24, 2012 at 11:16 AM, Merlin Moncure wrote: >> Our current protocol allocates a 2-byte integer for the purposes of >> specifying the type of each parameter, and another 2-byte integer for >> the purpose of specifying the result type... but only one bit is >> really needed at present: text or binary. If we revise the protocol >> version at some point, we might want to use some of that bit space to >> allow some more fine-grained negotiation of the protocol version. So, >> for example, we might define the top 5 bits as reserved (always pass >> zero), the next bit as a text/binary flag, and the remaining 10 bits >> as a 10-bit "format version number". When a change like this comes >> along, we can bump the highest binary format version recognized by the >> server, and clients who request the new version can get it. >> >> Alternatively, we might conclude that a 2-byte integer for each >> parameter is overkill and try to cut back... but the point is there's >> a bunch of unused bitspace there now. In theory we could even do >> something this without bumping the protocol version since the >> documentation seems clear that any value other than 0 and 1 yields >> undefined behavior, but in practice that seems like it might be a bit >> too edgy. > > Yeah. But again, this isn't a contract between libpq and the server, > but between the application and the server... I don't see how this is relevant. The text/binary format flag is there in both libpq and the underlying protocol. > So I'd vote against any format code > beyond the text/binary switch that currently exists (which, by the > way, while useful, is one of the great sins of libpq that we have to > deal with basically forever). While wire formatting is granular down > to the type level, applications should not have to deal with that. > They should Just Work. So who decides what format code to stuff into > the protocol? Where are the codes defined? > > I'm very much in the camp that sometime, presumably during connection > startup, the protocol accepts a non-#defined-in-libpq token (database > version?) from the application that describes to the server what wire > formats can be used and the server sends one back. There probably has > to be some additional facilities for non-core types but let's put that > aside for the moment. Those two tokens allow the server to pick the > highest supported wire format (text and binary!) that everybody > understands. The server's token is useful if we're being fancy and we > want libpq to translate an older server's wire format to a newer one > for the application. This of course means moving some of the type > system into the client, which is something we might not want to do > since among other things it puts a heavy burden on non-libpq driver > authors (but then again, they can always stay on the v3 protocol, > which can benefit from being frozen in terms of wire formats). I think it's sensible for the server to advertise a version to the client, but I don't see how you can dismiss add-on types so blithely. The format used to represent any given type is logically a property of that type, and only for built-in types is that associated with the server version. I do wonder whether we are making a mountain out of a mole-hill here, though. If I properly understand the proposal on the table, which it's possible that I don't, but if I do, the new format is self-identifying: when the optimization is in use, it sets a bit that previously would always have been clear. So if we just go ahead and change this, clients that have been updated to understand the new format will work just fine. The server uses the proposed optimization only for arrays that meet certain criteria, so any properly updated client must still be able to handle the case where that bit isn't set. On the flip side, clients that aren't expecting the new optimization might break. But that's, again, no different than what happened when we changed the default bytea output format. If you get bit, you either update your client or shut off the optimization and deal with the performance consequences of so doing. In fact, the cases are almost perfectly analogous, because in each case the proposal was based on the size of the output format being larger than necessary, and wanting to squeeze it down to a smaller size for compactness. And more generally, does anyone really expect that we're never going to change the output format of any type we support ever again, without retaining infinite backward compatibility? I didn't hear any screams of outrage when we updated the hyphenation rules for contrib/isbn - well, ok, there were some howls, but that was because the rules were still incomplete and US-centric, not so much because people thought it was unacceptable for the hyphenation rules to be different in major release N+1 than they were in major release N. If the IETF goes and defines a new standard for formatting IPv6 addresses, we're lik
[HACKERS] PgNext: CFP
Hello, The call for papers for PgNext (the old PgWest/PgEast) is now open: January 19th: Talk submission opens April 15th: Talk submission closes April 30th: Speaker notification Submit: https://www.postgresqlconference.org/talk_types Sincerely, JD -- Command Prompt, Inc. - http://www.commandprompt.com/ PostgreSQL Support, Training, Professional Services and Development The PostgreSQL Conference - http://www.postgresqlconference.org/ @cmdpromptinc - @postgresconf - 509-416-6579 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Checksums
On Jan 24, 2012, at 9:15 AM, Simon Riggs wrote: > On Tue, Jan 24, 2012 at 2:49 PM, Robert Treat wrote: >>> And yes, I would for sure turn such functionality on if it were present. >>> >> >> That's nice to say, but most people aren't willing to take a 50% >> performance hit. Not saying what we end up with will be that bad, but >> I've seen people get upset about performance hits much lower than >> that. > When we talk about a 50% hit, are we discussing (1) checksums that are > checked on each I/O, or (2) checksums that are checked each time we > re-pin a shared buffer? The 50% hit was my estimate of (2) and has > not yet been measured, so shouldn't be used unqualified when > discussing checksums. Same thing is also true "I would use it" > comments, since we're not sure whether you're voting for (1) or (2). > > As to whether people will actually use (1), I have no clue. But I do > know is that many people request that feature, including people that > run heavy duty Postgres production systems and who also know about > filesystems. Do people need (2)? It's easy enough to add as an option, > once we have (1) and there is real interest. Some people will be able to take a 50% hit and will happily turn on checksumming every time a page is pinned. But I suspect a lot of folks can't afford that kind of hit, but would really like to have their filesystem cache protected (we're certainly in the later camp). As for checksumming filesystems, I didn't see any answers about whether the filesystem *cache* was also protected by the filesystem checksum. Even if it is, the choice of checksumming filesystems is certainly limited... ZFS is the only one that seems to have real traction, but that forces you off of Linux, which is a problem for many shops. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] lots of unused variable warnings in assert-free builds
On Tue, Jan 24, 2012 at 12:12 PM, Tom Lane wrote: > Robert Haas writes: >> Spraying the code with __attribute__((unused)) is somewhat undesirable >> because it could mask a failure to properly initialize the variable in >> an assert-enabled build. > > Ouch. Is it really true that __attribute__((unused)) disables detection > of use of uninitialized variables? That would be nasty, and it's not > obvious to me why it should need to work like that. But if it is true, > then I agree that that makes this approach not too tenable. Oh, I think maybe I am confused. The downsides of disabling *unused* variable detection are obviously much less than the downsides of disabling *uninitialized* variable declaration... although neither is ideal. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] lots of unused variable warnings in assert-free builds
Robert Haas writes: > Spraying the code with __attribute__((unused)) is somewhat undesirable > because it could mask a failure to properly initialize the variable in > an assert-enabled build. Ouch. Is it really true that __attribute__((unused)) disables detection of use of uninitialized variables? That would be nasty, and it's not obvious to me why it should need to work like that. But if it is true, then I agree that that makes this approach not too tenable. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] lots of unused variable warnings in assert-free builds
On Sat, Jan 21, 2012 at 1:06 PM, Peter Eisentraut wrote: > So, here are three patches that could solve this issue. > > pg-cassert-unused-attribute.patch, the one I already showed, using > __attribute__((unused). > > pg-cassert-unused-ifdef.patch, using only additional #ifdef > USE_ASSERT_CHECKING. This does have additional documentation value, but > you can see that it gets bulky and complicated. (I introduced several > bugs merely while creating this patch.) > > pg-cassert-unused-void.patch is an alternative approach that avoids the > warnings by casting the arguments of Assert() to void if assertions are > disabled. This has the least code impact, but it changes the > traditional semantics of asserts, which is that they disappear > completely when not enabled. You can see how this creates a problem in > src/backend/replication/syncrep.c, where the nontrivial call to > SyncRepQueueIsOrderedByLSN() now becomes visible even in non-assert > builds. I played around with some other options like > __attribute__((pure)) to make the compiler optimize the function away in > that case, but that didn't appear to work. So this might not be a > workable solution (and it would be GCC-specific anyway). I think the third approach is unacceptable from a performance point of view. Some of these problems can be fixed without resorting to as much hackery as you have here. For example, in nodeWorkTableScan.c, you could easily fix the problem by getting rid of the estate variable and replacing its single use within the assertion with the expression from used to initialize it on the previous line. I think this might the cleanest solution in general. I'm not sure what to do about the cases where that isn't practical. Spraying the code with __attribute__((unused)) is somewhat undesirable because it could mask a failure to properly initialize the variable in an assert-enabled build. We could have a macro PG_UNUSED_IF_NO_ASSERTS or something, but that doesn't have any obvious advantage over just getting rid of the variable altogether in such cases. I lean toward the view that variables not needed in assertion-free builds should be #ifdef'd out altogether, as in your second patch, but we should try to minimize the number of places where we need to do that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multithread Query Planner
On Tue, Jan 24, 2012 at 11:25 AM, Tom Lane wrote: > Robert Haas writes: >> I doubt it. Almost nothing in the backend is thread-safe. You can't >> acquire a heavyweight lock, a lightweight lock, or a spinlock. You >> can't do anything that might elog() or ereport(). None of those >> things are reentrant. > > Not to mention palloc, another extremely fundamental and non-reentrant > subsystem. > > Possibly we could work on making all that stuff re-entrant, but it would > be a huge amount of work for a distant and uncertain payoff. Right. I think it makes more sense to try to get parallelism working first with the infrastructure we have. Converting to use threading, if we ever do it at all, should be something we view as a later performance optimization. But I suspect we won't want to do it anyway; I think there will be easier ways to get where we want to be. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multithread Query Planner
Robert Haas writes: > I doubt it. Almost nothing in the backend is thread-safe. You can't > acquire a heavyweight lock, a lightweight lock, or a spinlock. You > can't do anything that might elog() or ereport(). None of those > things are reentrant. Not to mention palloc, another extremely fundamental and non-reentrant subsystem. Possibly we could work on making all that stuff re-entrant, but it would be a huge amount of work for a distant and uncertain payoff. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Measuring relation free space
On Mon, Jan 23, 2012 at 7:18 PM, Noah Misch wrote: > On Mon, Jan 23, 2012 at 04:56:24PM -0300, Alvaro Herrera wrote: >> >> Hm. Leaf pages hold as much tuples as non-leaf pages, no? I mean >> for each page element there's a value and a CTID. In non-leaf those >> CTIDs point to other index pages, one level down the tree; in leaf pages >> they point to the heap. > > That distinction seemed important when I sent my last message, but now I agree > that it's largely irrelevant for free space purposes. If someone feels like > doing it, +1 for making pgstattuple() count non-leaf free space. > actually i agreed that non-leaf pages are irrelevant... i just confirmed that in a production system with 300GB none of the indexes in an 84M rows table nor in a heavily updated one has more than 1 root page, all the rest are deleted, half_dead or leaf. so the posibility of bloat coming from non-leaf pages seems very odd but the possibility of bloat coming from the meta page doesn't exist, AFAIUI at least we need the most accurate value about usable free space, because the idea is to add a sampler mode to the function so we don't scan the whole relation. that's why we still need the function. btw... pgstattuple also has the problem that it's not using a ring buffer attached are two patches: - v5: is the same original patch but only track space in leaf, deleted and half_dead pages - v5.1: adds the same for all kind of indexes (problem is that this is inconsistent with the fact that pageinspect only manages btree indexes for everything else) -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile new file mode 100644 index 13ba6d3..63fab95 *** a/contrib/pageinspect/Makefile --- b/contrib/pageinspect/Makefile *** MODULE_big = pageinspect *** 4,10 OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o EXTENSION = pageinspect ! DATA = pageinspect--1.0.sql pageinspect--unpackaged--1.0.sql ifdef USE_PGXS PG_CONFIG = pg_config --- 4,12 OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o EXTENSION = pageinspect ! DATA = pageinspect--1.0.sql pageinspect--1.1.sql \ !pageinspect--1.0--1.1.sql \ !pageinspect--unpackaged--1.0.sql ifdef USE_PGXS PG_CONFIG = pg_config diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c new file mode 100644 index dbb2158..9c0b0fb *** a/contrib/pageinspect/btreefuncs.c --- b/contrib/pageinspect/btreefuncs.c *** *** 34,39 --- 34,40 #include "utils/builtins.h" #include "utils/rel.h" + #include "btreefuncs.h" extern Datum bt_metap(PG_FUNCTION_ARGS); extern Datum bt_page_items(PG_FUNCTION_ARGS); *** GetBTPageStatistics(BlockNumber blkno, B *** 155,160 --- 156,216 stat->avg_item_size = 0; } + /* + * GetBTRelationFreeSpace + * + * Get the free space for a btree index. + * This is a helper function for relation_free_space() + * + */ + float4 + GetBTRelationFreeSpace(Relation rel) + { + BTPageStat stat; + + Buffer buffer; + BlockNumber blkno; + BlockNumber totalBlcksInRelation = RelationGetNumberOfBlocks(rel); + BlockNumber totalBlcksCounted = 0; + Size free_space = 0; + double free_percent = 0; + + BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD); + + /* Skip page 0 because it is a metapage */ + for (blkno = 1; blkno < totalBlcksInRelation; blkno++) + { + buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy); + /* + * get the statistics of the indexes and use that info + * to determine free space on the page + */ + GetBTPageStatistics(blkno, buffer, &stat); + /* + * Consider pages DELETED and HALF_DEAD as empty, + * besides those only consider LEAF pages + */ + if (stat.type == 'd' || stat.type == 'e') + { + free_space += stat.page_size; + totalBlcksCounted++; + } + else if (stat.type == 'l') + { + free_space += stat.free_size; + totalBlcksCounted++; + } + + ReleaseBuffer(buffer); + } + + if (totalBlcksCounted > 0) + free_percent = ((float4) free_space) / (totalBlcksCounted * BLCKSZ); + + return free_percent; + } + + /* --- * bt_page() * diff --git a/contrib/pageinspect/btreefuncs.h b/contrib/pageinspect/btreefuncs.h new file mode 100644 index ...549f878 *** a/contrib/pageinspect/btreefuncs.h --- b/contrib/pageinspect/btreefuncs.h *** *** 0 --- 1,5 + /* + * contrib/pageinspect/btreefuncs.h + */ + + float4 GetBTRelationFreeSpace(Relation); diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c new file mode 100644 index 260ccff..8e3961d *** a/contrib/pageinspect/heapfuncs.c --- b/contrib/pageinspect/heapfunc
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
On Tue, Jan 24, 2012 at 8:26 AM, Robert Haas wrote: > On Mon, Jan 23, 2012 at 5:49 PM, Merlin Moncure wrote: >> I'm not sure that you're getting anything with that user facing >> complexity. The only realistic case I can see for explicit control of >> wire formats chosen is to defend your application from format changes >> in the server when upgrading the server and/or libpq. This isn't a >> "let's get better compression problem", this is "I upgraded my >> database and my application broke" problem. >> >> Fixing this problem in non documentation fashion is going to require a >> full protocol change, period. > > Our current protocol allocates a 2-byte integer for the purposes of > specifying the type of each parameter, and another 2-byte integer for > the purpose of specifying the result type... but only one bit is > really needed at present: text or binary. If we revise the protocol > version at some point, we might want to use some of that bit space to > allow some more fine-grained negotiation of the protocol version. So, > for example, we might define the top 5 bits as reserved (always pass > zero), the next bit as a text/binary flag, and the remaining 10 bits > as a 10-bit "format version number". When a change like this comes > along, we can bump the highest binary format version recognized by the > server, and clients who request the new version can get it. > > Alternatively, we might conclude that a 2-byte integer for each > parameter is overkill and try to cut back... but the point is there's > a bunch of unused bitspace there now. In theory we could even do > something this without bumping the protocol version since the > documentation seems clear that any value other than 0 and 1 yields > undefined behavior, but in practice that seems like it might be a bit > too edgy. Yeah. But again, this isn't a contract between libpq and the server, but between the application and the server...unless you want libpq to do format translation to something the application can understand (but even then the application is still involved). I'm not very enthusiastic about encouraging libpq application authors to pass format #defines for every single parameter and consumed datum to get future proofing on wire formats. So I'd vote against any format code beyond the text/binary switch that currently exists (which, by the way, while useful, is one of the great sins of libpq that we have to deal with basically forever). While wire formatting is granular down to the type level, applications should not have to deal with that. They should Just Work. So who decides what format code to stuff into the protocol? Where are the codes defined? I'm very much in the camp that sometime, presumably during connection startup, the protocol accepts a non-#defined-in-libpq token (database version?) from the application that describes to the server what wire formats can be used and the server sends one back. There probably has to be some additional facilities for non-core types but let's put that aside for the moment. Those two tokens allow the server to pick the highest supported wire format (text and binary!) that everybody understands. The server's token is useful if we're being fancy and we want libpq to translate an older server's wire format to a newer one for the application. This of course means moving some of the type system into the client, which is something we might not want to do since among other things it puts a heavy burden on non-libpq driver authors (but then again, they can always stay on the v3 protocol, which can benefit from being frozen in terms of wire formats). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multithread Query Planner
On Mon, Jan 23, 2012 at 2:45 PM, Merlin Moncure wrote: > Yes, but OP is proposing to use multiple threads inside the forked > execution process. That's a completely different beast. Many other > databases support parallel execution of a single query and it might > very well be better/easier to do that with threads. I doubt it. Almost nothing in the backend is thread-safe. You can't acquire a heavyweight lock, a lightweight lock, or a spinlock. You can't do anything that might elog() or ereport(). None of those things are reentrant. Consequently, you can't do anything that involves reading or pinning a buffer, making a syscache lookup, or writing WAL. You can't even do something like parallelize the qsort() of a chunk of data that's already been read into a private buffer... because you'd have to call the comparison functions for the data type, and they might elog() or ereport(). Of course, in certain special cases (like int4) you could make it safe, but it's hard for to imagine anyone wanting to go to that amount of effort for such a small payoff. If we're going to do parallel query in PG, and I think we are going to need to do that eventually, we're going to need a system where large chunks of work can be handed off, as in the oft-repeated example of parallelizing an append node by executing multiple branches concurrently. That's where the big wins are. And that means either overhauling the entire backend to make it thread-safe, or using multiple backends. The latter will be hard, but it'll still be a lot easier than the former. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Checksums
On Tue, Jan 24, 2012 at 2:49 PM, Robert Treat wrote: >> And yes, I would for sure turn such functionality on if it were present. >> > > That's nice to say, but most people aren't willing to take a 50% > performance hit. Not saying what we end up with will be that bad, but > I've seen people get upset about performance hits much lower than > that. When we talk about a 50% hit, are we discussing (1) checksums that are checked on each I/O, or (2) checksums that are checked each time we re-pin a shared buffer? The 50% hit was my estimate of (2) and has not yet been measured, so shouldn't be used unqualified when discussing checksums. Same thing is also true "I would use it" comments, since we're not sure whether you're voting for (1) or (2). As to whether people will actually use (1), I have no clue. But I do know is that many people request that feature, including people that run heavy duty Postgres production systems and who also know about filesystems. Do people need (2)? It's easy enough to add as an option, once we have (1) and there is real interest. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Removing freelist (was Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?)
On Mon, Jan 23, 2012 at 4:01 PM, Tom Lane wrote: > The real > problem there is that BufFreelistLock is also used to protect the > clock sweep pointer. Agreed > I think basically we gotta find a way to allow > multiple backends to run clock sweeps concurrently. Or else fix > things so that the freelist never (well, hardly ever) runs dry. Agreed. The only question is what do we do now. I'm happy with the thought of jam tomorrow, I'd like to see some minor improvements now, given that when we say "we'll do that even better in the next release" it often doesn't happen. Which is where those patches come in. I've posted an improved lock wait analysis patch and have scheduled some runs on heavily used systems to see what this tells us. Results from live machines also greatly appreciated, since test systems seem likely to be inappropriate tests. Patch copied here again. This is a key issue since RAM is cheap enough now that people are swamped by it, so large shared_buffers are very common. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Removing freelist (was Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?)
On Mon, Jan 23, 2012 at 5:06 AM, Robert Haas wrote: > On Mon, Jan 23, 2012 at 12:12 AM, Tom Lane wrote: >> Robert Haas writes: >>> On Sat, Jan 21, 2012 at 5:29 PM, Jim Nasby wrote: We should also look at having the freelist do something useful, instead of just dropping it completely. Unfortunately that's probably more work... >> >>> That's kinda my feeling as well. The free list in its current form is >>> pretty much useless, but I don't think we'll save much by getting rid >>> of it, because that's just a single test. The expensive part of what >>> we do while holding BufFreelistLock is, I think, iterating through >>> buffers taking and releasing a spinlock on each one (!). >> >> Yeah ... spinlocks that, by definition, will be uncontested. > > What makes you think that they are uncontested? It is automatically partitioned over 131,072 spinlocks if shared_buffers=1GB, so the chances of contention seem pretty low. If a few very hot index root blocks are heavily contested, still that would only be encountered a few times out of the 131,072. I guess we could count them, similar to your LWLOCK_STATS changes. > Or for that matter, > that even an uncontested spinlock operation is cheap enough to do > while holding a badly contended LWLock? Is the concern that getting a uncontested spinlock has lots of instructions, or that the associated fences are expensive? > >> So I think >> it would be advisable to prove rather than just assume that that's a >> problem. > > It's pretty trivial to prove that there is a very serious problem with > BufFreelistLock. I'll admit I can't prove what the right fix is just > yet, and certainly measurement is warranted. From my own analysis and experiments, the mere act of getting the BufFreelistLock when it is contended is far more important than anything actually done while holding that lock. When the clock-sweep is done often enough that the lock is contended, then it is done often enough to keep the average usagecount low and so the average number of buffers visited during a clock sweep before finding a usable one was around 2.0. YMMV. Can we get some people who run busy high-CPU servers that churn the cache and think they may be getting hit with this problem post the results of this query: select usagecount, count(*) from pg_buffercache group by usagecount; Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Checksums
On Tue, Jan 24, 2012 at 3:02 AM, wrote: >> * Robert Treat: >> >>> Would it be unfair to assert that people who want checksums but aren't >>> willing to pay the cost of running a filesystem that provides >>> checksums aren't going to be willing to make the cost/benefit trade >>> off that will be asked for? Yes, it is unfair of course, but it's >>> interesting how small the camp of those using checksummed filesystems >>> is. >> >> Don't checksumming file systems currently come bundled with other >> features you might not want (such as certain vendors)? > > I would chip in and say that I would prefer sticking to well-known proved > filesystems like xfs/ext4 and let the application do the checksumming. > *shrug* You could use Illumos or BSD and you'd get generally vendor free systems using ZFS, which I'd say offers more well-known and proved checksumming than anything cooking in linux land, or than the as-to-be-written yet checksumming in postgres. > I dont forsee fully production-ready checksumming filesystems readily > available in the standard Linux distributions within a near future. > > And yes, I would for sure turn such functionality on if it were present. > That's nice to say, but most people aren't willing to take a 50% performance hit. Not saying what we end up with will be that bad, but I've seen people get upset about performance hits much lower than that. Robert Treat conjecture: xzilla.net consulting: omniti.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Removing freelist (was Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?)
On Sat, Jan 21, 2012 at 2:29 PM, Jim Nasby wrote: > On Jan 20, 2012, at 11:54 AM, Heikki Linnakangas wrote: >> So, you're proposing that we remove freelist altogether? Sounds reasonable, >> but that needs to be performance tested somehow. I'm not sure what exactly >> the test should look like, but it probably should involve an OLTP workload, >> and large tables being created, populated to fill the cache with pages from >> the table, and dropped, while the OLTP workload is running. I'd imagine that >> the worst case scenario looks something like that. > > We should also look at having the freelist do something useful, instead of > just dropping it completely. Unfortunately that's probably more work... If the head and tail are both protected by BufFreelistLock, I'm pretty sure this will make things worse, not better. If we change to having head and tail protected by separate spinlocks, then I don't see how you can remove the last buffer from the list, or add a buffer to an empty list, without causing havoc. Does anyone have ideas for implementing a cross-platform, lock-free, FIFO linked list? Without that, I don't see how we are going to get anywhere on this approach. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inline Extension
On Mon, Jan 23, 2012 at 7:59 PM, Daniel Farina wrote: > Things are still a bit ugly in the more complex cases: consider > PostGIS's linkage against libproj and other libraries. In order to > cover all cases, I feel that what I need is an optional hook (for the > same of argument, let's say it's another "command" type hook, e.g. > "archive_command") to be executed when extension (un)installation is > occurs on a primary or is replayed on a standby whereby I can acquire > the necessary dependencies for an extension or signal some kind of > error (as to exactly how that interfaces with the server is delicate, > should one want to supply good error messages to the user). There aren't a whole lot of good options for handling errors on the standby, in general. If the operation has already occurred on the master, it's too late to decide that we're not going to have it happen on the standby as well. Of course, if we confine ourselves to control and SQL files, then it doesn't really matter: the catalog entries from the master will make it to the standby regardless of the absence of the SQL and control files, and maybe we could simply decide that the operations which write in pg_extension aren't WAL-logged and can be separately performed on the standby if you want them there as well. But, that's a bit unlike how we normally do things. And if we're going to WAL-log the writing of the extension files, then I start to feel like we should go back to putting the data in a system catalog rather than the filesystem, because inventing a new WAL record type for this seems like it will make things quite a bit more complicated for no particular benefit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
On Mon, Jan 23, 2012 at 5:49 PM, Merlin Moncure wrote: > I'm not sure that you're getting anything with that user facing > complexity. The only realistic case I can see for explicit control of > wire formats chosen is to defend your application from format changes > in the server when upgrading the server and/or libpq. This isn't a > "let's get better compression problem", this is "I upgraded my > database and my application broke" problem. > > Fixing this problem in non documentation fashion is going to require a > full protocol change, period. Our current protocol allocates a 2-byte integer for the purposes of specifying the type of each parameter, and another 2-byte integer for the purpose of specifying the result type... but only one bit is really needed at present: text or binary. If we revise the protocol version at some point, we might want to use some of that bit space to allow some more fine-grained negotiation of the protocol version. So, for example, we might define the top 5 bits as reserved (always pass zero), the next bit as a text/binary flag, and the remaining 10 bits as a 10-bit "format version number". When a change like this comes along, we can bump the highest binary format version recognized by the server, and clients who request the new version can get it. Alternatively, we might conclude that a 2-byte integer for each parameter is overkill and try to cut back... but the point is there's a bunch of unused bitspace there now. In theory we could even do something this without bumping the protocol version since the documentation seems clear that any value other than 0 and 1 yields undefined behavior, but in practice that seems like it might be a bit too edgy. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] basic pgbench runs with various performance-related patches
On Tue, Jan 24, 2012 at 1:26 AM, Tatsuo Ishii wrote: >> ** pgbench, permanent tables, scale factor 100, 300 s ** >> 1 group-commit-2012-01-21 614.425851 -10.4% >> 8 group-commit-2012-01-21 4705.129896 +6.3% >> 16 group-commit-2012-01-21 7962.131701 +2.0% >> 24 group-commit-2012-01-21 13074.939290 -1.5% >> 32 group-commit-2012-01-21 12458.962510 +4.5% >> 80 group-commit-2012-01-21 12907.062908 +2.8% > > Interesting. Comparing with this: > http://archives.postgresql.org/pgsql-hackers/2012-01/msg00804.php > you achieved very small enhancement. Do you think of any reason which > makes the difference? My test was run with synchronous_commit=off, so I didn't expect the group commit patch to have much of an impact. I included it mostly to see whether by chance it helped anyway (since it also helps other WAL flushes, not just commits) or whether it caused any regression. One somewhat odd thing about these numbers is that, on permanent tables, all of the patches seemed to show regressions vs. master in single-client throughput. That's a slightly difficult result to believe, though, so it's probably a testing artifact of some kind. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GiST for range types (was Re: [HACKERS] Range Types - typo + NULL string constructor)
Hi! New version of patch is attached. On Thu, Dec 22, 2011 at 11:46 AM, Jeff Davis wrote: > A few comments: > > * In range_gist_picksplit, it would be nice to have a little bit more > intuitive description of what's going on with the nonEmptyCount and > nonInfCount numbers. For instance, it appears to depend on the fact that > a range must either be in nonEmptyCount or in nonInfCount. Also, can you > describe the reason you're multiplying by two and taking the absolute > value? It seems to work, but I am missing the intuition behind those > operations. > total_count - 2*nonEmptyCount = (total_count - nonEmptyCount) - nonEmptyCount = emptyCount - nonEmptyCount So, it's really not evident. I've simplified it. > * The penalty function is fairly hard to read still. At a high level, I > think we're trying to accomplish a few things (in order from most to > least important): > - Keep normal ranges separate. > - Avoid broadening the class of the original predicate (e.g. turning > single-infinite into double-infinite). > - Avoid broadening (as determined by subtype_diff) the original > predicate. > - Favor adding ranges to narrower original predicates. > > Do you agree? If so, perhaps we can attack those more directly and it > might be a little more readable. > > Additionally, the arbitrary numbers might become a problem. Can we > choose better constants there? They would still be arbitrary when > compared with real numbers derived from subtype_diff, but maybe we can > still do better than what's there. > I've changes some comments and add constants for penalty values. > * Regarding the leftover "common" entries that can go to either side: > what is the "delta" measure trying to accomplish? When I work through > some examples, it seems to favor putting larger common ranges on the > left (small delta) and smaller common ranges on the right (smaller > delta). Why is that good? Or did I misread the code? > > Intuitively, I would think that we'd want to push the ranges with lower > upper bounds to the left and higher lower bounds to the right -- in > other words, recurse. Obviously, we'd need to make sure it terminated at > some point, but splitting the common entries does seem like a smaller > version of the original problem. Thoughts? > That was a bug. Actually, no "abs" is needed. Indeed it doesn't affect result significantly. - With best regards, Alexander Korotkov. rangetypegist-0.6.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New replication mode: write
On Tue, Jan 24, 2012 at 10:47 AM, Fujii Masao wrote: >>> I'm afraid I could not understand your idea. Could you explain it in >>> more detail? >> >> We either tell libpqwalreceiver about the latch, or we tell >> walreceiver about the socket used by libpqwalreceiver. >> >> In either case we share a pointer from one module to another. > > The former seems difficult because it's not easy to link libpqwalreceiver.so > to the latch. I will consider about the latter. Yes, it might be too hard, but lets look. >>> You mean to change the meaning of apply_location? Currently it indicates >>> the end + 1 of the last replayed WAL record, regardless of whether it's >>> a commit record or not. So too many replies can be sent per incoming >>> message because it might contain many WAL records. But you mean to >>> change apply_location only when a commit record is replayed? >> >> There is no change to the meaning of apply_location. The only change >> is that we send that message only when it has an updated value of >> committed lsn. > > This means that apply_location might return the different location from > pg_last_xlog_replay_location() on the standby, though in 9.1 they return > the same. Which might confuse a user. No? The two values only match on a quiet system anyway, since both are moving forwards. They will still match on a quiet system. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Online base backup from the hot-standby
On Tue, Jan 24, 2012 at 10:54 AM, Simon Riggs wrote: > On Tue, Jan 24, 2012 at 9:51 AM, Fujii Masao wrote: > >>> I'll proceed to commit for this now. >> >> Thanks a lot! > > Can I just check a few things? Just to clarify, not expecting another patch version, just reply here and I can edit. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Online base backup from the hot-standby
On Tue, Jan 24, 2012 at 9:51 AM, Fujii Masao wrote: >> I'll proceed to commit for this now. > > Thanks a lot! Can I just check a few things? You say /* +* Update full_page_writes in shared memory and write an +* XLOG_FPW_CHANGE record before resource manager writes cleanup +* WAL records or checkpoint record is written. +*/ why does it need to be before the cleanup and checkpoint? You say /* +* Currently only non-exclusive backup can be taken during recovery. +*/ why? You mention in the docs "The backup history file is not created in the database cluster backed up." but we need to explain the bad effect, if any. You say "If the standby is promoted to the master during online backup, the backup fails." but no explanation of why? I could work those things out, but I don't want to have to, plus we may disagree if I did. There are some good explanations in comments of other things, just not everywhere needed. What happens if we shutdown the WALwriter and then issue SIGHUP? Are we sure we want to make the change of file format mandatory? That means earlier versions of clients such as pg_basebackup will fail against this version. Should we allow that if BACKUP FROM is missing we assume it was master? There are no docs to explain the new feature is available in the main docs, or to explain the restrictions. I expect you will add that later after commit. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New replication mode: write
On Mon, Jan 23, 2012 at 9:53 PM, Simon Riggs wrote: > On Mon, Jan 23, 2012 at 10:03 AM, Fujii Masao wrote: > To make the walreceiver call WaitLatchOrSocket(), we would need to merge it and libpq_select() into one function. But the former is the backend function and the latter is the frontend one. Now I have no good idea to merge them cleanly. >>> >>> We can wait on the socket wherever it comes from. poll/select doesn't >>> care how we got the socket. >>> >>> So we just need a common handler that calls either >>> walreceiver/libpqwalreceiver function as required to handle the >>> wakeup. >> >> I'm afraid I could not understand your idea. Could you explain it in >> more detail? > > We either tell libpqwalreceiver about the latch, or we tell > walreceiver about the socket used by libpqwalreceiver. > > In either case we share a pointer from one module to another. The former seems difficult because it's not easy to link libpqwalreceiver.so to the latch. I will consider about the latter. If we send back the reply as soon as the Apply pointer is changed, I'm afraid quite lots of reply messages are sent frequently, which might cause performance problem. This is also one of the reasons why I didn't implement the quick-response feature. To address this problem, we might need to change the master so that it sends the Wait pointer to the standby, and change the standby so that it replies whenever the Apply pointer catches up with the Wait one. This can reduce the number of useless reply from the standby about the Apply pointer. >>> >>> We send back one reply per incoming message. The incoming messages >>> don't know request state and checking that has a cost which I don't >>> think is an appropriate payment since we only need this info when the >>> link goes quiet. >>> >>> When the link goes quiet we still need to send replies if we have >>> apply mode, but we only need to send apply messages if the lsn has >>> changed because of a commit. That will considerably reduce the >>> messages sent so I don't see a problem. >> >> You mean to change the meaning of apply_location? Currently it indicates >> the end + 1 of the last replayed WAL record, regardless of whether it's >> a commit record or not. So too many replies can be sent per incoming >> message because it might contain many WAL records. But you mean to >> change apply_location only when a commit record is replayed? > > There is no change to the meaning of apply_location. The only change > is that we send that message only when it has an updated value of > committed lsn. This means that apply_location might return the different location from pg_last_xlog_replay_location() on the standby, though in 9.1 they return the same. Which might confuse a user. No? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL Restore process during recovery
On Tue, Jan 24, 2012 at 6:49 PM, Simon Riggs wrote: > On Tue, Jan 24, 2012 at 9:43 AM, Fujii Masao wrote: >> On Tue, Jan 24, 2012 at 12:23 AM, Simon Riggs wrote: >>> On Mon, Jan 23, 2012 at 12:23 PM, Fujii Masao wrote: Why does walrestore need to be invoked even when restore_command is not specified? It seems to be useless. We invoke walreceiver only when primary_conninfo is specified now. Similarly we should invoke walrestore only when restore_command is specified? >>> >>> walreceiver is shutdown and restarted in case of failed connection. >>> That never happens with walrestore because the command is run each >>> time - when we issue system(3) a new process is forked to run the >>> command. So there is no specific cleanup to perform and so no reason >>> for a managed cleanup process. >>> >>> So I can't see a specific reason to change that. Do you think it makes >>> a difference? >> >> Yes. When restore_command is not specified in recovery.conf, walrestore >> process doesn't do any useful activity and just wastes CPU cycle. Which >> might be harmless for a functionality of recovery, but ISTM it's better not >> to start up walrestore in that case to avoid the waste of cycle. > > It just sleeps on a latch when it has nothing to do, so no wasted cycles. Right, since walrestore process wakes up just every 10 seconds, a waste of cycle is low. But what I feel uncomfortable is that walrestore process has nothing to do *from start to end*, when restore_command is not specified, but it's started up. I guess that many people would get surprised at that. Of course, if restore_command can be changed without restarting the server, I agree with you because walrestore process might do an useful activity later. But currently not. > Asking the postmaster seemed the easier option, I guess I could have > chosen the other way also. > > I'll look at this when this is the last thing left to resolve to see > if that improves things. Okay. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Online base backup from the hot-standby
On Mon, Jan 23, 2012 at 10:11 PM, Simon Riggs wrote: > On Mon, Jan 23, 2012 at 10:29 AM, Fujii Masao wrote: >> On Fri, Jan 20, 2012 at 11:34 PM, Simon Riggs wrote: >>> On Fri, Jan 20, 2012 at 12:54 PM, Fujii Masao wrote: Thanks for the review! On Fri, Jan 20, 2012 at 8:15 PM, Simon Riggs wrote: > I'm looking at this patch and wondering why we're doing so many > press-ups to ensure full_page_writes parameter is on. This will still > fail if you use a utility that removes the full page writes, but fail > silently. > > I think it would be beneficial to explicitly check that all WAL > records have full page writes actually attached to them until we > achieve consistency. I agree that it's worth adding such a safeguard. That can be a self-contained feature, so I'll submit a separate patch for that, to keep each patch small. >>> >>> Maybe, but you mean do this now as well? Not sure I like silent errors. >> >> If many people think the patch is not acceptable without such a safeguard, >> I will do that right now. Otherwise, I'd like to take more time to do >> that, i.e., >> add it to 9.2dev Oepn Items. > >> I've not come up with good idea. Ugly idea is to keep track of all replays of >> full_page_writes for every buffer pages (i.e., prepare 1-bit per buffer page >> table and set the specified bit to 1 when full_page_writes is applied), >> and then check whether full_page_writes has been already applied when >> replaying normal WAL record... Do you have any better idea? > > Not sure. > > I think the only possible bug here is one introduced by an outside utility. > > In that case, I don't think it should be the job of the backend to go > too far to protect against such atypical error. So if we can't solve > it fairly easily and with no overhead then I'd say lets skip it. We > could easily introduce a bug here just by having faulty checking code. > > So lets add it to 9.2 open items as a non-priority item. Agreed. > I'll proceed to commit for this now. Thanks a lot! Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL Restore process during recovery
On Tue, Jan 24, 2012 at 9:43 AM, Fujii Masao wrote: > On Tue, Jan 24, 2012 at 12:23 AM, Simon Riggs wrote: >> On Mon, Jan 23, 2012 at 12:23 PM, Fujii Masao wrote: >>> Why does walrestore need to be invoked even when restore_command is >>> not specified? It seems to be useless. We invoke walreceiver only when >>> primary_conninfo is specified now. Similarly we should invoke walrestore >>> only when restore_command is specified? >> >> walreceiver is shutdown and restarted in case of failed connection. >> That never happens with walrestore because the command is run each >> time - when we issue system(3) a new process is forked to run the >> command. So there is no specific cleanup to perform and so no reason >> for a managed cleanup process. >> >> So I can't see a specific reason to change that. Do you think it makes >> a difference? > > Yes. When restore_command is not specified in recovery.conf, walrestore > process doesn't do any useful activity and just wastes CPU cycle. Which > might be harmless for a functionality of recovery, but ISTM it's better not > to start up walrestore in that case to avoid the waste of cycle. It just sleeps on a latch when it has nothing to do, so no wasted cycles. Asking the postmaster seemed the easier option, I guess I could have chosen the other way also. I'll look at this when this is the last thing left to resolve to see if that improves things. >> Cleaned up the points noted, new patch attached in case you wish to >> review further. >> >> Still has bug, so still with me to fix. > > Thanks! Will review further. Much appreciated. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL Restore process during recovery
On Tue, Jan 24, 2012 at 12:23 AM, Simon Riggs wrote: > On Mon, Jan 23, 2012 at 12:23 PM, Fujii Masao wrote: >> Why does walrestore need to be invoked even when restore_command is >> not specified? It seems to be useless. We invoke walreceiver only when >> primary_conninfo is specified now. Similarly we should invoke walrestore >> only when restore_command is specified? > > walreceiver is shutdown and restarted in case of failed connection. > That never happens with walrestore because the command is run each > time - when we issue system(3) a new process is forked to run the > command. So there is no specific cleanup to perform and so no reason > for a managed cleanup process. > > So I can't see a specific reason to change that. Do you think it makes > a difference? Yes. When restore_command is not specified in recovery.conf, walrestore process doesn't do any useful activity and just wastes CPU cycle. Which might be harmless for a functionality of recovery, but ISTM it's better not to start up walrestore in that case to avoid the waste of cycle. > Cleaned up the points noted, new patch attached in case you wish to > review further. > > Still has bug, so still with me to fix. Thanks! Will review further. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Checksums
> I would chip in and say that I would prefer sticking to well-known proved > filesystems like xfs/ext4 and let the application do the checksumming. Yes, that's a different way of putting my concern. If you want a proven file system with checksumming (and an fsck), options are really quite limited. > And yes, I would for sure turn such functionality on if it were present. Same here. I already use page-level checksum with Berkeley DB. -- Florian Weimer BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Checksums
> * Robert Treat: > >> Would it be unfair to assert that people who want checksums but aren't >> willing to pay the cost of running a filesystem that provides >> checksums aren't going to be willing to make the cost/benefit trade >> off that will be asked for? Yes, it is unfair of course, but it's >> interesting how small the camp of those using checksummed filesystems >> is. > > Don't checksumming file systems currently come bundled with other > features you might not want (such as certain vendors)? I would chip in and say that I would prefer sticking to well-known proved filesystems like xfs/ext4 and let the application do the checksumming. I dont forsee fully production-ready checksumming filesystems readily available in the standard Linux distributions within a near future. And yes, I would for sure turn such functionality on if it were present. -- Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers