[HACKERS] minor spelling error fix (btis -> bits)
Attached please find a minor spelling error fix, changing "btis" to "bits". -- Jon From f590a6dce6677bc5b8a409d40fd651ecb69b27bb Mon Sep 17 00:00:00 2001 From: Jon Nelson Date: Sun, 23 Mar 2014 08:23:48 -0500 Subject: [PATCH] - fix minor spelling error --- src/backend/utils/adt/network.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backend/utils/adt/network.c b/src/backend/utils/adt/network.c index 1f8469a..b126494 100644 --- a/src/backend/utils/adt/network.c +++ b/src/backend/utils/adt/network.c @@ -1136,7 +1136,7 @@ network_scan_first(Datum in) /* * return "last" IP on a given network. It's the broadcast address, - * however, masklen has to be set to its max btis, since + * however, masklen has to be set to its max bits, since * 192.168.0.255/24 is considered less than 192.168.0.255/32 * * inet_set_masklen() hacked to max out the masklength to 128 for IPv6 -- 2.10.2 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZE to max_wal_send
On Thu, Mar 16, 2017 at 9:59 AM, David Steele wrote: > On 1/9/17 11:33 PM, Jon Nelson wrote: > > > > On Sat, Jan 7, 2017 at 7:48 PM, Jim Nasby > <mailto:jim.na...@bluetreble.com>> wrote: > > > > On 1/5/17 12:55 PM, Jonathon Nelson wrote: > > > > Attached please find a patch for PostgreSQL 9.4 which changes the > > maximum amount of data that the wal sender will send at any > point in > > time from the hard-coded value of 128KiB to a user-controllable > > value up > > to 16MiB. It has been primarily tested under 9.4 but there has > > been some > > testing with 9.5. > > > > > > To make sure this doesn't get lost, please add it to > > https://commitfest.postgresql.org > > <https://commitfest.postgresql.org>. Please verify the patch will > > apply against current HEAD and pass make check-world. > > > > > > Attached please find a revision of the patch, changed in the following > ways: > > > > 1. removed a call to debug2. > > 2. applies cleanly against master (as of > > 8c5722948e831c1862a39da2bb5d793a6f2aabab) > > 3. one small indentation fix, one small verbiage fix. > > 4. switched to calculating the upper bound using XLOG_SEG_SIZE rather > > than hard-coding 16384. > > 5. the git author is - obviously - different. > > > > make check-world passes. > > I have added it to the commitfest. > > I have verified with strace that up to 16MB sends are being used. > > I have verified that the GUC properly grumps about values greater than > > XLOG_SEG_SIZE / 1024 or smaller than 4. > > This patch applies cleanly on cccbdde and compiles. However, > documentation in config.sgml is needed. > > The concept is simple enough though there seems to be some argument > about whether or not the patch is necessary. In my experience 128K > should be more than large enough for a chunk size, but I'll buy the > argument that libpq is acting as a barrier in this case. > (as > I'm marking this patch "Waiting on Author" for required documentation. > Thank you for testing and the comments. I have some updates: - I set up a network at home and - in some very quick testing - was unable to observe any obvious performance difference regardless of chunk size - Before I could get any real testing done, one of the machines I was using for testing died and won't even POST, which has put a damper on said testing (as you might imagine). - There is a small issue with the patch: a lower-bound of 4 is not appropriate; it should be XLOG_BLCKSZ / 1024 (I can submit an updated patch if that is appropriate) - I am, at this time, unable to replicate the earlier results however I can't rule them out, either. -- Jon
Re: [HACKERS] Faster methods for getting SPI results
On Thu, Mar 2, 2017 at 10:03 AM, Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 12/20/16 23:14, Jim Nasby wrote: > > I've been looking at the performance of SPI calls within plpython. > > There's a roughly 1.5x difference from equivalent python code just in > > pulling data out of the SPI tuplestore. Some of that is due to an > > inefficiency in how plpython is creating result dictionaries, but fixing > > that is ultimately a dead-end: if you're dealing with a lot of results > > in python, you want a tuple of arrays, not an array of tuples. > > There is nothing that requires us to materialize the results into an > actual list of actual rows. We could wrap the SPI_tuptable into a > Python object and implement __getitem__ or __iter__ to emulate sequence > or mapping access. > Python objects have a small (but non-zero) overhead in terms of both memory and speed. A built-in dictionary is probably one of the least-expensive (memory/cpu) choices, although how the dictionary is constructed also impacts performance. Another choice is a tuple. Avoiding Py_BuildValue(...) in exchange for a bit more complexity (via PyTuple_New(..) and PyTuple_SetItem(...)) is also a nice performance win in my experience. -- Jon
Re: [HACKERS] [PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZE to max_wal_send
On Sat, Jan 7, 2017 at 7:48 PM, Jim Nasby wrote: > On 1/5/17 12:55 PM, Jonathon Nelson wrote: > >> Attached please find a patch for PostgreSQL 9.4 which changes the >> maximum amount of data that the wal sender will send at any point in >> time from the hard-coded value of 128KiB to a user-controllable value up >> to 16MiB. It has been primarily tested under 9.4 but there has been some >> testing with 9.5. >> > > To make sure this doesn't get lost, please add it to > https://commitfest.postgresql.org. Please verify the patch will apply > against current HEAD and pass make check-world. Attached please find a revision of the patch, changed in the following ways: 1. removed a call to debug2. 2. applies cleanly against master (as of 8c5722948e831c1862a39da2bb5d79 3a6f2aabab) 3. one small indentation fix, one small verbiage fix. 4. switched to calculating the upper bound using XLOG_SEG_SIZE rather than hard-coding 16384. 5. the git author is - obviously - different. make check-world passes. I have added it to the commitfest. I have verified with strace that up to 16MB sends are being used. I have verified that the GUC properly grumps about values greater than XLOG_SEG_SIZE / 1024 or smaller than 4. -- Jon From 7286e290daec32e12260e9e1e44a490c13ed2245 Mon Sep 17 00:00:00 2001 From: Jon Nelson Date: Mon, 9 Jan 2017 20:00:41 -0600 Subject: [PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZE to max_wal_send --- src/backend/replication/walsender.c | 13 +++-- src/backend/utils/misc/guc.c| 12 2 files changed, 19 insertions(+), 6 deletions(-) diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index f3082c3..4a9eb16 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -91,7 +91,7 @@ * because signals are checked only between messages. 128kB (with * default 8k blocks) seems like a reasonable guess for now. */ -#define MAX_SEND_SIZE (XLOG_BLCKSZ * 16) +int max_wal_send_guc = 0; /* Array of WalSnds in shared memory */ WalSndCtlData *WalSndCtl = NULL; @@ -2194,7 +2194,7 @@ retry: /* * Send out the WAL in its normal physical/stored form. * - * Read up to MAX_SEND_SIZE bytes of WAL that's been flushed to disk, + * Read up to max_wal_send kilobytes of WAL that's been flushed to disk, * but not yet sent to the client, and buffer it in the libpq output * buffer. * @@ -2208,6 +2208,7 @@ XLogSendPhysical(void) XLogRecPtr startptr; XLogRecPtr endptr; Size nbytes; + int max_wal_send = max_wal_send_guc * 1024; if (streamingDoneSending) { @@ -2346,8 +2347,8 @@ XLogSendPhysical(void) /* * Figure out how much to send in one message. If there's no more than - * MAX_SEND_SIZE bytes to send, send everything. Otherwise send - * MAX_SEND_SIZE bytes, but round back to logfile or page boundary. + * max_wal_send bytes to send, send everything. Otherwise send + * max_wal_send bytes, but round back to logfile or page boundary. * * The rounding is not only for performance reasons. Walreceiver relies on * the fact that we never split a WAL record across two messages. Since a @@ -2357,7 +2358,7 @@ XLogSendPhysical(void) */ startptr = sentPtr; endptr = startptr; - endptr += MAX_SEND_SIZE; + endptr += max_wal_send; /* if we went beyond SendRqstPtr, back off */ if (SendRqstPtr <= endptr) @@ -2376,7 +2377,7 @@ XLogSendPhysical(void) } nbytes = endptr - startptr; - Assert(nbytes <= MAX_SEND_SIZE); + Assert(nbytes <= max_wal_send); /* * OK to read and send the slice. diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 5b23dbf..b9a0c47 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -112,6 +112,7 @@ extern char *default_tablespace; extern char *temp_tablespaces; extern bool ignore_checksum_failure; extern bool synchronize_seqscans; +extern int max_wal_send_guc; #ifdef TRACE_SYNCSCAN extern bool trace_syncscan; @@ -2342,6 +2343,17 @@ static struct config_int ConfigureNamesInt[] = }, { + {"max_wal_send", PGC_SIGHUP, REPLICATION_SENDING, + gettext_noop("Sets the maximum WAL payload size for WAL replication."), + NULL, + GUC_UNIT_KB + }, + &max_wal_send_guc, + 128, 4, XLOG_SEG_SIZE / 1024, + NULL, NULL, NULL + }, + + { {"commit_delay", PGC_SUSET, WAL_SETTINGS, gettext_noop("Sets the delay in microseconds between transaction commit and " "flushing WAL to disk."), -- 2.10.2 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] macaddr 64 bit (EUI-64) datatype support
On Tue, Nov 22, 2016 at 8:42 AM, Tom Lane wrote: > Haribabu Kommi writes: > > Any suggestions for the name to be used for the new datatype the can > > work for both 48 and 64 bit MAC addresses? > > The precedent of int4/int8/float4/float8 is that SQL data types should > be named after their length in bytes. So I'd be inclined to call this > "macaddr8" not "macaddr64". That would suggest taking the simple > approach of always storing values in the 8-byte format, rather than > dealing with the complexities of having two formats internally, two > display formats, etc. > Be that as it may, but almost everybody else (outside the db world?) uses bits. The C types, for example, are expressed in bits (int8_t, int64_t, etc...). > While comparing a 48 bit MAC address with 64 bit MAC address, Ignore > > the two bytes if the contents in those bytes are reserved bytes. > > Um ... I don't follow. Surely these must compare different: > > 01-01-01-FF-FE-01-01-01 > 01-01-01-FF-0E-01-01-01 > What's more, it now requires 2 comparisons and some logic versus the possibility of a single memcmp. -- Jon
Re: [HACKERS] Change pg_cancel_*() to ignore current backend
On May 20, 2015 6:43 AM, "David Steele" wrote: > > On 5/20/15 1:40 AM, Jim Nasby wrote: > > On 5/19/15 9:19 PM, Fabrízio de Royes Mello wrote: > >> We could add a second parameter to the current functions: > >> allow_own_pid DEFAULT false. To me that seems better than an > >> entirely separate set of functions. > >> > >> > >> +1 to add a second parameter to current functions. > > > > Instead of allow_own_pid, I went with skip_own_pid. I have the function > > still returning true even when it skips it's own PID... that seems a bit > > weird, but I think it's better than returning false. Unless someone > > thinks it should return NULL, but I don't see that as any better either. > > +1. I agree that cancelling/killing your own process should not be the > default behavior. -1. It breaks backwards compatibility. I use this function a fair bit to terminate the current backend all the time. Changing the signature is great, by make the default behavior retain the current behavior. > > -- > - David Steele > da...@pgmasters.net >
Re: [HACKERS] Change pg_cancel_*() to ignore current backend
On Wed, May 20, 2015 at 9:09 AM, Tom Lane wrote: > > I think backwards compatibility probably trumps that argument. I have > no objection to providing a different call that behaves this way, but > changing the behavior of existing applications will face a *much* > higher barrier to acceptance. Especially since a real use-case for > the current behavior was shown upthread, which means you can't argue > that it's simply a bug. > > regards, tom lane Agree. It breaks backwards compatibility. I use this function a fair bit to terminate the current backend all the time. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CTE that result in repeated sorting of the data
On Thu, May 15, 2014 at 4:50 PM, David G Johnston wrote: > Jon Nelson-14 wrote >> I was watching a very large recursive CTE get built today and this CTE >> involves on the order of a dozen or so "loops" joining the initial >> table against existing tables. It struck me that - every time through >> the loop the tables were sorted and then joined and that it would be >> much more efficient if the tables remained in a sorted state and could >> avoid being re-sorted each time through the loop. Am I missing >> something here? I am using PG 8.4 if that matters. > > I'm not sure what you mean by "watching" but maybe this is a simple as > changing your CTE to use "UNION ALL" instead of "UNION [DISTINCT]"? In fact, I'm using UNION ALL. > If you really think it could be improved upon maybe you can help and provide > a minimal self-contained example query and data that exhibits the behavior > you describe so others can see it and test changes? It would be nice to > know if other versions than one that is basically no longer supported > exhibits the same behavior. Pretty much any CTE that looks like this: with cte AS ( select stuff from A UNION ALL select more_stuff from B, cte WHERE ) SELECT * FROM cte; *and* where the planner chooses to join B and cte by sorting and doing a merge join. I'll see if I can come up with a self-contained example. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] CTE that result in repeated sorting of the data
I was watching a very large recursive CTE get built today and this CTE involves on the order of a dozen or so "loops" joining the initial table against existing tables. It struck me that - every time through the loop the tables were sorted and then joined and that it would be much more efficient if the tables remained in a sorted state and could avoid being re-sorted each time through the loop. Am I missing something here? I am using PG 8.4 if that matters. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT
What - if anything - do I need to do to get this on the commitfest list for the next commitfest? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT
What are my next steps here? I believe the concept is sound, the code is appears to work and doesn't crash, and the result does show a performance win in most cases (sometimes a big win). It's also incomplete, at least insofar as it doesn't interface with the cost models at all, etc... -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT
On Wed, Jan 22, 2014 at 3:26 PM, Tom Lane wrote: > Jeremy Harris writes: >> On 22/01/14 03:53, Tom Lane wrote: >>> Jon Nelson writes: >>>> - in createplan.c, eliding duplicate tuples is enabled if we are >>>> creating a unique plan which involves sorting first > >>> [ raised eyebrow ... ] And what happens if the planner drops the >>> unique step and then the sort doesn't actually go to disk? > >> I don't think Jon was suggesting that the planner drop the unique step. > > Hm, OK, maybe I misread what he said there. Still, if we've told > tuplesort to remove duplicates, why shouldn't we expect it to have > done the job? Passing the data through a useless Unique step is > not especially cheap. That's correct - I do not propose to drop the unique step. Duplicates are only dropped if it's convenient to do so. In one case, it's a zero-cost drop (no extra comparison is made). In most other cases, an extra comparison is made, typically right before writing a tuple to tape. If it compares as identical to the previously-written tuple, it's thrown out instead of being written. The output of the modified code is still sorted, still *might* (and in most cases, probably will) contain duplicates, but will (probably) contain fewer duplicates. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT
On Tue, Jan 21, 2014 at 9:53 PM, Tom Lane wrote: > Jon Nelson writes: >> A rough summary of the patch follows: > >> - a GUC variable enables or disables this capability >> - in nodeAgg.c, eliding duplicate tuples is enabled if the number of >> distinct columns is equal to the number of sort columns (and both are >> greater than zero). >> - in createplan.c, eliding duplicate tuples is enabled if we are >> creating a unique plan which involves sorting first >> - ditto planner.c >> - all of the remaining changes are in tuplesort.c, which consist of: >> + a new macro, DISCARDTUP and a new structure member, discardtup, are >> both defined and operate similar to COMPARETUP, COPYTUP, etc... >> + in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply >> throw out the new tuple if it compares as identical to the tuple at >> the top of the heap. Since we're already performing this comparison, >> this is essentially free. >> + in mergeonerun, we may discard a tuple if it compares as identical >> to the *last written tuple*. This is a comparison that did not take >> place before, so it's not free, but it saves a write I/O. >> + We perform the same logic in dumptuples > > [ raised eyebrow ... ] And what happens if the planner drops the > unique step and then the sort doesn't actually go to disk? I'm not familiar enough with the code to be able to answer your question with any sort of authority, but I believe that if the state TSS_BUILDRUNS is never hit, then basically nothing new happens. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT
Greetings -hackers: I have worked up a patch to PostgreSQL which elides tuples during an external sort. The primary use case is when sorted input is being used to feed a DISTINCT operation. The idea is to throw out tuples that compare as identical whenever it's convenient, predicated on the assumption that even a single I/O is more expensive than some number of (potentially extra) comparisons. Obviously, this is where a cost model comes in, which has not been implemented. This patch is a work-in-progress. Being very new to hacking PostgreSQL, I've probably done a bunch of things wrong, but I've tried to test it thoroughly. A rough summary of the patch follows: - a GUC variable enables or disables this capability - in nodeAgg.c, eliding duplicate tuples is enabled if the number of distinct columns is equal to the number of sort columns (and both are greater than zero). - in createplan.c, eliding duplicate tuples is enabled if we are creating a unique plan which involves sorting first - ditto planner.c - all of the remaining changes are in tuplesort.c, which consist of: + a new macro, DISCARDTUP and a new structure member, discardtup, are both defined and operate similar to COMPARETUP, COPYTUP, etc... + in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply throw out the new tuple if it compares as identical to the tuple at the top of the heap. Since we're already performing this comparison, this is essentially free. + in mergeonerun, we may discard a tuple if it compares as identical to the *last written tuple*. This is a comparison that did not take place before, so it's not free, but it saves a write I/O. + We perform the same logic in dumptuples I have tested this patch with a bunch of different data, and the results are summarized below. Test 1: I used a corpus of approximately 1.5TB of textual data, consisting of 7.4 billion input rows, 12.6% of which is unique. This is a 'real-world' test. Before the patch: 44.25 hours using 274GB of temporary files After the patch: 36.44 hours using 125GB of temporary files Difference: -7.6 hours, 18% faster Test 2: Acquire the unique set of strings from a totally random set that contains no duplicates. This is a 'worst case'. Input: 700 million rows, 32GB. Before: 18,796 seconds. After: 19,358 seconds Difference: +562 seconds, *slower* by 3%. In this particular case, the performance varies from faster to slower, and I'm not sure why. Statistical noise? Test 3: Acquire the unique set of integers from a set of 100 million, all happen to have the value '1'. This is a 'best case'. Input: 100 million rows, 3.4GB on-disk Before: 26.1 seconds using 1.3GB of temporary files After: 8.9 seconds using 8KB of disk Difference: -17.2 seconds, 65% faster Test 4: Similar to test 3, but using 1 billion rows. Before: 1450 seconds, 13 GB of temporary files After: 468 seconds, 8 KB of temporary files Difference: -982 seconds, 68% faster See below for details on test 4. Lastly, 'make check' passes without issues (modulo rangefuncs grumping about changes in pg_settings). Tests 1 through 3 were performed against PostgreSQL 9.4 HEAD as of early September 2013. I have rebased the patch against PostgreSQL 9.4 HEAD as of 19 January 2014, and re-run tests 2, 3, and 4 with similar results. Regarding the patch: I've tried to conform to the indenting and style rules, including using pg_bsd_indent and pg_indent, but I've not had 100% success. The details on test 4: A simple example, using 1 billion rows all with the value '1' (integer): This is a fabricated *best case*. postgres=# set enable_hashagg = false; SET Time: 30.402 ms postgres=# explain analyze verbose select distinct * from i ; QUERY PLAN - Unique (cost=245942793.27..250942793.27 rows=1 width=4) (actual time=468188.900..468188.901 rows=1 loops=1) Output: i -> Sort (cost=245942793.27..248442793.27 rows=10 width=4) (actual time=468188.897..468188.897 rows=1 loops=1) Output: i Sort Key: i.i Sort Method: external sort Disk: 8kB -> Seq Scan on public.i (cost=0.00..14424779.00 rows=10 width=4) (actual time=34.765..254595.990 rows=10 loops=1) Output: i Total runtime: 468189.125 ms (9 rows) Time: 468189.755 ms postgres=# set enable_opportunistic_tuple_elide = false; SET Time: 30.402 ms postgres=# explain analyze verbose select distinct * from i ; QUERY PLAN - Unique (cost=245942793.27..250942793.27 rows=1 width=4) (actual time=1047226.451..1449371.632 rows=1 loops=1) Output: i -
Re: [HACKERS] 9.4 regression
On Thu, Oct 24, 2013 at 5:41 AM, Thom Brown wrote: > On 5 September 2013 22:24, Bruce Momjian wrote: >> On Mon, Aug 19, 2013 at 09:27:57PM -0400, Stephen Frost wrote: >>> * Andres Freund (and...@2ndquadrant.com) wrote: >>> > I vote for adapting the patch to additionally zero out the file via >>> > write(). In your tests that seemed to perform at least as good as the >>> > old method... It also has the advantage that we can use it a littlebit >>> > more as a testbed for possibly using it for heap extensions one day. >>> > We're pretty early in the cycle, so I am not worried about this too >>> > much... >>> >>> I dunno, I'm pretty disappointed that this doesn't actually improve >>> things. Just following this casually, it looks like it might be some >>> kind of locking issue in the kernel that's causing it to be slower; or >>> at least some code path that isn't exercise terribly much and therefore >>> hasn't been given the love that it should. >>> >>> Definitely interested in what Ts'o says, but if we can't figure out why >>> it's slower *without* writing out the zeros, I'd say we punt on this >>> until Linux and the other OS folks improve the situation. >> >> FYI, the patch has been reverted. > > Is there an updated patch available for this? And did anyone hear from Ts'o? After the patch was reverted, it was not re-submitted. I have tried 3 or 4 times to get more info out of Ts'o , without luck. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 regression
On Fri, Aug 16, 2013 at 3:57 PM, Bruce Momjian wrote: > On Thu, Aug 15, 2013 at 12:08:57PM -0500, Jon Nelson wrote: >> > Where are we on this issue? >> >> I've been able to replicate it pretty easily with PostgreSQL and >> continue to look into it. I've contacted Theodore Ts'o and have gotten >> some useful information, however I'm unable to replicate the behavior >> with the test program (even one that's been modified). What I've >> learned is: >> >> - XLogWrite appears to take approx. 2.5 times longer when writing to a >> file allocated with posix_fallocate, but only the first time the file >> contents are overwritten. This is partially explained by how ext4 >> handles extents and uninitialized data, but 2.5x is MUCH more >> expensive than anticipated or expected here. >> - Writing zeroes to a file allocated with posix_fallocate (essentially >> adding a posix_fallocate step before the usual write-zeroes-in-a-loop >> approach) not only doesn't seem to hurt performance, it seems to help >> or at least have parity, *and* the space is guaranteed to exist on >> disk. At the very least that seems useful. > > Is it time to revert this patch until we know more? While I'm not qualified to say, my inclination is to say yes. It can always be added back later. The only caveat there would be that - perhaps - a small modification of the patch would be warranted. Specifically, with with posix_fallocate, I saw no undesirable behavior when the (newly allocated) file was manually zeroed anyway. The only advantages (that I can see) to doing it this way versus not using posix_fallocate at all is (a) a potential reduction in the number of extents and (b) the space is guaranteed to be on disk if posix_fallocate succeeds. My reading of the patch is that even if posix_fallocate fails due to out of space conditions, we will still try to create the file by writing out zeroes, so perhaps the out-of-disk-space scenario isn't all that useful anyway. I'm awaiting more information from Theodore Ts'o, but I don't expect things to materially change in the near future. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_ctl initdb takes options, but pg_ctl --help doesn't document them?
Taking a look at PostgreSQL HEAD today, I noticed that pg_ctl documents that pg_ctl initdb takes OPTIONS but doesn't document them (unlike for start and others). Is this intentional? -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 regression
> Where are we on this issue? I've been able to replicate it pretty easily with PostgreSQL and continue to look into it. I've contacted Theodore Ts'o and have gotten some useful information, however I'm unable to replicate the behavior with the test program (even one that's been modified). What I've learned is: - XLogWrite appears to take approx. 2.5 times longer when writing to a file allocated with posix_fallocate, but only the first time the file contents are overwritten. This is partially explained by how ext4 handles extents and uninitialized data, but 2.5x is MUCH more expensive than anticipated or expected here. - Writing zeroes to a file allocated with posix_fallocate (essentially adding a posix_fallocate step before the usual write-zeroes-in-a-loop approach) not only doesn't seem to hurt performance, it seems to help or at least have parity, *and* the space is guaranteed to exist on disk. At the very least that seems useful. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 regression
On Thu, Aug 8, 2013 at 9:27 PM, Andres Freund wrote: > On 2013-08-08 16:12:06 -0500, Jon Nelson wrote: ... >> At this point I'm convinced that the issue is a pathological case in >> ext4. The performance impact disappears as soon as the unwritten >> extent(s) are written to with real data. Thus, even though allocating >> files with posix_fallocate is - frequently - orders of magnitude >> quicker than doing it with write(2), the subsequent re-write can be >> more expensive. At least, that's what I'm gathering from the various >> threads. > > >> Why this issue didn't crop up in earlier testing and why I >> can't seem to make test_fallocate do it (even when I modify >> test_fallocate to write to the newly-allocated file in a mostly-random >> fashion) has me baffled. > > It might be kernel version specific and concurrency seems to play a > role. If you reproduce the problem, could you run a "perf record -ga" to > collect a systemwide profile? Finally, an excuse to learn how to use 'perf'! I'll try to provide that info when I am able. > There's some more things to test: > - is the slowdown dependent on the scale? I.e is it visible with -j 1 -c > 1? scale=1 (-j 1 -c 1): with fallocate: 685 tps without: 727 scale=20 with fallocate: 129 without: 402 scale=40 with fallocate: 163 without: 511 > - Does it also occur in synchronous_commit=off configurations? Those > don't fdatasync() from so many backends, that might play a role. With synchronous_commit=off, the performance is vastly improved. Interestingly, the fallocate case is (immaterially) faster than the non-fallocate case: 3766tps vs 3700tps. I tried a few other wal_sync_methods besides the default of fdatasync, all with scale=80. fsync: 198 tps (with fallocate) vs 187. open_sync: 195 tps (with fallocate) vs. 192 > - do bulkloads see it? E.g. the initial pgbench load? time pgbench -s 200 -p 54320 -d pgb -i with fallocate: 2m47s without: 2m50s Hopefully the above is useful. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 regression
On Thu, Aug 8, 2013 at 5:25 PM, Jon Nelson wrote: > On Thu, Aug 8, 2013 at 5:23 PM, Tom Lane wrote: >> Jon Nelson writes: >>> On Thu, Aug 8, 2013 at 4:42 PM, Tom Lane wrote: >>>> Does your test program use all the same writing options that the real >>>> WAL writes do (like O_DIRECT)? >> >>> I believe so. >> >>>> From xlog.c: >> >>> /* do not use get_sync_bit() here --- want to fsync only at end of fill >>> */ >>> fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY, >>>S_IRUSR | S_IWUSR); >> >>> and from the test program: >> >>> fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600); >> >> Maybe I misunderstood, but I thought the performance complaint had to do >> with the actual writes of WAL data, not with the pre-fill. That is, you >> should not just be measuring how long the pre-fill takes, but what is the >> speed of real writes to the file later on (which will be using >> get_sync_bit, for various values of the sync settings). > > Ah, no, I misunderstood your question. > I'm fairly certain the test program doesn't open up files with any > sort of sync. bit set. > I'll have to see what PostgreSQL is using, exactly, and get back to you. My reading of xlog.c seems to indicate that if the sync mode is fdatasync, then the file is not opened with any special sync bit, but that an fdatasync occurs at the end. That is also what the test program does. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 regression
On Thu, Aug 8, 2013 at 5:23 PM, Tom Lane wrote: > Jon Nelson writes: >> On Thu, Aug 8, 2013 at 4:42 PM, Tom Lane wrote: >>> Does your test program use all the same writing options that the real >>> WAL writes do (like O_DIRECT)? > >> I believe so. > >>> From xlog.c: > >> /* do not use get_sync_bit() here --- want to fsync only at end of fill >> */ >> fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY, >>S_IRUSR | S_IWUSR); > >> and from the test program: > >> fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600); > > Maybe I misunderstood, but I thought the performance complaint had to do > with the actual writes of WAL data, not with the pre-fill. That is, you > should not just be measuring how long the pre-fill takes, but what is the > speed of real writes to the file later on (which will be using > get_sync_bit, for various values of the sync settings). Ah, no, I misunderstood your question. I'm fairly certain the test program doesn't open up files with any sort of sync. bit set. I'll have to see what PostgreSQL is using, exactly, and get back to you. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 regression
On Thu, Aug 8, 2013 at 4:42 PM, Tom Lane wrote: > Jon Nelson writes: >> At this point I'm convinced that the issue is a pathological case in >> ext4. The performance impact disappears as soon as the unwritten >> extent(s) are written to with real data. Thus, even though allocating >> files with posix_fallocate is - frequently - orders of magnitude >> quicker than doing it with write(2), the subsequent re-write can be >> more expensive. At least, that's what I'm gathering from the various >> threads. Why this issue didn't crop up in earlier testing and why I >> can't seem to make test_fallocate do it (even when I modify >> test_fallocate to write to the newly-allocated file in a mostly-random >> fashion) has me baffled. > > Does your test program use all the same writing options that the real > WAL writes do (like O_DIRECT)? I believe so. >From xlog.c: /* do not use get_sync_bit() here --- want to fsync only at end of fill */ fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY, S_IRUSR | S_IWUSR); and from the test program: fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600); PG_BINARY expands to 0 on non-Windows. I also tried using O_WRONLY in xlog.c without change. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 regression
On Thu, Aug 8, 2013 at 4:24 PM, Bruce Momjian wrote: > On Thu, Aug 8, 2013 at 04:12:06PM -0500, Jon Nelson wrote: >> On Thu, Aug 8, 2013 at 2:50 PM, Hannu Krosing wrote: >> > On 08/08/2013 05:28 PM, Jon Nelson wrote: >> ... >> >> > Just an idea - can you check if using a fillfactor different form 100 >> > changes anything >> > >> > pgbench -s 20 -p 54320 -d pgb -F 90 -i >> > >> > >> >> pgbench -j 80 -c 80 -T 120 -p 54320 pgb >> >> pg_ctl -D tt -w stop >> > >> > That is, does extending tables and indexes to add updated tuples play >> > any role here >> >> I gave it a go - it didn't make any difference at all. >> At this point I'm convinced that the issue is a pathological case in >> ext4. The performance impact disappears as soon as the unwritten >> extent(s) are written to with real data. Thus, even though allocating >> files with posix_fallocate is - frequently - orders of magnitude >> quicker than doing it with write(2), the subsequent re-write can be >> more expensive. At least, that's what I'm gathering from the various >> threads. Why this issue didn't crop up in earlier testing and why I >> can't seem to make test_fallocate do it (even when I modify >> test_fallocate to write to the newly-allocated file in a mostly-random >> fashion) has me baffled. Should this feature be reconsidered? > > How much slower would it be if we wrote it with zeros after > posix_fallocate() --- that would still give use single extents. Has > anyone tested to see if the write without test_fallocate() still gives > us one extent? Actually, I did that test - I found that there was no longer a performance drop and possibly a slight performance benefit. I couldn't quite parse your last sentence, I assume you mean "does skipping the posix_fallocate altogether still produce one extent?" Sadly, I can't really test that properly as the ext4 filesystem I'm testing on is quite fresh and has lots and lots of free space. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 regression
On Thu, Aug 8, 2013 at 2:50 PM, Hannu Krosing wrote: > On 08/08/2013 05:28 PM, Jon Nelson wrote: ... > Just an idea - can you check if using a fillfactor different form 100 > changes anything > > pgbench -s 20 -p 54320 -d pgb -F 90 -i > > >> pgbench -j 80 -c 80 -T 120 -p 54320 pgb >> pg_ctl -D tt -w stop > > That is, does extending tables and indexes to add updated tuples play > any role here I gave it a go - it didn't make any difference at all. At this point I'm convinced that the issue is a pathological case in ext4. The performance impact disappears as soon as the unwritten extent(s) are written to with real data. Thus, even though allocating files with posix_fallocate is - frequently - orders of magnitude quicker than doing it with write(2), the subsequent re-write can be more expensive. At least, that's what I'm gathering from the various threads. Why this issue didn't crop up in earlier testing and why I can't seem to make test_fallocate do it (even when I modify test_fallocate to write to the newly-allocated file in a mostly-random fashion) has me baffled. Should this feature be reconsidered? -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 regression
A follow-up. I found this thread that seems to explain some things: http://comments.gmane.org/gmane.comp.file-systems.ext4/33024 Short version: if we are writing into the "middle" of the newly-fallocated file on ext4 (with extents) the extent tree can fragment badly, causing poor performance due to extent merging. I also ran some tests with xfs, and the results were even stranger: xfs performed very slightly better with fallocate vs. without. xfs (fallocate) 216 tps vs. (no fallocate) 206 ext4 (no fallocate) 605 vs (fallocate) 134. I made an ext4 filesystem without extents using the same block device as above - a real, single hard drive with nothing else on it. On this filesystem, the performance remained the same (or - perhaps - improved very slightly): 633tps (with fallocate) vs. 607. Using the following additions postgresql.conf: max_connections = 500 shared_buffers = 1GB effective_cache_size = 1GB random_page_cost = 2.0 cpu_tuple_cost = 0.03 wal_buffers = 32MB work_mem = 100MB maintenance_work_mem = 512MB commit_delay = 50 commit_siblings = 15 checkpoint_segments = 256 log_checkpoints = on and this partial script for testing: pg_ctl initdb -D tt -w cp ~/postgresql.conf tt pg_ctl -D tt -w start createdb -p 54320 pgb pgbench -s 20 -p 54320 -d pgb -i pgbench -j 80 -c 80 -T 120 -p 54320 pgb pg_ctl -D tt -w stop -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 regression
On Wed, Aug 7, 2013 at 10:05 PM, Andres Freund wrote: > On 2013-08-07 20:23:55 +0100, Thom Brown wrote: >> >>> 269e78 was the commit immediately after 8800d8, so it appears that >> >>> introduced the regression. >> >>> >> >>> "Use posix_fallocate() for new WAL files, where available." >> >> >> >> This is curious. Could you either run a longer test before/after the >> >> commit or reduce checkpoint_timeout to something like 3min? >> > >> > Okay, I'll rerun the test for both those commits at 1 hour each with >> > checkpoint_timeout set at 3mins, but with all other configuration >> > settings the same >> >> Results >> (checkpoint_timeout = 3min) >> >> pgbench -j 80 -c 80 -T 3600 >> >> 269e78: 606.268013 >> 8800d8: 779.583129 > > Ok, so the performance difference is lower. But, it seems to be still to > high to be explaineable by performance differences during > initialization/fallocate. > In a very quick test, I don't see the same on my laptop. I am currently > travelling and I don't have convenient access to anything else. > > Could you: > - run filefrag on a couple of wal segments of both clusters after the > test and post the results here? For me, there is no meaningful difference, but I have a relatively fresh filesystem (ext4) with lots of free space. > - enable log_checkpoints, post the lines spat out together with the results > - run pgbench without reinitializing the cluster/pgbench tables > inbetween? When I do this (as I note below), the performance difference (for me) disappears. > Basically, I'd like to know whether its the initialization that's slow > (measurable because we're regularly creating new files because of a too > low checkpoint_segments) or whether it's the actual writes. What I've found so far is very confusing. I start by using initdb to initialize a temporary cluster, copy in a postgresql.conf (with the modifcations from Thom Brown tweaked for my small laptop), start the cluster, create a test database, initialize it with pgbench, and then run. I'm also only running for two minutes at this time. Every time I test, the non-fallocate version is faster. I modifed xlog.c to use posix_fallocate (or not) based on an environment variable. Once the WAL files have been rewritten at least once, then it doesn't seem to matter which method is used to allocate them (although posix_fallocate seems to have a slight edge). I even went to far as to modify the code to posix_fallocate the file *and then zero it out anyway*, and it was almost as fast as zeroing alone, and faster than using posix_fallocate alone. >> Jon, here are the test results you asked for: >> >> $ for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done >> method: classic. 1 open/close iterations, 1 rewrite in 0.9157s >> method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.5314s >> method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6018s >> method: classic. 2 open/close iterations, 1 rewrite in 1.1417s >> method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6505s >> method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.8900s >> method: classic. 5 open/close iterations, 1 rewrite in 3.6490s >> method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.9841s >> method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1053s >> method: classic. 10 open/close iterations, 1 rewrite in 5.7562s >> method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2015s >> method: glibc emulation. 10 open/close iterations, 1 rewrite in 7.1426s >> method: classic. 100 open/close iterations, 1 rewrite in 64.9483s >> method: posix_fallocate. 100 open/close iterations, 1 rewrite in 36.3748s >> method: glibc emulation. 100 open/close iterations, 1 rewrite in 64.8386s > > Ok, this seems to indicate that fallocate works nicely. Jon, wasn't > there a version of the test that rewrote the file afterwards? Yes. If you use a different number besides '1' as the third argument in the command line above, it will rewrite the file that many times. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 regression
On Wed, Aug 7, 2013 at 4:18 PM, Kevin Grittner wrote: > Thom Brown wrote: > >> pgbench -j 80 -c 80 -T 3600 >> >> 269e78: 606.268013 >> 8800d8: 779.583129 I have also been running some tests and - as yet - they are inconclusive. What I can say about them so far is that - at times - they agree with these results and do show a performance hit. I took the settings as posted and adjusted them slightly for my much lower-powered personal laptop, changing checkpoint_completion_target to 1.0 and checkpoint_timeout to 1min. I am testing with the latest git HEAD, turning fallocate support on and off by editing xlog.c directly. Furthermore, before each run I would try to reduce the number of existing WAL files by making a "pre" run with checkpoint_segments = 3 before changing it to 32. For reasons I don't entirely understand, when WAL files are being created (rather than recycled) I found a performance hit, but when they were being recycled I got a slight performance gain (this may be due to reduced fragmentation) but the gain was never more than 10% and frequently less than that. I can't explain - yet (if ever!) - why my initial tests (and many of those done subsequently) showed improvement - it may very well have something to do with how the code interacts with these settings. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 regression
On Wed, Aug 7, 2013 at 12:42 PM, Thom Brown wrote: > On 7 August 2013 17:54, Jon Nelson wrote: >> On Wed, Aug 7, 2013 at 11:21 AM, Thom Brown wrote: >>> Hi all, >>> >>> I recently tried a simple benchmark to see how far 9.4 had come since >>> 8.4, but I discovered that I couldn't get 9.4 to even touch 8.4 for >>> performance. After checking 9.2 and 9.3 (as per Kevin Grittner's >>> suggestion), I found that those were fine, so the issue must be in >>> 9.4devel. I used identical configurations for each test, and used >>> 9.1's pgbench to ensure changes in pgbench didn't affect the outcome. >>> The common config changes were: >> >> ... >> >>> 8.4: 812.482108 >>> 9.4 HEAD: 355.397658 >>> 9.4 e5592c (9th July): 356.485625 >>> 9.4 537227 (7th July): 365.992518 >>> 9.4 9b2543 (7th July): 362.587339 >>> 9.4 269e78 (5th July): 359.439143 >>> 9.4 8800d8 (5th July): 821.933082 >>> 9.4 568d41 (2nd July): 822.991160 >>> >>> 269e78 was the commit immediately after 8800d8, so it appears that >>> introduced the regression. >>> >>> "Use posix_fallocate() for new WAL files, where available." >>> >>> Ironically, that was intended to be a performance improvement. >> >> Would it be possible for you to download, compile, and run the test >> program as described and located in this email: >> >> http://www.postgresql.org/message-id/cakuk5j1acml-1cgbhnrzed-vh4og+8hkmfhy2eca-8jbj-6...@mail.gmail.com > > I shall do after the 2 x 1 hour benchmarks are complete. > >> I also wonder if there is a problem with the 3.8.0 kernel specifically. > > Well my laptop has the same kernel (and also 64-bit Linux Mint), so > took 3 quick sample benchmarks on those two commits, and I get the > following (all without --enable-cassert): > > 269e78: 1162.593665 / 1158.644302 / 1153.955611 > 8800d8: 2446.087618 / 2443.797252 / 2321.876431 > > And running your test program gives the following (again, just on my laptop): > > for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done > method: classic. 1 open/close iterations, 1 rewrite in 0.6380s > method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.3204s > method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6274s > method: classic. 2 open/close iterations, 1 rewrite in 1.2908s > method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6596s > method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.2666s > method: classic. 5 open/close iterations, 1 rewrite in 3.1419s > method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.5930s > method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1516s > method: classic. 10 open/close iterations, 1 rewrite in 6.2912s > method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2626s > method: glibc emulation. 10 open/close iterations, 1 rewrite in 6.3667s > method: classic. 100 open/close iterations, 1 rewrite in 67.4174s > method: posix_fallocate. 100 open/close iterations, 1 rewrite in 37.8788s > method: glibc emulation. 100 open/close iterations, 1 rewrite in 55.0714s OK. That's interesting, and a good data point. One thing you could try manually disabling the use of posix_fallocate in 269e78. After running ./configure --stuff-here edit src/include/pg_config.h and comment out the following line (on or around line 374) #define HAVE_POSIX_FALLOCATE 1 *then* build postgresql and see if the performance hit is still there. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 regression
On Wed, Aug 7, 2013 at 11:21 AM, Thom Brown wrote: > Hi all, > > I recently tried a simple benchmark to see how far 9.4 had come since > 8.4, but I discovered that I couldn't get 9.4 to even touch 8.4 for > performance. After checking 9.2 and 9.3 (as per Kevin Grittner's > suggestion), I found that those were fine, so the issue must be in > 9.4devel. I used identical configurations for each test, and used > 9.1's pgbench to ensure changes in pgbench didn't affect the outcome. > The common config changes were: ... > 8.4: 812.482108 > 9.4 HEAD: 355.397658 > 9.4 e5592c (9th July): 356.485625 > 9.4 537227 (7th July): 365.992518 > 9.4 9b2543 (7th July): 362.587339 > 9.4 269e78 (5th July): 359.439143 > 9.4 8800d8 (5th July): 821.933082 > 9.4 568d41 (2nd July): 822.991160 > > 269e78 was the commit immediately after 8800d8, so it appears that > introduced the regression. > > "Use posix_fallocate() for new WAL files, where available." > > Ironically, that was intended to be a performance improvement. Would it be possible for you to download, compile, and run the test program as described and located in this email: http://www.postgresql.org/message-id/cakuk5j1acml-1cgbhnrzed-vh4og+8hkmfhy2eca-8jbj-6...@mail.gmail.com I also wonder if there is a problem with the 3.8.0 kernel specifically. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On Fri, Jul 5, 2013 at 2:23 AM, Greg Smith wrote: > On 7/5/13 2:50 AM, Jeff Davis wrote: >> >> So, my simple conclusion is that glibc emulation should be about the >> same as what we're doing now, so there's no reason to avoid it. That >> means, if posix_fallocate() is present, we should use it, because it's >> either the same (if emulated in glibc) or significantly faster (if >> implemented in the kernel). > > > That's what I'm seeing everywhere too. I'm happy that we've spent enough > time chasing after potential issues without finding anything now. Pull out > the GUC that was added for default and this is ready to commit. Wonderful. Is removing the GUC something that I should do or should that be done by somebody that knows more about what they are doing? (I am happy to give it a go!) Should the small test program that I made also be included somewhere in the source tree? -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On Sun, Jun 30, 2013 at 11:52 PM, Greg Smith wrote: > On 6/30/13 9:28 PM, Jon Nelson wrote: >> >> The performance of the latter (new) test sometimes seems to perform >> worse and sometimes seems to perform better (usually worse) than >> either of the other two. In all cases, posix_fallocate performs >> better, but I don't have a sufficiently old kernel to test with. > > > This updated test program looks reliable now. The numbers are very tight > when I'd expect them to be, and there's nowhere with the huge differences I > saw in the earlier test program. > > Here's results from a few sets of popular older platforms: If you found yourself with a spare moment, could you run these again with the number of open/close cycles set high (say, 100) and the number of rewrites set to 0 and also to 1? Most of the time spent is actually spent overwriting the files so by reducing or eliminating that aspect it might be easier to get a handle on the actual performance difference. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On Sun, Jun 30, 2013 at 6:49 PM, Greg Smith wrote: > On 5/28/13 10:00 PM, Jon Nelson wrote: > >> A note: The attached test program uses *fsync* instead of *fdatasync* >> after calling fallocate (or writing out 16MB of zeroes), per an >> earlier suggestion. > > > I tried this out on the RHEL5 platform I'm worried about now. There's > something weird about the test program there. If I run it once it shows > posix_fallocate running much faster: > > without posix_fallocate: 1 open/close iterations, 1 rewrite in 23.0169s > with posix_fallocate: 1 open/close iterations, 1 rewrite in 11.1904s Assuming the platform chosen is using the glibc approach of pwrite(4 bytes) every 4KiB, then the results ought to be similar, and I'm at a loss to explain why it's performing better (unless - grasping at straws - simply the *volume* of data transferred from userspace to the kernel is at play, in which case posix_fallocate will result in 4096 calls to pwrite but at 4 bytes each versus 2048 calls to write at 8KiB each.) Ultimately the same amount of data gets written to disk (one would imagine), but otherwise I can't really think of much. I have also found several errors test_fallocate.c program I posted, corrected below. One of them is: it is missing two pairs of parentheses around two #defines: #define SIXTEENMB 1024*1024*16 #define EIGHTKB 1024*8 should be: #define SIXTEENMB (1024*1024*16) #define EIGHTKB (1024*8) Otherwise the program will end up writing (131072) 8KiB blocks instead of 2048. This actually makes the comparison between writing 8KiB blocks and using posix_fallocate favor the latter more strongly in the results (also seen below). > The problem is that I'm seeing the gap between the two get smaller the more > iterations I run, which makes me wonder if the test is completely fair: > > without posix_fallocate: 2 open/close iterations, 2 rewrite in 34.3281s > with posix_fallocate: 2 open/close iterations, 2 rewrite in 23.1798s > > > without posix_fallocate: 3 open/close iterations, 3 rewrite in 44.4791s > with posix_fallocate: 3 open/close iterations, 3 rewrite in 33.6102s > > without posix_fallocate: 5 open/close iterations, 5 rewrite in 65.6244s > with posix_fallocate: 5 open/close iterations, 5 rewrite in 61.0991s > > You didn't show any output from the latest program on your system, so I'm > not sure how it behaved for you here. On the the platform I use - openSUSE (12.3, x86_64, kernel 3.9.7 currently) I never see posix_fadvise perform worse. Typically better, sometimes much better. To set the number of times the file is overwritten to just 1 (one): for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done I am including a revised version of test_fallocate.c that corrects the above noted error, one typo (from when I changed fdatasync to fsync) that did not alter program behavior, corrects a mis-placed gettimeofday (which does change the results) and includes a new test that aims (perhaps poorly) to emulate the glibc style of pwrite(4 bytes) for every 4KiB, and tests the resulting file size to make sure it is 16MiB in size. The performance of the latter (new) test sometimes seems to perform worse and sometimes seems to perform better (usually worse) than either of the other two. In all cases, posix_fallocate performs better, but I don't have a sufficiently old kernel to test with. The new results on one machine are below. With 0 (zero) rewrites (testing *just* open/some_type_of_allocation/fsync/close): method: classic. 100 open/close iterations, 0 rewrite in 29.6060s method: posix_fallocate. 100 open/close iterations, 0 rewrite in 2.1054s method: glibc emulation. 100 open/close iterations, 0 rewrite in 31.7445s And with the same number of rewrites as open/close cycles: method: classic. 1 open/close iterations, 1 rewrite in 0.6297s method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.3028s method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.5521s method: classic. 2 open/close iterations, 2 rewrite in 1.6455s method: posix_fallocate. 2 open/close iterations, 2 rewrite in 1.0409s method: glibc emulation. 2 open/close iterations, 2 rewrite in 1.5604s method: classic. 5 open/close iterations, 5 rewrite in 7.5916s method: posix_fallocate. 5 open/close iterations, 5 rewrite in 6.9177s method: glibc emulation. 5 open/close iterations, 5 rewrite in 8.1137s method: classic. 10 open/close iterations, 10 rewrite in 29.2816s method: posix_fallocate. 10 open/close iterations, 10 rewrite in 28.4400s method: glibc emulation. 10 open/close iterations, 10 rewrite in 31.2693s -- Jon #include #include #include #include #include #include #include #include #define SIXTEENMB (1024*1024*16) #define FOURKB (1024*4) #define EIGHTKB (1024*8) void writeout(int fd, char *buf) { int i; for (i = 0; i < SIXTEENMB / EIGHTKB; ++i) {
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On Sun, Jun 30, 2013 at 5:55 PM, Greg Smith wrote: > > > pwrite(4, "\0", 1, 16769023)= 1 > pwrite(4, "\0", 1, 16773119)= 1 > pwrite(4, "\0", 1, 16777215)= 1 > > That's glibc helpfully converting your call to posix_fallocate into small > writes, because the OS doesn't provide a better way in that kernel. It's > not hard to imagine this being slower than what the WAL code is doing right > now. I'm not worried about correctness issues anymore, but my gut paranoia > about this not working as expected on older systems was justified. Everyone > who thought I was just whining owes me a cookie. I had noted in the very early part of the thread that glibc emulates posix_fallocate when the (Linux-specific) 'fallocate' systemcall fails. In this case, it's writing 4 bytes of zeros and then essentially seeking forward 4092 (4096-4) bytes. This prevents files with holes in them because the holes have to be at least 4kiB in size, if I recall properly. It's *not* writing out 16MiB in 4 byte increments. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On Sun, Jun 16, 2013 at 9:59 PM, Jon Nelson wrote: > On Fri, Jun 14, 2013 at 12:06 PM, Jeff Davis wrote: >> On Sat, 2013-05-25 at 13:55 -0500, Jon Nelson wrote: .. >> * You check for the presence of posix_fallocate at configure time, but >> don't #ifdef the call site. It looks like you removed this from the v2 >> patch, was there a reason for that? Won't that cause build errors for >> platforms without it? > > Indeed! I believe I have corrected the error and will be sending out a > revised patch (v5), once the performance and back-testing have > completed. Please find attached v5 of the patch, with the above correction in place. The only change from the v4 patch is wrapping the if (wal_use_fallocate) block in an #ifdef USE_POSIX_FALLOCATE. Thanks! -- Jon fallocate-v5.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On Fri, Jun 14, 2013 at 12:06 PM, Jeff Davis wrote: > On Sat, 2013-05-25 at 13:55 -0500, Jon Nelson wrote: >> Ack. I've revised the patch to always have the GUC (for now), default >> to false, and if configure can't find posix_fallocate (or the user >> disables it by way of pg_config_manual.h) then it remains a GUC that >> simply can't be changed. > > Why have a GUC here at all? Perhaps this was already discussed, and I > missed it? Is it just for testing purposes, or did you intend for it to > be in the final version? As Greg Smith noted, right now it's only for testing purposes. > * The other code assumes that no errno means ENOSPC. We should be > consistent about that assumption, and do the same thing in the fallocate > case. Unlike write(2), posix_fallocate does *not* set errno, but instead returns a subset of the errno values as it's return code. The current approach was suggested to me by Andres Freund and Alvaro Herrera. > * You check for the presence of posix_fallocate at configure time, but > don't #ifdef the call site. It looks like you removed this from the v2 > patch, was there a reason for that? Won't that cause build errors for > platforms without it? Indeed! I believe I have corrected the error and will be sending out a revised patch (v5), once the performance and back-testing have completed. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On Tue, Jun 11, 2013 at 12:49 PM, Stefan Drees wrote: > On 2013-06-11 19:45 CEST, Greg Smith wrote: >> >> On 6/11/13 12:22 PM, Merlin Moncure wrote: >> >>> Personally I think this patch should go in regardless -- the concerns >>> made IMNSHO are specious. >> >> >> That's nice, but we have this process for validating whether features go >> in or not that relies on review instead of opinions. >> > ;-) that's why I played with the test_fallocate.c, as it was easy to do and > I understood, the author (of the patch) wanted to trigger some reviews ... I > do not (yet) know anything about the core codes, so I leave this to the > hackers. My review result was, that with newer gcc's you should specify an > open mode flag as third argument of the fopen call, as only with the test > tool nothing important found. If you change line 68 to read: fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600); then it should work just fine (assuming you have not already done so). -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
There hasn't been much activity here recently. I'm curious, then, if there are questions that I can answer. It may be useful to summarize some things here: - the purpose of the patch is to use posix_fallocate when creating new WAL files, because it's (usually) much quicker - using posix_fallocate is *one* system call versus 2048 calls to write(2) - additionally, using posix_fallocate /guarantees/ that the filesystem has space for the WAL file (by spec) - reportedly (difficult to test or prove), using posix_fallocate *may* reduce file fragmentation - the (limited) testing I've done bears this out: the more new WAL file creation there is, the more the improvement. Once the number of WAL files reaches a constant point, there does not appear to be either a positive or a negative performance impact. This is as expected. - a test program (C) was also written and used which creates, allocates, and then writes to files as fast as possible. This test program also shows the expected performance benefits. - the performance benefits range from a few percent up to about 15 percent Concerns: - some were concerned that the spec makes no claims about posix_fallocate being able to guarantee that the space allocated has zeroes in it. This was discussed here and on the Linux Kernel mailing list, wherein the expected behavior is that it does provide zeroes - most systems don't allocate a great many new WAL files, so the performance benefit is small - Benefits: - new WAL file allocate is much quicker, more efficient (fewer system calls) - the patch is (reportedly - I'm not a good judge here!) quite small -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On Tue, May 28, 2013 at 10:36 AM, Greg Smith wrote: > On 5/28/13 11:12 AM, Jon Nelson wrote: >> >> It opens a new file, fallocates 16MB, calls fdatasync. > > > Outside of the run for performance testing, I think it would be good at this > point to validate that there is really a 16MB file full of zeroes resulting > from these operations. I am not really concerned that posix_fallocate might > be slower in some cases; that seems unlikely. I am concerned that it might > result in a file that isn't structurally the same as the 16MB of zero writes > implementation used now. util-linux comes with fallocate which (might?) suffice for testing in that respect, no? If that is a real concern, it could be made part of the autoconf testing, perhaps. > The timing program you're writing has some aspects that are similar to the > contrib/pg_test_fsync program. You might borrow some code from there > usefully. Thanks! If it looks like what I'm attaching will not do, then I'll look at that as a possible next step. > To clarify the suggestion I was making before about including performance > test results: that doesn't necessarily mean the testing code must run using > only the database. That's better if possible, but as Robert says it may not > be for some optimizations. The important thing is to have something > measuring the improvement that a reviewer can duplicate, and if that's a > standalone benchmark problem that's still very useful. The main thing I'm > wary of is any "this should be faster" claims that don't come with any > repeatable measurements at all. Very often theories about the fastest way > to do something don't match what's actually seen in testing. Ack. A note: The attached test program uses *fsync* instead of *fdatasync* after calling fallocate (or writing out 16MB of zeroes), per an earlier suggestion. -- Jon #include #include #include #include #include #include #include #define SIXTEENMB 1024*1024*16 #define EIGHTKB 1024*8 void writeout(int fd, char *buf) { int i; for (i = 0; i < SIXTEENMB / EIGHTKB; ++i) { if (write(fd, buf, EIGHTKB) != EIGHTKB) { fprintf(stderr, "Error in write: %m!\n"); exit(1); } } } int main(int argc, char *argv[]) { int with_fallocate, open_close_iterations, rewrite_iterations; int fd, i, j; double tt; struct timeval tv1, tv2; char *buf0, *buf1; const char *filename; /* for convenience */ if (argc != 4) { fprintf(stderr, "Usage: %s \n", argv[0]); exit(1); } filename = argv[1]; open_close_iterations = atoi(argv[2]); if (open_close_iterations < 0) { fprintf(stderr, "Error parsing 'open_close_iterations'!\n"); exit(1); } rewrite_iterations = atoi(argv[3]); if (rewrite_iterations < 0) { fprintf(stderr, "Error parsing 'rewrite_iterations'!\n"); exit(1); } buf0 = malloc(SIXTEENMB); if (!buf0) { fprintf(stderr, "Unable to allocate memory!\n"); exit(1); } memset(buf0, 0, SIXTEENMB); buf1 = malloc(SIXTEENMB); if (!buf1) { fprintf(stderr, "Unable to allocate memory!\n"); exit(1); } memset(buf1, 1, SIXTEENMB); for (with_fallocate = 0;with_fallocate < 2;++with_fallocate) { for (i = 0;i < open_close_iterations; ++i) { gettimeofday(&tv1, NULL); fd = open(filename, O_CREAT | O_EXCL | O_WRONLY); if (fd < 0) { fprintf(stderr, "Error opening file: %m\n"); exit(1); } if (with_fallocate) { if (posix_fallocate(fd, 0, SIXTEENMB) != 0) { fprintf(stderr, "Error in posix_fallocate!\n"); exit(1); } } else { writeout(fd, buf0); } if (fsync(fd)) { fprintf(stderr, "Error in fdatasync: %m!\n"); exit(1); } for (j = 0; j < rewrite_iterations; ++j) { lseek(fd, 0, SEEK_SET); writeout(fd, buf1); if (fdatasync(fd)) { fprintf(stderr, "Error in fdatasync: %m!\n"); exit(1); } } if (close(fd)) { fprintf(stderr, "Error in close: %m!\n"); exit(1); } unlink(filename); /* don't check for error */ } gettimeofday(&tv2, NULL); tt = (tv2.tv_usec + tv2.tv_sec * 100) - (tv1.tv_usec + tv1.tv_sec * 100); tt /= 100; fprintf(stderr, "with%s posix_fallocate: %d open/close iterations, %d rewrite in %0.4fs\n", (with_fallocate ? "" : "out"), open_close_iterations, rewrite_iterations, tt); sleep(5); } /* cleanup */ free(buf0); free(buf1); return 0; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On Tue, May 28, 2013 at 9:19 AM, Robert Haas wrote: > On Tue, May 28, 2013 at 10:15 AM, Andres Freund > wrote: >> On 2013-05-28 10:03:58 -0400, Robert Haas wrote: >>> On Sat, May 25, 2013 at 2:55 PM, Jon Nelson >>> wrote: >>> >> The biggest thing missing from this submission is information about what >>> >> performance testing you did. Ideally performance patches are submitted >>> >> with >>> >> enough information for a reviewer to duplicate the same test the author >>> >> did, >>> >> as well as hard before/after performance numbers from your test system. >>> >> It >>> >> often turns tricky to duplicate a performance gain, and being able to run >>> >> the same test used for initial development eliminates a lot of the >>> >> problems. >>> > >>> > This has been a bit of a struggle. While it's true that WAL file >>> > creation doesn't happen with great frequency, and while it's also true >>> > that - with strace and other tests - it can be proven that >>> > fallocate(16MB) is much quicker than writing it zeroes by hand, >>> > proving that in the larger context of a running install has been >>> > challenging. >>> >>> It's nice to be able to test things in the context of a running >>> install, but sometimes a microbenchmark is just as good. I mean, if >>> posix_fallocate() is faster, then it's just faster, right? >> >> Well, it's a bit more complex than that. Fallocate doesn't actually >> initializes the disk space in most filesystems, just marks it as >> allocated and zeroed which is one of the reasons it can be noticeably >> faster. But that can make the runtime overhead of writing to those pages >> higher. > > Maybe it would be good to measure that impact. Something like this: > > 1. Write 16MB of zeroes to an empty file in the same size chunks we're > currently using (8kB?). Time that. Rewrite the file with real data. > Time that. > 2. posix_fallocate() an empty file out to 16MB. Time that. Rewrite > the fie with real data. Time that. > > Personally, I have trouble believing that writing 16MB of zeroes by > hand is "better" than telling the OS to do it for us. If that's so, > the OS is just stupid, because it ought to be able to optimize the > crap out of that compared to anything we can do. Of course, it is > more than possible that the OS is in fact stupid. But I'd like to > hope not. I wrote a little C program to do something very similar to that (which I'll hope to post later today). It opens a new file, fallocates 16MB, calls fdatasync. Then it loops 10 times: seek to the start of the file, writes 16MB of ones, calls fdatasync. Then it closes and removes the file, re-opens it, and this time writes out 16MB of zeroes, calls fdatasync, and then does the same loop as above. The program times the process from file open to file unlink, inclusive. The results - for me - are pretty consistent: using fallocate is 12-13% quicker than writing out zeroes. I used fdatasync twice to (attempt) to mimic what the WAL writer does. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On Thu, May 16, 2013 at 7:05 PM, Greg Smith wrote: > On 5/16/13 9:16 AM, Jon Nelson wrote: >> >> Am I doing this the right way? Should I be posting the full patch each >> time, or incremental patches? > > > There are guidelines for getting your patch in the right format at > https://wiki.postgresql.org/wiki/Working_with_Git#Context_diffs_with_Git > that would improve this one. You have some formatting issues with tab > spacing at lines 120 through 133 in your v3 patch. And it looks like there > was a formatting change on line 146 that is making the diff larger than it > needs to be. I've corrected the formatting change (end-of-line whitespace was stripped) on line 146. The other whitespace changes are - I think - due to newly-indented code due to a new code block. Included please find a v4 patch which uses context diffs per the above url. > The biggest thing missing from this submission is information about what > performance testing you did. Ideally performance patches are submitted with > enough information for a reviewer to duplicate the same test the author did, > as well as hard before/after performance numbers from your test system. It > often turns tricky to duplicate a performance gain, and being able to run > the same test used for initial development eliminates a lot of the problems. This has been a bit of a struggle. While it's true that WAL file creation doesn't happen with great frequency, and while it's also true that - with strace and other tests - it can be proven that fallocate(16MB) is much quicker than writing it zeroes by hand, proving that in the larger context of a running install has been challenging. Attached you'll find a small test script (t.sh) which creates a new cluster in 'foo', changes some config values, starts the cluster, and then times how long it takes pgbench to prepare a database. I've used "wal_level = hot_standby" in the hopes that this generates the largest number of WAL files (and I set the number of such files to 1024). The hardware is an AMD 9150e with a 2-disk software RAID1 (SATA disks) on kernel 3.9.2 and ext4 (x86_64, openSUSE 12.3). The test results are not that surprising. The longer the test (the larger the scale factor) the less of a difference using posix_fallocate makes. With a scale factor of 100, I see an average of 10-11% reduction in the time taken to initialize the database. With 300, it's about 5.5% and with 900, it's between 0 and 1.2%. I will be doing more testing but this is what I started with. I'm very open to suggestions. > Second bit of nitpicking. There are already some GUC values that appear or > disappear based on compile time options. They're all debugging related > things though. I would prefer not to see this one go away when it's > implementation isn't available. That's going to break any scripts that SHOW > the setting to see if it's turned on or not as a first problem. I think the > right model to follow here is the IFDEF setup used for > effective_io_concurrency. I wouldn't worry about this too much though. > Having a wal_use_fallocate GUC is good for testing. But if it works out > well, when it's ready for commit I don't see why anyone would want it turned > off on platforms where it works. There are already too many performance > tweaking GUCs. Something has to be very likely to be changed from the > default before its worth adding one for it. Ack. I've revised the patch to always have the GUC (for now), default to false, and if configure can't find posix_fallocate (or the user disables it by way of pg_config_manual.h) then it remains a GUC that simply can't be changed. I'll also be re-running the tests. -- Jon <> fallocate-v4.patch Description: Binary data t.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On Wed, May 15, 2013 at 10:36 PM, Jon Nelson wrote: > On Wed, May 15, 2013 at 10:17 PM, Alvaro Herrera > wrote: >> Jon Nelson escribió: >>> On Wed, May 15, 2013 at 4:46 PM, Jon Nelson >>> wrote: >> >>> > That's true. I originally wrote the patch using fallocate(2). What >>> > would be appropriate here? Should I switch on the return value and the >>> > six (6) or so relevant error codes? >>> >>> I addressed this, hopefully in a reasonable way. >> >> Would it work to just assign the value you got from posix_fallocate (if >> nonzero) to errno and then use %m in the errmsg() call in ereport()? > > That strikes me as a better way. I'll work something up soon. > Thanks! Please find attached version 3. Am I doing this the right way? Should I be posting the full patch each time, or incremental patches? -- Jon fallocate-v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On Wed, May 15, 2013 at 10:17 PM, Alvaro Herrera wrote: > Jon Nelson escribió: >> On Wed, May 15, 2013 at 4:46 PM, Jon Nelson >> wrote: > >> > That's true. I originally wrote the patch using fallocate(2). What >> > would be appropriate here? Should I switch on the return value and the >> > six (6) or so relevant error codes? >> >> I addressed this, hopefully in a reasonable way. > > Would it work to just assign the value you got from posix_fallocate (if > nonzero) to errno and then use %m in the errmsg() call in ereport()? That strikes me as a better way. I'll work something up soon. Thanks! -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On Wed, May 15, 2013 at 4:46 PM, Jon Nelson wrote: > On Wed, May 15, 2013 at 4:34 PM, Andres Freund wrote: .. >> Some where quick comments, without thinking about this: > > Thank you for the kind feedback. > >> * needs a configure check for posix_fallocate. The current version will >> e.g. fail to compile on windows or many other non linux systems. Check >> how its done for posix_fadvise. The following patch includes the changes to configure.in. I had to make other changes (not included here) because my local system uses autoconf 2.69, but I did test this successfully. > That's true. I originally wrote the patch using fallocate(2). What > would be appropriate here? Should I switch on the return value and the > six (6) or so relevant error codes? I addressed this, hopefully in a reasonable way. -- Jon fallocate.patch-v2 Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On Wed, May 15, 2013 at 4:34 PM, Andres Freund wrote: > Hi, > > On 2013-05-15 16:26:15 -0500, Jon Nelson wrote: >> >> I have written up a patch to use posix_fallocate in new WAL file >> >> creation, including configuration by way of a GUC variable, but I've >> >> not contributed to the PostgreSQL project before. Therefore, I'm >> >> fairly certain the patch is not formatted properly or conforms to the >> >> appropriate style guides. Currently, the patch is based on 9.2, and is >> >> quite small in size - 3.6KiB. >> >> I have re-based and reformatted the code, and basic testing shows a >> reduction in WAL-file creation time of a fairly significant amount. >> I ran 'make test' and did additional local testing without issue. >> Therefore, I am attaching the patch. I will try to add it to the >> commitfest page. > > Some where quick comments, without thinking about this: Thank you for the kind feedback. > * needs a configure check for posix_fallocate. The current version will > e.g. fail to compile on windows or many other non linux systems. Check > how its done for posix_fadvise. I will address as soon as I am able. > * Is wal file creation performance actually relevant? Is the performance > of a system running on fallocate()d wal files any different? In my limited testing, I noticed a drop of approx. 100ms per WAL file. I do not have a good idea for how to really stress the WAL-file creation area without calling pg_start_backup and pg_stop_backup over and over (with archiving enabled). However, a file allocated with fallocate is (supposed to be) less fragmented than one created by the traditional means. > * According to the man page posix_fallocate doesn't set errno but rather > returns the error code. That's true. I originally wrote the patch using fallocate(2). What would be appropriate here? Should I switch on the return value and the six (6) or so relevant error codes? > * I wonder whether we ever want to actually disable this? Afair the libc > contains emulation for posix_fadvise if the filesystem doesn't support > it. I know that glibc does, but I don't know about other libc implementations. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On Tue, May 14, 2013 at 9:43 PM, Robert Haas wrote: > On Mon, May 13, 2013 at 9:54 PM, Jon Nelson wrote: >> Pertinent to another thread titled >> [HACKERS] corrupt pages detected by enabling checksums >> I hope to explore the possibility of using fallocate (or >> posix_fallocate) for new WAL file creation. >> >> Most modern Linux filesystems support fast fallocate/posix_fallocate, >> reducing extent fragmentation (where extents are used) and frequently >> offering a pretty significant speed improvement. In my tests, using >> posix_fallocate (followed by pg_fsync) is at least 28 times quicker >> than using the current method (which writes zeroes followed by >> pg_fsync). >> >> I have written up a patch to use posix_fallocate in new WAL file >> creation, including configuration by way of a GUC variable, but I've >> not contributed to the PostgreSQL project before. Therefore, I'm >> fairly certain the patch is not formatted properly or conforms to the >> appropriate style guides. Currently, the patch is based on 9.2, and is >> quite small in size - 3.6KiB. I have re-based and reformatted the code, and basic testing shows a reduction in WAL-file creation time of a fairly significant amount. I ran 'make test' and did additional local testing without issue. Therefore, I am attaching the patch. I will try to add it to the commitfest page. -- Jon 0001-enhance-GUC-and-xlog-with-wal_use_fallocate-boolean-.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
Pertinent to another thread titled [HACKERS] corrupt pages detected by enabling checksums I hope to explore the possibility of using fallocate (or posix_fallocate) for new WAL file creation. Most modern Linux filesystems support fast fallocate/posix_fallocate, reducing extent fragmentation (where extents are used) and frequently offering a pretty significant speed improvement. In my tests, using posix_fallocate (followed by pg_fsync) is at least 28 times quicker than using the current method (which writes zeroes followed by pg_fsync). I have written up a patch to use posix_fallocate in new WAL file creation, including configuration by way of a GUC variable, but I've not contributed to the PostgreSQL project before. Therefore, I'm fairly certain the patch is not formatted properly or conforms to the appropriate style guides. Currently, the patch is based on 9.2, and is quite small in size - 3.6KiB. Advice on how to proceed is appreciated. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] corrupt pages detected by enabling checksums
On Mon, May 13, 2013 at 8:32 AM, Andres Freund wrote: > On 2013-05-12 19:41:26 -0500, Jon Nelson wrote: >> On Sun, May 12, 2013 at 3:46 PM, Jim Nasby wrote: >> > On 5/10/13 1:06 PM, Jeff Janes wrote: >> >> >> >> Of course the paranoid DBA could turn off restart_after_crash and do a >> >> manual investigation on every crash, but in that case the database would >> >> refuse to restart even in the case where it perfectly clear that all the >> >> following WAL belongs to the recycled file and not the current file. >> > >> > >> > Perhaps we should also allow for zeroing out WAL files before reuse (or >> > just >> > disable reuse). I know there's a performance hit there, but the reuse idea >> > happened before we had bgWriter. Theoretically the overhead creating a new >> > file would always fall to bgWriter and therefore not be a big deal. >> >> For filesystems like btrfs, re-using a WAL file is suboptimal to >> simply creating a new one and removing the old one when it's no longer >> required. Using fallocate (or posix_fallocate) (I have a patch for >> that!) to create a new one is - by my tests - 28 times faster than the >> currently-used method. > > I don't think the comparison between just fallocate()ing and what we > currently do is fair. fallocate() doesn't guarantee that the file is the > same size after a crash, so you would still need an fsync() or we > couldn't use fdatasync() anymore. And I'd guess the benefits aren't all > that big anymore in that case? fallocate (16MB) + fsync is still almost certainly faster than write+write+write... + fsync. The test I performed at the time did exactly that .. posix_fallocate + pg_fsync. > That said, using posix_fallocate seems like a good idea in lots of > places inside pg, its just not all that easy to do in some of the > places. I should not derail this thread any further. Perhaps, if interested parties would like to discuss the use of fallocate/posix_fallocate, a new thread might be more appropriate? -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] corrupt pages detected by enabling checksums
On Mon, May 13, 2013 at 7:49 AM, k...@rice.edu wrote: > On Sun, May 12, 2013 at 07:41:26PM -0500, Jon Nelson wrote: >> On Sun, May 12, 2013 at 3:46 PM, Jim Nasby wrote: >> > On 5/10/13 1:06 PM, Jeff Janes wrote: >> >> >> >> Of course the paranoid DBA could turn off restart_after_crash and do a >> >> manual investigation on every crash, but in that case the database would >> >> refuse to restart even in the case where it perfectly clear that all the >> >> following WAL belongs to the recycled file and not the current file. >> > >> > >> > Perhaps we should also allow for zeroing out WAL files before reuse (or >> > just >> > disable reuse). I know there's a performance hit there, but the reuse idea >> > happened before we had bgWriter. Theoretically the overhead creating a new >> > file would always fall to bgWriter and therefore not be a big deal. >> >> For filesystems like btrfs, re-using a WAL file is suboptimal to >> simply creating a new one and removing the old one when it's no longer >> required. Using fallocate (or posix_fallocate) (I have a patch for >> that!) to create a new one is - by my tests - 28 times faster than the >> currently-used method. >> > What about for less cutting (bleeding) edge filesystems? The patch would be applicable for any filesystem which implements the fallocate/posix_fallocate calls in an efficient fashion. xfs and ext4 would both work, if I recall properly. I'm certain there are others, and the technique would probably work on other operating systems like the *BSDs, etc.. Additionally, it's improbable that there would be a performance hit for other filesystems versus simply writing zeroes, since that's the approach that is taken anyway as a fallback. Another win is reduction in fragmentation. I would love to be able to turn off WAL recycling to perform more useful testing. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] corrupt pages detected by enabling checksums
On Sun, May 12, 2013 at 3:46 PM, Jim Nasby wrote: > On 5/10/13 1:06 PM, Jeff Janes wrote: >> >> Of course the paranoid DBA could turn off restart_after_crash and do a >> manual investigation on every crash, but in that case the database would >> refuse to restart even in the case where it perfectly clear that all the >> following WAL belongs to the recycled file and not the current file. > > > Perhaps we should also allow for zeroing out WAL files before reuse (or just > disable reuse). I know there's a performance hit there, but the reuse idea > happened before we had bgWriter. Theoretically the overhead creating a new > file would always fall to bgWriter and therefore not be a big deal. For filesystems like btrfs, re-using a WAL file is suboptimal to simply creating a new one and removing the old one when it's no longer required. Using fallocate (or posix_fallocate) (I have a patch for that!) to create a new one is - by my tests - 28 times faster than the currently-used method. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Considering Gerrit for CFs
On Fri, Feb 8, 2013 at 1:43 PM, Phil Sorber wrote: > On Fri, Feb 8, 2013 at 10:20 AM, Peter Eisentraut wrote: >> On 2/8/13 5:23 AM, Magnus Hagander wrote: >>> But do you have any actual proof that the problem is in "we >>> loose reviewers because we're relying on email"? >> >> Here is one: Me. >> >> Just yesterday I downloaded a piece of software that was previously >> unknown to me from GitHub and found a bug. Within 15 minutes or so I >> had fixed the bug, made a fork, sent a pull request. Today I read, the >> fix was merged last night, and I'm happy. >> >> How would this go with PostgreSQL? You can use the bug form on the web >> site, but you can't attach any code, so the bug will just linger and >> ultimately put more burden on a core contributor to deal with the >> minutiae of developing, testing, and committing a trivial fix and >> sending feedback to the submitter. Or the user could take the high road >> and develop and patch and submit it. Just make sure it's in context >> diff format! Search the wiki if you don't know how to do that! Send it >> to -hackers, your email will be held for moderation. We won't actually >> do anything with your patch, but we will tell you to add it to that >> commitfest app over there. You need to sign up for an account to use >> that. We will deal with your patch in one or two months. But only if >> you review another patch. And you should sign up for that other mailing >> list, to make sure you're doing it right. Chances are, the first review >> you're going to get is that your patch doesn't apply anymore, but which >> time you will have lost interest in the patch anyway. > > This. This times 1000. I, too, could not agree more. > I'm not sure if Gerrit specifically is the answer, but there are > definitely better ways to do code review like this. I really like the > way github allows you to post a patch and then have conversation > around it, offer comments on specific lines of code, and add updates > to the patch all in one interface. Another benefit is that a lot more > people are familiar and comfortable with this work flow. There are > even some open source work-a-likes that we could use to we don't have > to rely on a 3rd party like github. Gerrit seems to do it slightly > differently with side by side diff's and patch revisions, but either > way would be an improvement. Please take this for what it's worth - I'm not a code reviewer or committer - just a pretty heavy user, and I lurk on (most?) of the mailing lists. Mostly I find bugs and ask others to fix them, since I lack the necessary intimate knowledge of postgresql internals to produce a meaningful patch. That said, I believe that - from my perspective - having postgresql's interaction with it's *large* community would only be improved by using something like github. I am far more likely to try to introduce a new feature, minor bugfix, code improvement, et cetera when using github than I would be if the interaction starts with a post to a mailing list and at least /looks/ like it might involve rather more than that. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] question: foreign key constraints and AccessExclusive locks
On Sun, Jan 6, 2013 at 4:14 AM, Simon Riggs wrote: > On 6 January 2013 03:08, Jon Nelson wrote: >> When adding a foreign key constraint on tableA which references >> tableB, why is an AccessExclusive lock on tableB necessary? Wouldn't a >> lock that prevents writes be sufficient, or does PostgreSQL have to >> modify *both* tables in some fashion? I'm using PostgreSQL 8.4 on >> Linux. > > FKs are enforced by triggers currently. Adding triggers requires > AccessExclusiveLock because of catalog visibility issues; you are > right that a lower lock is eventually possible. > > SQLStandard requires the check to be symmetrical, so adding FKs > requires a trigger on each table and so an AEL is placed on tableB. I've read and re-read this a few times, and I think I understand. However, could you clarify "you are right that a lower lock is eventually possible" for me, please? -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] question: foreign key constraints and AccessExclusive locks
When adding a foreign key constraint on tableA which references tableB, why is an AccessExclusive lock on tableB necessary? Wouldn't a lock that prevents writes be sufficient, or does PostgreSQL have to modify *both* tables in some fashion? I'm using PostgreSQL 8.4 on Linux. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] xmalloc => pg_malloc
On Wed, Oct 3, 2012 at 11:36 PM, Tom Lane wrote: > Peter Eisentraut writes: >> xmalloc, xstrdup, etc. are pretty common names for functions that do >> alloc-or-die (another possible naming scheme ;-) ). The naming >> pg_malloc etc. on the other hand suggests that the allocation is being >> done in a PostgreSQL-specific way, and anyway sounds too close to >> palloc. > >> So I'd be more in favor of xmalloc <= pg_malloc. > > Meh. The fact that other people use that name is not really an > advantage from where I sit. I'm concerned about possible name > collisions, eg in libraries loaded into the backend. > > There are probably not any actual risks of collision right now, given > that all these functions are currently in our client-side programs --- > but it's foreseeable that we might use this same naming convention in > more-exposed places in future. In fact, somebody was already proposing > creating such functions in the core backend. > > But having said that, I'm not absolutely wedded to these names; they > were just the majority of existing cases. Why not split the difference and use pg_xmalloc? As in: "PostgreSQL-special malloc that dies on failure." -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
On Thu, Jun 28, 2012 at 8:57 AM, Robert Haas wrote: > On Thu, Jun 28, 2012 at 9:47 AM, Jon Nelson wrote: >> Why not just mmap /dev/zero (MAP_SHARED but not MAP_ANONYMOUS)? I >> seem to think that's what I did when I needed this functionality oh so >> many moons ago. > > From the reading I've done on this topic, that seems to be a trick > invented on Solaris that is considered grotty and awful by everyone > else. The thing is that you want the mapping to be shared with the > processes that inherit the mapping from you. You do *NOT* want the > mapping to be shared with EVERYONE who has mapped that file for any > reason, which is the usual meaning of MAP_SHARED on a file. Maybe > this happens to work correctly on some or all platforms, but I would > want to have some convincing evidence that it's more widely supported > (with the correct semantics) than MAP_ANON before relying on it. When I did this (I admit, it was on Linux but it was a long time ago) only the inherited file descriptor + mmap structure mattered - modifications were private to the process and it's children - other apps always saw their "own" /dev/zero. A quick google suggests that - according to qnx, sco, and some others - mmap'ing /dev/zero retains the expected privacy. Given how /dev/zero works I'd be very surprised if it was otherwise. I would love to see links that suggest that /dev/zero is nasty (or, in fact, in any way fundamentally different than mmap'ing /dev/zero) - feel free to send them to me privately to avoid polluting the list. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Posix Shared Mem patch
On Thu, Jun 28, 2012 at 6:05 AM, Magnus Hagander wrote: > On Thu, Jun 28, 2012 at 7:00 AM, Robert Haas wrote: >> On Wed, Jun 27, 2012 at 9:44 AM, Tom Lane wrote: >>> Robert Haas writes: On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane wrote: > Would Posix shmem help with that at all? Why did you choose not to > use the Posix API, anyway? >>> It seemed more complicated. If we use the POSIX API, we've got to have code to find a non-colliding name for the shm, and we've got to arrange to clean it up at process exit. Anonymous shm doesn't require a name and goes away automatically when it's no longer in use. >>> >>> I see. Those are pretty good reasons ... >> >> So, should we do it this way? >> >> I did a little research and discovered that Linux 2.3.51 (released >> 3/11/2000) apparently returns EINVAL for MAP_SHARED|MAP_ANONYMOUS. >> That combination is documented to work beginning in Linux 2.4.0. How >> worried should we be about people trying to run PostgreSQL 9.3 on >> pre-2.4 kernels? If we want to worry about it, we could try mapping a >> one-page shared MAP_SHARED|MAP_ANONYMOUS segment first. If that >> works, we could assume that we have a working MAP_SHARED|MAP_ANONYMOUS >> facility and try to allocate the whole segment plus a minimal sysv >> shm. If the single page allocation fails with EINVAL, we could fall >> back to allocating the entire segment as sysv shm. Why not just mmap /dev/zero (MAP_SHARED but not MAP_ANONYMOUS)? I seem to think that's what I did when I needed this functionality oh so many moons ago. -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers