[HACKERS] minor spelling error fix (btis -> bits)

2017-03-23 Thread Jon Nelson
Attached please find a minor spelling error fix, changing "btis" to "bits".




-- 
Jon
From f590a6dce6677bc5b8a409d40fd651ecb69b27bb Mon Sep 17 00:00:00 2001
From: Jon Nelson <jnel...@jamponi.net>
Date: Sun, 23 Mar 2014 08:23:48 -0500
Subject: [PATCH] - fix minor spelling error

---
 src/backend/utils/adt/network.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/adt/network.c b/src/backend/utils/adt/network.c
index 1f8469a..b126494 100644
--- a/src/backend/utils/adt/network.c
+++ b/src/backend/utils/adt/network.c
@@ -1136,7 +1136,7 @@ network_scan_first(Datum in)
 
 /*
  * return "last" IP on a given network. It's the broadcast address,
- * however, masklen has to be set to its max btis, since
+ * however, masklen has to be set to its max bits, since
  * 192.168.0.255/24 is considered less than 192.168.0.255/32
  *
  * inet_set_masklen() hacked to max out the masklength to 128 for IPv6
-- 
2.10.2


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZE to max_wal_send

2017-03-16 Thread Jon Nelson
On Thu, Mar 16, 2017 at 9:59 AM, David Steele <da...@pgmasters.net> wrote:

> On 1/9/17 11:33 PM, Jon Nelson wrote:
> >
> > On Sat, Jan 7, 2017 at 7:48 PM, Jim Nasby <jim.na...@bluetreble.com
> > <mailto:jim.na...@bluetreble.com>> wrote:
> >
> > On 1/5/17 12:55 PM, Jonathon Nelson wrote:
> >
> > Attached please find a patch for PostgreSQL 9.4 which changes the
> > maximum amount of data that the wal sender will send at any
> point in
> > time from the hard-coded value of 128KiB to a user-controllable
> > value up
> > to 16MiB. It has been primarily tested under 9.4 but there has
> > been some
> > testing with 9.5.
> >
> >
> > To make sure this doesn't get lost, please add it to
> > https://commitfest.postgresql.org
> > <https://commitfest.postgresql.org>. Please verify the patch will
> > apply against current HEAD and pass make check-world.
> >
> >
> > Attached please find a revision of the patch, changed in the following
> ways:
> >
> > 1. removed a call to debug2.
> > 2. applies cleanly against master (as of
> > 8c5722948e831c1862a39da2bb5d793a6f2aabab)
> > 3. one small indentation fix, one small verbiage fix.
> > 4. switched to calculating the upper bound using XLOG_SEG_SIZE rather
> > than hard-coding 16384.
> > 5. the git author is - obviously - different.
> >
> > make check-world passes.
> > I have added it to the commitfest.
> > I have verified with strace that up to 16MB sends are being used.
> > I have verified that the GUC properly grumps about values greater than
> > XLOG_SEG_SIZE / 1024 or smaller than 4.
>
> This patch applies cleanly on cccbdde and compiles.  However,
> documentation in config.sgml is needed.
>
> The concept is simple enough though there seems to be some argument
> about whether or not the patch is necessary.  In my experience 128K
> should be more than large enough for a chunk size, but I'll buy the
> argument that libpq is acting as a barrier in this case.
>   (as
> I'm marking this patch "Waiting on Author" for required documentation.
>

Thank you for testing and the comments.  I have some updates:

- I set up a network at home and - in some very quick testing - was unable
to observe any obvious performance difference regardless of chunk size
- Before I could get any real testing done, one of the machines I was using
for testing died and won't even POST, which has put a damper on said
testing (as you might imagine).
- There is a small issue with the patch: a lower-bound of 4 is not
appropriate; it should be XLOG_BLCKSZ / 1024 (I can submit an updated patch
if that is appropriate)
- I am, at this time, unable to replicate the earlier results however I
can't rule them out, either.


--
Jon


Re: [HACKERS] Faster methods for getting SPI results

2017-03-02 Thread Jon Nelson
On Thu, Mar 2, 2017 at 10:03 AM, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 12/20/16 23:14, Jim Nasby wrote:
> > I've been looking at the performance of SPI calls within plpython.
> > There's a roughly 1.5x difference from equivalent python code just in
> > pulling data out of the SPI tuplestore. Some of that is due to an
> > inefficiency in how plpython is creating result dictionaries, but fixing
> > that is ultimately a dead-end: if you're dealing with a lot of results
> > in python, you want a tuple of arrays, not an array of tuples.
>
> There is nothing that requires us to materialize the results into an
> actual list of actual rows.  We could wrap the SPI_tuptable into a
> Python object and implement __getitem__ or __iter__ to emulate sequence
> or mapping access.
>

Python objects have a small (but non-zero) overhead in terms of both memory
and speed. A built-in dictionary is probably one of the least-expensive
(memory/cpu) choices, although how the dictionary is constructed also
impacts performance.  Another choice is a tuple.

Avoiding Py_BuildValue(...) in exchange for a bit more complexity (via
PyTuple_New(..) and PyTuple_SetItem(...)) is also a nice performance win in
my experience.

-- 
Jon


Re: [HACKERS] [PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZE to max_wal_send

2017-01-09 Thread Jon Nelson
On Sat, Jan 7, 2017 at 7:48 PM, Jim Nasby <jim.na...@bluetreble.com> wrote:

> On 1/5/17 12:55 PM, Jonathon Nelson wrote:
>
>> Attached please find a patch for PostgreSQL 9.4 which changes the
>> maximum amount of data that the wal sender will send at any point in
>> time from the hard-coded value of 128KiB to a user-controllable value up
>> to 16MiB. It has been primarily tested under 9.4 but there has been some
>> testing with 9.5.
>>
>
> To make sure this doesn't get lost, please add it to
> https://commitfest.postgresql.org. Please verify the patch will apply
> against current HEAD and pass make check-world.


Attached please find a revision of the patch, changed in the following ways:

1. removed a call to debug2.
2. applies cleanly against master (as of 8c5722948e831c1862a39da2bb5d79
3a6f2aabab)
3. one small indentation fix, one small verbiage fix.
4. switched to calculating the upper bound using XLOG_SEG_SIZE rather than
hard-coding 16384.
5. the git author is - obviously - different.

make check-world passes.
I have added it to the commitfest.
I have verified with strace that up to 16MB sends are being used.
I have verified that the GUC properly grumps about values greater than
XLOG_SEG_SIZE / 1024 or smaller than 4.

-- 
Jon
From 7286e290daec32e12260e9e1e44a490c13ed2245 Mon Sep 17 00:00:00 2001
From: Jon Nelson <jnel...@jamponi.net>
Date: Mon, 9 Jan 2017 20:00:41 -0600
Subject: [PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZE to max_wal_send

---
 src/backend/replication/walsender.c | 13 +++--
 src/backend/utils/misc/guc.c| 12 
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index f3082c3..4a9eb16 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -91,7 +91,7 @@
  * because signals are checked only between messages.  128kB (with
  * default 8k blocks) seems like a reasonable guess for now.
  */
-#define MAX_SEND_SIZE (XLOG_BLCKSZ * 16)
+int	max_wal_send_guc = 0;
 
 /* Array of WalSnds in shared memory */
 WalSndCtlData *WalSndCtl = NULL;
@@ -2194,7 +2194,7 @@ retry:
 /*
  * Send out the WAL in its normal physical/stored form.
  *
- * Read up to MAX_SEND_SIZE bytes of WAL that's been flushed to disk,
+ * Read up to max_wal_send kilobytes of WAL that's been flushed to disk,
  * but not yet sent to the client, and buffer it in the libpq output
  * buffer.
  *
@@ -2208,6 +2208,7 @@ XLogSendPhysical(void)
 	XLogRecPtr	startptr;
 	XLogRecPtr	endptr;
 	Size		nbytes;
+	int max_wal_send = max_wal_send_guc * 1024;
 
 	if (streamingDoneSending)
 	{
@@ -2346,8 +2347,8 @@ XLogSendPhysical(void)
 
 	/*
 	 * Figure out how much to send in one message. If there's no more than
-	 * MAX_SEND_SIZE bytes to send, send everything. Otherwise send
-	 * MAX_SEND_SIZE bytes, but round back to logfile or page boundary.
+	 * max_wal_send bytes to send, send everything. Otherwise send
+	 * max_wal_send bytes, but round back to logfile or page boundary.
 	 *
 	 * The rounding is not only for performance reasons. Walreceiver relies on
 	 * the fact that we never split a WAL record across two messages. Since a
@@ -2357,7 +2358,7 @@ XLogSendPhysical(void)
 	 */
 	startptr = sentPtr;
 	endptr = startptr;
-	endptr += MAX_SEND_SIZE;
+	endptr += max_wal_send;
 
 	/* if we went beyond SendRqstPtr, back off */
 	if (SendRqstPtr <= endptr)
@@ -2376,7 +2377,7 @@ XLogSendPhysical(void)
 	}
 
 	nbytes = endptr - startptr;
-	Assert(nbytes <= MAX_SEND_SIZE);
+	Assert(nbytes <= max_wal_send);
 
 	/*
 	 * OK to read and send the slice.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5b23dbf..b9a0c47 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -112,6 +112,7 @@ extern char *default_tablespace;
 extern char *temp_tablespaces;
 extern bool ignore_checksum_failure;
 extern bool synchronize_seqscans;
+extern int  max_wal_send_guc;
 
 #ifdef TRACE_SYNCSCAN
 extern bool trace_syncscan;
@@ -2342,6 +2343,17 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"max_wal_send", PGC_SIGHUP, REPLICATION_SENDING,
+			gettext_noop("Sets the maximum WAL payload size for WAL replication."),
+			NULL,
+			GUC_UNIT_KB
+		},
+		_wal_send_guc,
+		128, 4, XLOG_SEG_SIZE / 1024,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"commit_delay", PGC_SUSET, WAL_SETTINGS,
 			gettext_noop("Sets the delay in microseconds between transaction commit and "
 		 "flushing WAL to disk."),
-- 
2.10.2


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] macaddr 64 bit (EUI-64) datatype support

2016-11-22 Thread Jon Nelson
On Tue, Nov 22, 2016 at 8:42 AM, Tom Lane  wrote:

> Haribabu Kommi  writes:
> > Any suggestions for the name to be used for the new datatype the can
> > work for both 48 and 64 bit MAC addresses?
>
> The precedent of int4/int8/float4/float8 is that SQL data types should
> be named after their length in bytes.  So I'd be inclined to call this
> "macaddr8" not "macaddr64".  That would suggest taking the simple
> approach of always storing values in the 8-byte format, rather than
> dealing with the complexities of having two formats internally, two
> display formats, etc.
>

Be that as it may, but almost everybody else (outside the db world?) uses
bits.  The C types, for example, are expressed in bits (int8_t, int64_t,
etc...).

> While comparing a 48 bit MAC address with 64 bit MAC address, Ignore
> > the two bytes if the contents in those bytes are reserved bytes.
>
> Um ... I don't follow.  Surely these must compare different:
>
> 01-01-01-FF-FE-01-01-01
> 01-01-01-FF-0E-01-01-01
>

What's more, it now requires 2 comparisons and some logic versus the
possibility of a single memcmp.

-- 
Jon


Re: [HACKERS] Change pg_cancel_*() to ignore current backend

2015-05-20 Thread Jon Nelson
On May 20, 2015 6:43 AM, David Steele da...@pgmasters.net wrote:

 On 5/20/15 1:40 AM, Jim Nasby wrote:
  On 5/19/15 9:19 PM, Fabrízio de Royes Mello wrote:
  We could add a second parameter to the current functions:
  allow_own_pid DEFAULT false. To me that seems better than an
  entirely separate set of functions.
 
 
  +1 to add a second parameter to current functions.
 
  Instead of allow_own_pid, I went with skip_own_pid. I have the function
  still returning true even when it skips it's own PID... that seems a bit
  weird, but I think it's better than returning false. Unless someone
  thinks it should return NULL, but I don't see that as any better either.

 +1.  I agree that cancelling/killing your own process should not be the
 default behavior.

-1.  It breaks backwards compatibility. I use this function a fair bit to
terminate the current backend all the time.  Changing the signature is
great, by make the default behavior retain the current behavior.


 --
 - David Steele
 da...@pgmasters.net



Re: [HACKERS] Change pg_cancel_*() to ignore current backend

2015-05-20 Thread Jon Nelson
On Wed, May 20, 2015 at 9:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 I think backwards compatibility probably trumps that argument.  I have
 no objection to providing a different call that behaves this way, but
 changing the behavior of existing applications will face a *much*
 higher barrier to acceptance.  Especially since a real use-case for
 the current behavior was shown upthread, which means you can't argue
 that it's simply a bug.

 regards, tom lane


Agree.
It breaks backwards compatibility. I use this function a fair bit to
terminate the current backend all the time.

-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] CTE that result in repeated sorting of the data

2014-05-15 Thread Jon Nelson
I was watching a very large recursive CTE get built today and this CTE
involves on the order of a dozen or so loops joining the initial
table against existing tables. It struck me that - every time through
the loop the tables were sorted and then joined and that it would be
much more efficient if the tables remained in a sorted state and could
avoid being re-sorted each time through the loop. Am I missing
something here? I am using PG 8.4 if that matters.

-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CTE that result in repeated sorting of the data

2014-05-15 Thread Jon Nelson
On Thu, May 15, 2014 at 4:50 PM, David G Johnston
david.g.johns...@gmail.com wrote:
 Jon Nelson-14 wrote
 I was watching a very large recursive CTE get built today and this CTE
 involves on the order of a dozen or so loops joining the initial
 table against existing tables. It struck me that - every time through
 the loop the tables were sorted and then joined and that it would be
 much more efficient if the tables remained in a sorted state and could
 avoid being re-sorted each time through the loop. Am I missing
 something here? I am using PG 8.4 if that matters.

 I'm not sure what you mean by watching but maybe this is a simple as
 changing your CTE to use UNION ALL instead of UNION [DISTINCT]?

In fact, I'm using UNION ALL.

 If you really think it could be improved upon maybe you can help and provide
 a minimal self-contained example query and data that exhibits the behavior
 you describe so others can see it and test changes?  It would be nice to
 know if other versions than one that is basically no longer supported
 exhibits the same behavior.

Pretty much any CTE that looks like this:

with cte AS (
  select stuff from A
  UNION ALL
  select more_stuff from B, cte WHERE join conditions
) SELECT * FROM cte;

*and* where the planner chooses to join B and cte by sorting and doing
a merge join.

I'll see if I can come up with a self-contained example.


-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

2014-02-04 Thread Jon Nelson
What - if anything - do I need to do to get this on the commitfest
list for the next commitfest?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

2014-01-26 Thread Jon Nelson
What are my next steps here?

I believe the concept is sound, the code is appears to work and
doesn't crash, and the result does show a performance win in most
cases (sometimes a big win).  It's also incomplete, at least insofar
as it doesn't interface with the cost models at all, etc...

-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

2014-01-22 Thread Jon Nelson
On Wed, Jan 22, 2014 at 3:26 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jeremy Harris j...@wizmail.org writes:
 On 22/01/14 03:53, Tom Lane wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 - in createplan.c, eliding duplicate tuples is enabled if we are
 creating a unique plan which involves sorting first

 [ raised eyebrow ... ]  And what happens if the planner drops the
 unique step and then the sort doesn't actually go to disk?

 I don't think Jon was suggesting that the planner drop the unique step.

 Hm, OK, maybe I misread what he said there.  Still, if we've told
 tuplesort to remove duplicates, why shouldn't we expect it to have
 done the job?  Passing the data through a useless Unique step is
 not especially cheap.

That's correct - I do not propose to drop the unique step. Duplicates
are only dropped if it's convenient to do so. In one case, it's a
zero-cost drop (no extra comparison is made). In most other cases, an
extra comparison is made, typically right before writing a tuple to
tape. If it compares as identical to the previously-written tuple,
it's thrown out instead of being written.

The output of the modified code is still sorted, still *might* (and in
most cases, probably will) contain duplicates, but will (probably)
contain fewer duplicates.

-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

2014-01-21 Thread Jon Nelson
Greetings -hackers:

I have worked up a patch to PostgreSQL which elides tuples during an
external sort. The primary use case is when sorted input is being used
to feed a DISTINCT operation. The idea is to throw out tuples that
compare as identical whenever it's convenient, predicated on the
assumption that even a single I/O is more expensive than some number
of (potentially extra) comparisons.  Obviously, this is where a cost
model comes in, which has not been implemented. This patch is a
work-in-progress.

Being very new to hacking PostgreSQL, I've probably done a bunch of
things wrong, but I've tried to test it thoroughly.

A rough summary of the patch follows:

- a GUC variable enables or disables this capability
- in nodeAgg.c, eliding duplicate tuples is enabled if the number of
  distinct columns is equal to the number of sort columns (and both are
  greater than zero).
- in createplan.c, eliding duplicate tuples is enabled if we are
  creating a unique plan which involves sorting first
- ditto planner.c
- all of the remaining changes are in tuplesort.c, which consist of:
  + a new macro, DISCARDTUP and a new structure member, discardtup, are
both defined and operate similar to COMPARETUP, COPYTUP, etc...
  + in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply
throw out the new tuple if it compares as identical to the tuple at
the top of the heap. Since we're already performing this comparison,
this is essentially free.
  + in mergeonerun, we may discard a tuple if it compares as identical
to the *last written tuple*. This is a comparison that did not take
place before, so it's not free, but it saves a write I/O.
  + We perform the same logic in dumptuples

I have tested this patch with a bunch of different data, and the
results are summarized below.

Test 1:
I used a corpus of approximately 1.5TB of textual data, consisting of
7.4 billion input rows, 12.6% of which is unique. This is a 'real-world'
test.

Before the patch: 44.25 hours using 274GB of temporary files
After the patch: 36.44 hours using 125GB of temporary files
Difference: -7.6 hours, 18% faster

Test 2:
Acquire the unique set of strings from a totally random set that
contains no duplicates. This is a 'worst case'.

Input: 700 million rows, 32GB.
Before: 18,796 seconds.
After: 19,358 seconds
Difference: +562 seconds, *slower* by 3%.
In this particular case, the performance varies from faster to slower, and I'm
not sure why. Statistical noise?

Test 3:
Acquire the unique set of integers from a set of 100 million, all
happen to have the value '1'. This is a 'best case'.

Input: 100 million rows, 3.4GB on-disk
Before: 26.1 seconds using 1.3GB of temporary files
After: 8.9 seconds using 8KB of disk
Difference: -17.2 seconds, 65% faster

Test 4:
Similar to test 3, but using 1 billion rows.
Before: 1450 seconds, 13 GB of temporary files
After: 468 seconds, 8 KB of temporary files
Difference: -982 seconds, 68% faster

See below for details on test 4.

Lastly, 'make check' passes without issues (modulo rangefuncs grumping about
changes in pg_settings).

Tests 1 through 3 were performed against PostgreSQL 9.4 HEAD as of
early September 2013.  I have rebased the patch against
PostgreSQL 9.4 HEAD as of 19 January 2014, and re-run tests
2, 3, and 4 with similar results.

Regarding the patch: I've tried to conform to the indenting and style
rules, including using pg_bsd_indent and pg_indent, but I've not had
100% success.

The details on test 4:

A simple example, using 1 billion rows all with the value '1' (integer):
This is a fabricated *best case*.

postgres=# set enable_hashagg = false;
SET
Time: 30.402 ms
postgres=# explain analyze verbose select distinct * from i ;
 QUERY PLAN
-
 Unique  (cost=245942793.27..250942793.27 rows=1 width=4) (actual
time=468188.900..468188.901 rows=1 loops=1)
   Output: i
   -  Sort  (cost=245942793.27..248442793.27 rows=10 width=4)
(actual time=468188.897..468188.897 rows=1 loops=1)
 Output: i
 Sort Key: i.i
 Sort Method: external sort  Disk: 8kB
 -  Seq Scan on public.i  (cost=0.00..14424779.00
rows=10 width=4) (actual time=34.765..254595.990
rows=10 loops=1)
   Output: i
 Total runtime: 468189.125 ms
(9 rows)

Time: 468189.755 ms
postgres=# set enable_opportunistic_tuple_elide = false;
SET
Time: 30.402 ms
postgres=# explain analyze verbose select distinct * from i ;
 QUERY PLAN
-
 Unique  (cost=245942793.27..250942793.27 rows=1 width=4) (actual
time=1047226.451..1449371.632 rows=1 loops=1)
   Output: i
   -  

Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

2014-01-21 Thread Jon Nelson
On Tue, Jan 21, 2014 at 9:53 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 A rough summary of the patch follows:

 - a GUC variable enables or disables this capability
 - in nodeAgg.c, eliding duplicate tuples is enabled if the number of
   distinct columns is equal to the number of sort columns (and both are
   greater than zero).
 - in createplan.c, eliding duplicate tuples is enabled if we are
   creating a unique plan which involves sorting first
 - ditto planner.c
 - all of the remaining changes are in tuplesort.c, which consist of:
   + a new macro, DISCARDTUP and a new structure member, discardtup, are
 both defined and operate similar to COMPARETUP, COPYTUP, etc...
   + in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply
 throw out the new tuple if it compares as identical to the tuple at
 the top of the heap. Since we're already performing this comparison,
 this is essentially free.
   + in mergeonerun, we may discard a tuple if it compares as identical
 to the *last written tuple*. This is a comparison that did not take
 place before, so it's not free, but it saves a write I/O.
   + We perform the same logic in dumptuples

 [ raised eyebrow ... ]  And what happens if the planner drops the
 unique step and then the sort doesn't actually go to disk?

I'm not familiar enough with the code to be able to answer your
question with any sort of authority, but I believe that if the state
TSS_BUILDRUNS is never hit, then basically nothing new happens.

-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 regression

2013-10-24 Thread Jon Nelson
On Thu, Oct 24, 2013 at 5:41 AM, Thom Brown t...@linux.com wrote:
 On 5 September 2013 22:24, Bruce Momjian br...@momjian.us wrote:
 On Mon, Aug 19, 2013 at 09:27:57PM -0400, Stephen Frost wrote:
 * Andres Freund (and...@2ndquadrant.com) wrote:
  I vote for adapting the patch to additionally zero out the file via
  write(). In your tests that seemed to perform at least as good as the
  old method... It also has the advantage that we can use it a littlebit
  more as a testbed for possibly using it for heap extensions one day.
  We're pretty early in the cycle, so I am not worried about this too 
  much...

 I dunno, I'm pretty disappointed that this doesn't actually improve
 things.  Just following this casually, it looks like it might be some
 kind of locking issue in the kernel that's causing it to be slower; or
 at least some code path that isn't exercise terribly much and therefore
 hasn't been given the love that it should.

 Definitely interested in what Ts'o says, but if we can't figure out why
 it's slower *without* writing out the zeros, I'd say we punt on this
 until Linux and the other OS folks improve the situation.

 FYI, the patch has been reverted.

 Is there an updated patch available for this?  And did anyone hear from Ts'o?

After the patch was reverted, it was not re-submitted. I have tried 3
or 4 times to get more info out of Ts'o , without luck.

-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 regression

2013-08-19 Thread Jon Nelson
On Fri, Aug 16, 2013 at 3:57 PM, Bruce Momjian br...@momjian.us wrote:
 On Thu, Aug 15, 2013 at 12:08:57PM -0500, Jon Nelson wrote:
  Where are we on this issue?

 I've been able to replicate it pretty easily with PostgreSQL and
 continue to look into it. I've contacted Theodore Ts'o and have gotten
 some useful information, however I'm unable to replicate the behavior
 with the test program (even one that's been modified). What I've
 learned is:

 - XLogWrite appears to take approx. 2.5 times longer when writing to a
 file allocated with posix_fallocate, but only the first time the file
 contents are overwritten. This is partially explained by how ext4
 handles extents and uninitialized data, but 2.5x is MUCH more
 expensive than anticipated or expected here.
 - Writing zeroes to a file allocated with posix_fallocate (essentially
 adding a posix_fallocate step before the usual write-zeroes-in-a-loop
 approach) not only doesn't seem to hurt performance, it seems to help
 or at least have parity, *and* the space is guaranteed to exist on
 disk. At the very least that seems useful.

 Is it time to revert this patch until we know more?

While I'm not qualified to say, my inclination is to say yes. It can
always be added back later. The only caveat there would be that -
perhaps - a small modification of the patch would be warranted.
Specifically, with with posix_fallocate, I saw no undesirable behavior
when the (newly allocated) file was manually zeroed anyway. The only
advantages (that I can see) to doing it this way versus not using
posix_fallocate at all is (a) a potential reduction in the number of
extents and (b) the space is guaranteed to be on disk if
posix_fallocate succeeds. My reading of the patch is that even if
posix_fallocate fails due to out of space conditions, we will still
try to create the file by writing out zeroes, so perhaps the
out-of-disk-space scenario isn't all that useful anyway.

I'm awaiting more information from Theodore Ts'o, but I don't expect
things to materially change in the near future.


-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 regression

2013-08-15 Thread Jon Nelson
 Where are we on this issue?

I've been able to replicate it pretty easily with PostgreSQL and
continue to look into it. I've contacted Theodore Ts'o and have gotten
some useful information, however I'm unable to replicate the behavior
with the test program (even one that's been modified). What I've
learned is:

- XLogWrite appears to take approx. 2.5 times longer when writing to a
file allocated with posix_fallocate, but only the first time the file
contents are overwritten. This is partially explained by how ext4
handles extents and uninitialized data, but 2.5x is MUCH more
expensive than anticipated or expected here.
- Writing zeroes to a file allocated with posix_fallocate (essentially
adding a posix_fallocate step before the usual write-zeroes-in-a-loop
approach) not only doesn't seem to hurt performance, it seems to help
or at least have parity, *and* the space is guaranteed to exist on
disk. At the very least that seems useful.


-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_ctl initdb takes options, but pg_ctl --help doesn't document them?

2013-08-15 Thread Jon Nelson
Taking a look at PostgreSQL HEAD today, I noticed that pg_ctl
documents that pg_ctl initdb takes OPTIONS but doesn't document them
(unlike for start and others).

Is this intentional?


-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 regression

2013-08-08 Thread Jon Nelson
A follow-up.
I found this thread that seems to explain some things:

http://comments.gmane.org/gmane.comp.file-systems.ext4/33024

Short version: if we are writing into the middle of the
newly-fallocated file on ext4 (with extents) the extent tree can
fragment badly, causing poor performance due to extent merging.

I also ran some tests with xfs, and the results were even stranger:

xfs performed very slightly better with fallocate vs. without.
xfs (fallocate) 216 tps vs. (no fallocate) 206
ext4 (no fallocate) 605 vs (fallocate) 134.

I made an ext4 filesystem without extents using the same block device
as above - a real, single hard drive with nothing else on it.
On this filesystem, the performance remained the same (or - perhaps -
improved very slightly): 633tps (with fallocate) vs. 607.


Using the following additions postgresql.conf:

max_connections = 500
shared_buffers = 1GB
effective_cache_size = 1GB
random_page_cost = 2.0
cpu_tuple_cost = 0.03
wal_buffers = 32MB
work_mem = 100MB
maintenance_work_mem = 512MB
commit_delay = 50
commit_siblings = 15
checkpoint_segments = 256
log_checkpoints = on

and this partial script for testing:

pg_ctl initdb -D tt -w
cp ~/postgresql.conf tt
pg_ctl -D tt -w start
createdb -p 54320 pgb
pgbench -s 20 -p 54320 -d pgb -i
pgbench -j 80 -c 80 -T 120 -p 54320 pgb
pg_ctl -D tt -w stop


-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 regression

2013-08-08 Thread Jon Nelson
On Thu, Aug 8, 2013 at 2:50 PM, Hannu Krosing ha...@2ndquadrant.com wrote:
 On 08/08/2013 05:28 PM, Jon Nelson wrote:
...

 Just an idea - can you check if using a fillfactor different form 100
 changes anything

 pgbench -s 20 -p 54320 -d pgb -F 90 -i


 pgbench -j 80 -c 80 -T 120 -p 54320 pgb
 pg_ctl -D tt -w stop

 That is, does extending tables and indexes to add updated tuples play
 any role here

I gave it a go - it didn't make any difference at all.
At this point I'm convinced that the issue is a pathological case in
ext4. The performance impact disappears as soon as the unwritten
extent(s) are written to with real data. Thus, even though allocating
files with posix_fallocate is - frequently - orders of magnitude
quicker than doing it with write(2), the subsequent re-write can be
more expensive.  At least, that's what I'm gathering from the various
threads.  Why this issue didn't crop up in earlier testing and why I
can't seem to make test_fallocate do it (even when I modify
test_fallocate to write to the newly-allocated file in a mostly-random
fashion) has me baffled. Should this feature be reconsidered?


-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 regression

2013-08-08 Thread Jon Nelson
On Thu, Aug 8, 2013 at 4:24 PM, Bruce Momjian br...@momjian.us wrote:
 On Thu, Aug  8, 2013 at 04:12:06PM -0500, Jon Nelson wrote:
 On Thu, Aug 8, 2013 at 2:50 PM, Hannu Krosing ha...@2ndquadrant.com wrote:
  On 08/08/2013 05:28 PM, Jon Nelson wrote:
 ...

  Just an idea - can you check if using a fillfactor different form 100
  changes anything
 
  pgbench -s 20 -p 54320 -d pgb -F 90 -i
 
 
  pgbench -j 80 -c 80 -T 120 -p 54320 pgb
  pg_ctl -D tt -w stop
 
  That is, does extending tables and indexes to add updated tuples play
  any role here

 I gave it a go - it didn't make any difference at all.
 At this point I'm convinced that the issue is a pathological case in
 ext4. The performance impact disappears as soon as the unwritten
 extent(s) are written to with real data. Thus, even though allocating
 files with posix_fallocate is - frequently - orders of magnitude
 quicker than doing it with write(2), the subsequent re-write can be
 more expensive.  At least, that's what I'm gathering from the various
 threads.  Why this issue didn't crop up in earlier testing and why I
 can't seem to make test_fallocate do it (even when I modify
 test_fallocate to write to the newly-allocated file in a mostly-random
 fashion) has me baffled. Should this feature be reconsidered?

 How much slower would it be if we wrote it with zeros after
 posix_fallocate() --- that would still give use single extents.  Has
 anyone tested to see if the write without test_fallocate() still gives
 us one extent?

Actually, I did that test - I found that there was no longer a
performance drop and possibly a slight performance benefit.
I couldn't quite parse your last sentence, I assume you mean does
skipping the posix_fallocate altogether still produce one extent?
Sadly, I can't really test that properly as the ext4 filesystem I'm
testing on is quite fresh and has lots and lots of free space.


-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 regression

2013-08-08 Thread Jon Nelson
On Thu, Aug 8, 2013 at 4:42 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 At this point I'm convinced that the issue is a pathological case in
 ext4. The performance impact disappears as soon as the unwritten
 extent(s) are written to with real data. Thus, even though allocating
 files with posix_fallocate is - frequently - orders of magnitude
 quicker than doing it with write(2), the subsequent re-write can be
 more expensive.  At least, that's what I'm gathering from the various
 threads.  Why this issue didn't crop up in earlier testing and why I
 can't seem to make test_fallocate do it (even when I modify
 test_fallocate to write to the newly-allocated file in a mostly-random
 fashion) has me baffled.

 Does your test program use all the same writing options that the real
 WAL writes do (like O_DIRECT)?

I believe so.

From xlog.c:

/* do not use get_sync_bit() here --- want to fsync only at end of fill */
fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
   S_IRUSR | S_IWUSR);

and from the test program:

fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600);

PG_BINARY expands to 0 on non-Windows.  I also tried using O_WRONLY in
xlog.c without change.


-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 regression

2013-08-08 Thread Jon Nelson
On Thu, Aug 8, 2013 at 5:23 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 On Thu, Aug 8, 2013 at 4:42 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Does your test program use all the same writing options that the real
 WAL writes do (like O_DIRECT)?

 I believe so.

 From xlog.c:

 /* do not use get_sync_bit() here --- want to fsync only at end of fill 
 */
 fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
S_IRUSR | S_IWUSR);

 and from the test program:

 fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600);

 Maybe I misunderstood, but I thought the performance complaint had to do
 with the actual writes of WAL data, not with the pre-fill.  That is, you
 should not just be measuring how long the pre-fill takes, but what is the
 speed of real writes to the file later on (which will be using
 get_sync_bit, for various values of the sync settings).

Ah, no, I misunderstood your question.
I'm fairly certain the test program doesn't open up files with any
sort of sync. bit set.
I'll have to see what PostgreSQL is using, exactly, and get back to you.



-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 regression

2013-08-08 Thread Jon Nelson
On Thu, Aug 8, 2013 at 5:25 PM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 On Thu, Aug 8, 2013 at 5:23 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 On Thu, Aug 8, 2013 at 4:42 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Does your test program use all the same writing options that the real
 WAL writes do (like O_DIRECT)?

 I believe so.

 From xlog.c:

 /* do not use get_sync_bit() here --- want to fsync only at end of fill 
 */
 fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
S_IRUSR | S_IWUSR);

 and from the test program:

 fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600);

 Maybe I misunderstood, but I thought the performance complaint had to do
 with the actual writes of WAL data, not with the pre-fill.  That is, you
 should not just be measuring how long the pre-fill takes, but what is the
 speed of real writes to the file later on (which will be using
 get_sync_bit, for various values of the sync settings).

 Ah, no, I misunderstood your question.
 I'm fairly certain the test program doesn't open up files with any
 sort of sync. bit set.
 I'll have to see what PostgreSQL is using, exactly, and get back to you.

My reading of xlog.c seems to indicate that if the sync mode is
fdatasync, then the file is not opened with any special sync bit, but
that an fdatasync occurs at the end.  That is also what the test
program does.


-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 regression

2013-08-08 Thread Jon Nelson
On Thu, Aug 8, 2013 at 9:27 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-08-08 16:12:06 -0500, Jon Nelson wrote:
...

 At this point I'm convinced that the issue is a pathological case in
 ext4. The performance impact disappears as soon as the unwritten
 extent(s) are written to with real data. Thus, even though allocating
 files with posix_fallocate is - frequently - orders of magnitude
 quicker than doing it with write(2), the subsequent re-write can be
 more expensive.  At least, that's what I'm gathering from the various
 threads.


  Why this issue didn't crop up in earlier testing and why I
 can't seem to make test_fallocate do it (even when I modify
 test_fallocate to write to the newly-allocated file in a mostly-random
 fashion) has me baffled.

 It might be kernel version specific and concurrency seems to play a
 role. If you reproduce the problem, could you run a perf record -ga to
 collect a systemwide profile?

Finally, an excuse to learn how to use 'perf'! I'll try to provide
that info when I am able.

 There's some more things to test:
 - is the slowdown dependent on the scale? I.e is it visible with -j 1 -c
   1?

scale=1 (-j 1 -c 1):
with fallocate: 685 tps
without: 727

scale=20
with fallocate: 129
without: 402

scale=40
with fallocate: 163
without: 511

 - Does it also occur in synchronous_commit=off configurations? Those
   don't fdatasync() from so many backends, that might play a role.

With synchronous_commit=off, the performance is vastly improved.
Interestingly, the fallocate case is (immaterially) faster than the
non-fallocate case:   3766tps vs 3700tps.

I tried a few other wal_sync_methods besides the default of fdatasync,
all with scale=80.

fsync:
198 tps (with fallocate) vs 187.

open_sync:
195 tps (with fallocate) vs. 192

 - do bulkloads see it? E.g. the initial pgbench load?

time pgbench -s 200 -p 54320 -d pgb -i

with fallocate: 2m47s
without: 2m50s

Hopefully the above is useful.

-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 regression

2013-08-07 Thread Jon Nelson
On Wed, Aug 7, 2013 at 11:21 AM, Thom Brown t...@linux.com wrote:
 Hi all,

 I recently tried a simple benchmark to see how far 9.4 had come since
 8.4, but I discovered that I couldn't get 9.4 to even touch 8.4 for
 performance.  After checking 9.2 and 9.3 (as per Kevin Grittner's
 suggestion), I found that those were fine, so the issue must be in
 9.4devel.  I used identical configurations for each test, and used
 9.1's pgbench to ensure changes in pgbench didn't affect the outcome.
  The common config changes were:

...

 8.4: 812.482108
 9.4 HEAD: 355.397658
 9.4 e5592c (9th July): 356.485625
 9.4 537227 (7th July): 365.992518
 9.4 9b2543 (7th July): 362.587339
 9.4 269e78 (5th July): 359.439143
 9.4 8800d8 (5th July): 821.933082
 9.4 568d41 (2nd July): 822.991160

 269e78 was the commit immediately after 8800d8, so it appears that
 introduced the regression.

 Use posix_fallocate() for new WAL files, where available.

 Ironically, that was intended to be a performance improvement.

Would it be possible for you to download, compile, and run the test
program as described and located in this email:

http://www.postgresql.org/message-id/cakuk5j1acml-1cgbhnrzed-vh4og+8hkmfhy2eca-8jbj-6...@mail.gmail.com

I also wonder if there is a problem with the 3.8.0 kernel specifically.

-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 regression

2013-08-07 Thread Jon Nelson
On Wed, Aug 7, 2013 at 12:42 PM, Thom Brown t...@linux.com wrote:
 On 7 August 2013 17:54, Jon Nelson jnelson+pg...@jamponi.net wrote:
 On Wed, Aug 7, 2013 at 11:21 AM, Thom Brown t...@linux.com wrote:
 Hi all,

 I recently tried a simple benchmark to see how far 9.4 had come since
 8.4, but I discovered that I couldn't get 9.4 to even touch 8.4 for
 performance.  After checking 9.2 and 9.3 (as per Kevin Grittner's
 suggestion), I found that those were fine, so the issue must be in
 9.4devel.  I used identical configurations for each test, and used
 9.1's pgbench to ensure changes in pgbench didn't affect the outcome.
  The common config changes were:

 ...

 8.4: 812.482108
 9.4 HEAD: 355.397658
 9.4 e5592c (9th July): 356.485625
 9.4 537227 (7th July): 365.992518
 9.4 9b2543 (7th July): 362.587339
 9.4 269e78 (5th July): 359.439143
 9.4 8800d8 (5th July): 821.933082
 9.4 568d41 (2nd July): 822.991160

 269e78 was the commit immediately after 8800d8, so it appears that
 introduced the regression.

 Use posix_fallocate() for new WAL files, where available.

 Ironically, that was intended to be a performance improvement.

 Would it be possible for you to download, compile, and run the test
 program as described and located in this email:

 http://www.postgresql.org/message-id/cakuk5j1acml-1cgbhnrzed-vh4og+8hkmfhy2eca-8jbj-6...@mail.gmail.com

 I shall do after the 2 x 1 hour benchmarks are complete.

 I also wonder if there is a problem with the 3.8.0 kernel specifically.

 Well my laptop has the same kernel (and also 64-bit Linux Mint), so
 took 3 quick sample benchmarks on those two commits, and I get the
 following (all without --enable-cassert):

 269e78: 1162.593665 / 1158.644302 / 1153.955611
 8800d8: 2446.087618 / 2443.797252 / 2321.876431

 And running your test program gives the following (again, just on my laptop):

 for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done
 method: classic. 1 open/close iterations, 1 rewrite in 0.6380s
 method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.3204s
 method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6274s
 method: classic. 2 open/close iterations, 1 rewrite in 1.2908s
 method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6596s
 method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.2666s
 method: classic. 5 open/close iterations, 1 rewrite in 3.1419s
 method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.5930s
 method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1516s
 method: classic. 10 open/close iterations, 1 rewrite in 6.2912s
 method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2626s
 method: glibc emulation. 10 open/close iterations, 1 rewrite in 6.3667s
 method: classic. 100 open/close iterations, 1 rewrite in 67.4174s
 method: posix_fallocate. 100 open/close iterations, 1 rewrite in 37.8788s
 method: glibc emulation. 100 open/close iterations, 1 rewrite in 55.0714s

OK. That's interesting, and a good data point.

One thing you could try manually disabling the use of posix_fallocate in 269e78.
After running ./configure --stuff-here
edit src/include/pg_config.h and comment out the following line (on or
around line 374)
#define HAVE_POSIX_FALLOCATE 1

*then* build postgresql and see if the performance hit is still there.

-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 regression

2013-08-07 Thread Jon Nelson
On Wed, Aug 7, 2013 at 4:18 PM, Kevin Grittner kgri...@ymail.com wrote:
 Thom Brown t...@linux.com wrote:

 pgbench -j 80 -c 80 -T 3600

 269e78: 606.268013
 8800d8: 779.583129

I have also been running some tests and - as yet - they are
inconclusive. What I can say about them so far is that - at times -
they agree with these results and do show a performance hit.

I took the settings as posted and adjusted them slightly for my much
lower-powered personal laptop, changing checkpoint_completion_target
to 1.0 and checkpoint_timeout to 1min.

I am testing with the latest git HEAD, turning fallocate support on
and off by editing xlog.c directly. Furthermore, before each run I
would try to reduce the number of existing WAL files by making a pre
run with checkpoint_segments = 3 before changing it to 32.

For reasons I don't entirely understand, when WAL files are being
created (rather than recycled) I found a performance hit, but when
they were being recycled I got a slight performance gain (this may be
due to reduced fragmentation) but the gain was never more than 10% and
frequently less than that.

I can't explain - yet (if ever!) - why my initial tests (and many of
those done subsequently) showed improvement - it may very well have
something to do with how the code interacts with these settings.



-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 regression

2013-08-07 Thread Jon Nelson
On Wed, Aug 7, 2013 at 10:05 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-08-07 20:23:55 +0100, Thom Brown wrote:
  269e78 was the commit immediately after 8800d8, so it appears that
  introduced the regression.
 
  Use posix_fallocate() for new WAL files, where available.
 
  This is curious. Could you either run a longer test before/after the
  commit or reduce checkpoint_timeout to something like 3min?
 
  Okay, I'll rerun the test for both those commits at 1 hour each with
  checkpoint_timeout set at 3mins, but with all other configuration
  settings the same

 Results
 (checkpoint_timeout = 3min)

 pgbench -j 80 -c 80 -T 3600

 269e78: 606.268013
 8800d8: 779.583129

 Ok, so the performance difference is lower. But, it seems to be still to
 high to be explaineable by performance differences during
 initialization/fallocate.
 In a very quick test, I don't see the same on my laptop. I am currently
 travelling and I don't have convenient access to anything else.

 Could you:
 - run filefrag on a couple of wal segments of both clusters after the
   test and post the results here?

For me, there is no meaningful difference, but I have a relatively
fresh filesystem (ext4) with lots of free space.

 - enable log_checkpoints, post the lines spat out together with the results
 - run pgbench without reinitializing the cluster/pgbench tables
   inbetween?

When I do this (as I note below), the performance difference (for me)
disappears.

 Basically, I'd like to know whether its the initialization that's slow
 (measurable because we're regularly creating new files because of a too
 low checkpoint_segments) or whether it's the actual writes.

What I've found so far is very confusing.
I start by using initdb to initialize a temporary cluster, copy in a
postgresql.conf (with the modifcations from Thom Brown tweaked for my
small laptop), start the cluster, create a test database, initialize
it with pgbench, and then run. I'm also only running for two minutes
at this time.

Every time I test, the non-fallocate version is faster. I modifed
xlog.c to use posix_fallocate (or not) based on an environment
variable.
Once the WAL files have been rewritten at least once, then it doesn't
seem to matter which method is used to allocate them (although
posix_fallocate seems to have a slight edge). I even went to far as to
modify the code to posix_fallocate the file *and then zero it out
anyway*, and it was almost as fast as zeroing alone, and faster than
using posix_fallocate alone.

 Jon, here are the test results you asked for:

 $ for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done
 method: classic. 1 open/close iterations, 1 rewrite in 0.9157s
 method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.5314s
 method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6018s
 method: classic. 2 open/close iterations, 1 rewrite in 1.1417s
 method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6505s
 method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.8900s
 method: classic. 5 open/close iterations, 1 rewrite in 3.6490s
 method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.9841s
 method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1053s
 method: classic. 10 open/close iterations, 1 rewrite in 5.7562s
 method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2015s
 method: glibc emulation. 10 open/close iterations, 1 rewrite in 7.1426s
 method: classic. 100 open/close iterations, 1 rewrite in 64.9483s
 method: posix_fallocate. 100 open/close iterations, 1 rewrite in 36.3748s
 method: glibc emulation. 100 open/close iterations, 1 rewrite in 64.8386s

 Ok, this seems to indicate that fallocate works nicely. Jon, wasn't
 there a version of the test that rewrote the file afterwards?

Yes. If you use a different number besides '1' as the third argument
in the command line above, it will rewrite the file that many times.

-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-07-05 Thread Jon Nelson
On Fri, Jul 5, 2013 at 2:23 AM, Greg Smith g...@2ndquadrant.com wrote:
 On 7/5/13 2:50 AM, Jeff Davis wrote:

 So, my simple conclusion is that glibc emulation should be about the
 same as what we're doing now, so there's no reason to avoid it. That
 means, if posix_fallocate() is present, we should use it, because it's
 either the same (if emulated in glibc) or significantly faster (if
 implemented in the kernel).


 That's what I'm seeing everywhere too.  I'm happy that we've spent enough
 time chasing after potential issues without finding anything now.  Pull out
 the GUC that was added for default and this is ready to commit.

Wonderful. Is removing the GUC something that I should do or should
that be done by somebody that knows more about what they are doing? (I
am happy to give it a go!)

Should the small test program that I made also be included somewhere
in the source tree?

--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-07-01 Thread Jon Nelson
On Sun, Jun 30, 2013 at 11:52 PM, Greg Smith g...@2ndquadrant.com wrote:
 On 6/30/13 9:28 PM, Jon Nelson wrote:

 The performance of the latter (new) test sometimes seems to perform
 worse and sometimes seems to perform better (usually worse) than
 either of the other two. In all cases, posix_fallocate performs
 better, but I don't have a sufficiently old kernel to test with.


 This updated test program looks reliable now.  The numbers are very tight
 when I'd expect them to be, and there's nowhere with the huge differences I
 saw in the earlier test program.

 Here's results from a few sets of popular older platforms:

If you found yourself with a spare moment, could you run these again
with the number of open/close cycles set high (say, 100) and the
number of rewrites set to 0 and also to 1? Most of the time spent is
actually spent overwriting the files so by reducing or eliminating
that aspect it might be easier to get a handle on the actual
performance difference.



--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-30 Thread Jon Nelson
On Sun, Jun 30, 2013 at 5:55 PM, Greg Smith g...@2ndquadrant.com wrote:


 pwrite(4, \0, 1, 16769023)= 1
 pwrite(4, \0, 1, 16773119)= 1
 pwrite(4, \0, 1, 16777215)= 1

 That's glibc helpfully converting your call to posix_fallocate into small
 writes, because the OS doesn't provide a better way in that kernel.  It's
 not hard to imagine this being slower than what the WAL code is doing right
 now.  I'm not worried about correctness issues anymore, but my gut paranoia
 about this not working as expected on older systems was justified.  Everyone
 who thought I was just whining owes me a cookie.

I had noted in the very early part of the thread that glibc emulates
posix_fallocate when the (Linux-specific) 'fallocate' systemcall
fails. In this case, it's writing 4 bytes of zeros and then
essentially seeking forward 4092 (4096-4) bytes. This prevents files
with holes in them because the holes have to be at least 4kiB in size,
if I recall properly. It's *not* writing out 16MiB in 4 byte
increments.

--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-30 Thread Jon Nelson
On Sun, Jun 30, 2013 at 6:49 PM, Greg Smith g...@2ndquadrant.com wrote:
 On 5/28/13 10:00 PM, Jon Nelson wrote:

 A note: The attached test program uses *fsync* instead of *fdatasync*
 after calling fallocate (or writing out 16MB of zeroes), per an
 earlier suggestion.


 I tried this out on the RHEL5 platform I'm worried about now.  There's
 something weird about the test program there.  If I run it once it shows
 posix_fallocate running much faster:

 without posix_fallocate: 1 open/close iterations, 1 rewrite in 23.0169s
 with posix_fallocate: 1 open/close iterations, 1 rewrite in 11.1904s

Assuming the platform chosen is using the glibc approach of pwrite(4
bytes) every 4KiB, then the results ought to be similar, and I'm at a
loss to explain why it's performing better (unless - grasping at
straws - simply the *volume* of data transferred from userspace to the
kernel is at play, in which case posix_fallocate will result in 4096
calls to pwrite but at 4 bytes each versus 2048 calls to write at 8KiB
each.) Ultimately the same amount of data gets written to disk (one
would imagine), but otherwise I can't really think of much.

I have also found several errors test_fallocate.c program I posted,
corrected below.
One of them is: it is missing two pairs of parentheses around two #defines:

#define SIXTEENMB 1024*1024*16
#define EIGHTKB 1024*8

should be:

#define SIXTEENMB (1024*1024*16)
#define EIGHTKB (1024*8)

Otherwise the program will end up writing (131072) 8KiB blocks instead of 2048.

This actually makes the comparison between writing 8KiB blocks and
using posix_fallocate favor the latter more strongly in the results
(also seen below).

 The problem is that I'm seeing the gap between the two get smaller the more
 iterations I run, which makes me wonder if the test is completely fair:

 without posix_fallocate: 2 open/close iterations, 2 rewrite in 34.3281s
 with posix_fallocate: 2 open/close iterations, 2 rewrite in 23.1798s


 without posix_fallocate: 3 open/close iterations, 3 rewrite in 44.4791s
 with posix_fallocate: 3 open/close iterations, 3 rewrite in 33.6102s

 without posix_fallocate: 5 open/close iterations, 5 rewrite in 65.6244s
 with posix_fallocate: 5 open/close iterations, 5 rewrite in 61.0991s

 You didn't show any output from the latest program on your system, so I'm
 not sure how it behaved for you here.

On the the platform I use - openSUSE (12.3, x86_64, kernel 3.9.7
currently) I never see posix_fadvise perform worse. Typically better,
sometimes much better.


To set the number of times the file is overwritten to just 1 (one):

for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done

I am including a revised version of test_fallocate.c that corrects the
above noted error, one typo (from when I changed fdatasync to fsync)
that did not alter program behavior, corrects a mis-placed
gettimeofday (which does change the results) and includes a new test
that aims (perhaps poorly) to emulate the glibc style of pwrite(4
bytes) for every 4KiB, and tests the resulting file size to make sure
it is 16MiB in size.

The performance of the latter (new) test sometimes seems to perform
worse and sometimes seems to perform better (usually worse) than
either of the other two. In all cases, posix_fallocate performs
better, but I don't have a sufficiently old kernel to test with.

The new results on one machine are below.

With 0 (zero) rewrites (testing *just*
open/some_type_of_allocation/fsync/close):

method: classic. 100 open/close iterations, 0 rewrite in 29.6060s
method: posix_fallocate. 100 open/close iterations, 0 rewrite in 2.1054s
method: glibc emulation. 100 open/close iterations, 0 rewrite in 31.7445s

And with the same number of rewrites as open/close cycles:

method: classic. 1 open/close iterations, 1 rewrite in 0.6297s
method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.3028s
method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.5521s

method: classic. 2 open/close iterations, 2 rewrite in 1.6455s
method: posix_fallocate. 2 open/close iterations, 2 rewrite in 1.0409s
method: glibc emulation. 2 open/close iterations, 2 rewrite in 1.5604s

method: classic. 5 open/close iterations, 5 rewrite in 7.5916s
method: posix_fallocate. 5 open/close iterations, 5 rewrite in 6.9177s
method: glibc emulation. 5 open/close iterations, 5 rewrite in 8.1137s

method: classic. 10 open/close iterations, 10 rewrite in 29.2816s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 28.4400s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 31.2693s



--
Jon
#include stdlib.h
#include stdio.h
#include fcntl.h
#include unistd.h
#include string.h
#include sys/time.h
#include sys/types.h
#include sys/stat.h

#define SIXTEENMB (1024*1024*16)
#define FOURKB (1024*4)
#define EIGHTKB (1024*8)

void writeout(int fd, char *buf)
{
	int i;
	for (i = 0; i  SIXTEENMB / EIGHTKB; ++i) {
		if (write(fd, buf, EIGHTKB) != EIGHTKB) {
			fprintf(stderr, Error in write: %m!\n

Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-16 Thread Jon Nelson
On Fri, Jun 14, 2013 at 12:06 PM, Jeff Davis pg...@j-davis.com wrote:
 On Sat, 2013-05-25 at 13:55 -0500, Jon Nelson wrote:
 Ack.  I've revised the patch to always have the GUC (for now), default
 to false, and if configure can't find posix_fallocate (or the user
 disables it by way of pg_config_manual.h) then it remains a GUC that
 simply can't be changed.

 Why have a GUC here at all? Perhaps this was already discussed, and I
 missed it? Is it just for testing purposes, or did you intend for it to
 be in the final version?

As Greg Smith noted, right now it's only for testing purposes.

 * The other code assumes that no errno means ENOSPC. We should be
 consistent about that assumption, and do the same thing in the fallocate
 case.

Unlike write(2), posix_fallocate does *not* set errno, but instead
returns a subset of the errno values as it's return code. The current
approach was suggested to me by Andres Freund and Alvaro Herrera.

 * You check for the presence of posix_fallocate at configure time, but
 don't #ifdef the call site. It looks like you removed this from the v2
 patch, was there a reason for that? Won't that cause build errors for
 platforms without it?

Indeed! I believe I have corrected the error and will be sending out a
revised patch (v5), once the performance and back-testing have
completed.

--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-16 Thread Jon Nelson
On Sun, Jun 16, 2013 at 9:59 PM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 On Fri, Jun 14, 2013 at 12:06 PM, Jeff Davis pg...@j-davis.com wrote:
 On Sat, 2013-05-25 at 13:55 -0500, Jon Nelson wrote:
..
 * You check for the presence of posix_fallocate at configure time, but
 don't #ifdef the call site. It looks like you removed this from the v2
 patch, was there a reason for that? Won't that cause build errors for
 platforms without it?

 Indeed! I believe I have corrected the error and will be sending out a
 revised patch (v5), once the performance and back-testing have
 completed.

Please find attached v5 of the patch, with the above correction in place.
The only change from the v4 patch is wrapping the if
(wal_use_fallocate) block in an #ifdef USE_POSIX_FALLOCATE.
Thanks!

--
Jon


fallocate-v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-11 Thread Jon Nelson
There hasn't been much activity here recently.  I'm curious, then, if
there are questions that I can answer.
It may be useful to summarize some things here:

- the purpose of the patch is to use posix_fallocate when creating new
WAL files, because it's (usually) much quicker
- using posix_fallocate is *one* system call versus 2048 calls to write(2)
- additionally, using posix_fallocate /guarantees/ that the filesystem
has space for the WAL file (by spec)
- reportedly (difficult to test or prove), using posix_fallocate *may*
reduce file fragmentation
- the (limited) testing I've done bears this out: the more new WAL
file creation there is, the more the improvement. Once the number of
WAL files reaches a constant point, there does not appear to be either
a positive or a negative performance impact. This is as expected.
- a test program (C) was also written and used which creates,
allocates, and then writes to files as fast as possible. This test
program also shows the expected performance benefits.
- the performance benefits range from a few percent up to about 15 percent

Concerns:
- some were concerned that the spec makes no claims about
posix_fallocate being able to guarantee that the space allocated has
zeroes in it. This was discussed here and on the Linux Kernel mailing
list, wherein the expected behavior is that it does provide zeroes
- most systems don't allocate a great many new WAL files, so the
performance benefit is small
- your concern here

Benefits:
- new WAL file allocate is much quicker, more efficient (fewer system calls)
- the patch is (reportedly - I'm not a good judge here!) quite small


--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-11 Thread Jon Nelson
On Tue, Jun 11, 2013 at 12:49 PM, Stefan Drees ste...@drees.name wrote:
 On 2013-06-11 19:45 CEST, Greg Smith wrote:

 On 6/11/13 12:22 PM, Merlin Moncure wrote:

 Personally I think this patch should go in regardless -- the concerns
 made IMNSHO are specious.


 That's nice, but we have this process for validating whether features go
 in or not that relies on review instead of opinions.

 ;-) that's why I played with the test_fallocate.c, as it was easy to do and
 I understood, the author (of the patch) wanted to trigger some reviews ... I
 do not (yet) know anything about the core codes, so I leave this to the
 hackers. My review result was, that with newer gcc's you should specify an
 open mode flag as third argument of the fopen call, as only with the test
 tool nothing important found.

If you change line 68 to read:

fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600);

then it should work just fine (assuming you have not already done so).


--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-28 Thread Jon Nelson
On Tue, May 28, 2013 at 9:19 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, May 28, 2013 at 10:15 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
 On 2013-05-28 10:03:58 -0400, Robert Haas wrote:
 On Sat, May 25, 2013 at 2:55 PM, Jon Nelson jnelson+pg...@jamponi.net 
 wrote:
  The biggest thing missing from this submission is information about what
  performance testing you did.  Ideally performance patches are submitted 
  with
  enough information for a reviewer to duplicate the same test the author 
  did,
  as well as hard before/after performance numbers from your test system.  
  It
  often turns tricky to duplicate a performance gain, and being able to run
  the same test used for initial development eliminates a lot of the 
  problems.
 
  This has been a bit of a struggle. While it's true that WAL file
  creation doesn't happen with great frequency, and while it's also true
  that - with strace and other tests - it can be proven that
  fallocate(16MB) is much quicker than writing it zeroes by hand,
  proving that in the larger context of a running install has been
  challenging.

 It's nice to be able to test things in the context of a running
 install, but sometimes a microbenchmark is just as good.  I mean, if
 posix_fallocate() is faster, then it's just faster, right?

 Well, it's a bit more complex than that. Fallocate doesn't actually
 initializes the disk space in most filesystems, just marks it as
 allocated and zeroed which is one of the reasons it can be noticeably
 faster. But that can make the runtime overhead of writing to those pages
 higher.

 Maybe it would be good to measure that impact.  Something like this:

 1. Write 16MB of zeroes to an empty file in the same size chunks we're
 currently using (8kB?).  Time that.  Rewrite the file with real data.
 Time that.
 2. posix_fallocate() an empty file out to 16MB.  Time that.  Rewrite
 the fie with real data.  Time that.

 Personally, I have trouble believing that writing 16MB of zeroes by
 hand is better than telling the OS to do it for us.  If that's so,
 the OS is just stupid, because it ought to be able to optimize the
 crap out of that compared to anything we can do.  Of course, it is
 more than possible that the OS is in fact stupid.  But I'd like to
 hope not.

I wrote a little C program to do something very similar to that (which
I'll hope to post later today).
It opens a new file, fallocates 16MB, calls fdatasync.  Then it loops
10 times:  seek to the start of the file, writes 16MB of ones, calls
fdatasync.
Then it closes and removes the file, re-opens it, and this time writes
out 16MB of zeroes, calls fdatasync, and then does the same loop as
above. The program times the process from file open to file unlink,
inclusive.

The results - for me - are pretty consistent: using fallocate is
12-13% quicker than writing out zeroes. I used fdatasync twice to
(attempt) to mimic what the WAL writer does.

--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-28 Thread Jon Nelson
On Tue, May 28, 2013 at 10:36 AM, Greg Smith g...@2ndquadrant.com wrote:
 On 5/28/13 11:12 AM, Jon Nelson wrote:

 It opens a new file, fallocates 16MB, calls fdatasync.


 Outside of the run for performance testing, I think it would be good at this
 point to validate that there is really a 16MB file full of zeroes resulting
 from these operations.  I am not really concerned that posix_fallocate might
 be slower in some cases; that seems unlikely.  I am concerned that it might
 result in a file that isn't structurally the same as the 16MB of zero writes
 implementation used now.

util-linux comes with fallocate which (might?) suffice for testing in
that respect, no?
If that is a real concern, it could be made part of the autoconf
testing, perhaps.

 The timing program you're writing has some aspects that are similar to the
 contrib/pg_test_fsync program.  You might borrow some code from there
 usefully.

Thanks! If it looks like what I'm attaching will not do, then I'll
look at that as a possible next step.

 To clarify the suggestion I was making before about including performance
 test results:  that doesn't necessarily mean the testing code must run using
 only the database.  That's better if possible, but as Robert says it may not
 be for some optimizations.  The important thing is to have something
 measuring the improvement that a reviewer can duplicate, and if that's a
 standalone benchmark problem that's still very useful.  The main thing I'm
 wary of is any this should be faster claims that don't come with any
 repeatable measurements at all.  Very often theories about the fastest way
 to do something don't match what's actually seen in testing.

Ack.
A note: The attached test program uses *fsync* instead of *fdatasync*
after calling fallocate (or writing out 16MB of zeroes), per an
earlier suggestion.

--
Jon
#include stdlib.h
#include stdio.h
#include fcntl.h
#include unistd.h
#include string.h
#include sys/time.h
#include sys/types.h

#define SIXTEENMB 1024*1024*16
#define EIGHTKB 1024*8

void writeout(int fd, char *buf)
{
	int i;
	for (i = 0; i  SIXTEENMB / EIGHTKB; ++i) {
		if (write(fd, buf, EIGHTKB) != EIGHTKB) {
			fprintf(stderr, Error in write: %m!\n);
			exit(1);
		}
	}
}

int main(int argc, char *argv[])
{
	int with_fallocate, open_close_iterations, rewrite_iterations;
	int fd, i, j;
	double tt;
	struct timeval tv1, tv2;
	char *buf0, *buf1;
	const char *filename; /* for convenience */

	if (argc != 4) {
		fprintf(stderr, Usage: %s filename open/close iterations rewrite iterations\n, argv[0]);
		exit(1);
	}
	
	filename = argv[1];

	open_close_iterations = atoi(argv[2]);
	if (open_close_iterations  0) {
		fprintf(stderr, Error parsing 'open_close_iterations'!\n);
		exit(1);
	}

	rewrite_iterations = atoi(argv[3]);
	if (rewrite_iterations  0) {
		fprintf(stderr, Error parsing 'rewrite_iterations'!\n);
		exit(1);
	}

	buf0 = malloc(SIXTEENMB);
	if (!buf0) {
		fprintf(stderr, Unable to allocate memory!\n);
		exit(1);
	}
	memset(buf0, 0, SIXTEENMB);

	buf1 = malloc(SIXTEENMB);
	if (!buf1) {
		fprintf(stderr, Unable to allocate memory!\n);
		exit(1);
	}
	memset(buf1, 1, SIXTEENMB);

	for (with_fallocate = 0;with_fallocate  2;++with_fallocate) {
		for (i = 0;i  open_close_iterations; ++i) {
			gettimeofday(tv1, NULL);
			fd = open(filename, O_CREAT | O_EXCL | O_WRONLY);
			if (fd  0) {
fprintf(stderr, Error opening file: %m\n);
exit(1);
			}
			if (with_fallocate) {
if (posix_fallocate(fd, 0, SIXTEENMB) != 0) {
	fprintf(stderr, Error in posix_fallocate!\n);
exit(1);
}
			} else {
writeout(fd, buf0);
			}
			if (fsync(fd)) {
fprintf(stderr, Error in fdatasync: %m!\n);
exit(1);
			}
			for (j = 0; j  rewrite_iterations; ++j) {
lseek(fd, 0, SEEK_SET);
writeout(fd, buf1);
if (fdatasync(fd)) {
	fprintf(stderr, Error in fdatasync: %m!\n);
	exit(1);
}
			}
			if (close(fd)) {
fprintf(stderr, Error in close: %m!\n);
exit(1);
			}
			unlink(filename);		/* don't check for error */
		}
		gettimeofday(tv2, NULL);
		tt = (tv2.tv_usec + tv2.tv_sec * 100) - (tv1.tv_usec + tv1.tv_sec * 100);
		tt /= 100;
		fprintf(stderr,
			with%s posix_fallocate: %d open/close iterations, %d rewrite in %0.4fs\n,
			(with_fallocate ?  : out), open_close_iterations, rewrite_iterations, tt);
		sleep(5);
	}
	/* cleanup */
	free(buf0);
	free(buf1);
	return 0;
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-25 Thread Jon Nelson
On Thu, May 16, 2013 at 7:05 PM, Greg Smith g...@2ndquadrant.com wrote:
 On 5/16/13 9:16 AM, Jon Nelson wrote:

 Am I doing this the right way? Should I be posting the full patch each
 time, or incremental patches?


 There are guidelines for getting your patch in the right format at
 https://wiki.postgresql.org/wiki/Working_with_Git#Context_diffs_with_Git
 that would improve this one.  You have some formatting issues with tab
 spacing at lines 120 through 133 in your v3 patch.  And it looks like there
 was a formatting change on line 146 that is making the diff larger than it
 needs to be.

I've corrected the formatting change (end-of-line whitespace was
stripped) on line 146.
The other whitespace changes are - I think - due to newly-indented
code due to a new code block.
Included please find a v4 patch which uses context diffs per the above url.

 The biggest thing missing from this submission is information about what
 performance testing you did.  Ideally performance patches are submitted with
 enough information for a reviewer to duplicate the same test the author did,
 as well as hard before/after performance numbers from your test system.  It
 often turns tricky to duplicate a performance gain, and being able to run
 the same test used for initial development eliminates a lot of the problems.

This has been a bit of a struggle. While it's true that WAL file
creation doesn't happen with great frequency, and while it's also true
that - with strace and other tests - it can be proven that
fallocate(16MB) is much quicker than writing it zeroes by hand,
proving that in the larger context of a running install has been
challenging.

Attached you'll find a small test script (t.sh) which creates a new
cluster in 'foo', changes some config values, starts the cluster, and
then times how long it takes pgbench to prepare a database. I've used
wal_level = hot_standby in the hopes that this generates the largest
number of WAL files (and I set the number of such files to 1024). The
hardware is an AMD 9150e with a 2-disk software RAID1 (SATA disks) on
kernel 3.9.2 and ext4 (x86_64, openSUSE 12.3). The test results are
not that surprising. The longer the test (the larger the scale factor)
the less of a difference using posix_fallocate makes. With a scale
factor of 100, I see an average of 10-11% reduction in the time taken
to initialize the database. With 300, it's about 5.5% and with 900,
it's between 0 and 1.2%. I will be doing more testing but this is what
I started with. I'm very open to suggestions.

 Second bit of nitpicking.  There are already some GUC values that appear or
 disappear based on compile time options.  They're all debugging related
 things though.  I would prefer not to see this one go away when it's
 implementation isn't available.  That's going to break any scripts that SHOW
 the setting to see if it's turned on or not as a first problem.  I think the
 right model to follow here is the IFDEF setup used for
 effective_io_concurrency.  I wouldn't worry about this too much though.
 Having a wal_use_fallocate GUC is good for testing.  But if it works out
 well, when it's ready for commit I don't see why anyone would want it turned
 off on platforms where it works.  There are already too many performance
 tweaking GUCs.  Something has to be very likely to be changed from the
 default before its worth adding one for it.

Ack.  I've revised the patch to always have the GUC (for now), default
to false, and if configure can't find posix_fallocate (or the user
disables it by way of pg_config_manual.h) then it remains a GUC that
simply can't be changed.

I'll also be re-running the tests.

--
Jon
attachment: plot.png

fallocate-v4.patch
Description: Binary data


t.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-16 Thread Jon Nelson
On Wed, May 15, 2013 at 10:36 PM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 On Wed, May 15, 2013 at 10:17 PM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
 Jon Nelson escribió:
 On Wed, May 15, 2013 at 4:46 PM, Jon Nelson jnelson+pg...@jamponi.net 
 wrote:

  That's true. I originally wrote the patch using fallocate(2). What
  would be appropriate here? Should I switch on the return value and the
  six (6) or so relevant error codes?

 I addressed this, hopefully in a reasonable way.

 Would it work to just assign the value you got from posix_fallocate (if
 nonzero) to errno and then use %m in the errmsg() call in ereport()?

 That strikes me as a better way. I'll work something up soon.
 Thanks!

Please find attached version 3.
Am I doing this the right way? Should I be posting the full patch each
time, or incremental patches?


--
Jon


fallocate-v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-15 Thread Jon Nelson
On Tue, May 14, 2013 at 9:43 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, May 13, 2013 at 9:54 PM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 Pertinent to another thread titled
 [HACKERS] corrupt pages detected by enabling checksums
 I hope to explore the possibility of using fallocate (or
 posix_fallocate) for new WAL file creation.

 Most modern Linux filesystems support fast fallocate/posix_fallocate,
 reducing extent fragmentation (where extents are used) and frequently
 offering a pretty significant speed improvement. In my tests, using
 posix_fallocate (followed by pg_fsync) is at least 28 times quicker
 than using the current method (which writes zeroes followed by
 pg_fsync).

 I have written up a patch to use posix_fallocate in new WAL file
 creation, including configuration by way of a GUC variable, but I've
 not contributed to the PostgreSQL project before. Therefore, I'm
 fairly certain the patch is not formatted properly or conforms to the
 appropriate style guides. Currently, the patch is based on 9.2, and is
 quite small in size - 3.6KiB.

I have re-based and reformatted the code, and basic testing shows a
reduction in WAL-file creation time of a fairly significant amount.
I ran 'make test' and did additional local testing without issue.
Therefore, I am attaching the patch. I will try to add it to the
commitfest page.


--
Jon


0001-enhance-GUC-and-xlog-with-wal_use_fallocate-boolean-.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-15 Thread Jon Nelson
On Wed, May 15, 2013 at 4:34 PM, Andres Freund and...@2ndquadrant.com wrote:
 Hi,

 On 2013-05-15 16:26:15 -0500, Jon Nelson wrote:
  I have written up a patch to use posix_fallocate in new WAL file
  creation, including configuration by way of a GUC variable, but I've
  not contributed to the PostgreSQL project before. Therefore, I'm
  fairly certain the patch is not formatted properly or conforms to the
  appropriate style guides. Currently, the patch is based on 9.2, and is
  quite small in size - 3.6KiB.

 I have re-based and reformatted the code, and basic testing shows a
 reduction in WAL-file creation time of a fairly significant amount.
 I ran 'make test' and did additional local testing without issue.
 Therefore, I am attaching the patch. I will try to add it to the
 commitfest page.

 Some where quick comments, without thinking about this:

Thank you for the kind feedback.

 * needs a configure check for posix_fallocate. The current version will
   e.g. fail to compile on windows or many other non linux systems. Check
   how its done for posix_fadvise.

I will address as soon as I am able.

 * Is wal file creation performance actually relevant? Is the performance
   of a system running on fallocate()d wal files any different?

In my limited testing, I noticed a drop of approx. 100ms per WAL file.
I do not have a good idea for how to really stress the WAL-file
creation area without calling pg_start_backup and pg_stop_backup over
and over (with archiving enabled).

However, a file allocated with fallocate is (supposed to be) less
fragmented than one created by the traditional means.

 * According to the man page posix_fallocate doesn't set errno but rather
   returns the error code.

That's true. I originally wrote the patch using fallocate(2). What
would be appropriate here? Should I switch on the return value and the
six (6) or so relevant error codes?

 * I wonder whether we ever want to actually disable this? Afair the libc
   contains emulation for posix_fadvise if the filesystem doesn't support
   it.

I know that glibc does, but I don't know about other libc implementations.

--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-15 Thread Jon Nelson
On Wed, May 15, 2013 at 4:46 PM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 On Wed, May 15, 2013 at 4:34 PM, Andres Freund and...@2ndquadrant.com wrote:
..
 Some where quick comments, without thinking about this:

 Thank you for the kind feedback.

 * needs a configure check for posix_fallocate. The current version will
   e.g. fail to compile on windows or many other non linux systems. Check
   how its done for posix_fadvise.

The following patch includes the changes to configure.in.
I had to make other changes (not included here) because my local
system uses autoconf 2.69, but I did test this successfully.

 That's true. I originally wrote the patch using fallocate(2). What
 would be appropriate here? Should I switch on the return value and the
 six (6) or so relevant error codes?

I addressed this, hopefully in a reasonable way.

--
Jon


fallocate.patch-v2
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-15 Thread Jon Nelson
On Wed, May 15, 2013 at 10:17 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Jon Nelson escribió:
 On Wed, May 15, 2013 at 4:46 PM, Jon Nelson jnelson+pg...@jamponi.net 
 wrote:

  That's true. I originally wrote the patch using fallocate(2). What
  would be appropriate here? Should I switch on the return value and the
  six (6) or so relevant error codes?

 I addressed this, hopefully in a reasonable way.

 Would it work to just assign the value you got from posix_fallocate (if
 nonzero) to errno and then use %m in the errmsg() call in ereport()?

That strikes me as a better way. I'll work something up soon.
Thanks!

--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-13 Thread Jon Nelson
On Mon, May 13, 2013 at 7:49 AM, k...@rice.edu k...@rice.edu wrote:
 On Sun, May 12, 2013 at 07:41:26PM -0500, Jon Nelson wrote:
 On Sun, May 12, 2013 at 3:46 PM, Jim Nasby j...@nasby.net wrote:
  On 5/10/13 1:06 PM, Jeff Janes wrote:
 
  Of course the paranoid DBA could turn off restart_after_crash and do a
  manual investigation on every crash, but in that case the database would
  refuse to restart even in the case where it perfectly clear that all the
  following WAL belongs to the recycled file and not the current file.
 
 
  Perhaps we should also allow for zeroing out WAL files before reuse (or 
  just
  disable reuse). I know there's a performance hit there, but the reuse idea
  happened before we had bgWriter. Theoretically the overhead creating a new
  file would always fall to bgWriter and therefore not be a big deal.

 For filesystems like btrfs, re-using a WAL file is suboptimal to
 simply creating a new one and removing the old one when it's no longer
 required. Using fallocate (or posix_fallocate) (I have a patch for
 that!) to create a new one is - by my tests - 28 times faster than the
 currently-used method.


 What about for less cutting (bleeding) edge filesystems?

The patch would be applicable for any filesystem which implements the
fallocate/posix_fallocate calls in an efficient fashion. xfs and ext4
would both work, if I recall properly. I'm certain there are others,
and the technique would probably work on other operating systems like
the *BSDs, etc.. Additionally, it's improbable that there would be a
performance hit for other filesystems versus simply writing zeroes,
since that's the approach that is taken anyway as a fallback.  Another
win is reduction in fragmentation. I would love to be able to turn off
WAL recycling to perform more useful testing.

--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-13 Thread Jon Nelson
On Mon, May 13, 2013 at 8:32 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-05-12 19:41:26 -0500, Jon Nelson wrote:
 On Sun, May 12, 2013 at 3:46 PM, Jim Nasby j...@nasby.net wrote:
  On 5/10/13 1:06 PM, Jeff Janes wrote:
 
  Of course the paranoid DBA could turn off restart_after_crash and do a
  manual investigation on every crash, but in that case the database would
  refuse to restart even in the case where it perfectly clear that all the
  following WAL belongs to the recycled file and not the current file.
 
 
  Perhaps we should also allow for zeroing out WAL files before reuse (or 
  just
  disable reuse). I know there's a performance hit there, but the reuse idea
  happened before we had bgWriter. Theoretically the overhead creating a new
  file would always fall to bgWriter and therefore not be a big deal.

 For filesystems like btrfs, re-using a WAL file is suboptimal to
 simply creating a new one and removing the old one when it's no longer
 required. Using fallocate (or posix_fallocate) (I have a patch for
 that!) to create a new one is - by my tests - 28 times faster than the
 currently-used method.

 I don't think the comparison between just fallocate()ing and what we
 currently do is fair. fallocate() doesn't guarantee that the file is the
 same size after a crash, so you would still need an fsync() or we
 couldn't use fdatasync() anymore. And I'd guess the benefits aren't all
 that big anymore in that case?

fallocate (16MB) + fsync is still almost certainly faster than
write+write+write... + fsync.
The test I performed at the time did exactly that .. posix_fallocate + pg_fsync.

 That said, using posix_fallocate seems like a good idea in lots of
 places inside pg, its just not all that easy to do in some of the
 places.

I should not derail this thread any further. Perhaps, if interested
parties would like to discuss the use of fallocate/posix_fallocate, a
new thread might be more appropriate?


--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-13 Thread Jon Nelson
Pertinent to another thread titled
[HACKERS] corrupt pages detected by enabling checksums
I hope to explore the possibility of using fallocate (or
posix_fallocate) for new WAL file creation.

Most modern Linux filesystems support fast fallocate/posix_fallocate,
reducing extent fragmentation (where extents are used) and frequently
offering a pretty significant speed improvement. In my tests, using
posix_fallocate (followed by pg_fsync) is at least 28 times quicker
than using the current method (which writes zeroes followed by
pg_fsync).

I have written up a patch to use posix_fallocate in new WAL file
creation, including configuration by way of a GUC variable, but I've
not contributed to the PostgreSQL project before. Therefore, I'm
fairly certain the patch is not formatted properly or conforms to the
appropriate style guides. Currently, the patch is based on 9.2, and is
quite small in size - 3.6KiB.

Advice on how to proceed is appreciated.

--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-12 Thread Jon Nelson
On Sun, May 12, 2013 at 3:46 PM, Jim Nasby j...@nasby.net wrote:
 On 5/10/13 1:06 PM, Jeff Janes wrote:

 Of course the paranoid DBA could turn off restart_after_crash and do a
 manual investigation on every crash, but in that case the database would
 refuse to restart even in the case where it perfectly clear that all the
 following WAL belongs to the recycled file and not the current file.


 Perhaps we should also allow for zeroing out WAL files before reuse (or just
 disable reuse). I know there's a performance hit there, but the reuse idea
 happened before we had bgWriter. Theoretically the overhead creating a new
 file would always fall to bgWriter and therefore not be a big deal.

For filesystems like btrfs, re-using a WAL file is suboptimal to
simply creating a new one and removing the old one when it's no longer
required. Using fallocate (or posix_fallocate) (I have a patch for
that!) to create a new one is - by my tests - 28 times faster than the
currently-used method.


--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Considering Gerrit for CFs

2013-02-08 Thread Jon Nelson
On Fri, Feb 8, 2013 at 1:43 PM, Phil Sorber p...@omniti.com wrote:
 On Fri, Feb 8, 2013 at 10:20 AM, Peter Eisentraut pete...@gmx.net wrote:
 On 2/8/13 5:23 AM, Magnus Hagander wrote:
 But do you have any actual proof that the problem is in we
 loose reviewers because we're relying on email?

 Here is one: Me.

 Just yesterday I downloaded a piece of software that was previously
 unknown to me from GitHub and found a bug.  Within 15 minutes or so I
 had fixed the bug, made a fork, sent a pull request.  Today I read, the
 fix was merged last night, and I'm happy.

 How would this go with PostgreSQL?  You can use the bug form on the web
 site, but you can't attach any code, so the bug will just linger and
 ultimately put more burden on a core contributor to deal with the
 minutiae of developing, testing, and committing a trivial fix and
 sending feedback to the submitter.  Or the user could take the high road
 and develop and patch and submit it.  Just make sure it's in context
 diff format!  Search the wiki if you don't know how to do that!  Send it
 to -hackers, your email will be held for moderation.  We won't actually
 do anything with your patch, but we will tell you to add it to that
 commitfest app over there.  You need to sign up for an account to use
 that.  We will deal with your patch in one or two months.  But only if
 you review another patch.  And you should sign up for that other mailing
 list, to make sure you're doing it right.  Chances are, the first review
 you're going to get is that your patch doesn't apply anymore, but which
 time you will have lost interest in the patch anyway.

 This. This times 1000.

I, too, could not agree more.

 I'm not sure if Gerrit specifically is the answer, but there are
 definitely better ways to do code review like this. I really like the
 way github allows you to post a patch and then have conversation
 around it, offer comments on specific lines of code, and add updates
 to the patch all in one interface. Another benefit is that a lot more
 people are familiar and comfortable with this work flow. There are
 even some open source work-a-likes that we could use to we don't have
 to rely on a 3rd party like github. Gerrit seems to do it slightly
 differently with side by side diff's and patch revisions, but either
 way would be an improvement.

Please take this for what it's worth - I'm not a code reviewer or
committer - just a pretty heavy user, and I lurk on (most?) of the
mailing lists.

Mostly I find bugs and ask others to fix them, since I lack the
necessary intimate knowledge of postgresql internals to produce a
meaningful patch. That said, I believe that - from my perspective -
having postgresql's interaction with it's *large* community would only
be improved by using something like github. I am far more likely to
try to introduce a new feature, minor bugfix, code improvement, et
cetera when using github than I would be if the interaction starts
with a post to a mailing list and at least /looks/ like it might
involve rather more than that.

-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] question: foreign key constraints and AccessExclusive locks

2013-01-06 Thread Jon Nelson
On Sun, Jan 6, 2013 at 4:14 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 6 January 2013 03:08, Jon Nelson jnelson+pg...@jamponi.net wrote:
 When adding a foreign key constraint on tableA which references
 tableB, why is an AccessExclusive lock on tableB necessary? Wouldn't a
 lock that prevents writes be sufficient, or does PostgreSQL have to
 modify *both* tables in some fashion? I'm using PostgreSQL 8.4 on
 Linux.

 FKs are enforced by triggers currently. Adding triggers requires
 AccessExclusiveLock because of catalog visibility issues; you are
 right that a lower lock is eventually possible.

 SQLStandard requires the check to be symmetrical, so adding FKs
 requires a trigger on each table and so an AEL is placed on tableB.

I've read and re-read this a few times, and I think I understand.
However, could you clarify you are right that a lower lock is
eventually possible for me, please?

-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] question: foreign key constraints and AccessExclusive locks

2013-01-05 Thread Jon Nelson
When adding a foreign key constraint on tableA which references
tableB, why is an AccessExclusive lock on tableB necessary? Wouldn't a
lock that prevents writes be sufficient, or does PostgreSQL have to
modify *both* tables in some fashion? I'm using PostgreSQL 8.4 on
Linux.

-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] xmalloc = pg_malloc

2012-10-04 Thread Jon Nelson
On Wed, Oct 3, 2012 at 11:36 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Peter Eisentraut pete...@gmx.net writes:
 xmalloc, xstrdup, etc. are pretty common names for functions that do
 alloc-or-die (another possible naming scheme ;-) ).  The naming
 pg_malloc etc. on the other hand suggests that the allocation is being
 done in a PostgreSQL-specific way, and anyway sounds too close to
 palloc.

 So I'd be more in favor of xmalloc = pg_malloc.

 Meh.  The fact that other people use that name is not really an
 advantage from where I sit.  I'm concerned about possible name
 collisions, eg in libraries loaded into the backend.

 There are probably not any actual risks of collision right now, given
 that all these functions are currently in our client-side programs ---
 but it's foreseeable that we might use this same naming convention in
 more-exposed places in future.  In fact, somebody was already proposing
 creating such functions in the core backend.

 But having said that, I'm not absolutely wedded to these names; they
 were just the majority of existing cases.

Why not split the difference and use pg_xmalloc?
As in: PostgreSQL-special malloc that dies on failure.

-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Jon Nelson
On Thu, Jun 28, 2012 at 6:05 AM, Magnus Hagander mag...@hagander.net wrote:
 On Thu, Jun 28, 2012 at 7:00 AM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Jun 27, 2012 at 9:44 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Would Posix shmem help with that at all?  Why did you choose not to
 use the Posix API, anyway?

 It seemed more complicated.  If we use the POSIX API, we've got to
 have code to find a non-colliding name for the shm, and we've got to
 arrange to clean it up at process exit.  Anonymous shm doesn't require
 a name and goes away automatically when it's no longer in use.

 I see.  Those are pretty good reasons ...

 So, should we do it this way?

 I did a little research and discovered that Linux 2.3.51 (released
 3/11/2000) apparently returns EINVAL for MAP_SHARED|MAP_ANONYMOUS.
 That combination is documented to work beginning in Linux 2.4.0.  How
 worried should we be about people trying to run PostgreSQL 9.3 on
 pre-2.4 kernels?  If we want to worry about it, we could try mapping a
 one-page shared MAP_SHARED|MAP_ANONYMOUS segment first.  If that
 works, we could assume that we have a working MAP_SHARED|MAP_ANONYMOUS
 facility and try to allocate the whole segment plus a minimal sysv
 shm.  If the single page allocation fails with EINVAL, we could fall
 back to allocating the entire segment as sysv shm.

Why not just mmap /dev/zero (MAP_SHARED but not MAP_ANONYMOUS)?  I
seem to think that's what I did when I needed this functionality oh so
many moons ago.

-- 
Jon

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Jon Nelson
On Thu, Jun 28, 2012 at 8:57 AM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Jun 28, 2012 at 9:47 AM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 Why not just mmap /dev/zero (MAP_SHARED but not MAP_ANONYMOUS)?  I
 seem to think that's what I did when I needed this functionality oh so
 many moons ago.

 From the reading I've done on this topic, that seems to be a trick
 invented on Solaris that is considered grotty and awful by everyone
 else.  The thing is that you want the mapping to be shared with the
 processes that inherit the mapping from you.  You do *NOT* want the
 mapping to be shared with EVERYONE who has mapped that file for any
 reason, which is the usual meaning of MAP_SHARED on a file.  Maybe
 this happens to work correctly on some or all platforms, but I would
 want to have some convincing evidence that it's more widely supported
 (with the correct semantics) than MAP_ANON before relying on it.

When I did this (I admit, it was on Linux but it was a long time ago)
only the inherited file descriptor + mmap structure mattered -
modifications were private to the process and it's children - other
apps always saw their own /dev/zero. A quick google suggests that -
according to qnx, sco, and some others - mmap'ing /dev/zero retains
the expected privacy. Given how /dev/zero works I'd be very surprised
if it was otherwise.

I would love to see links that suggest that /dev/zero is nasty (or, in
fact, in any way fundamentally different than mmap'ing /dev/zero) -
feel free to send them to me privately to avoid polluting the list.

-- 
Jon

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers