Re: [HACKERS] _mdfd_getseg can be expensive

2016-09-08 Thread Andres Freund
On 2016-08-31 15:15:16 -0700, Peter Geoghegan wrote:
> On Wed, Aug 31, 2016 at 3:08 PM, Andres Freund  wrote:
> > On August 31, 2016 3:06:23 PM PDT, Peter Geoghegan  wrote:
> >
> >>In other painfully pedantic news, I should point out that
> >>sizeof(size_t) isn't necessarily word size (the most generic
> >>definition of word size for the architecture), contrary to my reading
> >>of the 0002-* patch comments. I'm mostly talking thinking about x86_64
> >>here, of course.
> >
> > Uh?
> 
> Sorry, I really should have not said anything. It is true that x86_64
> word size is sometimes reported as 16 and/or 32 bits [1], because of
> legacy issues.

I think native word size describes the issue well enough. And, more
importantly, I can't think of an equally short but more accurate
description.

I've pushed the patches.  Thanks for the review.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-08-31 Thread Peter Geoghegan
On Wed, Aug 31, 2016 at 3:08 PM, Andres Freund  wrote:
> On August 31, 2016 3:06:23 PM PDT, Peter Geoghegan  wrote:
>
>>In other painfully pedantic news, I should point out that
>>sizeof(size_t) isn't necessarily word size (the most generic
>>definition of word size for the architecture), contrary to my reading
>>of the 0002-* patch comments. I'm mostly talking thinking about x86_64
>>here, of course.
>
> Uh?

Sorry, I really should have not said anything. It is true that x86_64
word size is sometimes reported as 16 and/or 32 bits [1], because of
legacy issues.

[1] 
https://en.wikipedia.org/wiki/Word_(computer_architecture)#Table_of_word_sizes
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-08-31 Thread Andres Freund


On August 31, 2016 3:06:23 PM PDT, Peter Geoghegan  wrote:

>In other painfully pedantic news, I should point out that
>sizeof(size_t) isn't necessarily word size (the most generic
>definition of word size for the architecture), contrary to my reading
>of the 0002-* patch comments. I'm mostly talking thinking about x86_64
>here, of course.

Uh?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-08-31 Thread Peter Geoghegan
On Wed, Aug 31, 2016 at 2:37 PM, Andres Freund  wrote:
>> This looks good.
>
> Thanks for looking!

No problem.

In other painfully pedantic news, I should point out that
sizeof(size_t) isn't necessarily word size (the most generic
definition of word size for the architecture), contrary to my reading
of the 0002-* patch comments. I'm mostly talking thinking about x86_64
here, of course.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-08-31 Thread Andres Freund
On 2016-08-31 14:09:47 -0700, Peter Geoghegan wrote:
> On Thu, Aug 18, 2016 at 5:26 PM, Andres Freund  wrote:
> > Rebased version attached. A review would be welcome. Plan to push this
> > forward otherwise in the not too far away future.
> 
> This looks good.

Thanks for looking!


> The only thing that stuck out to any degree is that we don't grow the
> "reln->md_seg_fds[forknum]" array within the new _fdvec_resize()
> function geometrically. In other words, we don't separately keep track
> of the array size and allocated size, and grow the allocated size
> using the doubling strategy that you see in many places.

I can't really see that being worth the code here. The cost of
open()/lseek()'ing etc. is going to dwarf the cost of this.


Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-08-31 Thread Peter Geoghegan
On Wed, Aug 31, 2016 at 2:09 PM, Peter Geoghegan  wrote:
> The only thing that stuck out to any degree is that we don't grow the
> "reln->md_seg_fds[forknum]" array within the new _fdvec_resize()
> function geometrically.

That new function looks like this:

> +static void
> +_fdvec_resize(SMgrRelation reln,
> + ForkNumber forknum,
> + int nseg)
>  {
> -   return (MdfdVec *) MemoryContextAlloc(MdCxt, sizeof(MdfdVec));
> +   if (nseg == 0)
> +   {
> +   if (reln->md_num_open_segs[forknum] > 0)
> +   {
> +   pfree(reln->md_seg_fds[forknum]);
> +   reln->md_seg_fds[forknum] = NULL;
> +   }
> +   }
> +   else if (reln->md_num_open_segs[forknum] == 0)
> +   {
> +   reln->md_seg_fds[forknum] =
> +   MemoryContextAlloc(MdCxt, sizeof(MdfdVec) * nseg);
> +   }
> +   else
> +   {
> +   reln->md_seg_fds[forknum] =
> +   repalloc(reln->md_seg_fds[forknum],
> +sizeof(MdfdVec) * nseg);
> +   }
> +
> +   reln->md_num_open_segs[forknum] = nseg;
>  }

Another tiny tweak that you might consider occurs to me here: It
couldn't hurt to "Assert(reln->md_seg_fds[forknum] == NULL)" within
the "else if (reln->md_num_open_segs[forknum] == 0)" path here, prior
to the MemoryContextAlloc(). If it's worth setting it to NULL when
there are 0 segs (i.e., "reln->md_seg_fds[forknum] = NULL"), then
perhaps it's worth checking that things are found that way later.

I guess that this is at odds with remarks in my last mail about
considering geometric allocations for the "reln->md_seg_fds[forknum]"
array. Both feedback items are more or less just things that bothered
me ever so slightly, which I don't necessarily expect you to act on,
but assume you'd like to hear.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-08-31 Thread Peter Geoghegan
On Thu, Aug 18, 2016 at 5:26 PM, Andres Freund  wrote:
> Rebased version attached. A review would be welcome. Plan to push this
> forward otherwise in the not too far away future.

This looks good.

The only thing that stuck out to any degree is that we don't grow the
"reln->md_seg_fds[forknum]" array within the new _fdvec_resize()
function geometrically. In other words, we don't separately keep track
of the array size and allocated size, and grow the allocated size
using the doubling strategy that you see in many places.

Now, I freely admit that that probably doesn't matter, given what that
array tracks. I'm just pointing out that that aspect did give me
pause. The struct MdfdVec is small enough that that *might* be
worthwhile.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-08-18 Thread Peter Geoghegan
On Thu, Aug 18, 2016 at 5:42 PM, Andres Freund  wrote:
> How large was the index & table in question? I mean this really only
> comes into effect at 100+ segments.

Not that big, but I see no reason to take the chance, I suppose.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-08-18 Thread Andres Freund
On 2016-08-18 17:35:47 -0700, Peter Geoghegan wrote:
> On Thu, Aug 18, 2016 at 5:28 PM, Andres Freund  wrote:
> >> I can review this next week.
> >
> > Thanks
> 
> Given the time frame that you have in mind, I won't revisit the
> question the parallel CLUSTER CPU bottleneck issue until this is
> committed. The patch might change things enough that that would be a
> waste of time.

How large was the index & table in question? I mean this really only
comes into effect at 100+ segments.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-08-18 Thread Peter Geoghegan
On Thu, Aug 18, 2016 at 5:28 PM, Andres Freund  wrote:
>> I can review this next week.
>
> Thanks

Given the time frame that you have in mind, I won't revisit the
question the parallel CLUSTER CPU bottleneck issue until this is
committed. The patch might change things enough that that would be a
waste of time.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-08-18 Thread Andres Freund
On 2016-08-18 17:27:59 -0700, Peter Geoghegan wrote:
> On Thu, Aug 18, 2016 at 5:26 PM, Andres Freund  wrote:
> > Rebased version attached. A review would be welcome. Plan to push this
> > forward otherwise in the not too far away future.
> 
> I can review this next week.

Thanks


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-08-18 Thread Peter Geoghegan
On Thu, Aug 18, 2016 at 5:26 PM, Andres Freund  wrote:
> Rebased version attached. A review would be welcome. Plan to push this
> forward otherwise in the not too far away future.

I can review this next week.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-08-18 Thread Andres Freund
On 2016-06-30 18:14:15 -0700, Peter Geoghegan wrote:
> On Tue, Dec 15, 2015 at 10:04 AM, Andres Freund  wrote:
> > Took a while. But here we go. The attached version is a significantly
> > revised version of my earlier patch. Notably I've pretty much entirely
> > revised the code in _mdfd_getseg() to be more similar to the state in
> > master. Also some comment policing.
> 
> Are you planning to pick this up again, Andres?

Rebased version attached. A review would be welcome. Plan to push this
forward otherwise in the not too far away future.

Andres
>From 62b0d36864b23b91961bb800c1b0814f20d00a8e Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 18 Aug 2016 16:04:33 -0700
Subject: [PATCH 1/2] Improve scalability of md.c for large relations.

Previously several routines in md.c were O(#segments) - a problem these
days, where it's not uncommon to have tables in the multi TB range.

Replace the linked list of segments hanging of SMgrRelationData, with an
array of opened segments. That allows O(1) access to individual
segments, if they've previously been opened.
---
 src/backend/storage/smgr/md.c   | 408 ++--
 src/backend/storage/smgr/smgr.c |   4 +-
 src/include/storage/smgr.h  |   8 +-
 3 files changed, 234 insertions(+), 186 deletions(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f329d15..0992a7e 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -92,27 +92,23 @@
  *	out to an unlinked old copy of a segment file that will eventually
  *	disappear.
  *
- *	The file descriptor pointer (md_fd field) stored in the SMgrRelation
- *	cache is, therefore, just the head of a list of MdfdVec objects, one
- *	per segment.  But note the md_fd pointer can be NULL, indicating
- *	relation not open.
+ *	File descriptors are stored in md_seg_fds arrays inside SMgrRelation. The
+ *	length of these arrays is stored in md_num_open_segs.  Note that
+ *	md_num_open_segs having a specific value does not necessarily mean the
+ *	relation doesn't have additional segments; we may just not have opened the
+ *	next segment yet.  (We could not have "all segments are in the array" as
+ *	an invariant anyway, since another backend could extend the relation while
+ *	we aren't looking.)  We do not have entries for inactive segments,
+ *	however; as soon as we find a partial segment, we assume that any
+ *	subsequent segments are inactive.
  *
- *	Also note that mdfd_chain == NULL does not necessarily mean the relation
- *	doesn't have another segment after this one; we may just not have
- *	opened the next segment yet.  (We could not have "all segments are
- *	in the chain" as an invariant anyway, since another backend could
- *	extend the relation when we weren't looking.)  We do not make chain
- *	entries for inactive segments, however; as soon as we find a partial
- *	segment, we assume that any subsequent segments are inactive.
- *
- *	All MdfdVec objects are palloc'd in the MdCxt memory context.
+ *	The entire MdfdVec array is palloc'd in the MdCxt memory context.
  */
 
 typedef struct _MdfdVec
 {
 	File		mdfd_vfd;		/* fd number in fd.c's pool */
 	BlockNumber mdfd_segno;		/* segment number, from 0 */
-	struct _MdfdVec *mdfd_chain;	/* next segment, or NULL */
 } MdfdVec;
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
@@ -189,7 +185,9 @@ static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
 static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
 	   MdfdVec *seg);
 static void register_unlink(RelFileNodeBackend rnode);
-static MdfdVec *_fdvec_alloc(void);
+static void _fdvec_resize(SMgrRelation reln,
+			  ForkNumber forknum,
+			  int nseg);
 static char *_mdfd_segpath(SMgrRelation reln, ForkNumber forknum,
 			  BlockNumber segno);
 static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno,
@@ -298,13 +296,14 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
 void
 mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
 {
+	MdfdVec*mdfd;
 	char	   *path;
 	File		fd;
 
-	if (isRedo && reln->md_fd[forkNum] != NULL)
+	if (isRedo && reln->md_num_open_segs[forkNum] > 0)
 		return;	/* created and opened already... */
 
-	Assert(reln->md_fd[forkNum] == NULL);
+	Assert(reln->md_num_open_segs[forkNum] == 0);
 
 	path = relpath(reln->smgr_rnode, forkNum);
 
@@ -334,11 +333,10 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
 
 	pfree(path);
 
-	reln->md_fd[forkNum] = _fdvec_alloc();
-
-	reln->md_fd[forkNum]->mdfd_vfd = fd;
-	reln->md_fd[forkNum]->mdfd_segno = 0;
-	reln->md_fd[forkNum]->mdfd_chain = NULL;
+	_fdvec_resize(reln, forkNum, 1);
+	mdfd = >md_seg_fds[forkNum][0];
+	mdfd->mdfd_vfd = fd;
+	mdfd->mdfd_segno = 0;
 }
 
 /*
@@ -583,8 +581,8 @@ mdopen(SMgrRelation reln, ForkNumber forknum, int behavior)
 	File		fd;
 
 	/* No work if already open */
-	if (reln->md_fd[forknum])

Re: [HACKERS] _mdfd_getseg can be expensive

2016-07-01 Thread Peter Geoghegan
On Thu, Jun 30, 2016 at 7:08 PM, Andres Freund  wrote:
> If you have a big enough index (maybe ~150GB+), sure. Before that,
> probably not.
>
> It's usually pretty easy to see in cpu profiles whether this issue
> exists.

I think that this is a contributing factor to why merging in parallel
CREATE INDEX becomes much more CPU bound when building such very large
indexes, which Corey Huinker has benchmarked using an advanced copy of
the patch. He has shown cases that are sped up by 3.6x when 8 parallel
workers are used (compared to a serial CREATE INDEX), but a several
hundred gigabyte index case only sees a speedup of about 1.5x. (This
bottleneck affects serial CREATE INDEX merging just as much as
parallel, since that part isn't parallelized, but it's far more
noticeable with parallel CREATE INDEX simply because merging in the
leader becomes a huge bottleneck).

Those two cases were not exactly comparable in perhaps several other
ways, but even still my sense is that that this can be at least
partially explained by md.c bottlenecks. This is something that we'll
need to confirm through profiling. Hopefully it's just this one
bottleneck.

--
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-06-30 Thread Andres Freund
On 2016-06-30 18:34:20 -0700, Peter Geoghegan wrote:
> On Thu, Jun 30, 2016 at 6:23 PM, Andres Freund  wrote:
> > I plan to, once the tree opens again. Likely needs some considerable
> > updates for recent changes.
> 
> Offhand, do you think that CREATE INDEX calls to smgrextend() could be
> appreciably affected by this bottleneck? If that's a very involved or
> difficult question, then no need to answer now.

If you have a big enough index (maybe ~150GB+), sure. Before that,
probably not.

It's usually pretty easy to see in cpu profiles whether this issue
exists.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-06-30 Thread Peter Geoghegan
On Thu, Jun 30, 2016 at 6:23 PM, Andres Freund  wrote:
> I plan to, once the tree opens again. Likely needs some considerable
> updates for recent changes.

Offhand, do you think that CREATE INDEX calls to smgrextend() could be
appreciably affected by this bottleneck? If that's a very involved or
difficult question, then no need to answer now.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-06-30 Thread Andres Freund
On 2016-06-30 18:14:15 -0700, Peter Geoghegan wrote:
> On Tue, Dec 15, 2015 at 10:04 AM, Andres Freund  wrote:
> > Took a while. But here we go. The attached version is a significantly
> > revised version of my earlier patch. Notably I've pretty much entirely
> > revised the code in _mdfd_getseg() to be more similar to the state in
> > master. Also some comment policing.
> 
> Are you planning to pick this up again, Andres?

I plan to, once the tree opens again. Likely needs some considerable
updates for recent changes.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-06-30 Thread Peter Geoghegan
On Tue, Dec 15, 2015 at 10:04 AM, Andres Freund  wrote:
> Took a while. But here we go. The attached version is a significantly
> revised version of my earlier patch. Notably I've pretty much entirely
> revised the code in _mdfd_getseg() to be more similar to the state in
> master. Also some comment policing.

Are you planning to pick this up again, Andres?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2015-12-15 Thread Andres Freund
On 2014-11-01 18:23:47 +0100, Andres Freund wrote:
> On 2014-11-01 12:57:40 -0400, Tom Lane wrote:
> > Andres Freund  writes:
> > > On 2014-10-31 18:48:45 -0400, Tom Lane wrote:
> > >> While the basic idea is sound, this particular implementation seems
> > >> pretty bizarre.  What's with the "md_seg_no" stuff, and why is that
> > >> array typed size_t?
> > 
> > > It stores the length of the array of _MdfdVec entries.
> > 
> > Oh.  "seg_no" seems like not a very good choice of name then.
> > Perhaps "md_seg_count" or something like that would be more intelligible.
> 
> That's fine with me.

Went with md_num_open_segs - reading the code that was easier to
understand.

> So I'll repost a version with those fixes.

Took a while. But here we go. The attached version is a significantly
revised version of my earlier patch. Notably I've pretty much entirely
revised the code in _mdfd_getseg() to be more similar to the state in
master. Also some comment policing.

As it's more than a bit painful to test with large numbers of 1GB
segments, I've rebuilt my local postgres with 100kb segments. Running a
pgbench -s 300 with 128MB shared buffers clearly shows the efficiency
differnces: 320tps vs 4900 in an assertion enabled build. Obviously
that's kind of an artificial situation...

But I've also verified this with a) fake relations built out of sparse
files, b) actually large relations. The latter shows performance
benefits as well, but my patience limits the amount of testing I was
willing to do...

Kevin, Robert: I've CCed you because Robert commented in the recent
mdnblocks() blocks thread that Kevin lamented the O(n)'ness of md.c...

Andres
>From a0b30154714df3e13bb668fa9590f7be40e4de35 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 15 Dec 2015 19:01:36 +0100
Subject: [PATCH 1/2] Faster PageIsVerified() for the all zeroes case.

---
 src/backend/storage/page/bufpage.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index df77bb2..de9800f 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -81,7 +81,7 @@ bool
 PageIsVerified(Page page, BlockNumber blkno)
 {
 	PageHeader	p = (PageHeader) page;
-	char	   *pagebytes;
+	size_t	   *pagebytes;
 	int			i;
 	bool		checksum_failure = false;
 	bool		header_sane = false;
@@ -118,10 +118,17 @@ PageIsVerified(Page page, BlockNumber blkno)
 			return true;
 	}
 
-	/* Check all-zeroes case */
+	/*
+	 * Check all-zeroes case. Luckily BLCKSZ is guaranteed to always be a
+	 * multiple of size_t - and it's much faster compare memory using the
+	 * processor's native word size.
+	 */
+	StaticAssertStmt(BLCKSZ == (BLCKSZ / sizeof(size_t)) * sizeof(size_t),
+	 "BLCKSZ has to be a multiple of sizeof(size_t)");
+
 	all_zeroes = true;
-	pagebytes = (char *) page;
-	for (i = 0; i < BLCKSZ; i++)
+	pagebytes = (size_t *) page;
+	for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++)
 	{
 		if (pagebytes[i] != 0)
 		{
-- 
2.6.0.rc3

>From 15150d1ba7f17d3a241be3073a03e6bb9a36a937 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 15 Dec 2015 19:01:36 +0100
Subject: [PATCH 2/2] Improve scalability of md.c for large relations.

Previously several routines in md.c were O(#segments) - a problem these
days, where it's not uncommon to have tables in the multi TB range.

Replace the linked list of segments hanging of SMgrRelationData with an
array of opened segments. That allows O(1) access to individual
segments, if they've previously been opened.
---
 src/backend/storage/smgr/md.c   | 337 +++-
 src/backend/storage/smgr/smgr.c |   4 +-
 src/include/storage/smgr.h  |   8 +-
 3 files changed, 200 insertions(+), 149 deletions(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 42a43bb..cb163f4 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -92,27 +92,23 @@
  *	out to an unlinked old copy of a segment file that will eventually
  *	disappear.
  *
- *	The file descriptor pointer (md_fd field) stored in the SMgrRelation
- *	cache is, therefore, just the head of a list of MdfdVec objects, one
- *	per segment.  But note the md_fd pointer can be NULL, indicating
- *	relation not open.
+ *	File descriptors are stored in md_seg_fds arrays inside SMgrRelation. The
+ *	length of these arrays is stored in md_num_open_segs.  Note that
+ *	md_num_open_segs having a specific value does not necessarily mean the
+ *	relation doesn't have additional segments; we may just not have opened the
+ *	next segment yet.  (We could not have "all segments are in the array" as
+ *	an invariant anyway, since another backend could extend the relation while
+ *	we aren't looking.)  We do not have entries for inactive segments,
+ *	however; as soon as we find a partial segment, we assume that any
+ *	

Re: [HACKERS] _mdfd_getseg can be expensive

2014-11-01 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2014-10-31 18:48:45 -0400, Tom Lane wrote:
 While the basic idea is sound, this particular implementation seems
 pretty bizarre.  What's with the md_seg_no stuff, and why is that
 array typed size_t?

 It stores the length of the array of _MdfdVec entries.

Oh.  seg_no seems like not a very good choice of name then.
Perhaps md_seg_count or something like that would be more intelligible.

And personally I'd have made it an int, because we are certainly not doing
segment-number arithmetic in anything wider than int anywhere else.
Introducing size_t into the mix won't do anything except create a risk of
signed-vs-unsigned logic bugs.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2014-11-01 Thread Andres Freund
On 2014-11-01 12:57:40 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2014-10-31 18:48:45 -0400, Tom Lane wrote:
  While the basic idea is sound, this particular implementation seems
  pretty bizarre.  What's with the md_seg_no stuff, and why is that
  array typed size_t?
 
  It stores the length of the array of _MdfdVec entries.
 
 Oh.  seg_no seems like not a very good choice of name then.
 Perhaps md_seg_count or something like that would be more intelligible.

That's fine with me.

 And personally I'd have made it an int, because we are certainly not doing
 segment-number arithmetic in anything wider than int anywhere else.

Fine with me too. I picked size_t by habit, because there's projects
that don't allow anything else to be used for lengths of memory...

I've, during testing, also noticed it has accidentally introduced a
vfd/memory leak...

So I'll repost a version with those fixes.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2014-10-31 Thread Andres Freund
Hi,

On 2014-03-31 12:10:01 +0200, Andres Freund wrote:
 I recently have seen some perf profiles in which _mdfd_getseg() was in
 the top #3 when VACUUMing large (~200GB) relations. Called by mdread(),
 mdwrite(). Looking at it's implementation, I am not surprised. It
 iterates over all segment entries a relations has; for every read or
 write. That's not painful for smaller relations, but at a couple of
 hundred GB it starts to be annoying. Especially if kernel readahead has
 already read in all data from disk.
 
 I don't have a good idea what to do about this yet, but it seems like
 something that should be fixed mid-term.
 
 The best I can come up is is caching the last mdvec used, but that's
 fairly ugly. Alternatively it might be a good idea to not store MdfdVec
 as a linked list, but as a densely allocated array.

I've seen this a couple times more since. On larger relations it gets
even more pronounced. When sequentially scanning a 2TB relation,
_mdfd_getseg() gets up to 80% proportionate CPU time towards the end of
the scan.

I wrote the attached patch that get rids of that essentially quadratic
behaviour, by replacing the mdfd chain/singly linked list with an
array. Since we seldomly grow files by a whole segment I can't see the
slightly bigger memory reallocations matter significantly. In pretty
much every other case the array is bound to be a winner.

Does anybody have fundamental arguments against that idea?

With some additional work we could save a bit more memory by getting rid
of the mdfd_segno as it's essentially redundant - but that's not
entirely trivial and I'm unsure if it's worth it.

I've also attached a second patch that makes PageIsVerified() noticeably
faster when the page is new. That's helpful and related because it makes
it easier to test the correctness of the md.c rewrite by faking full 1GB
segments. It's also pretty simple, so there's imo little reason not to
do it.

Greetings,

Andres Freund

PS:
  The research leading to these results has received funding from the
  European Union's Seventh Framework Programme (FP7/2007-2013) under
  grant agreement n° 318633

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From f7bf48137d5c3c4268e0fb6231ca6996d9fe Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Thu, 30 Oct 2014 13:24:20 +0100
Subject: [PATCH 1/2] Faster PageIsVerified() for the all zeroes case.

---
 src/backend/storage/page/bufpage.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 6351a9b..c175ed0 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -81,7 +81,7 @@ bool
 PageIsVerified(Page page, BlockNumber blkno)
 {
 	PageHeader	p = (PageHeader) page;
-	char	   *pagebytes;
+	size_t	   *pagebytes;
 	int			i;
 	bool		checksum_failure = false;
 	bool		header_sane = false;
@@ -118,10 +118,17 @@ PageIsVerified(Page page, BlockNumber blkno)
 			return true;
 	}
 
-	/* Check all-zeroes case */
+	/*
+	 * Check all-zeroes case. Luckily BLCKSZ is guaranteed to always be a
+	 * multiple of size_t - and it's much faster compare memory using the
+	 * processor's native word size.
+	 */
+	StaticAssertStmt(BLCKSZ == (BLCKSZ / sizeof(size_t)) * sizeof(size_t),
+	 BLCKSZ has to be a multiple of sizeof(size_t));
+
 	all_zeroes = true;
-	pagebytes = (char *) page;
-	for (i = 0; i  BLCKSZ; i++)
+	pagebytes = (size_t *) page;
+	for (i = 0; i  (BLCKSZ / sizeof(size_t)); i++)
 	{
 		if (pagebytes[i] != 0)
 		{
-- 
2.0.0.rc2.4.g1dc51c6.dirty

From 5bc40c11081a0d034f2857b45cb3d45c5b63fa58 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Thu, 30 Oct 2014 13:24:52 +0100
Subject: [PATCH 2/2] Much faster/O(1) mfdvec.

---
 src/backend/storage/smgr/md.c   | 307 +---
 src/backend/storage/smgr/smgr.c |   4 +-
 src/include/storage/smgr.h  |   4 +-
 3 files changed, 166 insertions(+), 149 deletions(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 167d61c..6a7afa8 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -92,27 +92,22 @@
  *	out to an unlinked old copy of a segment file that will eventually
  *	disappear.
  *
- *	The file descriptor pointer (md_fd field) stored in the SMgrRelation
- *	cache is, therefore, just the head of a list of MdfdVec objects, one
- *	per segment.  But note the md_fd pointer can be NULL, indicating
- *	relation not open.
+ *	File descriptors are stored in the md_seg_fds arrays inside
+ *	SMgrRelation. The length of these arrays is stored in md_seg_no.  Note
+ *	that md_nfd having a specific value does not necessarily mean the relation
+ *	doesn't have additional segments; we may just not have opened the next
+ *	segment yet.  (We could not have all segments are in the array as an
+ *	

Re: [HACKERS] _mdfd_getseg can be expensive

2014-10-31 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 I wrote the attached patch that get rids of that essentially quadratic
 behaviour, by replacing the mdfd chain/singly linked list with an
 array. Since we seldomly grow files by a whole segment I can't see the
 slightly bigger memory reallocations matter significantly. In pretty
 much every other case the array is bound to be a winner.

 Does anybody have fundamental arguments against that idea?

While the basic idea is sound, this particular implementation seems
pretty bizarre.  What's with the md_seg_no stuff, and why is that
array typed size_t?  IOW, why didn't you *just* replace the linked
list with an array?  This patch seems to be making some other changes
that you've failed to explain.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2014-10-31 Thread Andres Freund
On 2014-10-31 18:48:45 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  I wrote the attached patch that get rids of that essentially quadratic
  behaviour, by replacing the mdfd chain/singly linked list with an
  array. Since we seldomly grow files by a whole segment I can't see the
  slightly bigger memory reallocations matter significantly. In pretty
  much every other case the array is bound to be a winner.
 
  Does anybody have fundamental arguments against that idea?
 
 While the basic idea is sound, this particular implementation seems
 pretty bizarre.  What's with the md_seg_no stuff, and why is that
 array typed size_t?
 IOW, why didn't you *just* replace the linked list with an array?

It stores the length of the array of _MdfdVec entries. To know whether
it's safe to access some element we first need to check whether we've
loaded that many entries. It's size_t[] because that seemed to be the
most appropriate type for the lenght of an array. It's an array because
md.c had already chosen to represent relation forks via an array indexed
by the fork.

So
   size_t   md_seg_no[MAX_FORKNUM + 1];
contains the length of the _MdfdVec array for each fork. These arrays
are stored in:
   struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];

 This patch seems to be making some other changes that you've failed to
 explain.

I'm not aware of any that aren't just a consequence of not iterating
through the linked list anymore. What change are you thinking of?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers