Re: [HACKERS] _mdfd_getseg can be expensive
On 2016-08-31 15:15:16 -0700, Peter Geoghegan wrote: > On Wed, Aug 31, 2016 at 3:08 PM, Andres Freundwrote: > > On August 31, 2016 3:06:23 PM PDT, Peter Geoghegan wrote: > > > >>In other painfully pedantic news, I should point out that > >>sizeof(size_t) isn't necessarily word size (the most generic > >>definition of word size for the architecture), contrary to my reading > >>of the 0002-* patch comments. I'm mostly talking thinking about x86_64 > >>here, of course. > > > > Uh? > > Sorry, I really should have not said anything. It is true that x86_64 > word size is sometimes reported as 16 and/or 32 bits [1], because of > legacy issues. I think native word size describes the issue well enough. And, more importantly, I can't think of an equally short but more accurate description. I've pushed the patches. Thanks for the review. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
On Wed, Aug 31, 2016 at 3:08 PM, Andres Freundwrote: > On August 31, 2016 3:06:23 PM PDT, Peter Geoghegan wrote: > >>In other painfully pedantic news, I should point out that >>sizeof(size_t) isn't necessarily word size (the most generic >>definition of word size for the architecture), contrary to my reading >>of the 0002-* patch comments. I'm mostly talking thinking about x86_64 >>here, of course. > > Uh? Sorry, I really should have not said anything. It is true that x86_64 word size is sometimes reported as 16 and/or 32 bits [1], because of legacy issues. [1] https://en.wikipedia.org/wiki/Word_(computer_architecture)#Table_of_word_sizes -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
On August 31, 2016 3:06:23 PM PDT, Peter Geogheganwrote: >In other painfully pedantic news, I should point out that >sizeof(size_t) isn't necessarily word size (the most generic >definition of word size for the architecture), contrary to my reading >of the 0002-* patch comments. I'm mostly talking thinking about x86_64 >here, of course. Uh? -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
On Wed, Aug 31, 2016 at 2:37 PM, Andres Freundwrote: >> This looks good. > > Thanks for looking! No problem. In other painfully pedantic news, I should point out that sizeof(size_t) isn't necessarily word size (the most generic definition of word size for the architecture), contrary to my reading of the 0002-* patch comments. I'm mostly talking thinking about x86_64 here, of course. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
On 2016-08-31 14:09:47 -0700, Peter Geoghegan wrote: > On Thu, Aug 18, 2016 at 5:26 PM, Andres Freundwrote: > > Rebased version attached. A review would be welcome. Plan to push this > > forward otherwise in the not too far away future. > > This looks good. Thanks for looking! > The only thing that stuck out to any degree is that we don't grow the > "reln->md_seg_fds[forknum]" array within the new _fdvec_resize() > function geometrically. In other words, we don't separately keep track > of the array size and allocated size, and grow the allocated size > using the doubling strategy that you see in many places. I can't really see that being worth the code here. The cost of open()/lseek()'ing etc. is going to dwarf the cost of this. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
On Wed, Aug 31, 2016 at 2:09 PM, Peter Geogheganwrote: > The only thing that stuck out to any degree is that we don't grow the > "reln->md_seg_fds[forknum]" array within the new _fdvec_resize() > function geometrically. That new function looks like this: > +static void > +_fdvec_resize(SMgrRelation reln, > + ForkNumber forknum, > + int nseg) > { > - return (MdfdVec *) MemoryContextAlloc(MdCxt, sizeof(MdfdVec)); > + if (nseg == 0) > + { > + if (reln->md_num_open_segs[forknum] > 0) > + { > + pfree(reln->md_seg_fds[forknum]); > + reln->md_seg_fds[forknum] = NULL; > + } > + } > + else if (reln->md_num_open_segs[forknum] == 0) > + { > + reln->md_seg_fds[forknum] = > + MemoryContextAlloc(MdCxt, sizeof(MdfdVec) * nseg); > + } > + else > + { > + reln->md_seg_fds[forknum] = > + repalloc(reln->md_seg_fds[forknum], > +sizeof(MdfdVec) * nseg); > + } > + > + reln->md_num_open_segs[forknum] = nseg; > } Another tiny tweak that you might consider occurs to me here: It couldn't hurt to "Assert(reln->md_seg_fds[forknum] == NULL)" within the "else if (reln->md_num_open_segs[forknum] == 0)" path here, prior to the MemoryContextAlloc(). If it's worth setting it to NULL when there are 0 segs (i.e., "reln->md_seg_fds[forknum] = NULL"), then perhaps it's worth checking that things are found that way later. I guess that this is at odds with remarks in my last mail about considering geometric allocations for the "reln->md_seg_fds[forknum]" array. Both feedback items are more or less just things that bothered me ever so slightly, which I don't necessarily expect you to act on, but assume you'd like to hear. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
On Thu, Aug 18, 2016 at 5:26 PM, Andres Freundwrote: > Rebased version attached. A review would be welcome. Plan to push this > forward otherwise in the not too far away future. This looks good. The only thing that stuck out to any degree is that we don't grow the "reln->md_seg_fds[forknum]" array within the new _fdvec_resize() function geometrically. In other words, we don't separately keep track of the array size and allocated size, and grow the allocated size using the doubling strategy that you see in many places. Now, I freely admit that that probably doesn't matter, given what that array tracks. I'm just pointing out that that aspect did give me pause. The struct MdfdVec is small enough that that *might* be worthwhile. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
On Thu, Aug 18, 2016 at 5:42 PM, Andres Freundwrote: > How large was the index & table in question? I mean this really only > comes into effect at 100+ segments. Not that big, but I see no reason to take the chance, I suppose. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
On 2016-08-18 17:35:47 -0700, Peter Geoghegan wrote: > On Thu, Aug 18, 2016 at 5:28 PM, Andres Freundwrote: > >> I can review this next week. > > > > Thanks > > Given the time frame that you have in mind, I won't revisit the > question the parallel CLUSTER CPU bottleneck issue until this is > committed. The patch might change things enough that that would be a > waste of time. How large was the index & table in question? I mean this really only comes into effect at 100+ segments. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
On Thu, Aug 18, 2016 at 5:28 PM, Andres Freundwrote: >> I can review this next week. > > Thanks Given the time frame that you have in mind, I won't revisit the question the parallel CLUSTER CPU bottleneck issue until this is committed. The patch might change things enough that that would be a waste of time. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
On 2016-08-18 17:27:59 -0700, Peter Geoghegan wrote: > On Thu, Aug 18, 2016 at 5:26 PM, Andres Freundwrote: > > Rebased version attached. A review would be welcome. Plan to push this > > forward otherwise in the not too far away future. > > I can review this next week. Thanks -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
On Thu, Aug 18, 2016 at 5:26 PM, Andres Freundwrote: > Rebased version attached. A review would be welcome. Plan to push this > forward otherwise in the not too far away future. I can review this next week. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
On 2016-06-30 18:14:15 -0700, Peter Geoghegan wrote: > On Tue, Dec 15, 2015 at 10:04 AM, Andres Freundwrote: > > Took a while. But here we go. The attached version is a significantly > > revised version of my earlier patch. Notably I've pretty much entirely > > revised the code in _mdfd_getseg() to be more similar to the state in > > master. Also some comment policing. > > Are you planning to pick this up again, Andres? Rebased version attached. A review would be welcome. Plan to push this forward otherwise in the not too far away future. Andres >From 62b0d36864b23b91961bb800c1b0814f20d00a8e Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Thu, 18 Aug 2016 16:04:33 -0700 Subject: [PATCH 1/2] Improve scalability of md.c for large relations. Previously several routines in md.c were O(#segments) - a problem these days, where it's not uncommon to have tables in the multi TB range. Replace the linked list of segments hanging of SMgrRelationData, with an array of opened segments. That allows O(1) access to individual segments, if they've previously been opened. --- src/backend/storage/smgr/md.c | 408 ++-- src/backend/storage/smgr/smgr.c | 4 +- src/include/storage/smgr.h | 8 +- 3 files changed, 234 insertions(+), 186 deletions(-) diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index f329d15..0992a7e 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -92,27 +92,23 @@ * out to an unlinked old copy of a segment file that will eventually * disappear. * - * The file descriptor pointer (md_fd field) stored in the SMgrRelation - * cache is, therefore, just the head of a list of MdfdVec objects, one - * per segment. But note the md_fd pointer can be NULL, indicating - * relation not open. + * File descriptors are stored in md_seg_fds arrays inside SMgrRelation. The + * length of these arrays is stored in md_num_open_segs. Note that + * md_num_open_segs having a specific value does not necessarily mean the + * relation doesn't have additional segments; we may just not have opened the + * next segment yet. (We could not have "all segments are in the array" as + * an invariant anyway, since another backend could extend the relation while + * we aren't looking.) We do not have entries for inactive segments, + * however; as soon as we find a partial segment, we assume that any + * subsequent segments are inactive. * - * Also note that mdfd_chain == NULL does not necessarily mean the relation - * doesn't have another segment after this one; we may just not have - * opened the next segment yet. (We could not have "all segments are - * in the chain" as an invariant anyway, since another backend could - * extend the relation when we weren't looking.) We do not make chain - * entries for inactive segments, however; as soon as we find a partial - * segment, we assume that any subsequent segments are inactive. - * - * All MdfdVec objects are palloc'd in the MdCxt memory context. + * The entire MdfdVec array is palloc'd in the MdCxt memory context. */ typedef struct _MdfdVec { File mdfd_vfd; /* fd number in fd.c's pool */ BlockNumber mdfd_segno; /* segment number, from 0 */ - struct _MdfdVec *mdfd_chain; /* next segment, or NULL */ } MdfdVec; static MemoryContext MdCxt; /* context for all MdfdVec objects */ @@ -189,7 +185,9 @@ static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior); static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg); static void register_unlink(RelFileNodeBackend rnode); -static MdfdVec *_fdvec_alloc(void); +static void _fdvec_resize(SMgrRelation reln, + ForkNumber forknum, + int nseg); static char *_mdfd_segpath(SMgrRelation reln, ForkNumber forknum, BlockNumber segno); static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno, @@ -298,13 +296,14 @@ mdexists(SMgrRelation reln, ForkNumber forkNum) void mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo) { + MdfdVec*mdfd; char *path; File fd; - if (isRedo && reln->md_fd[forkNum] != NULL) + if (isRedo && reln->md_num_open_segs[forkNum] > 0) return; /* created and opened already... */ - Assert(reln->md_fd[forkNum] == NULL); + Assert(reln->md_num_open_segs[forkNum] == 0); path = relpath(reln->smgr_rnode, forkNum); @@ -334,11 +333,10 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo) pfree(path); - reln->md_fd[forkNum] = _fdvec_alloc(); - - reln->md_fd[forkNum]->mdfd_vfd = fd; - reln->md_fd[forkNum]->mdfd_segno = 0; - reln->md_fd[forkNum]->mdfd_chain = NULL; + _fdvec_resize(reln, forkNum, 1); + mdfd = >md_seg_fds[forkNum][0]; + mdfd->mdfd_vfd = fd; + mdfd->mdfd_segno = 0; } /* @@ -583,8 +581,8 @@ mdopen(SMgrRelation reln, ForkNumber forknum, int behavior) File fd; /* No work if already open */ - if (reln->md_fd[forknum])
Re: [HACKERS] _mdfd_getseg can be expensive
On Thu, Jun 30, 2016 at 7:08 PM, Andres Freundwrote: > If you have a big enough index (maybe ~150GB+), sure. Before that, > probably not. > > It's usually pretty easy to see in cpu profiles whether this issue > exists. I think that this is a contributing factor to why merging in parallel CREATE INDEX becomes much more CPU bound when building such very large indexes, which Corey Huinker has benchmarked using an advanced copy of the patch. He has shown cases that are sped up by 3.6x when 8 parallel workers are used (compared to a serial CREATE INDEX), but a several hundred gigabyte index case only sees a speedup of about 1.5x. (This bottleneck affects serial CREATE INDEX merging just as much as parallel, since that part isn't parallelized, but it's far more noticeable with parallel CREATE INDEX simply because merging in the leader becomes a huge bottleneck). Those two cases were not exactly comparable in perhaps several other ways, but even still my sense is that that this can be at least partially explained by md.c bottlenecks. This is something that we'll need to confirm through profiling. Hopefully it's just this one bottleneck. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
On 2016-06-30 18:34:20 -0700, Peter Geoghegan wrote: > On Thu, Jun 30, 2016 at 6:23 PM, Andres Freundwrote: > > I plan to, once the tree opens again. Likely needs some considerable > > updates for recent changes. > > Offhand, do you think that CREATE INDEX calls to smgrextend() could be > appreciably affected by this bottleneck? If that's a very involved or > difficult question, then no need to answer now. If you have a big enough index (maybe ~150GB+), sure. Before that, probably not. It's usually pretty easy to see in cpu profiles whether this issue exists. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
On Thu, Jun 30, 2016 at 6:23 PM, Andres Freundwrote: > I plan to, once the tree opens again. Likely needs some considerable > updates for recent changes. Offhand, do you think that CREATE INDEX calls to smgrextend() could be appreciably affected by this bottleneck? If that's a very involved or difficult question, then no need to answer now. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
On 2016-06-30 18:14:15 -0700, Peter Geoghegan wrote: > On Tue, Dec 15, 2015 at 10:04 AM, Andres Freundwrote: > > Took a while. But here we go. The attached version is a significantly > > revised version of my earlier patch. Notably I've pretty much entirely > > revised the code in _mdfd_getseg() to be more similar to the state in > > master. Also some comment policing. > > Are you planning to pick this up again, Andres? I plan to, once the tree opens again. Likely needs some considerable updates for recent changes. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
On Tue, Dec 15, 2015 at 10:04 AM, Andres Freundwrote: > Took a while. But here we go. The attached version is a significantly > revised version of my earlier patch. Notably I've pretty much entirely > revised the code in _mdfd_getseg() to be more similar to the state in > master. Also some comment policing. Are you planning to pick this up again, Andres? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
On 2014-11-01 18:23:47 +0100, Andres Freund wrote: > On 2014-11-01 12:57:40 -0400, Tom Lane wrote: > > Andres Freundwrites: > > > On 2014-10-31 18:48:45 -0400, Tom Lane wrote: > > >> While the basic idea is sound, this particular implementation seems > > >> pretty bizarre. What's with the "md_seg_no" stuff, and why is that > > >> array typed size_t? > > > > > It stores the length of the array of _MdfdVec entries. > > > > Oh. "seg_no" seems like not a very good choice of name then. > > Perhaps "md_seg_count" or something like that would be more intelligible. > > That's fine with me. Went with md_num_open_segs - reading the code that was easier to understand. > So I'll repost a version with those fixes. Took a while. But here we go. The attached version is a significantly revised version of my earlier patch. Notably I've pretty much entirely revised the code in _mdfd_getseg() to be more similar to the state in master. Also some comment policing. As it's more than a bit painful to test with large numbers of 1GB segments, I've rebuilt my local postgres with 100kb segments. Running a pgbench -s 300 with 128MB shared buffers clearly shows the efficiency differnces: 320tps vs 4900 in an assertion enabled build. Obviously that's kind of an artificial situation... But I've also verified this with a) fake relations built out of sparse files, b) actually large relations. The latter shows performance benefits as well, but my patience limits the amount of testing I was willing to do... Kevin, Robert: I've CCed you because Robert commented in the recent mdnblocks() blocks thread that Kevin lamented the O(n)'ness of md.c... Andres >From a0b30154714df3e13bb668fa9590f7be40e4de35 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Tue, 15 Dec 2015 19:01:36 +0100 Subject: [PATCH 1/2] Faster PageIsVerified() for the all zeroes case. --- src/backend/storage/page/bufpage.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c index df77bb2..de9800f 100644 --- a/src/backend/storage/page/bufpage.c +++ b/src/backend/storage/page/bufpage.c @@ -81,7 +81,7 @@ bool PageIsVerified(Page page, BlockNumber blkno) { PageHeader p = (PageHeader) page; - char *pagebytes; + size_t *pagebytes; int i; bool checksum_failure = false; bool header_sane = false; @@ -118,10 +118,17 @@ PageIsVerified(Page page, BlockNumber blkno) return true; } - /* Check all-zeroes case */ + /* + * Check all-zeroes case. Luckily BLCKSZ is guaranteed to always be a + * multiple of size_t - and it's much faster compare memory using the + * processor's native word size. + */ + StaticAssertStmt(BLCKSZ == (BLCKSZ / sizeof(size_t)) * sizeof(size_t), + "BLCKSZ has to be a multiple of sizeof(size_t)"); + all_zeroes = true; - pagebytes = (char *) page; - for (i = 0; i < BLCKSZ; i++) + pagebytes = (size_t *) page; + for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++) { if (pagebytes[i] != 0) { -- 2.6.0.rc3 >From 15150d1ba7f17d3a241be3073a03e6bb9a36a937 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Tue, 15 Dec 2015 19:01:36 +0100 Subject: [PATCH 2/2] Improve scalability of md.c for large relations. Previously several routines in md.c were O(#segments) - a problem these days, where it's not uncommon to have tables in the multi TB range. Replace the linked list of segments hanging of SMgrRelationData with an array of opened segments. That allows O(1) access to individual segments, if they've previously been opened. --- src/backend/storage/smgr/md.c | 337 +++- src/backend/storage/smgr/smgr.c | 4 +- src/include/storage/smgr.h | 8 +- 3 files changed, 200 insertions(+), 149 deletions(-) diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 42a43bb..cb163f4 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -92,27 +92,23 @@ * out to an unlinked old copy of a segment file that will eventually * disappear. * - * The file descriptor pointer (md_fd field) stored in the SMgrRelation - * cache is, therefore, just the head of a list of MdfdVec objects, one - * per segment. But note the md_fd pointer can be NULL, indicating - * relation not open. + * File descriptors are stored in md_seg_fds arrays inside SMgrRelation. The + * length of these arrays is stored in md_num_open_segs. Note that + * md_num_open_segs having a specific value does not necessarily mean the + * relation doesn't have additional segments; we may just not have opened the + * next segment yet. (We could not have "all segments are in the array" as + * an invariant anyway, since another backend could extend the relation while + * we aren't looking.) We do not have entries for inactive segments, + * however; as soon as we find a partial segment, we assume that any + *
Re: [HACKERS] _mdfd_getseg can be expensive
Andres Freund and...@2ndquadrant.com writes: On 2014-10-31 18:48:45 -0400, Tom Lane wrote: While the basic idea is sound, this particular implementation seems pretty bizarre. What's with the md_seg_no stuff, and why is that array typed size_t? It stores the length of the array of _MdfdVec entries. Oh. seg_no seems like not a very good choice of name then. Perhaps md_seg_count or something like that would be more intelligible. And personally I'd have made it an int, because we are certainly not doing segment-number arithmetic in anything wider than int anywhere else. Introducing size_t into the mix won't do anything except create a risk of signed-vs-unsigned logic bugs. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
On 2014-11-01 12:57:40 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-10-31 18:48:45 -0400, Tom Lane wrote: While the basic idea is sound, this particular implementation seems pretty bizarre. What's with the md_seg_no stuff, and why is that array typed size_t? It stores the length of the array of _MdfdVec entries. Oh. seg_no seems like not a very good choice of name then. Perhaps md_seg_count or something like that would be more intelligible. That's fine with me. And personally I'd have made it an int, because we are certainly not doing segment-number arithmetic in anything wider than int anywhere else. Fine with me too. I picked size_t by habit, because there's projects that don't allow anything else to be used for lengths of memory... I've, during testing, also noticed it has accidentally introduced a vfd/memory leak... So I'll repost a version with those fixes. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
Hi, On 2014-03-31 12:10:01 +0200, Andres Freund wrote: I recently have seen some perf profiles in which _mdfd_getseg() was in the top #3 when VACUUMing large (~200GB) relations. Called by mdread(), mdwrite(). Looking at it's implementation, I am not surprised. It iterates over all segment entries a relations has; for every read or write. That's not painful for smaller relations, but at a couple of hundred GB it starts to be annoying. Especially if kernel readahead has already read in all data from disk. I don't have a good idea what to do about this yet, but it seems like something that should be fixed mid-term. The best I can come up is is caching the last mdvec used, but that's fairly ugly. Alternatively it might be a good idea to not store MdfdVec as a linked list, but as a densely allocated array. I've seen this a couple times more since. On larger relations it gets even more pronounced. When sequentially scanning a 2TB relation, _mdfd_getseg() gets up to 80% proportionate CPU time towards the end of the scan. I wrote the attached patch that get rids of that essentially quadratic behaviour, by replacing the mdfd chain/singly linked list with an array. Since we seldomly grow files by a whole segment I can't see the slightly bigger memory reallocations matter significantly. In pretty much every other case the array is bound to be a winner. Does anybody have fundamental arguments against that idea? With some additional work we could save a bit more memory by getting rid of the mdfd_segno as it's essentially redundant - but that's not entirely trivial and I'm unsure if it's worth it. I've also attached a second patch that makes PageIsVerified() noticeably faster when the page is new. That's helpful and related because it makes it easier to test the correctness of the md.c rewrite by faking full 1GB segments. It's also pretty simple, so there's imo little reason not to do it. Greetings, Andres Freund PS: The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 318633 -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From f7bf48137d5c3c4268e0fb6231ca6996d9fe Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Thu, 30 Oct 2014 13:24:20 +0100 Subject: [PATCH 1/2] Faster PageIsVerified() for the all zeroes case. --- src/backend/storage/page/bufpage.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c index 6351a9b..c175ed0 100644 --- a/src/backend/storage/page/bufpage.c +++ b/src/backend/storage/page/bufpage.c @@ -81,7 +81,7 @@ bool PageIsVerified(Page page, BlockNumber blkno) { PageHeader p = (PageHeader) page; - char *pagebytes; + size_t *pagebytes; int i; bool checksum_failure = false; bool header_sane = false; @@ -118,10 +118,17 @@ PageIsVerified(Page page, BlockNumber blkno) return true; } - /* Check all-zeroes case */ + /* + * Check all-zeroes case. Luckily BLCKSZ is guaranteed to always be a + * multiple of size_t - and it's much faster compare memory using the + * processor's native word size. + */ + StaticAssertStmt(BLCKSZ == (BLCKSZ / sizeof(size_t)) * sizeof(size_t), + BLCKSZ has to be a multiple of sizeof(size_t)); + all_zeroes = true; - pagebytes = (char *) page; - for (i = 0; i BLCKSZ; i++) + pagebytes = (size_t *) page; + for (i = 0; i (BLCKSZ / sizeof(size_t)); i++) { if (pagebytes[i] != 0) { -- 2.0.0.rc2.4.g1dc51c6.dirty From 5bc40c11081a0d034f2857b45cb3d45c5b63fa58 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Thu, 30 Oct 2014 13:24:52 +0100 Subject: [PATCH 2/2] Much faster/O(1) mfdvec. --- src/backend/storage/smgr/md.c | 307 +--- src/backend/storage/smgr/smgr.c | 4 +- src/include/storage/smgr.h | 4 +- 3 files changed, 166 insertions(+), 149 deletions(-) diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 167d61c..6a7afa8 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -92,27 +92,22 @@ * out to an unlinked old copy of a segment file that will eventually * disappear. * - * The file descriptor pointer (md_fd field) stored in the SMgrRelation - * cache is, therefore, just the head of a list of MdfdVec objects, one - * per segment. But note the md_fd pointer can be NULL, indicating - * relation not open. + * File descriptors are stored in the md_seg_fds arrays inside + * SMgrRelation. The length of these arrays is stored in md_seg_no. Note + * that md_nfd having a specific value does not necessarily mean the relation + * doesn't have additional segments; we may just not have opened the next + * segment yet. (We could not have all segments are in the array as an + *
Re: [HACKERS] _mdfd_getseg can be expensive
Andres Freund and...@2ndquadrant.com writes: I wrote the attached patch that get rids of that essentially quadratic behaviour, by replacing the mdfd chain/singly linked list with an array. Since we seldomly grow files by a whole segment I can't see the slightly bigger memory reallocations matter significantly. In pretty much every other case the array is bound to be a winner. Does anybody have fundamental arguments against that idea? While the basic idea is sound, this particular implementation seems pretty bizarre. What's with the md_seg_no stuff, and why is that array typed size_t? IOW, why didn't you *just* replace the linked list with an array? This patch seems to be making some other changes that you've failed to explain. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
On 2014-10-31 18:48:45 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: I wrote the attached patch that get rids of that essentially quadratic behaviour, by replacing the mdfd chain/singly linked list with an array. Since we seldomly grow files by a whole segment I can't see the slightly bigger memory reallocations matter significantly. In pretty much every other case the array is bound to be a winner. Does anybody have fundamental arguments against that idea? While the basic idea is sound, this particular implementation seems pretty bizarre. What's with the md_seg_no stuff, and why is that array typed size_t? IOW, why didn't you *just* replace the linked list with an array? It stores the length of the array of _MdfdVec entries. To know whether it's safe to access some element we first need to check whether we've loaded that many entries. It's size_t[] because that seemed to be the most appropriate type for the lenght of an array. It's an array because md.c had already chosen to represent relation forks via an array indexed by the fork. So size_t md_seg_no[MAX_FORKNUM + 1]; contains the length of the _MdfdVec array for each fork. These arrays are stored in: struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1]; This patch seems to be making some other changes that you've failed to explain. I'm not aware of any that aren't just a consequence of not iterating through the linked list anymore. What change are you thinking of? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers