Re: [HACKERS] Subtle bug in "Simplify tape block format" commit
On 01/31/2017 01:51 AM, Peter Geoghegan wrote: * If you're writing out any old bit pattern, I suggest using 0x7F bytes rather than nul bytes (I do agree that it does seem worth making this some particular bit pattern, so as to not have to bother with Valgrind suppressions and so on). That pattern immediately gives you a hint on where to look for more information when there is a problem. To me, it suggests "this is something weird". We don't want this to happen very often. Somehow writing zeros still feels more natural to me. That's what you'd get if you just seeked past the end of file, too, for example. I understand your point of using 0x7F to catch bugs, but an all-zeros page is a tell-tale too. So I went with all-zeros, anyway. * I think that you should put the new code into a new function, called ltsAllocBlocksUntil(), or similar -- this can do the BufFileWrite() stuff itself, with a suitable custom defensive elog(ERROR) message. That way, the only new branch needed in ltsWriteBlock() is "if (blocknum > lts->nBlocksWritten)" (maybe use unlikely() too?), and you can make it clear that ltsAllocBlocksUntil() is a rarely needed thing, which seems appropriate. I started to refactor this with something like ltsAllocBlocksUntil(), but in the end, it just seemed like more almost identical code with ltsWriteBlock. We really don't want ltsAllocBlocksUntil() logic to be called very often, and certainly not to write more than 1 or 2 blocks at a time (no more than 1 with current usage). After all, that would mean writing to the same position twice or more, for no reason at all. Maybe note in comments that it's only called at the end of a run in practice, or something to that effect. Keeping writes sequential is very important, to keep logtape block reclamation effective. Added a comment explaining that. Pushed with those little changes. Thanks! - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Subtle bug in "Simplify tape block format" commit
On Mon, Jan 30, 2017 at 10:01 AM, Peter Geoghegan wrote: > Let me take another look at this later today before proceeding. I want > to run it against a custom test suite I've developed. I've done so. Some more thoughts: * I don't think that this is really any less efficient than my patch. It's just that the write of a block happens within ltsWriteBlock() automatically, rather than being an explicit thing that comes at the tail-end of a run. This is more general, but otherwise very similar. The downside is that it will work when something that we might want to break happens, but I do concede that that's worth it. * If you're writing out any old bit pattern, I suggest using 0x7F bytes rather than nul bytes (I do agree that it does seem worth making this some particular bit pattern, so as to not have to bother with Valgrind suppressions and so on). That pattern immediately gives you a hint on where to look for more information when there is a problem. To me, it suggests "this is something weird". We don't want this to happen very often. * I think that the name lts->nBlocksAllocated works better than lts->nBlocksReserved. After all, you are still writing stuff out in a way that respects BufFile limitations around holes, and still reporting that to the user via trace_sort as the number of blocks actually written. * I think that you should put the new code into a new function, called ltsAllocBlocksUntil(), or similar -- this can do the BufFileWrite() stuff itself, with a suitable custom defensive elog(ERROR) message. That way, the only new branch needed in ltsWriteBlock() is "if (blocknum > lts->nBlocksWritten)" (maybe use unlikely() too?), and you can make it clear that ltsAllocBlocksUntil() is a rarely needed thing, which seems appropriate. We really don't want ltsAllocBlocksUntil() logic to be called very often, and certainly not to write more than 1 or 2 blocks at a time (no more than 1 with current usage). After all, that would mean writing to the same position twice or more, for no reason at all. Maybe note in comments that it's only called at the end of a run in practice, or something to that effect. Keeping writes sequential is very important, to keep logtape block reclamation effective. I was actually planning on introducing a distinction between blocks allocated and blocks written anyway, for the benefit of logtape.c parallel sort, where holes can be left behind post-unification (the start of each workers space needs to be aligned to MAX_PHYSICAL_FILESIZE, so multi-block ranges of holes will be common). So, your approach makes sense to me, and I now slightly prefer your patch to my original. Thanks -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Subtle bug in "Simplify tape block format" commit
On Mon, Jan 30, 2017 at 9:55 AM, Heikki Linnakangas wrote: > It won't make any difference for correctness what bit pattern you write to > "fill the hole", but you have to write something, and zeros seems like a > natural choice. Ah, okay. That's probably fine, then. Let me take another look at this later today before proceeding. I want to run it against a custom test suite I've developed. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Subtle bug in "Simplify tape block format" commit
On 01/30/2017 07:45 PM, Peter Geoghegan wrote: On Mon, Jan 30, 2017 at 4:04 AM, Heikki Linnakangas wrote: Alternatively, we could fix this with a small change in ltsWriteBlock(), see attached patch. When we're about to create a hole in the file, write all-zero blocks to avoid the creating hole, before the block itself. That's not quite as efficient as writing the actual block contents into the hole, which avoids having to write it later, but probably won't make any measurable difference in practice. I'm inclined to do that, because it relaxes the rules on what you're allowed to do, in what order, which makes this more robust in general. Why write an all-zero block, rather than arbitrary garbage, which is all that BufFile really requires? I think it's because sizeof(int) nul bytes represent the end of a run for tuplesort's purposes. That makes me doubtful that this is any more robust or general than what I proposed. So, I don't have a problem with the performance implications of doing this, which should be minor, but I'm concerned that it appears to be more general than it actually is. It won't make any difference for correctness what bit pattern you write to "fill the hole", but you have to write something, and zeros seems like a natural choice. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Subtle bug in "Simplify tape block format" commit
On Mon, Jan 30, 2017 at 4:04 AM, Heikki Linnakangas wrote: > Hmm. The LogicalTapeEndWriting() function is very similar to the > LogicalTapePause() function I had in the pause/resume patch. They both flush > the last block to disk. The difference is that LogicalTapePause() also > free'd the buffer, and read back the last block from the disk if you > continued writing, while LogicalTapeEndWriting() keeps a copy of it in > memory. Right. > With the proposed fix (or with the pause/resume), you can only write to a > single tape at a time. Not a problem at the moment, but something to > consider. At least, would need more comments to make that more clear, and an > Assert would be nice. Agreed on that. > Alternatively, we could fix this with a small change in ltsWriteBlock(), see > attached patch. When we're about to create a hole in the file, write > all-zero blocks to avoid the creating hole, before the block itself. That's > not quite as efficient as writing the actual block contents into the hole, > which avoids having to write it later, but probably won't make any > measurable difference in practice. I'm inclined to do that, because it > relaxes the rules on what you're allowed to do, in what order, which makes > this more robust in general. Why write an all-zero block, rather than arbitrary garbage, which is all that BufFile really requires? I think it's because sizeof(int) nul bytes represent the end of a run for tuplesort's purposes. That makes me doubtful that this is any more robust or general than what I proposed. So, I don't have a problem with the performance implications of doing this, which should be minor, but I'm concerned that it appears to be more general than it actually is. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Subtle bug in "Simplify tape block format" commit
On 01/07/2017 12:33 AM, Peter Geoghegan wrote: Previously, all calls to ltsGetFreeBlock() were immediately followed by a corresponding call to ltsWriteBlock(); we wrote out the newly-allocated-in-first-pass block there and then. It's a good idea for a corresponding ltsWriteBlock() call to happen immediately, and it's *essential* for it to happen before any later block is written during the first write pass (when tuples are initially dumped out to form runs), since, as noted above ltsWriteBlock(): /* * Write a block-sized buffer to the specified block of the underlying file. * * NB: should not attempt to write beyond current end of file (ie, create * "holes" in file), since BufFile doesn't allow that. The first write pass * must write blocks sequentially. ... I completely missed that rule :-(. However, a "write beyond current end of file" now seems to actually be attempted at times, resulting in an arcane error during sorting. This is more or less because LogicalTapeWrite() doesn't heed the warning from ltsGetFreeBlock(), as seen here: while (size > 0) { if (lt->pos >= TapeBlockPayloadSize) { ... /* * First allocate the next block, so that we can store it in the * 'next' pointer of this block. */ nextBlockNumber = ltsGetFreeBlock(lts); /* set the next-pointer and dump the current block. */ TapeBlockGetTrailer(lt->buffer)->next = nextBlockNumber; ltsWriteBlock(lts, lt->curBlockNumber, (void *) lt->buffer); ... } ... } Note that LogicalTapeWrite() now allocates each block (calls ltsGetFreeBlock()) one block in advance of current block, immediately before *current* block is written (not the next block/block just allocated). So, the corresponding ltsWriteBlock() call *must* happen at an unspecified time in the future, typically the next time control ends up at exactly the same point, where the block that was "next" becomes "current". I'm not about to argue that we should go back to following this ltsGetFreeBlock() rule, though; I can see why Heikki refactored LogicalTapeWrite() to allocate blocks a block in advance. Still, we need to be more careful in avoiding the underlying problem that the ltsGetFreeBlock() rule was intended to prevent. Attached patch 0001-* has logtape.c be sure to write out a tape's buffer every time tuplesort ends a run. Hmm. The LogicalTapeEndWriting() function is very similar to the LogicalTapePause() function I had in the pause/resume patch. They both flush the last block to disk. The difference is that LogicalTapePause() also free'd the buffer, and read back the last block from the disk if you continued writing, while LogicalTapeEndWriting() keeps a copy of it in memory. With the proposed fix (or with the pause/resume), you can only write to a single tape at a time. Not a problem at the moment, but something to consider. At least, would need more comments to make that more clear, and an Assert would be nice. Alternatively, we could fix this with a small change in ltsWriteBlock(), see attached patch. When we're about to create a hole in the file, write all-zero blocks to avoid the creating hole, before the block itself. That's not quite as efficient as writing the actual block contents into the hole, which avoids having to write it later, but probably won't make any measurable difference in practice. I'm inclined to do that, because it relaxes the rules on what you're allowed to do, in what order, which makes this more robust in general. We coul *also* have something like LogicalTapeEndWriting() or LogicalTapePause(), for efficiency, but it doesn't seem that important. I have a test case. > ... Thanks for the analysis! - Heikki diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c index 83e9424537..4a64c53af5 100644 --- a/src/backend/utils/sort/logtape.c +++ b/src/backend/utils/sort/logtape.c @@ -152,7 +152,17 @@ typedef struct LogicalTape struct LogicalTapeSet { BufFile*pfile; /* underlying file for whole tape set */ - longnFileBlocks;/* # of blocks used in underlying file */ + + /* +* File size tracking. nBlocksWritten is the size of the underlying file, +* in BLCKSZ blocks. nBlocksReserved is the number of blocks allocated by +* ltsGetFreeBlock(), and it is always greater than or equal to +* nBlocksWritten. Blocks between nBlocksReserved and nBlocksWritten are +* blocks that have been allocated for a tape, but have not been written +* to the underlying file yet. +*/ + longnBlocksReserved;/* # of blocks allocated by ltsGetFreeBlock */ + longnBlocksWritten; /* # of blocks used in underlying file */ /* * We store the numbers of recycled-and-available bloc
[HACKERS] Subtle bug in "Simplify tape block format" commit
It seems that commit 01ec2563 has a subtle bug, which stems from the fact that logtape.c no longer follows the rule described above ltsGetFreeBlock(): /* * Select a currently unused block for writing to. * * NB: should only be called when writer is ready to write immediately, * to ensure that first write pass is sequential. */ static long ltsGetFreeBlock(LogicalTapeSet *lts) { ... } Previously, all calls to ltsGetFreeBlock() were immediately followed by a corresponding call to ltsWriteBlock(); we wrote out the newly-allocated-in-first-pass block there and then. It's a good idea for a corresponding ltsWriteBlock() call to happen immediately, and it's *essential* for it to happen before any later block is written during the first write pass (when tuples are initially dumped out to form runs), since, as noted above ltsWriteBlock(): /* * Write a block-sized buffer to the specified block of the underlying file. * * NB: should not attempt to write beyond current end of file (ie, create * "holes" in file), since BufFile doesn't allow that. The first write pass * must write blocks sequentially. ... However, a "write beyond current end of file" now seems to actually be attempted at times, resulting in an arcane error during sorting. This is more or less because LogicalTapeWrite() doesn't heed the warning from ltsGetFreeBlock(), as seen here: while (size > 0) { if (lt->pos >= TapeBlockPayloadSize) { ... /* * First allocate the next block, so that we can store it in the * 'next' pointer of this block. */ nextBlockNumber = ltsGetFreeBlock(lts); /* set the next-pointer and dump the current block. */ TapeBlockGetTrailer(lt->buffer)->next = nextBlockNumber; ltsWriteBlock(lts, lt->curBlockNumber, (void *) lt->buffer); ... } ... } Note that LogicalTapeWrite() now allocates each block (calls ltsGetFreeBlock()) one block in advance of current block, immediately before *current* block is written (not the next block/block just allocated). So, the corresponding ltsWriteBlock() call *must* happen at an unspecified time in the future, typically the next time control ends up at exactly the same point, where the block that was "next" becomes "current". I'm not about to argue that we should go back to following this ltsGetFreeBlock() rule, though; I can see why Heikki refactored LogicalTapeWrite() to allocate blocks a block in advance. Still, we need to be more careful in avoiding the underlying problem that the ltsGetFreeBlock() rule was intended to prevent. Attached patch 0001-* has logtape.c be sure to write out a tape's buffer every time tuplesort ends a run. I have a test case. Build Postgres with only 0002-* applied, which makes MAX_PHYSICAL_FILESIZE far smaller than its current value (current value is 2^30, while this reduces it to BLCKSZ). Then: postgres=# set replacement_sort_tuples = 0; set work_mem = 64; SET SET postgres=# with a as (select repeat('a', 7000) i from generate_series(1, 7)) select * from a order by i; ERROR: XX000: could not write block 5 of temporary file: Success LOCATION: ltsWriteBlock, logtape.c:204 This "block 5" corresponds to an fd.c file/BufFile segment that does not in fact exist when this error is raised. Since BufFiles multiplex BLCKSZ-sized segment files in this build of Postgres, it's quite likely, in general, that writes marking the end of a run will straddle "segment boundaries", which is what it takes for buffile.c/logtape.c to throw this error. (The BufFile contract does not allow "holes" in files, but segment boundaries must be crossed as the critical point (just before a tape switch) to actually see this error -- you'd have to be very unlucky to have things line up in exactly the wrong way with 1GiB BufFile segments, but it can still happen.) -- Peter Geoghegan From d4a611e94a3b4504f034fdb31a8ab4a72955b887 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Thu, 5 Jan 2017 11:29:53 -0800 Subject: [PATCH 2/2] Reduce MAX_PHYSICAL_FILESIZE value to BLCKSZ This constant controls the size of underlying fd.c managed files that the BufFile abstraction multiplexes for its callers. A very low value for the constant can be useful for flushing out bugs. --- src/backend/storage/file/buffile.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c index 7ebd636..ac3e400 100644 --- a/src/backend/storage/file/buffile.c +++ b/src/backend/storage/file/buffile.c @@ -47,7 +47,7 @@ * The reason is that we'd like large temporary BufFiles to be spread across * multiple tablespaces when available. */ -#define MAX_PHYSICAL_FILESIZE 0x4000 +#define MAX_PHYSICAL_FILESIZE BLCKSZ #define BUFFILE_SEG_SIZE (MAX_PHYSICAL_FILESIZE / BLCKSZ) /* -- 2.7.4 From 389859e1a31fab7fe906176d33660583c12e4277 Mon Sep 17 00:00:00 200