Re: [HACKERS] Subtle bug in "Simplify tape block format" commit
On 01/31/2017 01:51 AM, Peter Geoghegan wrote: * If you're writing out any old bit pattern, I suggest using 0x7F bytes rather than nul bytes (I do agree that it does seem worth making this some particular bit pattern, so as to not have to bother with Valgrind suppressions and so on). That pattern immediately gives you a hint on where to look for more information when there is a problem. To me, it suggests "this is something weird". We don't want this to happen very often. Somehow writing zeros still feels more natural to me. That's what you'd get if you just seeked past the end of file, too, for example. I understand your point of using 0x7F to catch bugs, but an all-zeros page is a tell-tale too. So I went with all-zeros, anyway. * I think that you should put the new code into a new function, called ltsAllocBlocksUntil(), or similar -- this can do the BufFileWrite() stuff itself, with a suitable custom defensive elog(ERROR) message. That way, the only new branch needed in ltsWriteBlock() is "if (blocknum > lts->nBlocksWritten)" (maybe use unlikely() too?), and you can make it clear that ltsAllocBlocksUntil() is a rarely needed thing, which seems appropriate. I started to refactor this with something like ltsAllocBlocksUntil(), but in the end, it just seemed like more almost identical code with ltsWriteBlock. We really don't want ltsAllocBlocksUntil() logic to be called very often, and certainly not to write more than 1 or 2 blocks at a time (no more than 1 with current usage). After all, that would mean writing to the same position twice or more, for no reason at all. Maybe note in comments that it's only called at the end of a run in practice, or something to that effect. Keeping writes sequential is very important, to keep logtape block reclamation effective. Added a comment explaining that. Pushed with those little changes. Thanks! - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Subtle bug in "Simplify tape block format" commit
On Mon, Jan 30, 2017 at 10:01 AM, Peter Geogheganwrote: > Let me take another look at this later today before proceeding. I want > to run it against a custom test suite I've developed. I've done so. Some more thoughts: * I don't think that this is really any less efficient than my patch. It's just that the write of a block happens within ltsWriteBlock() automatically, rather than being an explicit thing that comes at the tail-end of a run. This is more general, but otherwise very similar. The downside is that it will work when something that we might want to break happens, but I do concede that that's worth it. * If you're writing out any old bit pattern, I suggest using 0x7F bytes rather than nul bytes (I do agree that it does seem worth making this some particular bit pattern, so as to not have to bother with Valgrind suppressions and so on). That pattern immediately gives you a hint on where to look for more information when there is a problem. To me, it suggests "this is something weird". We don't want this to happen very often. * I think that the name lts->nBlocksAllocated works better than lts->nBlocksReserved. After all, you are still writing stuff out in a way that respects BufFile limitations around holes, and still reporting that to the user via trace_sort as the number of blocks actually written. * I think that you should put the new code into a new function, called ltsAllocBlocksUntil(), or similar -- this can do the BufFileWrite() stuff itself, with a suitable custom defensive elog(ERROR) message. That way, the only new branch needed in ltsWriteBlock() is "if (blocknum > lts->nBlocksWritten)" (maybe use unlikely() too?), and you can make it clear that ltsAllocBlocksUntil() is a rarely needed thing, which seems appropriate. We really don't want ltsAllocBlocksUntil() logic to be called very often, and certainly not to write more than 1 or 2 blocks at a time (no more than 1 with current usage). After all, that would mean writing to the same position twice or more, for no reason at all. Maybe note in comments that it's only called at the end of a run in practice, or something to that effect. Keeping writes sequential is very important, to keep logtape block reclamation effective. I was actually planning on introducing a distinction between blocks allocated and blocks written anyway, for the benefit of logtape.c parallel sort, where holes can be left behind post-unification (the start of each workers space needs to be aligned to MAX_PHYSICAL_FILESIZE, so multi-block ranges of holes will be common). So, your approach makes sense to me, and I now slightly prefer your patch to my original. Thanks -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Subtle bug in "Simplify tape block format" commit
On Mon, Jan 30, 2017 at 9:55 AM, Heikki Linnakangaswrote: > It won't make any difference for correctness what bit pattern you write to > "fill the hole", but you have to write something, and zeros seems like a > natural choice. Ah, okay. That's probably fine, then. Let me take another look at this later today before proceeding. I want to run it against a custom test suite I've developed. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Subtle bug in "Simplify tape block format" commit
On 01/30/2017 07:45 PM, Peter Geoghegan wrote: On Mon, Jan 30, 2017 at 4:04 AM, Heikki Linnakangaswrote: Alternatively, we could fix this with a small change in ltsWriteBlock(), see attached patch. When we're about to create a hole in the file, write all-zero blocks to avoid the creating hole, before the block itself. That's not quite as efficient as writing the actual block contents into the hole, which avoids having to write it later, but probably won't make any measurable difference in practice. I'm inclined to do that, because it relaxes the rules on what you're allowed to do, in what order, which makes this more robust in general. Why write an all-zero block, rather than arbitrary garbage, which is all that BufFile really requires? I think it's because sizeof(int) nul bytes represent the end of a run for tuplesort's purposes. That makes me doubtful that this is any more robust or general than what I proposed. So, I don't have a problem with the performance implications of doing this, which should be minor, but I'm concerned that it appears to be more general than it actually is. It won't make any difference for correctness what bit pattern you write to "fill the hole", but you have to write something, and zeros seems like a natural choice. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Subtle bug in "Simplify tape block format" commit
On Mon, Jan 30, 2017 at 4:04 AM, Heikki Linnakangaswrote: > Hmm. The LogicalTapeEndWriting() function is very similar to the > LogicalTapePause() function I had in the pause/resume patch. They both flush > the last block to disk. The difference is that LogicalTapePause() also > free'd the buffer, and read back the last block from the disk if you > continued writing, while LogicalTapeEndWriting() keeps a copy of it in > memory. Right. > With the proposed fix (or with the pause/resume), you can only write to a > single tape at a time. Not a problem at the moment, but something to > consider. At least, would need more comments to make that more clear, and an > Assert would be nice. Agreed on that. > Alternatively, we could fix this with a small change in ltsWriteBlock(), see > attached patch. When we're about to create a hole in the file, write > all-zero blocks to avoid the creating hole, before the block itself. That's > not quite as efficient as writing the actual block contents into the hole, > which avoids having to write it later, but probably won't make any > measurable difference in practice. I'm inclined to do that, because it > relaxes the rules on what you're allowed to do, in what order, which makes > this more robust in general. Why write an all-zero block, rather than arbitrary garbage, which is all that BufFile really requires? I think it's because sizeof(int) nul bytes represent the end of a run for tuplesort's purposes. That makes me doubtful that this is any more robust or general than what I proposed. So, I don't have a problem with the performance implications of doing this, which should be minor, but I'm concerned that it appears to be more general than it actually is. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Subtle bug in "Simplify tape block format" commit
On 01/07/2017 12:33 AM, Peter Geoghegan wrote: Previously, all calls to ltsGetFreeBlock() were immediately followed by a corresponding call to ltsWriteBlock(); we wrote out the newly-allocated-in-first-pass block there and then. It's a good idea for a corresponding ltsWriteBlock() call to happen immediately, and it's *essential* for it to happen before any later block is written during the first write pass (when tuples are initially dumped out to form runs), since, as noted above ltsWriteBlock(): /* * Write a block-sized buffer to the specified block of the underlying file. * * NB: should not attempt to write beyond current end of file (ie, create * "holes" in file), since BufFile doesn't allow that. The first write pass * must write blocks sequentially. ... I completely missed that rule :-(. However, a "write beyond current end of file" now seems to actually be attempted at times, resulting in an arcane error during sorting. This is more or less because LogicalTapeWrite() doesn't heed the warning from ltsGetFreeBlock(), as seen here: while (size > 0) { if (lt->pos >= TapeBlockPayloadSize) { ... /* * First allocate the next block, so that we can store it in the * 'next' pointer of this block. */ nextBlockNumber = ltsGetFreeBlock(lts); /* set the next-pointer and dump the current block. */ TapeBlockGetTrailer(lt->buffer)->next = nextBlockNumber; ltsWriteBlock(lts, lt->curBlockNumber, (void *) lt->buffer); ... } ... } Note that LogicalTapeWrite() now allocates each block (calls ltsGetFreeBlock()) one block in advance of current block, immediately before *current* block is written (not the next block/block just allocated). So, the corresponding ltsWriteBlock() call *must* happen at an unspecified time in the future, typically the next time control ends up at exactly the same point, where the block that was "next" becomes "current". I'm not about to argue that we should go back to following this ltsGetFreeBlock() rule, though; I can see why Heikki refactored LogicalTapeWrite() to allocate blocks a block in advance. Still, we need to be more careful in avoiding the underlying problem that the ltsGetFreeBlock() rule was intended to prevent. Attached patch 0001-* has logtape.c be sure to write out a tape's buffer every time tuplesort ends a run. Hmm. The LogicalTapeEndWriting() function is very similar to the LogicalTapePause() function I had in the pause/resume patch. They both flush the last block to disk. The difference is that LogicalTapePause() also free'd the buffer, and read back the last block from the disk if you continued writing, while LogicalTapeEndWriting() keeps a copy of it in memory. With the proposed fix (or with the pause/resume), you can only write to a single tape at a time. Not a problem at the moment, but something to consider. At least, would need more comments to make that more clear, and an Assert would be nice. Alternatively, we could fix this with a small change in ltsWriteBlock(), see attached patch. When we're about to create a hole in the file, write all-zero blocks to avoid the creating hole, before the block itself. That's not quite as efficient as writing the actual block contents into the hole, which avoids having to write it later, but probably won't make any measurable difference in practice. I'm inclined to do that, because it relaxes the rules on what you're allowed to do, in what order, which makes this more robust in general. We coul *also* have something like LogicalTapeEndWriting() or LogicalTapePause(), for efficiency, but it doesn't seem that important. I have a test case. > ... Thanks for the analysis! - Heikki diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c index 83e9424537..4a64c53af5 100644 --- a/src/backend/utils/sort/logtape.c +++ b/src/backend/utils/sort/logtape.c @@ -152,7 +152,17 @@ typedef struct LogicalTape struct LogicalTapeSet { BufFile*pfile; /* underlying file for whole tape set */ - longnFileBlocks;/* # of blocks used in underlying file */ + + /* +* File size tracking. nBlocksWritten is the size of the underlying file, +* in BLCKSZ blocks. nBlocksReserved is the number of blocks allocated by +* ltsGetFreeBlock(), and it is always greater than or equal to +* nBlocksWritten. Blocks between nBlocksReserved and nBlocksWritten are +* blocks that have been allocated for a tape, but have not been written +* to the underlying file yet. +*/ + longnBlocksReserved;/* # of blocks allocated by ltsGetFreeBlock */ + longnBlocksWritten; /* # of blocks used in underlying file */ /* * We store the numbers of recycled-and-available