Re: [HACKERS] Subtle bug in "Simplify tape block format" commit

2017-02-01 Thread Heikki Linnakangas

On 01/31/2017 01:51 AM, Peter Geoghegan wrote:

* If you're writing out any old bit pattern, I suggest using 0x7F
bytes rather than nul bytes (I do agree that it does seem worth making
this some particular bit pattern, so as to not have to bother with
Valgrind suppressions and so on). That pattern immediately gives you a
hint on where to look for more information when there is a problem. To
me, it suggests "this is something weird". We don't want this to
happen very often.


Somehow writing zeros still feels more natural to me. That's what you'd 
get if you just seeked past the end of file, too, for example. I 
understand your point of using 0x7F to catch bugs, but an all-zeros page 
is a tell-tale too. So I went with all-zeros, anyway.



* I think that you should put the new code into a new function, called
ltsAllocBlocksUntil(), or similar -- this can do the BufFileWrite()
stuff itself, with a suitable custom defensive elog(ERROR) message.
That way, the only new branch needed in ltsWriteBlock() is "if
(blocknum > lts->nBlocksWritten)" (maybe use unlikely() too?), and you
can make it clear that ltsAllocBlocksUntil() is a rarely needed thing,
which seems appropriate.


I started to refactor this with something like ltsAllocBlocksUntil(), 
but in the end, it just seemed like more almost identical code with 
ltsWriteBlock.



We really don't want ltsAllocBlocksUntil() logic to be called very
often, and certainly not to write more than 1 or 2 blocks at a time
(no more than 1 with current usage). After all, that would mean
writing to the same position twice or more, for no reason at all.
Maybe note in comments that it's only called at the end of a run in
practice, or something to that effect. Keeping writes sequential is
very important, to keep logtape block reclamation effective.


Added a comment explaining that.

Pushed with those little changes. Thanks!

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Subtle bug in "Simplify tape block format" commit

2017-01-30 Thread Peter Geoghegan
On Mon, Jan 30, 2017 at 10:01 AM, Peter Geoghegan  wrote:
> Let me take another look at this later today before proceeding. I want
> to run it against a custom test suite I've developed.

I've done so. Some more thoughts:

* I don't think that this is really any less efficient than my patch.
It's just that the write of a block happens within ltsWriteBlock()
automatically, rather than being an explicit thing that comes at the
tail-end of a run. This is more general, but otherwise very similar.
The downside is that it will work when something that we might want to
break happens, but I do concede that that's worth it.

* If you're writing out any old bit pattern, I suggest using 0x7F
bytes rather than nul bytes (I do agree that it does seem worth making
this some particular bit pattern, so as to not have to bother with
Valgrind suppressions and so on). That pattern immediately gives you a
hint on where to look for more information when there is a problem. To
me, it suggests "this is something weird". We don't want this to
happen very often.

* I think that the name lts->nBlocksAllocated works better than
lts->nBlocksReserved. After all, you are still writing stuff out in a
way that respects BufFile limitations around holes, and still
reporting that to the user via trace_sort as the number of blocks
actually written.

* I think that you should put the new code into a new function, called
ltsAllocBlocksUntil(), or similar -- this can do the BufFileWrite()
stuff itself, with a suitable custom defensive elog(ERROR) message.
That way, the only new branch needed in ltsWriteBlock() is "if
(blocknum > lts->nBlocksWritten)" (maybe use unlikely() too?), and you
can make it clear that ltsAllocBlocksUntil() is a rarely needed thing,
which seems appropriate.

We really don't want ltsAllocBlocksUntil() logic to be called very
often, and certainly not to write more than 1 or 2 blocks at a time
(no more than 1 with current usage). After all, that would mean
writing to the same position twice or more, for no reason at all.
Maybe note in comments that it's only called at the end of a run in
practice, or something to that effect. Keeping writes sequential is
very important, to keep logtape block reclamation effective.

I was actually planning on introducing a distinction between blocks
allocated and blocks written anyway, for the benefit of logtape.c
parallel sort, where holes can be left behind post-unification (the
start of each workers space needs to be aligned to
MAX_PHYSICAL_FILESIZE, so multi-block ranges of holes will be common).
So, your approach makes sense to me, and I now slightly prefer your
patch to my original.

Thanks
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Subtle bug in "Simplify tape block format" commit

2017-01-30 Thread Peter Geoghegan
On Mon, Jan 30, 2017 at 9:55 AM, Heikki Linnakangas  wrote:
> It won't make any difference for correctness what bit pattern you write to
> "fill the hole", but you have to write something, and zeros seems like a
> natural choice.

Ah, okay. That's probably fine, then.

Let me take another look at this later today before proceeding. I want
to run it against a custom test suite I've developed.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Subtle bug in "Simplify tape block format" commit

2017-01-30 Thread Heikki Linnakangas

On 01/30/2017 07:45 PM, Peter Geoghegan wrote:

On Mon, Jan 30, 2017 at 4:04 AM, Heikki Linnakangas  wrote:

Alternatively, we could fix this with a small change in ltsWriteBlock(), see
attached patch. When we're about to create a hole in the file, write
all-zero blocks to avoid the creating hole, before the block itself. That's
not quite as efficient as writing the actual block contents into the hole,
which avoids having to write it later, but probably won't make any
measurable difference in practice. I'm inclined to do that, because it
relaxes the rules on what you're allowed to do, in what order, which makes
this more robust in general.


Why write an all-zero block, rather than arbitrary garbage, which is
all that BufFile really requires? I think it's because sizeof(int) nul
bytes represent the end of a run for tuplesort's purposes. That makes
me doubtful that this is any more robust or general than what I
proposed. So, I don't have a problem with the performance implications
of doing this, which should be minor, but I'm concerned that it
appears to be more general than it actually is.


It won't make any difference for correctness what bit pattern you write 
to "fill the hole", but you have to write something, and zeros seems 
like a natural choice.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Subtle bug in "Simplify tape block format" commit

2017-01-30 Thread Peter Geoghegan
On Mon, Jan 30, 2017 at 4:04 AM, Heikki Linnakangas  wrote:
> Hmm. The LogicalTapeEndWriting() function is very similar to the
> LogicalTapePause() function I had in the pause/resume patch. They both flush
> the last block to disk. The difference is that LogicalTapePause() also
> free'd the buffer, and read back the last block from the disk if you
> continued writing, while LogicalTapeEndWriting() keeps a copy of it in
> memory.

Right.

> With the proposed fix (or with the pause/resume), you can only write to a
> single tape at a time. Not a problem at the moment, but something to
> consider. At least, would need more comments to make that more clear, and an
> Assert would be nice.

Agreed on that.

> Alternatively, we could fix this with a small change in ltsWriteBlock(), see
> attached patch. When we're about to create a hole in the file, write
> all-zero blocks to avoid the creating hole, before the block itself. That's
> not quite as efficient as writing the actual block contents into the hole,
> which avoids having to write it later, but probably won't make any
> measurable difference in practice. I'm inclined to do that, because it
> relaxes the rules on what you're allowed to do, in what order, which makes
> this more robust in general.

Why write an all-zero block, rather than arbitrary garbage, which is
all that BufFile really requires? I think it's because sizeof(int) nul
bytes represent the end of a run for tuplesort's purposes. That makes
me doubtful that this is any more robust or general than what I
proposed. So, I don't have a problem with the performance implications
of doing this, which should be minor, but I'm concerned that it
appears to be more general than it actually is.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Subtle bug in "Simplify tape block format" commit

2017-01-30 Thread Heikki Linnakangas

On 01/07/2017 12:33 AM, Peter Geoghegan wrote:

Previously, all calls to ltsGetFreeBlock() were immediately followed
by a corresponding call to ltsWriteBlock(); we wrote out the
newly-allocated-in-first-pass block there and then. It's a good idea
for a corresponding ltsWriteBlock() call to happen immediately, and
it's *essential* for it to happen before any later block is written
during the first write pass (when tuples are initially dumped out to
form runs), since, as noted above ltsWriteBlock():

/*
 * Write a block-sized buffer to the specified block of the underlying file.
 *
 * NB: should not attempt to write beyond current end of file (ie, create
 * "holes" in file), since BufFile doesn't allow that.  The first write pass
 * must write blocks sequentially.
...


I completely missed that rule :-(.


However, a "write beyond current end of file" now seems to actually be
attempted at times, resulting in an arcane error during sorting. This
is more or less because LogicalTapeWrite() doesn't heed the warning
from ltsGetFreeBlock(), as seen here:

while (size > 0)
{
if (lt->pos >= TapeBlockPayloadSize)
{
...

/*
 * First allocate the next block, so that we can store it in the
 * 'next' pointer of this block.
 */
nextBlockNumber = ltsGetFreeBlock(lts);

/* set the next-pointer and dump the current block. */
TapeBlockGetTrailer(lt->buffer)->next = nextBlockNumber;
ltsWriteBlock(lts, lt->curBlockNumber, (void *) lt->buffer);
...
}
...

}

Note that LogicalTapeWrite() now allocates each block (calls
ltsGetFreeBlock()) one block in advance of current block, immediately
before *current* block is written (not the next block/block just
allocated). So, the corresponding ltsWriteBlock() call *must* happen
at an unspecified time in the future, typically the next time control
ends up at exactly the same point, where the block that was "next"
becomes "current".

I'm not about to argue that we should go back to following this
ltsGetFreeBlock() rule, though; I can see why Heikki refactored
LogicalTapeWrite() to allocate blocks a block in advance. Still, we
need to be more careful in avoiding the underlying problem that the
ltsGetFreeBlock() rule was intended to prevent. Attached patch 0001-*
has logtape.c be sure to write out a tape's buffer every time
tuplesort ends a run.


Hmm. The LogicalTapeEndWriting() function is very similar to the 
LogicalTapePause() function I had in the pause/resume patch. They both 
flush the last block to disk. The difference is that LogicalTapePause() 
also free'd the buffer, and read back the last block from the disk if 
you continued writing, while LogicalTapeEndWriting() keeps a copy of it 
in memory.


With the proposed fix (or with the pause/resume), you can only write to 
a single tape at a time. Not a problem at the moment, but something to 
consider. At least, would need more comments to make that more clear, 
and an Assert would be nice.


Alternatively, we could fix this with a small change in ltsWriteBlock(), 
see attached patch. When we're about to create a hole in the file, write 
all-zero blocks to avoid the creating hole, before the block itself. 
That's not quite as efficient as writing the actual block contents into 
the hole, which avoids having to write it later, but probably won't make 
any measurable difference in practice. I'm inclined to do that, because 
it relaxes the rules on what you're allowed to do, in what order, which 
makes this more robust in general. We coul *also* have something like 
LogicalTapeEndWriting() or LogicalTapePause(), for efficiency, but it 
doesn't seem that important.



I have a test case.

> ...

Thanks for the analysis!

- Heikki

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 83e9424537..4a64c53af5 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -152,7 +152,17 @@ typedef struct LogicalTape
 struct LogicalTapeSet
 {
BufFile*pfile;  /* underlying file for whole 
tape set */
-   longnFileBlocks;/* # of blocks used in underlying file 
*/
+
+   /*
+* File size tracking. nBlocksWritten is the size of the underlying 
file,
+* in BLCKSZ blocks. nBlocksReserved is the number of blocks allocated 
by
+* ltsGetFreeBlock(), and it is always greater than or equal to
+* nBlocksWritten. Blocks between nBlocksReserved and nBlocksWritten are
+* blocks that have been allocated for a tape, but have not been written
+* to the underlying file yet.
+*/
+   longnBlocksReserved;/* # of blocks allocated by 
ltsGetFreeBlock */
+   longnBlocksWritten; /* # of blocks used in 
underlying file */
 
/*
 * We store the numbers of recycled-and-available