Re: Unlogged relation copy is not fsync'd

2024-01-05 Thread Robert Haas
On Fri, Sep 15, 2023 at 7:47 AM Heikki Linnakangas  wrote:
> Thinking about this some more, I think this is still not 100% correct,
> even with the patch I posted earlier:

This is marked as needing review, but that doesn't appear to be
correct, because there's this comment, indicating that the patch
requires re-work, and there's also two emails from Noah on the thread
providing further feedback. So it seems this has been waiting for
Heikki or someone else to have time to work it for the last 3 months.

Hence, marking RwF for now; if someone gets back to it, please reopen.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Unlogged relation copy is not fsync'd

2023-10-07 Thread Noah Misch
On Wed, Sep 20, 2023 at 11:22:10PM -0700, Noah Misch wrote:
> On Fri, Sep 15, 2023 at 02:47:45PM +0300, Heikki Linnakangas wrote:
> > On 05/09/2023 21:20, Robert Haas wrote:
> 
> > Thinking about this some more, I think this is still not 100% correct, even
> > with the patch I posted earlier:
> > 
> > >   /*
> > >* When we WAL-logged rel pages, we must nonetheless fsync them.  The
> > >* reason is that since we're copying outside shared buffers, a 
> > > CHECKPOINT
> > >* occurring during the copy has no way to flush the previously written
> > >* data to disk (indeed it won't know the new rel even exists).  A crash
> > >* later on would replay WAL from the checkpoint, therefore it wouldn't
> > >* replay our earlier WAL entries. If we do not fsync those pages here,
> > >* they might still not be on disk when the crash occurs.
> > >*/
> > >   if (use_wal || relpersistence == RELPERSISTENCE_UNLOGGED)
> > >   smgrimmedsync(dst, forkNum);
> > 
> > Let's walk through each case one by one:
> > 
> > 1. Temporary table. No fsync() needed. This is correct.
> > 
> > 2. Unlogged rel, main fork. Needs to be fsync'd, because we skipped WAL, and
> > also because we bypassed the buffer manager. Correct.
> 
> Agreed.
> 
> > 3. Unlogged rel, init fork. Needs to be fsync'd, even though we WAL-logged
> > it, because we bypassed the buffer manager. Like the comment explains. This
> > is now correct with the patch.
> 
> This has two subtypes:
> 
> 3a. Unlogged rel, init fork, use_wal (wal_level!=minimal). Matches what
> you wrote.
> 
> 3b. Unlogged rel, init fork, !use_wal (wal_level==minimal). Needs to be
> fsync'd because we didn't WAL-log it and RelationCreateStorage() won't call
> AddPendingSync().  (RelationCreateStorage() could start calling
> AddPendingSync() for this case.  I think we chose not to do that because there
> will never be additional writes to the init fork, and smgrDoPendingSyncs()
> would send the fork to FlushRelationsAllBuffers() even though the fork will
> never appear in shared buffers.  On the other hand, grouping the sync with the
> other end-of-xact syncs could improve efficiency for some filesystems.  Also,
> the number of distinguishable cases is unpleasantly high.)
> 
> > 4. Permanent rel, use_wal == true. Needs to be fsync'd, even though we
> > WAL-logged it, because we bypassed the buffer manager. Like the comment
> > explains. Correct.
> > 
> > 5. Permanent rel, use_wal == false. We skip fsync, because it means that
> > it's a new relation, so we have a sync request pending for it. (An assertion
> > for that would be nice). At the end of transaction, in smgrDoPendingSyncs(),
> > we will either fsync it or we will WAL-log all the pages if it was a small
> > relation. I believe this is *wrong*. If smgrDoPendingSyncs() decides to
> > WAL-log the pages, we have the same race condition that's explained in the
> > comment, because we bypassed the buffer manager:
> > 
> > 1. RelationCopyStorage() copies some of the pages.
> > 2. Checkpoint happens, which fsyncs the relation (smgrcreate() registered a
> > dirty segment when the relation was created)
> > 3. RelationCopyStorage() copies the rest of the pages.
> > 4. smgrDoPendingSyncs() WAL-logs all the pages.
> 
> smgrDoPendingSyncs() delegates to log_newpage_range().  log_newpage_range()
> loads each page into the buffer manager and calls MarkBufferDirty() on each.
> Hence, ...
> 
> > 5. Another checkpoint happens. It does *not* fsync the relation.
> 
> ... I think this will fsync the relation.  No?
> 
> > 6. Crash.

While we're cataloging gaps, I think the middle sentence is incorrect in the
following heapam_relation_set_new_filelocator() comment:

/*
 * If required, set up an init fork for an unlogged table so that it can
 * be correctly reinitialized on restart.  Recovery may remove it while
 * replaying, for example, an XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE
 * record.  Therefore, logging is necessary even if wal_level=minimal.
 */
if (persistence == RELPERSISTENCE_UNLOGGED)
{
Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
   rel->rd_rel->relkind == RELKIND_MATVIEW ||
   rel->rd_rel->relkind == RELKIND_TOASTVALUE);
smgrcreate(srel, INIT_FORKNUM, false);
log_smgrcreate(newrlocator, INIT_FORKNUM);
}

XLOG_DBASE_CREATE_FILE_COPY last had that problem before fbcbc5d (2005-06)
made it issue a checkpoint.  XLOG_DBASE_CREATE_WAL_LOG never had that problem.
XLOG_TBLSPC_CREATE last had that problem before 97ddda8a82 (2021-08).  In
general, if we reintroduced such a bug, it would affect all new rels under
wal_level=minimal, not just init forks.  Having said all that,
log_smgrcreate() calls are never conditional on wal_level=minimal; the above
code is correct.




Re: Unlogged relation copy is not fsync'd

2023-09-21 Thread Noah Misch
On Fri, Sep 15, 2023 at 02:47:45PM +0300, Heikki Linnakangas wrote:
> On 05/09/2023 21:20, Robert Haas wrote:

> Thinking about this some more, I think this is still not 100% correct, even
> with the patch I posted earlier:
> 
> > /*
> >  * When we WAL-logged rel pages, we must nonetheless fsync them.  The
> >  * reason is that since we're copying outside shared buffers, a 
> > CHECKPOINT
> >  * occurring during the copy has no way to flush the previously written
> >  * data to disk (indeed it won't know the new rel even exists).  A crash
> >  * later on would replay WAL from the checkpoint, therefore it wouldn't
> >  * replay our earlier WAL entries. If we do not fsync those pages here,
> >  * they might still not be on disk when the crash occurs.
> >  */
> > if (use_wal || relpersistence == RELPERSISTENCE_UNLOGGED)
> > smgrimmedsync(dst, forkNum);
> 
> Let's walk through each case one by one:
> 
> 1. Temporary table. No fsync() needed. This is correct.
> 
> 2. Unlogged rel, main fork. Needs to be fsync'd, because we skipped WAL, and
> also because we bypassed the buffer manager. Correct.

Agreed.

> 3. Unlogged rel, init fork. Needs to be fsync'd, even though we WAL-logged
> it, because we bypassed the buffer manager. Like the comment explains. This
> is now correct with the patch.

This has two subtypes:

3a. Unlogged rel, init fork, use_wal (wal_level!=minimal). Matches what
you wrote.

3b. Unlogged rel, init fork, !use_wal (wal_level==minimal). Needs to be
fsync'd because we didn't WAL-log it and RelationCreateStorage() won't call
AddPendingSync().  (RelationCreateStorage() could start calling
AddPendingSync() for this case.  I think we chose not to do that because there
will never be additional writes to the init fork, and smgrDoPendingSyncs()
would send the fork to FlushRelationsAllBuffers() even though the fork will
never appear in shared buffers.  On the other hand, grouping the sync with the
other end-of-xact syncs could improve efficiency for some filesystems.  Also,
the number of distinguishable cases is unpleasantly high.)

> 4. Permanent rel, use_wal == true. Needs to be fsync'd, even though we
> WAL-logged it, because we bypassed the buffer manager. Like the comment
> explains. Correct.
> 
> 5. Permanent rel, use_wal == false. We skip fsync, because it means that
> it's a new relation, so we have a sync request pending for it. (An assertion
> for that would be nice). At the end of transaction, in smgrDoPendingSyncs(),
> we will either fsync it or we will WAL-log all the pages if it was a small
> relation. I believe this is *wrong*. If smgrDoPendingSyncs() decides to
> WAL-log the pages, we have the same race condition that's explained in the
> comment, because we bypassed the buffer manager:
> 
> 1. RelationCopyStorage() copies some of the pages.
> 2. Checkpoint happens, which fsyncs the relation (smgrcreate() registered a
> dirty segment when the relation was created)
> 3. RelationCopyStorage() copies the rest of the pages.
> 4. smgrDoPendingSyncs() WAL-logs all the pages.

smgrDoPendingSyncs() delegates to log_newpage_range().  log_newpage_range()
loads each page into the buffer manager and calls MarkBufferDirty() on each.
Hence, ...

> 5. Another checkpoint happens. It does *not* fsync the relation.

... I think this will fsync the relation.  No?

> 6. Crash.




Re: Unlogged relation copy is not fsync'd

2023-09-15 Thread Heikki Linnakangas

On 05/09/2023 21:20, Robert Haas wrote:

In other words, somehow it feels like we ought to be trying to defer
the fsync here until a clean shutdown actually occurs, instead of
performing it immediately.


+1


Admittedly, the bookkeeping seems like a problem, so maybe this is
the best we can do, but it's clearly worse than what we do in other
cases.
I think we can do it, I have been working on a patch along those lines 
on the side. But I want to focus on a smaller, backpatchable fix in this 
thread.



Thinking about this some more, I think this is still not 100% correct, 
even with the patch I posted earlier:



/*
 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
 * reason is that since we're copying outside shared buffers, a 
CHECKPOINT
 * occurring during the copy has no way to flush the previously written
 * data to disk (indeed it won't know the new rel even exists).  A crash
 * later on would replay WAL from the checkpoint, therefore it wouldn't
 * replay our earlier WAL entries. If we do not fsync those pages here,
 * they might still not be on disk when the crash occurs.
 */
if (use_wal || relpersistence == RELPERSISTENCE_UNLOGGED)
smgrimmedsync(dst, forkNum);


Let's walk through each case one by one:

1. Temporary table. No fsync() needed. This is correct.

2. Unlogged rel, main fork. Needs to be fsync'd, because we skipped WAL, 
and also because we bypassed the buffer manager. Correct.


3. Unlogged rel, init fork. Needs to be fsync'd, even though we 
WAL-logged it, because we bypassed the buffer manager. Like the comment 
explains. This is now correct with the patch.


4. Permanent rel, use_wal == true. Needs to be fsync'd, even though we 
WAL-logged it, because we bypassed the buffer manager. Like the comment 
explains. Correct.


5. Permanent rel, use_wal == false. We skip fsync, because it means that 
it's a new relation, so we have a sync request pending for it. (An 
assertion for that would be nice). At the end of transaction, in 
smgrDoPendingSyncs(), we will either fsync it or we will WAL-log all the 
pages if it was a small relation. I believe this is *wrong*. If 
smgrDoPendingSyncs() decides to WAL-log the pages, we have the same race 
condition that's explained in the comment, because we bypassed the 
buffer manager:


1. RelationCopyStorage() copies some of the pages.
2. Checkpoint happens, which fsyncs the relation (smgrcreate() 
registered a dirty segment when the relation was created)

3. RelationCopyStorage() copies the rest of the pages.
4. smgrDoPendingSyncs() WAL-logs all the pages.
5. Another checkpoint happens. It does *not* fsync the relation.
6. Crash.

WAL replay will not see the WAL-logged pages, because they were 
WAL-logged before the last checkpoint. And the contents were not fsync'd 
either.


In other words, we must do smgrimmedsync() here for permanent relations, 
even if use_wal==false, because we bypassed the buffer manager. Same 
reason we need to do it with use_wal==true.


For a backportable fix, I think we should change this to only exempt 
temporary tables, and call smgrimmedsync() for all other cases. Maybe 
would be safe to skip it in some cases, but it feels too dangerous to be 
clever here. The other similar callers of smgrimmedsync() in nbtsort.c, 
gistbuild.c, and rewriteheap.c, need similar treatment.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Unlogged relation copy is not fsync'd

2023-09-15 Thread Heikki Linnakangas

On 04/09/2023 16:59, Melanie Plageman wrote:

The patch looks reasonable to me. Is this [1] case in hash index build
that I reported but didn't take the time to reproduce similar?

[1] 
https://www.postgresql.org/message-id/CAAKRu_bPc81M121pOEU7W%3D%2BwSWEebiLnrie4NpaFC%2BkWATFtSA%40mail.gmail.com


Yes, I think you're right. Any caller of smgrwrite() must either:

a) Call smgrimmedsync(), after smgrwrite() and before the relation is 
visible to other transactions. Regardless of the 'skipFsync' parameter! 
I don't think this is ever completely safe unless it's a new relation. 
Like you pointed out with the hash index case. Or:


b) Hold the buffer lock, so that if a checkpoint happens, it cannot 
"race past" the page without seeing the sync request.


The comment on smgwrite() doesn't make this clear. It talks about 
'skipFsync', and gives the impression that as long as you pass 
'skipFsync=false', you don't need to worry about fsyncing. But that is 
not true. Even in these bulk loading cases where we currently call 
smgrimmedsync(), we would *still* need to call smgrimmedsync() if we 
used 'skipFsync=false'.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Unlogged relation copy is not fsync'd

2023-09-06 Thread Michael Paquier
On Tue, Sep 05, 2023 at 02:20:18PM -0400, Robert Haas wrote:
> The general rule throughout the system is that the init-fork of an
> unlogged relation is treated the same as a permanent relation: it is
> WAL-logged and fsyncd. But the other forks of an unlogged relation are
> neither WAL-logged nor fsync'd ... except in the case of a clean
> shutdown, when we fsync even that data.
> 
> In other words, somehow it feels like we ought to be trying to defer
> the fsync here until a clean shutdown actually occurs, instead of
> performing it immediately. Admittedly, the bookkeeping seems like a
> problem, so maybe this is the best we can do, but it's clearly worse
> than what we do in other cases.

That's where we usually rely more on RegisterSyncRequest() and
register_dirty_segment() so as the flush of the dirty segments can
happen when they should, but we don't go through the shared buffers
when copying all the forks of a relation file across tablespaces..
--
Michael


signature.asc
Description: PGP signature


Re: Unlogged relation copy is not fsync'd

2023-09-05 Thread Robert Haas
On Fri, Aug 25, 2023 at 8:47 AM Heikki Linnakangas  wrote:
> 1. Create an unlogged table
> 2. ALTER TABLE unlogged_tbl SET TABLESPACE ... -- This calls
> RelationCopyStorage
> 3. a checkpoint happens while the command is running
> 4. After the ALTER TABLE has finished, shut down postgres cleanly.
> 5. Lose power
>
> When you recover, the unlogged table is not reset, because it was a
> clean postgres shutdown. But the part of the file that was copied after
> the checkpoint at step 3 was never fsync'd, so part of the file is lost.
> I was able to reproduce with a VM that I kill to simulate step 5.
>
> This is the same scenario that the smgrimmedsync() call above protects
> from for WAL-logged relations. But we got the condition wrong: instead
> of worrying about the init fork, we need to call smgrimmedsync() for all
> *other* forks of an unlogged relation.
>
> Fortunately the fix is trivial, see attached.

The general rule throughout the system is that the init-fork of an
unlogged relation is treated the same as a permanent relation: it is
WAL-logged and fsyncd. But the other forks of an unlogged relation are
neither WAL-logged nor fsync'd ... except in the case of a clean
shutdown, when we fsync even that data.

In other words, somehow it feels like we ought to be trying to defer
the fsync here until a clean shutdown actually occurs, instead of
performing it immediately. Admittedly, the bookkeeping seems like a
problem, so maybe this is the best we can do, but it's clearly worse
than what we do in other cases.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Unlogged relation copy is not fsync'd

2023-09-04 Thread Melanie Plageman
On Fri, Aug 25, 2023 at 8:47 AM Heikki Linnakangas  wrote:
>
> I noticed another missing fsync() with unlogged tables, similar to the
> one at [1].
>
> RelationCopyStorage does this:
>
> >   /*
> >* When we WAL-logged rel pages, we must nonetheless fsync them.  The
> >* reason is that since we're copying outside shared buffers, a 
> > CHECKPOINT
> >* occurring during the copy has no way to flush the previously 
> > written
> >* data to disk (indeed it won't know the new rel even exists).  A 
> > crash
> >* later on would replay WAL from the checkpoint, therefore it 
> > wouldn't
> >* replay our earlier WAL entries. If we do not fsync those pages 
> > here,
> >* they might still not be on disk when the crash occurs.
> >*/
> >   if (use_wal ||  copying_initfork)
> >   smgrimmedsync(dst, forkNum);
>
> That 'copying_initfork' condition is wrong. The first hint of that is
> that 'use_wal' is always true for an init fork. I believe this was meant
> to check for 'relpersistence == RELPERSISTENCE_UNLOGGED'. Otherwise,
> this bad thing can happen:
>
> 1. Create an unlogged table
> 2. ALTER TABLE unlogged_tbl SET TABLESPACE ... -- This calls
> RelationCopyStorage
> 3. a checkpoint happens while the command is running
> 4. After the ALTER TABLE has finished, shut down postgres cleanly.
> 5. Lose power
>
> When you recover, the unlogged table is not reset, because it was a
> clean postgres shutdown. But the part of the file that was copied after
> the checkpoint at step 3 was never fsync'd, so part of the file is lost.
> I was able to reproduce with a VM that I kill to simulate step 5.
>
> This is the same scenario that the smgrimmedsync() call above protects
> from for WAL-logged relations. But we got the condition wrong: instead
> of worrying about the init fork, we need to call smgrimmedsync() for all
> *other* forks of an unlogged relation.
>
> Fortunately the fix is trivial, see attached. I suspect we have similar
> problems in a few other places, though. end_heap_rewrite(), _bt_load(),
> and gist_indexsortbuild have a similar-looking smgrimmedsync() calls,
> with no consideration for unlogged relations at all. I haven't tested
> those, but they look wrong to me.

The patch looks reasonable to me. Is this [1] case in hash index build
that I reported but didn't take the time to reproduce similar?

- Melanie

[1] 
https://www.postgresql.org/message-id/CAAKRu_bPc81M121pOEU7W%3D%2BwSWEebiLnrie4NpaFC%2BkWATFtSA%40mail.gmail.com




Re: Unlogged relation copy is not fsync'd

2023-09-04 Thread Michael Paquier
On Fri, Aug 25, 2023 at 03:47:27PM +0300, Heikki Linnakangas wrote:
> That 'copying_initfork' condition is wrong. The first hint of that is that
> 'use_wal' is always true for an init fork. I believe this was meant to check
> for 'relpersistence == RELPERSISTENCE_UNLOGGED'. Otherwise, this bad thing
> can happen:
> 
> 1. Create an unlogged table
> 2. ALTER TABLE unlogged_tbl SET TABLESPACE ... -- This calls
> RelationCopyStorage
> 3. a checkpoint happens while the command is running
> 4. After the ALTER TABLE has finished, shut down postgres cleanly.
> 5. Lose power
> 
> When you recover, the unlogged table is not reset, because it was a clean
> postgres shutdown. But the part of the file that was copied after the
> checkpoint at step 3 was never fsync'd, so part of the file is lost. I was
> able to reproduce with a VM that I kill to simulate step 5.

Oops.

The comment at the top of smgrimmedsync() looks incorrect now to me
now.  When copying the data from something else than an init fork, the
relation pages are not WAL-logged for an unlogged relation.

> Fortunately the fix is trivial, see attached. I suspect we have similar
> problems in a few other places, though. end_heap_rewrite(), _bt_load(), and
> gist_indexsortbuild have a similar-looking smgrimmedsync() calls, with no
> consideration for unlogged relations at all. I haven't tested those, but
> they look wrong to me.

Oops ^ N.  And end_heap_rewrite() mentions directly
RelationCopyStorage().. 
--
Michael


signature.asc
Description: PGP signature


Unlogged relation copy is not fsync'd

2023-08-25 Thread Heikki Linnakangas
I noticed another missing fsync() with unlogged tables, similar to the 
one at [1].


RelationCopyStorage does this:


/*
 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
 * reason is that since we're copying outside shared buffers, a 
CHECKPOINT
 * occurring during the copy has no way to flush the previously written
 * data to disk (indeed it won't know the new rel even exists).  A crash
 * later on would replay WAL from the checkpoint, therefore it wouldn't
 * replay our earlier WAL entries. If we do not fsync those pages here,
 * they might still not be on disk when the crash occurs.
 */
if (use_wal ||  copying_initfork)
smgrimmedsync(dst, forkNum);


That 'copying_initfork' condition is wrong. The first hint of that is 
that 'use_wal' is always true for an init fork. I believe this was meant 
to check for 'relpersistence == RELPERSISTENCE_UNLOGGED'. Otherwise, 
this bad thing can happen:


1. Create an unlogged table
2. ALTER TABLE unlogged_tbl SET TABLESPACE ... -- This calls 
RelationCopyStorage

3. a checkpoint happens while the command is running
4. After the ALTER TABLE has finished, shut down postgres cleanly.
5. Lose power

When you recover, the unlogged table is not reset, because it was a 
clean postgres shutdown. But the part of the file that was copied after 
the checkpoint at step 3 was never fsync'd, so part of the file is lost. 
I was able to reproduce with a VM that I kill to simulate step 5.


This is the same scenario that the smgrimmedsync() call above protects 
from for WAL-logged relations. But we got the condition wrong: instead 
of worrying about the init fork, we need to call smgrimmedsync() for all 
*other* forks of an unlogged relation.


Fortunately the fix is trivial, see attached. I suspect we have similar 
problems in a few other places, though. end_heap_rewrite(), _bt_load(), 
and gist_indexsortbuild have a similar-looking smgrimmedsync() calls, 
with no consideration for unlogged relations at all. I haven't tested 
those, but they look wrong to me.


I'm also attaching the scripts I used to reproduce this, although they 
will require some manual fiddling to run.


[1] 
https://www.postgresql.org/message-id/d47d8122-415e-425c-d0a2-e0160829702d%40iki.fi


--
Heikki Linnakangas
Neon (https://neon.tech)diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 2add053489..ebb80fa822 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -533,7 +533,7 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	 * replay our earlier WAL entries. If we do not fsync those pages here,
 	 * they might still not be on disk when the crash occurs.
 	 */
-	if (use_wal || copying_initfork)
+	if (use_wal || relpersistence == RELPERSISTENCE_UNLOGGED)
 		smgrimmedsync(dst, forkNum);
 }
 


unlogged-fsynctest.sh
Description: application/shellscript


unlogged-verify.sh
Description: application/shellscript