On Wed, Nov 26, 2014 at 4:13 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:

> On Tue, Nov 25, 2014 at 9:30 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:
>
>> Using both 9.2.9 and 9_2_STABLE 9b468bcec15f1ea7433d4, I get a fairly
>> reproducible startup failure.
>>
>> What I was doing is restore a database from a base backup and roll it
>> forward with a recovery.conf until it completes and starts up.  Then I
>> truncate an unlogged table and start repopulating it with a large slow
>> 'insert into ...select from' query.
>>
>> While that is running, if I reboot the server, when it restarts it does
>> not come back up initially.
>>
>> This is what I get in the log from the attempted restart:
>>
>> PST LOG:  database system was interrupted; last known up at 2014-11-25
>> 15:40:33 PST
>> PST LOG:  database system was not properly shut down; automatic recovery
>> in progress
>> PST LOG:  redo starts at 84/EF000080
>> PST LOG:  record with zero length at 84/EF09AE18
>> PST LOG:  redo done at 84/EF09AD28
>> PST LOG:  last completed transaction was at log time 2014-11-25
>> 15:42:09.173599-08
>> PST LOG:  checkpoint starting: end-of-recovery immediate
>> PST LOG:  checkpoint complete: wrote 103 buffers (0.2%); 0 transaction
>> log file(s) added, 246 removed, 7 recycled; write=0.002 s, sync=0.020 s,
>> total=0.526 s; sync files=51, longest=0.003 s, average=0.000 s
>> PST FATAL:  could not create file "base/16416/59288": File exists
>> PST LOG:  startup process (PID 2472) exited with exit code 1
>> PST LOG:  aborting startup due to startup process failure
>>
>> oid2name doesn't show me any 59288, so I think it is the new copy of the
>> unlogged table which is being created at the moment of the reboot.
>>
>
> 59288 is not the new (uncommitted) unlogged table itself, but is the new
> (uncommitted) toast table associated with that unlogged table.
>
> immediately before the 'sudo /sbin/reboot', while the 'truncate
> unlogged_table; insert into unlogged_table from select...' is running, the
> two files below exist and have zero size.
>
> base/16416/59288_init
> base/16416/59288
>
> Immediately after the system has come back up, the file
> base/16416/59288_init no longer exists.
>
> I can't reproduce this behavior without the reboot.  It seems to depend on
> the sequencing and timing of the signals which the kernel delivers to the
> running processes during the reboot.
>
> If I do a pg_ctl stop -mf, then both files go away.  If I do a pg_ctl stop
> -mi, then neither goes away.  It is only with the /sbin/reboot that I get
> the fatal combination of _init being gone but the other still present.
>
>
> Then once the system is back and I restart postmaster, the first pass
> through ResetUnloggedRelationsInDbspaceDir doesn't remove base/16416/59288,
> because base/16416/59288_init doesn't exist to trigger the deletion.  On
> the 2nd pass through, base/16416/59288_init exists because it was created
> by WAL replay, and the copy fails because it is expecting base/16416/59288
> to have already been cleaned up by the first pass.
>
> Should ResetUnloggedRelationsInDbspaceDir on the second pass attempt to
> unlink the target immediately before the copy (line 338, in 9.2) to
> accommodate cases like this?
>

I still haven't figured out what sequence of signals causes the init fork
to be removed but the main fork to remain, but I think the startup process
should be robust to this situation.  The attached patch adds an
unconditional and unchecked unlink before the copy.  It should probably
have a comment, but I don't really know what to say.

This resolves the situation for me.  It is expected that the unlink will
usually fail with non-existent files in the vast majority of cases.  Should
it check that the unlink either succeeds, or fails for non-existence, and
throw an error in cases other than those?  Would it be a performance
concern to call unlink potentially thousands of times?  I don't think so,
because it only calls the unlink when it is about to copy a file anyway,
which is a much heavier operation.

My understanding of the reason that it was done in two passes in the first
place is that during hot standby, it is OK for the main fork to be missing,
but not OK for it to be gibberish.  But then upon promotion to writable, it
is no longer OK for it to be missing, so they get copied over at that time
(and not earlier as then init forks created during replay wouldn't get
copied).  Given that reasoning, I think that unconditionally unlinking the
destination file before the copy would not be a problem. The gibberish
master fork would not be a problem during hot standby queries as the tables
creation has not been committed and so can't be seen.

(I've added Robert to CC because he wrote the original code.)

Cheers,

Jeff

Attachment: reinit_unlink.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to