Re: [HACKERS] Orphaned files in base/[oid]

2017-08-16 Thread Chris Travers
On Wed, Aug 16, 2017 at 7:15 PM, Andres Freund  wrote:

>
>
> I think this entirely is the wrong approach.  We shouldn't add weird
> check commands that require locks on pg_class, we should avoid leaving
> the orphaned files in the first place.  I've upthread outlined
> approached how to do so.
>

And in the mean time I suppose there is no reason that the approach I have
outlined cannot be used by an external program.I think it is sane to
draw the line at having the db responsible for cleaning up after itself
when something goes wrong and leave external programs to do after-the-fact
review and cleanup.

>
> Greetings,
>
> Andres Freund
>



-- 
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


Re: [HACKERS] Orphaned files in base/[oid]

2017-08-16 Thread Andres Freund
On 2017-08-16 14:20:02 +0200, Chris Travers wrote:
> So having throught about this a bit more, and having had some real-world
> experience with the script now, I have an idea that might work and some
> questions to make it succeed.
> 
> My thinking is to add a new form of vacuum called VACUUM FSCK.
> 
> This would:
> 1. lock pg_class in exclusive mode (or do I need exclusive access?), as
> this is needed to solve the race conditions.  As I see, this seems to bring
> the database to a screeching halt concurrency-wise (but unlike my script
> would allow other databases to be accessed normally).
> 2. read the files where the name consists of only digits out of the
> filesystem and compare with oids from pg_class and relfilenodes
> 3.  Any file not found in that list would then unlink it, as well as any
> files with the patter followed by an underscore or period.
> 
> This would mean that the following cases would not be handled:
> 
> If you have the first extent gone but later extents are present we check on
> the first extant, and so would not see the later ones.  Same goes for
> visibility maps and other helper files.
> 
> If you add a file in the directory which has a name like 34F3A222BC, that
> would never get cleaned up because it contains non-digits.
> 
> So this leads to the following questions:
> 
> 1.  Is locking pg_class enough to avoid race conditions?  Is exclusive mode
> sufficient or do I need exclusive access mode?
> 2.  would it be preferable to move the file to a directory rather than
> unlinking it?
> 3.  Should I perform any sort of check on the tables at the end to make
> sure everything is ok?

I think this entirely is the wrong approach.  We shouldn't add weird
check commands that require locks on pg_class, we should avoid leaving
the orphaned files in the first place.  I've upthread outlined
approached how to do so.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Orphaned files in base/[oid]

2017-08-16 Thread Robert Haas
On Mon, Aug 14, 2017 at 2:56 PM, Andres Freund  wrote:
> I think there are some possibilities to close the gap here. We could
> e.g. have .delete_on_crash marker files that get installed
> when creating a new persistent relfilenode. If we set up things so they
> get deleted post commit, but inside the critical section, we could rely
> on them being present in case of crash, but consistently removed during
> WAL replay. At the end of recovery, iterate over the whole datadir and
> nuke all relations with marker files present.

At the risk of being predictable, I think we should add an undo
subsystem instead of continuing to create ad-hoc solutions to problems
like this.  (Of course, that's being worked on by Thomas, Amit, and
others.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Orphaned files in base/[oid]

2017-08-16 Thread Chris Travers
So having throught about this a bit more, and having had some real-world
experience with the script now, I have an idea that might work and some
questions to make it succeed.

My thinking is to add a new form of vacuum called VACUUM FSCK.

This would:
1. lock pg_class in exclusive mode (or do I need exclusive access?), as
this is needed to solve the race conditions.  As I see, this seems to bring
the database to a screeching halt concurrency-wise (but unlike my script
would allow other databases to be accessed normally).
2. read the files where the name consists of only digits out of the
filesystem and compare with oids from pg_class and relfilenodes
3.  Any file not found in that list would then unlink it, as well as any
files with the patter followed by an underscore or period.

This would mean that the following cases would not be handled:

If you have the first extent gone but later extents are present we check on
the first extant, and so would not see the later ones.  Same goes for
visibility maps and other helper files.

If you add a file in the directory which has a name like 34F3A222BC, that
would never get cleaned up because it contains non-digits.

So this leads to the following questions:

1.  Is locking pg_class enough to avoid race conditions?  Is exclusive mode
sufficient or do I need exclusive access mode?
2.  would it be preferable to move the file to a directory rather than
unlinking it?
3.  Should I perform any sort of check on the tables at the end to make
sure everything is ok?

-- 
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


Re: [HACKERS] Orphaned files in base/[oid]

2017-08-15 Thread Chris Travers
On Tue, Aug 15, 2017 at 3:32 PM, Tom Lane  wrote:

> Chris Travers  writes:
> > I wonder about a different solution.  Would it be possible to special
> case
> > vacuum to check for and remove (or just move to where they can be
> removed)
> > files when vacuuming pg_class?  At the point we are vacuuming pg_class,
> we
> > ought to be able to know that a relfilenode shouldn't be used anymore,
> > right?
>
> I don't think so.  It's not clear to me whether you have in mind "scan
> pg_class, collect relfilenodes from all live tuples, then zap all files
> not in that set" or "when removing a dead tuple, zap the relfilenode
> it mentions".  But neither one works.  The first case has a race condition
> against new pg_class entries.  As for the second, the existence of a dead
> tuple bearing relfilenode N isn't evidence that some other live tuple
> can't have relfilenode N.
>

Ah because if the file never made it on to disk the number could be
re-used.

>
> Another problem for the second solution is that in the case you're worried
> about (ie, PANIC due to out-of-WAL-space during relation's creating
> transaction), there's no very good reason to expect that the relation's
> pg_class tuple ever made it to disk at all.
>
> A traditional low-tech answer to this has been to keep the WAL on a
> separate volume from the main data store, so that it's protected from
> out-of-space conditions in the main store and temp areas.  The space
> needs for WAL proper are generally much more predictable than the main
> store, so it's easier to keep the dedicated space from overflowing.
> (Stalled replication/archiving processes can be hazardous to your
> health in this scenario, though, if they prevent prompt recycling of
> WAL files.)
>

Yeah, most of our dbs here have wal on a separate volume but not this
system.  This system is also unusual in that disk usage varies wildly (and
I am not 100% sure that this is the only case which causes it though I can
reproduce it consistently in the case of the wal writer running out of disk
space with symptoms exactly what I found).

So for now that leaves my fallback approach as a way to fix it when I see
it.

I have written a shell script which does as follows:
1.  starts Postgres in single user mode with a data directory or dies
(won't run if Postgres seems to be already running)
2.  gets the old of the current database
3.  lists all files consisting of only digits in the  base/[dboid] directory
4. asks Postgres (In single user mode again) for all relfilenodes and oids
of tables (In my testing both were required because there were some cases
where relfilenodes were not set in some system
5.  Loops through the file nodes gathered, checks against the relfilenode
entries, and zaps $f, $f_*, and $f.*.  Currently for testing "zaps" has
been to move to a lostnfound folder for inspection following the script.
The logic here is not perfect and is very slightly under inclusive, but
better that than the other way.

Then we can start Postgres again.  I cannot find a better way to avoid race
conditions, I guess. At any rate it sounds like preventing the problem more
generally may be something beyond what I would feel comfortable trying to
do as a patch at my current level of familiarity with he source code.

The full script is included inline below my signature in case it is of
interest to anyone on the list.


> regards, tom lane
>



-- 
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com

Saarbrücker Straße 37a, 10405 Berlin

---

#!/bin/bash

datadir=$1
database=$2

pg_ctl -D $datadir stop

dboidfile="$PWD/cleanupdb.oid"
reloidfile="$PWD/refilenodes.list"

echo "COPY (select oid from pg_database where datname = current_database())
TO '$dboidfile'" | postgres --single -D $datadir $database > /dev/null


if (($?))
then
   echo "FATAL: Could not start Postgres in single user mode"
   exit 1
fi

dboid=`cat $dboidfile`
filenodes=`(cd test/base/$dboid; ls [0-9]*[0-9] | grep -v '\.' | sort -n)`
#echo $filenodes

echo "COPY (select relfilenode from pg_class union select oid as
relfilenode from pg_class) TO '$reloidfile'" | postgres --single -D
$datadir $database > /dev/null
relfilenodes=`cat $reloidfile`
#echo $relfilenodes
if [[ -z relfilenodes ]]
then
   echo "FATAL: did not get any relfilenodes"
   exit 2
fi

mkdir lostnfound;
for f in $filenodes
do
  if [[ -z `echo $relfilenodes | grep -w $f` ]]
  then
  echo moving $f to lostnfound
  mv $datadir/base/$dboid/$f lostnfound
  mv $datadir/base/$dboid/${f}_* lostnfound 2> /dev/null
  mv $datadir/base/$dboid/${f}.* lostnfound 2> /dev/null
  fi
done
rm $dboidfile
rm $reloidfile


Re: [HACKERS] Orphaned files in base/[oid]

2017-08-15 Thread Tom Lane
Chris Travers  writes:
> I wonder about a different solution.  Would it be possible to special case
> vacuum to check for and remove (or just move to where they can be removed)
> files when vacuuming pg_class?  At the point we are vacuuming pg_class, we
> ought to be able to know that a relfilenode shouldn't be used anymore,
> right?

I don't think so.  It's not clear to me whether you have in mind "scan
pg_class, collect relfilenodes from all live tuples, then zap all files
not in that set" or "when removing a dead tuple, zap the relfilenode
it mentions".  But neither one works.  The first case has a race condition
against new pg_class entries.  As for the second, the existence of a dead
tuple bearing relfilenode N isn't evidence that some other live tuple
can't have relfilenode N.

Another problem for the second solution is that in the case you're worried
about (ie, PANIC due to out-of-WAL-space during relation's creating
transaction), there's no very good reason to expect that the relation's
pg_class tuple ever made it to disk at all.

A traditional low-tech answer to this has been to keep the WAL on a
separate volume from the main data store, so that it's protected from
out-of-space conditions in the main store and temp areas.  The space
needs for WAL proper are generally much more predictable than the main
store, so it's easier to keep the dedicated space from overflowing.
(Stalled replication/archiving processes can be hazardous to your
health in this scenario, though, if they prevent prompt recycling of
WAL files.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Orphaned files in base/[oid]

2017-08-15 Thread Chris Travers
There's another side to this and that I am not sure it is a backend crash.

Here is what I did to reproduce:

2 virtual disk images:  100mb for main data, 40 MB for WAL.  work_mem set
to 256MB. The idea is to test different out of space conditions.

Create table as ...; drop table; select
pg_size_pretty(pg_current_xlog_location() - '0/0');

I played around with parameters to determine how different kinds of out of
space errors were handled.

1.  running out of temp space was cleaned up without a server restart
needed.
2.  A relation running out of disk space *seemed* to get cleaned up.
3.  Running out of WAL space left *both* temp and non-temp files.

I wonder about a different solution.  Would it be possible to special case
vacuum to check for and remove (or just move to where they can be removed)
files when vacuuming pg_class?  At the point we are vacuuming pg_class, we
ought to be able to know that a relfilenode shouldn't be used anymore,
right?

-- 
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


Re: [HACKERS] Orphaned files in base/[oid]

2017-08-14 Thread Michael Paquier
On Tue, Aug 15, 2017 at 3:56 AM, Andres Freund  wrote:
> I think there are some possibilities to close the gap here. We could
> e.g. have .delete_on_crash marker files that get installed
> when creating a new persistent relfilenode. If we set up things so they
> get deleted post commit, but inside the critical section, we could rely
> on them being present in case of crash, but consistently removed during
> WAL replay. At the end of recovery, iterate over the whole datadir and
> nuke all relations with marker files present.

I agree that an approach like that has value for the problem defined
in this thread.

> I first thought that'd cost an additional fsync per relation
> created. But I think we actually can delay that to a pre-commit phase,
> if we have XLOG_SMGR_CREATE create those markers via a flag, and fsync
> them just before checkpoint (via the usual delayed fsync mechanism).
> That'd still require an XLogFlush(), but that seems hard to avoid unless
> we just don't create relations on FS level until buffers are
> evicted and/or BufferSync().

Yeah, that should work as well.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Orphaned files in base/[oid]

2017-08-14 Thread Chris Travers
On Mon, Aug 14, 2017 at 8:40 PM, Tom Lane  wrote:

>
>
> It would be possible to have orphaned non-temp tables if you'd suffered
> a crash during the transaction that created those tables.  Ordinarily
> a newly-created table file wouldn't be that large, but if your workflow
> created tables and shoved boatloads of data into them in the same
> transaction, it's not so hard to see this becoming an issue.
>

I think the working theory is that these were very like a number of very
large (multi-hundred-GB materialised views).

>
> The core problem with zapping non-temp table files is that you can't
> do that unless you're sure you have consistent, up-to-date pg_class
> data that nobody else is busy adding to.  It's hard to see an external
> application being able to do that safely.  You certainly can't do it
> at the point in the postmaster startup cycle where we currently do
> the other things --- for those, we rely only on filesystem naming
> conventions to identify what to zap.


Yeah that occurred to me. At this point I would settle for something I
could run with Postgres in single user mode.  Although that is very far
from ideal.  So what I wonder is if at least a short-term solution might be
a utility that starts Postgres in single user mode and we insist that
PostgreSQL is otherwise not running before the run.

I am certainly not feeling qualified at present for more advanced solutions
but that I might be able to do.

>
> regards, tom lane
>



-- 
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


Re: [HACKERS] Orphaned files in base/[oid]

2017-08-14 Thread Andres Freund
On 2017-08-14 14:40:46 -0400, Tom Lane wrote:
> The core problem with zapping non-temp table files is that you can't
> do that unless you're sure you have consistent, up-to-date pg_class
> data that nobody else is busy adding to.  It's hard to see an external
> application being able to do that safely.  You certainly can't do it
> at the point in the postmaster startup cycle where we currently do
> the other things --- for those, we rely only on filesystem naming
> conventions to identify what to zap.

I think there are some possibilities to close the gap here. We could
e.g. have .delete_on_crash marker files that get installed
when creating a new persistent relfilenode. If we set up things so they
get deleted post commit, but inside the critical section, we could rely
on them being present in case of crash, but consistently removed during
WAL replay. At the end of recovery, iterate over the whole datadir and
nuke all relations with marker files present.

I first thought that'd cost an additional fsync per relation
created. But I think we actually can delay that to a pre-commit phase,
if we have XLOG_SMGR_CREATE create those markers via a flag, and fsync
them just before checkpoint (via the usual delayed fsync mechanism).
That'd still require an XLogFlush(), but that seems hard to avoid unless
we just don't create relations on FS level until buffers are
evicted and/or BufferSync().


Alternatively we could do something without marker files, with some
added complexity: Keep track of all "uncommitted new files" in memory,
and log them every checkpoint. Commit/abort records clear elements of
that list. Since we always start replay at the beginning of a
checkpoint, we'd always reach a moment with such an up2date list of
pending-action files before reaching end-of-recovery. At end-of-recovery
we can delete all unconfirmed files.  To avoid out-of-memory due to too
many tracked relations, we'd possibly still have to have marker files...

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Orphaned files in base/[oid]

2017-08-14 Thread Tom Lane
Chris Travers  writes:
> On Mon, Aug 14, 2017 at 6:33 PM, Andres Freund  wrote:
>> I think the fix here is to call RemovePgTempFiles() during
>> crash-restarts, instead of just full starts. The previously stated need
>> to be able to inspect temp files after a crash can be less impactfully
>> fulfilled with restart_after_crash = false.

> But that only clears temp files right?
> I am less concerned about the temp files because a restart clears them.

It will clear temp files, and also temp tables.

> The bigger issue I see are with the orphaned base files.

It would be possible to have orphaned non-temp tables if you'd suffered
a crash during the transaction that created those tables.  Ordinarily
a newly-created table file wouldn't be that large, but if your workflow
created tables and shoved boatloads of data into them in the same
transaction, it's not so hard to see this becoming an issue.

The core problem with zapping non-temp table files is that you can't
do that unless you're sure you have consistent, up-to-date pg_class
data that nobody else is busy adding to.  It's hard to see an external
application being able to do that safely.  You certainly can't do it
at the point in the postmaster startup cycle where we currently do
the other things --- for those, we rely only on filesystem naming
conventions to identify what to zap.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Orphaned files in base/[oid]

2017-08-14 Thread Chris Travers
On Mon, Aug 14, 2017 at 6:33 PM, Andres Freund  wrote:

> Hi,
>
> On 2017-08-14 14:12:22 +0200, Chris Travers wrote:
> > Problem:
> > The system this came up on is PostgreSQL 9.6.3 and has had repeated
> trouble
> > with disk space.  Querying pg_database_size, as well as du on the
> > subdirectory of base/ show total usage to be around 3.8TB.  Summing up
> the
> > size of the relations in pg_class though shows around 2.1TB.
> >
> > Initial troubleshooting found around 150 GB of space in pg_temp which had
> > never been cleared and was at least several days old.  Restarting the
> > server cleared these up.
> >
> > Poking around the base/[oid] directory, I found a large number of files
> > which did not correspond with a pg_class entry.  One of the apparent
> > relations was nearly 1TB in size.
> >
> > What I think happened:
> > I think various pg_temp/* and orphaned relation files (In base/[oid])
> were
> > created when PostgreSQL crashed due to running out of space in various
> > operations including creating materialised views.
> >
> > So my question is if there is a way we can safely clean these up on
> server
> > restart?  If not does it make sense to try to create a utility that can
> > connect to PostgreSQL, seek out valid files, and delete the rest?
>
> I think the fix here is to call RemovePgTempFiles() during
> crash-restarts, instead of just full starts. The previously stated need
> to be able to inspect temp files after a crash can be less impactfully
> fulfilled with restart_after_crash = false.
>
> But that only clears temp files right?

I am less concerned about the temp files because a restart clears them.

The bigger issue I see are with the orphaned base files.  It looks like
files in base/[oid] don't get cleaned up either if I read my output
correctly and it would explain why we saw 1.7TB of discrepancy between
relations and database size.  Safety-wise it seems like the best way out of
that is a dump/restore but doing that with a 2.1TB database is annoying.


> Greetings,
>
> Andres Freund
>



-- 
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


Re: [HACKERS] Orphaned files in base/[oid]

2017-08-14 Thread Andres Freund
Hi,

On 2017-08-14 14:12:22 +0200, Chris Travers wrote:
> Problem:
> The system this came up on is PostgreSQL 9.6.3 and has had repeated trouble
> with disk space.  Querying pg_database_size, as well as du on the
> subdirectory of base/ show total usage to be around 3.8TB.  Summing up the
> size of the relations in pg_class though shows around 2.1TB.
> 
> Initial troubleshooting found around 150 GB of space in pg_temp which had
> never been cleared and was at least several days old.  Restarting the
> server cleared these up.
> 
> Poking around the base/[oid] directory, I found a large number of files
> which did not correspond with a pg_class entry.  One of the apparent
> relations was nearly 1TB in size.
> 
> What I think happened:
> I think various pg_temp/* and orphaned relation files (In base/[oid]) were
> created when PostgreSQL crashed due to running out of space in various
> operations including creating materialised views.
> 
> So my question is if there is a way we can safely clean these up on server
> restart?  If not does it make sense to try to create a utility that can
> connect to PostgreSQL, seek out valid files, and delete the rest?

I think the fix here is to call RemovePgTempFiles() during
crash-restarts, instead of just full starts. The previously stated need
to be able to inspect temp files after a crash can be less impactfully
fulfilled with restart_after_crash = false.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Orphaned files in base/[oid]

2017-08-14 Thread Chris Travers
On Aug 14, 2017 14:12, "Chris Travers"  wrote:

Hi all;

I am trying to track down a problem we are seeing that looks very similar
to bug #12050, and would certainly consider trying to contribute a fix if
we agree on one.  (I am not sure we can, so absent that, the next question
is whether it makes sense to create a utility to fix the problem when it
comes up so that a dump/restore is not needed).

The system:
PostgreSQL 9.6.3
Gentoo Linux.

Problem:
The system this came up on is PostgreSQL 9.6.3 and has had repeated trouble
with disk space.  Querying pg_database_size, as well as du on the
subdirectory of base/ show total usage to be around 3.8TB.  Summing up the
size of the relations in pg_class though shows around 2.1TB.

Initial troubleshooting found around 150 GB of space in pg_temp which had
never been cleared and was at least several days old.  Restarting the
server cleared these up.

Poking around the base/[oid] directory, I found a large number of files
which did not correspond with a pg_class entry.  One of the apparent
relations was nearly 1TB in size.

What I think happened:
I think various pg_temp/* and orphaned relation files (In base/[oid]) were
created when PostgreSQL crashed due to running out of space in various
operations including creating materialised views.

So my question is if there is a way we can safely clean these up on server
restart?  If not does it make sense to try to create a utility that can
connect to PostgreSQL, seek out valid files, and delete the rest?


Ok I  have identified one case where symptoms I am seeing can be
reproduced.  I am currently working on a Mac so there may be quirks in my
repro.  However

When the WAL writer runs out of disk space no cleanup is done.

So I  will be looking at possible solutions next.


-- 
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 <+49%20162%209037210> | Skype: einhverfr |
www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin