subject:"Re\: \[HACKERS\] emergency outage requiring database restart"

Re: [HACKERS] emergency outage requiring database restart

2017-08-10 Thread Merlin Moncure

On Thu, Aug 10, 2017 at 12:01 PM, Ants Aasma  wrote:
> On Wed, Jan 18, 2017 at 4:33 PM, Merlin Moncure  wrote:
>> On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma  wrote:
>>> On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure  wrote:
 Still getting checksum failures.   Over the last 30 days, I see the
 following.  Since enabling checksums FWICT none of the damage is
 permanent and rolls back with the transaction.   So creepy!
>>>
>>> The checksums still only differ in least significant digits which
>>> pretty much means that there is a block number mismatch. So if you
>>> rule out filesystem not doing its job correctly and transposing
>>> blocks, it could be something else that is resulting in blocks getting
>>> read from a location that happens to differ by a small multiple of
>>> page size. Maybe somebody is racily mucking with table fd's between
>>> seeking and reading. That would explain the issue disappearing after a
>>> retry.
>>>
>>> Maybe you can arrange for the RelFileNode and block number to be
>>> logged for the checksum failures and check what the actual checksums
>>> are in data files surrounding the failed page. If the requested block
>>> number contains something completely else, but the page that follows
>>> contains the expected checksum value, then it would support this
>>> theory.
>>
>> will do.   Main challenge is getting hand compiled server to swap in
>> so that libdir continues to work.  Getting access to the server is
>> difficult as is getting a maintenance window.  I'll post back ASAP.
>
> As a new datapoint, we just had a customer with an issue that I think
> might be related. The issue was reasonably repeatable by running a
> report on the standby system. Issue manifested itself by first "could
> not open relation" and/or "column is not in index" errors, followed a
> few minutes later by a PANIC from startup process due to "specified
> item offset is too large", "invalid max offset number" or "page X of
> relation base/16384/1259 is uninitialized". I took a look at the xlog
> dump and it was completely fine. For instance in the "specified item
> offset is too large" case there was a INSERT_LEAF redo record
> inserting the preceding offset just a couple hundred kilobytes back.
> Restarting the server sometimes successfully applied the offending
> WAL, sometimes it failed with other corruption errors. The offending
> relations were always pg_class or pg_class_oid_index. Replacing plsh
> functions with dummy plpgsql functions made the problem go away,
> reintroducing plsh functions made it reappear.

Fantastic.  I was never able to attempt to apply O_CLOEXEC patch (see
upthread) due to the fact that access to the system is highly limited
and compiling a replacement binary was a bit of a headache.  IIRC this
was the best theory on the table as to the underlying cause and we
ought to to try that first, right?

Reminder; I was able to completely eliminate all damage (but had to
handle occasional unexpected rollback) via enabling checksums.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2017-08-10 Thread Ants Aasma

On Wed, Jan 18, 2017 at 4:33 PM, Merlin Moncure  wrote:
> On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma  wrote:
>> On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure  wrote:
>>> Still getting checksum failures.   Over the last 30 days, I see the
>>> following.  Since enabling checksums FWICT none of the damage is
>>> permanent and rolls back with the transaction.   So creepy!
>>
>> The checksums still only differ in least significant digits which
>> pretty much means that there is a block number mismatch. So if you
>> rule out filesystem not doing its job correctly and transposing
>> blocks, it could be something else that is resulting in blocks getting
>> read from a location that happens to differ by a small multiple of
>> page size. Maybe somebody is racily mucking with table fd's between
>> seeking and reading. That would explain the issue disappearing after a
>> retry.
>>
>> Maybe you can arrange for the RelFileNode and block number to be
>> logged for the checksum failures and check what the actual checksums
>> are in data files surrounding the failed page. If the requested block
>> number contains something completely else, but the page that follows
>> contains the expected checksum value, then it would support this
>> theory.
>
> will do.   Main challenge is getting hand compiled server to swap in
> so that libdir continues to work.  Getting access to the server is
> difficult as is getting a maintenance window.  I'll post back ASAP.

As a new datapoint, we just had a customer with an issue that I think
might be related. The issue was reasonably repeatable by running a
report on the standby system. Issue manifested itself by first "could
not open relation" and/or "column is not in index" errors, followed a
few minutes later by a PANIC from startup process due to "specified
item offset is too large", "invalid max offset number" or "page X of
relation base/16384/1259 is uninitialized". I took a look at the xlog
dump and it was completely fine. For instance in the "specified item
offset is too large" case there was a INSERT_LEAF redo record
inserting the preceding offset just a couple hundred kilobytes back.
Restarting the server sometimes successfully applied the offending
WAL, sometimes it failed with other corruption errors. The offending
relations were always pg_class or pg_class_oid_index. Replacing plsh
functions with dummy plpgsql functions made the problem go away,
reintroducing plsh functions made it reappear.

The only thing I came up with that is consistent with the symptoms is
that a page got thrown out of shared_buffers between the two xlog
records referencing it (shared_buffers was default 128MB), and then
read back by a backend process, where in the time between FileSeek and
FileRead calls in mdread a subprocess mucked with the fd's offset so
that a different page than intended got read in. Or basically the same
race condition, but on the write side. Maybe somebody else has a
better imagination than me...

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2017-01-30 Thread Michael Paquier

On Wed, Jan 4, 2017 at 4:17 AM, Peter Eisentraut
 wrote:
> It seems like everyone was generally in favor of this.  I looked around
> the internet for caveats but everyone was basically saying, you should
> definitely do this.
>
> Why not for EXEC_BACKEND?
>
> O_CLOEXEC is a newer interface.  There are older systems that don't have
> it but have FD_CLOEXEC for fcntl().  We should use that as a fallback.
>
> Have you gone through the code and checked for other ways file
> descriptors might get opened?  Here is a blog posts that lists some
> candidates: http://udrepper.livejournal.com/20407.html
>
> Ideally, we would have a test case that exec's something that lists the
> open file descriptors, and we check that there are only those we expect.
>
> The comment "We don't expect execve() calls inside the postgres code" is
> not quite correct, as we do things like archive_command and COPY to
> program (see OpenPipeStream()).

Oskari, are you planning to answer to this review? As the thread has
died 3 weeks ago, I am marking this as returned with feedback. Don't
hesitate to change the status of the patch if you have a new version.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2017-01-18 Thread Merlin Moncure

On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma  wrote:
> On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure  wrote:
>> Still getting checksum failures.   Over the last 30 days, I see the
>> following.  Since enabling checksums FWICT none of the damage is
>> permanent and rolls back with the transaction.   So creepy!
>
> The checksums still only differ in least significant digits which
> pretty much means that there is a block number mismatch. So if you
> rule out filesystem not doing its job correctly and transposing
> blocks, it could be something else that is resulting in blocks getting
> read from a location that happens to differ by a small multiple of
> page size. Maybe somebody is racily mucking with table fd's between
> seeking and reading. That would explain the issue disappearing after a
> retry.
>
> Maybe you can arrange for the RelFileNode and block number to be
> logged for the checksum failures and check what the actual checksums
> are in data files surrounding the failed page. If the requested block
> number contains something completely else, but the page that follows
> contains the expected checksum value, then it would support this
> theory.

will do.   Main challenge is getting hand compiled server to swap in
so that libdir continues to work.  Getting access to the server is
difficult as is getting a maintenance window.  I'll post back ASAP.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2017-01-18 Thread Ants Aasma

On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure  wrote:
> Still getting checksum failures.   Over the last 30 days, I see the
> following.  Since enabling checksums FWICT none of the damage is
> permanent and rolls back with the transaction.   So creepy!

The checksums still only differ in least significant digits which
pretty much means that there is a block number mismatch. So if you
rule out filesystem not doing its job correctly and transposing
blocks, it could be something else that is resulting in blocks getting
read from a location that happens to differ by a small multiple of
page size. Maybe somebody is racily mucking with table fd's between
seeking and reading. That would explain the issue disappearing after a
retry.

Maybe you can arrange for the RelFileNode and block number to be
logged for the checksum failures and check what the actual checksums
are in data files surrounding the failed page. If the requested block
number contains something completely else, but the page that follows
contains the expected checksum value, then it would support this
theory.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2017-01-04 Thread Merlin Moncure

On Tue, Jan 3, 2017 at 1:05 PM, Peter Eisentraut
 wrote:
> On 11/7/16 5:31 PM, Merlin Moncure wrote:
>> Regardless, it seems like you might be on to something, and I'm
>> inclined to patch your change, test it, and roll it out to production.
>> If it helps or at least narrows the problem down, we ought to give it
>> consideration for inclusion (unless someone else can think of a good
>> reason not to do that, heh!).
>
> Any results yet?

Not yet unfortunately.  I compiled the server with the change, but was
not able get $libdir working so that I could just do a binary swap
over my pgdg compiled package.  If anyone has some pointers on how to
do that, I'd appreciated it.

Still getting checksum failures.   Over the last 30 days, I see the
following.  Since enabling checksums FWICT none of the damage is
permanent and rolls back with the transaction.   So creepy!

[root@rcdylsdbmpf001 pg_log]# cat *.log | grep "page verification failed"
2016-12-05 10:17:48 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 61797 but expected 61798
2016-12-05 11:15:31 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 37750 but expected 37749
2016-12-05 11:15:58 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 44483 but expected 44482
2016-12-05 11:16:33 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 58926 but expected 58925
2016-12-05 11:17:08 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 38527 but expected 38528
2016-12-05 11:18:34 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 61932 but expected 61933
2016-12-05 11:18:55 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 23757 but expected 23758
2016-12-05 12:13:48 CST [rms@mpf2]: WARNING:  page verification
failed, calculated checksum 44192 but expected 44225 at character 417
2016-12-08 14:18:37 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 36083 but expected 36082
2016-12-08 15:52:31 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 63414 but expected 63415 at
character 1096
2016-12-09 09:12:21 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 25781 but expected 25780
2016-12-09 09:13:20 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 63043 but expected 63044 at
character 4230
2016-12-12 08:57:45 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 31775 but expected 31771
2016-12-13 09:47:11 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 40802 but expected 40806
2016-12-15 12:49:04 CST [rms@mpf2]: WARNING:  page verification
failed, calculated checksum 11625 but expected 11592 at character 417
2016-12-15 12:51:08 CST [rms@mpf2]: WARNING:  page verification
failed, calculated checksum 51017 but expected 51018
2016-12-15 12:52:36 CST [rms@mpf2]: WARNING:  page verification
failed, calculated checksum 51017 but expected 51018 at character 417
2016-12-16 12:16:31 CST [rms@mpf2]: WARNING:  page verification
failed, calculated checksum 23580 but expected 23576
2016-12-20 13:59:33 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 45273 but expected 45285
2016-12-20 14:00:22 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 10524 but expected 10525

note second database 'mpf2'.  This is a new development.  Example of
query that is jacking things is this:
2016-12-15 12:51:08 CST [rms@mpf2]: WARNING:  page verification
failed, calculated checksum 51017 but expected 51018
2016-12-15 12:51:08 CST [rms@mpf2]: CONTEXT:  SQL statement "
  COPY (
SELECT 'DELETE FROM tblNAptCommonSample WHERE ReportPeriod = 201612;'
UNION ALL
SELECT format(
'INSERT INTO tblNAptCommonSample('
  'ReportPeriod, Period, AdjustmentType, PlanType, MSA, MSASubMkt, '
  'Sample, Occupancy, OccupancyChange, AverageRent,
AverageRentChange, RentSF, '
  'RentSFChange)'
'VALUES('
  '%s, %s, ''%s'', ''%s'', ''%s'', %s,'
  '%s, %s, %s, %s, %s, %s,'
  '%s)',
  ReportPeriod, Period, AdjustmentType, PlanType, MSA, MSASubMkt,
  c(Sample), c(Occupancy), c(OccupancyChange), c(AverageRent),
c(AverageRentChange), c(RentSF),
  c(RentSFChange))
FROM tblNAptCommonSample
  WHERE Period = 201612
  AND MSA != '5610'
UNION ALL
  SELECT 'go'
  ) TO '/tmp/tblnaptcommonsample.sql';
"
PL/pgSQL function writempf1history(integer) line 75 at EXECUTE

or this:
2016-12-15 12:52:36 CST [rms@mpf2]: WARNING:  page verification
failed, calculated checksum 51017 but expected 51018 at character 417
2016-12-15 12:52:36 CST [rms@mpf2]: QUERY:
  COPY (
SELECT 'DELETE FROM tbltwrexistingunits WHERE ReportPeriod = 201612;'
UNION ALL
SELECT

Re: [HACKERS] emergency outage requiring database restart

2017-01-03 Thread Peter Eisentraut

On 11/2/16 11:45 AM, Oskari Saarenmaa wrote:
> 26.10.2016, 21:34, Andres Freund kirjoitti:
>> Any chance that plsh or the script it executes does anything with the file 
>> descriptors it inherits? That'd certainly one way to get into odd corruption 
>> issues.
>>
>> We processor really should use O_CLOEXEC for the majority of it file handles.
> 
> Attached a patch to always use O_CLOEXEC in BasicOpenFile if we're not 
> using EXEC_BACKEND.  It'd be nice to not expose all fds to most 
> pl-languages either, but I guess there's no easy solution to that 
> without forcibly closing all fds whenever any functions are called.

It seems like everyone was generally in favor of this.  I looked around
the internet for caveats but everyone was basically saying, you should
definitely do this.

Why not for EXEC_BACKEND?

O_CLOEXEC is a newer interface.  There are older systems that don't have
it but have FD_CLOEXEC for fcntl().  We should use that as a fallback.

Have you gone through the code and checked for other ways file
descriptors might get opened?  Here is a blog posts that lists some
candidates: http://udrepper.livejournal.com/20407.html

Ideally, we would have a test case that exec's something that lists the
open file descriptors, and we check that there are only those we expect.

The comment "We don't expect execve() calls inside the postgres code" is
not quite correct, as we do things like archive_command and COPY to
program (see OpenPipeStream()).

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2017-01-03 Thread Peter Eisentraut

On 11/7/16 5:31 PM, Merlin Moncure wrote:
> Regardless, it seems like you might be on to something, and I'm
> inclined to patch your change, test it, and roll it out to production.
> If it helps or at least narrows the problem down, we ought to give it
> consideration for inclusion (unless someone else can think of a good
> reason not to do that, heh!).

Any results yet?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-11-07 Thread Merlin Moncure

On Wed, Nov 2, 2016 at 10:45 AM, Oskari Saarenmaa  wrote:
> 26.10.2016, 21:34, Andres Freund kirjoitti:
>>
>> Any chance that plsh or the script it executes does anything with the file
>> descriptors it inherits? That'd certainly one way to get into odd corruption
>> issues.
>>
>> We processor really should use O_CLOEXEC for the majority of it file
>> handles.
>
>
> Attached a patch to always use O_CLOEXEC in BasicOpenFile if we're not using
> EXEC_BACKEND.  It'd be nice to not expose all fds to most pl-languages
> either, but I guess there's no easy solution to that without forcibly
> closing all fds whenever any functions are called.

FYI, this is not my first run-in with strange behavior, on this thread
(not necessarily worth reading);
https://www.postgresql.org/message-id/CAHyXU0x5mW-SbSuUBEshzumOaN7JPUWa7Ejza68HE-KY0Nq7Kg%40mail.gmail.com

I had a similar set of starting conditions that resulted in very
strange behavior (but not data corruption AFAICT) --the problem
mysteriously disappeared when I fixed some bugs that would cause the
routine to concurrently do the same operation.  I would like to point
out that I use both pl/sh (and via it, sqsh) very highly, so these
problems are not necessarily the norm.

Regardless, it seems like you might be on to something, and I'm
inclined to patch your change, test it, and roll it out to production.
If it helps or at least narrows the problem down, we ought to give it
consideration for inclusion (unless someone else can think of a good
reason not to do that, heh!).

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-11-02 Thread Oskari Saarenmaa


26.10.2016, 21:34, Andres Freund kirjoitti:

Any chance that plsh or the script it executes does anything with the file 
descriptors it inherits? That'd certainly one way to get into odd corruption 
issues.

We processor really should use O_CLOEXEC for the majority of it file handles.


Attached a patch to always use O_CLOEXEC in BasicOpenFile if we're not 
using EXEC_BACKEND.  It'd be nice to not expose all fds to most 
pl-languages either, but I guess there's no easy solution to that 
without forcibly closing all fds whenever any functions are called.


/ Oskari
>From 50d7410b895a1aae26c7001f11bd0d71a200ef96 Mon Sep 17 00:00:00 2001
From: Oskari Saarenmaa 
Date: Wed, 2 Nov 2016 16:42:36 +0200
Subject: [PATCH] BasicOpenFile: always use O_CLOEXEC if it is available

---
 src/backend/storage/file/fd.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git src/backend/storage/file/fd.c src/backend/storage/file/fd.c
index b7ff5ef..6cbe378 100644
--- src/backend/storage/file/fd.c
+++ src/backend/storage/file/fd.c
@@ -894,7 +894,19 @@ BasicOpenFile(FileName fileName, int fileFlags, int fileMode)
 	int			fd;
 
 tryAgain:
-	fd = open(fileName, fileFlags, fileMode);
+	fd = open(fileName, fileFlags, fileMode
+#if defined(O_CLOEXEC) && !defined(EXEC_BACKEND)
+	/*
+	 * We don't want exec'd processes to inherit our file handles unless
+	 * EXEC_BACKEND is used.  We don't expect execve() calls inside the
+	 * postgres code, but extensions and pl-languages may spawn new
+	 * processes that either don't work due to having no usable file
+	 * descriptors or write garbage in the files previously opened by
+	 * us.
+	 */
+	| O_CLOEXEC
+#endif
+		);
 
 	if (fd >= 0)
 		return fd;/* success! */
-- 
2.5.5


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-11-01 Thread Merlin Moncure

On Tue, Nov 1, 2016 at 8:56 AM, Tom Lane  wrote:
> Merlin Moncure  writes:
>> On Mon, Oct 31, 2016 at 10:32 AM, Oskari Saarenmaa  wrote:
>>> Your production system's postgres backends probably have a lot more open
>>> files associated with them than the simple test case does.  Since Postgres
>>> likes to keep files open as long as possible and only closes them when you
>>> need to free up fds to open new files, it's possible that your production
>>> backends have almost all allowed fds used when you execute your pl/sh
>>> function.
>>>
>>> If that's the case, the sqsh process that's executed may not have enough fds
>>> to do what it wanted to do and because of busted error handling could end up
>>> writing to fds that were opened by Postgres and point to $PGDATA files.
>
>> Does that apply?  the mechanics are a sqsh function that basically does:
>> cat foo.sql  | sqsh 
>> pipe redirection opens a new process, right?
>
> Yeah, but I doubt that either level of the shell would attempt to close
> inherited file handles.
>
> The real problem with Oskari's theory is that it requires not merely
> busted, but positively brain-dead error handling in the shell and/or
> sqsh, ie ignoring open() failures altogether.  That seems kind of
> unlikely.  Still, I suspect he might be onto something --- there must
> be some reason you can reproduce the issue in production and not in
> your test bed, and number-of-open-files is as good a theory as I've
> heard.
>
> Maybe the issue is not with open() failures, but with the resulting
> FD numbers being much larger than sqsh is expecting.  It would be
> weird to try to store an FD in something narrower than int, but
> I could see a use of select() being unprepared for large FDs.
> Still, it's hard to translate that idea into scribbling on the
> wrong file...

Looking at the sqsh code, nothing really stands out.  It's highly
developed and all obvious errors are checked.  There certainly could
be a freak bug in there (or in libfreetds which sqsh links to) doing
the damage though.  In the meantime I'll continue to try and work a
reliable reproduction. This particular routine only gets called in
batches on a quarterly basis so things have settled down.

Just a thought; could COPY be tricked into writing into the wrong file
descriptor?  For example, if a file was killed with a rm -rf and the
fd pressured backend reopened the fd immediately?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-11-01 Thread Andres Freund

On 2016-11-01 09:56:45 -0400, Tom Lane wrote:
> The real problem with Oskari's theory is that it requires not merely
> busted, but positively brain-dead error handling in the shell and/or
> sqsh, ie ignoring open() failures altogether.  That seems kind of
> unlikely.  Still, I suspect he might be onto something --- there must
> be some reason you can reproduce the issue in production and not in
> your test bed, and number-of-open-files is as good a theory as I've
> heard.

I've seen shell code akin to
exec >16 somefile # assume fd 16 is unused
more than one :(


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-11-01 Thread Tom Lane

Merlin Moncure  writes:
> On Mon, Oct 31, 2016 at 10:32 AM, Oskari Saarenmaa  wrote:
>> Your production system's postgres backends probably have a lot more open
>> files associated with them than the simple test case does.  Since Postgres
>> likes to keep files open as long as possible and only closes them when you
>> need to free up fds to open new files, it's possible that your production
>> backends have almost all allowed fds used when you execute your pl/sh
>> function.
>> 
>> If that's the case, the sqsh process that's executed may not have enough fds
>> to do what it wanted to do and because of busted error handling could end up
>> writing to fds that were opened by Postgres and point to $PGDATA files.

> Does that apply?  the mechanics are a sqsh function that basically does:
> cat foo.sql  | sqsh 
> pipe redirection opens a new process, right?

Yeah, but I doubt that either level of the shell would attempt to close
inherited file handles.

The real problem with Oskari's theory is that it requires not merely
busted, but positively brain-dead error handling in the shell and/or
sqsh, ie ignoring open() failures altogether.  That seems kind of
unlikely.  Still, I suspect he might be onto something --- there must
be some reason you can reproduce the issue in production and not in
your test bed, and number-of-open-files is as good a theory as I've
heard.

Maybe the issue is not with open() failures, but with the resulting
FD numbers being much larger than sqsh is expecting.  It would be
weird to try to store an FD in something narrower than int, but
I could see a use of select() being unprepared for large FDs.
Still, it's hard to translate that idea into scribbling on the
wrong file...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-11-01 Thread Merlin Moncure

On Mon, Oct 31, 2016 at 10:32 AM, Oskari Saarenmaa  wrote:
> 27.10.2016, 21:53, Merlin Moncure kirjoitti:
>>
>> As noted earlier, I was not able to reproduce the issue with
>> crashme.sh, which was:
>>
>> NUM_FORKS=16
>> do_parallel psql -p 5432  -c"select PushMarketSample('1740')"
>> castaging_test
>> do_parallel psql -p 5432  -c"select PushMarketSample('4400')"
>> castaging_test
>> do_parallel psql -p 5432  -c"select PushMarketSample('2160')"
>> castaging_test
>> do_parallel psql -p 5432  -c"select PushMarketSample('6680')"
>> castaging_test
>> 
>>
>> (do_parallel is simple wrapper to executing the command in parallel up
>> to NUM_FORKS).   This is on the same server and cluster as above.
>> This kind of suggests that either
>> A) there is some concurrent activity from another process that is
>> tripping the issue
>> or
>> B) there is something particular to the session invoking the function
>> that is participating in the problem.  As the application is
>> structured, a single threaded node.js app is issuing the query that is
>> high traffic and long lived.  It's still running in fact and I'm kind
>> of tempted to find some downtime to see if I can still reproduce via
>> the UI.
>
> Your production system's postgres backends probably have a lot more open
> files associated with them than the simple test case does.  Since Postgres
> likes to keep files open as long as possible and only closes them when you
> need to free up fds to open new files, it's possible that your production
> backends have almost all allowed fds used when you execute your pl/sh
> function.
>
> If that's the case, the sqsh process that's executed may not have enough fds
> to do what it wanted to do and because of busted error handling could end up
> writing to fds that were opened by Postgres and point to $PGDATA files.

Does that apply?  the mechanics are a sqsh function that basically does:
cat foo.sql  | sqsh 

pipe redirection opens a new process, right?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-31 Thread Oskari Saarenmaa


27.10.2016, 21:53, Merlin Moncure kirjoitti:

As noted earlier, I was not able to reproduce the issue with
crashme.sh, which was:

NUM_FORKS=16
do_parallel psql -p 5432  -c"select PushMarketSample('1740')" castaging_test
do_parallel psql -p 5432  -c"select PushMarketSample('4400')" castaging_test
do_parallel psql -p 5432  -c"select PushMarketSample('2160')" castaging_test
do_parallel psql -p 5432  -c"select PushMarketSample('6680')" castaging_test


(do_parallel is simple wrapper to executing the command in parallel up
to NUM_FORKS).   This is on the same server and cluster as above.
This kind of suggests that either
A) there is some concurrent activity from another process that is
tripping the issue
or
B) there is something particular to the session invoking the function
that is participating in the problem.  As the application is
structured, a single threaded node.js app is issuing the query that is
high traffic and long lived.  It's still running in fact and I'm kind
of tempted to find some downtime to see if I can still reproduce via
the UI.


Your production system's postgres backends probably have a lot more open 
files associated with them than the simple test case does.  Since 
Postgres likes to keep files open as long as possible and only closes 
them when you need to free up fds to open new files, it's possible that 
your production backends have almost all allowed fds used when you 
execute your pl/sh function.


If that's the case, the sqsh process that's executed may not have enough 
fds to do what it wanted to do and because of busted error handling 
could end up writing to fds that were opened by Postgres and point to 
$PGDATA files.


/ Oskari


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-28 Thread Merlin Moncure

On Fri, Oct 28, 2016 at 3:16 PM, Jim Nasby  wrote:
> On 10/28/16 8:23 AM, Merlin Moncure wrote:
>>
>> On Thu, Oct 27, 2016 at 6:39 PM, Greg Stark  wrote:
>>>
>>> On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure 
>>> wrote:

 I think we can rule out faulty storage
>>>
>>>
>>> Nobody ever expects the faulty storage
>
>
> LOL
>
>> Believe me, I know.  But the evidence points elsewhere in this case;
>> this is clearly application driven.
>
>
> FWIW, just because it's triggered by specific application behavior doesn't
> mean there isn't a storage bug. That's what makes data corruption bugs such
> a joy to figure out.
>
> BTW, if you haven't already, I would reset all your storage related options
> and GUCs to safe defaults... plain old FSYNC, no cute journal / FS / mount
> options, etc. Maybe this is related to the app, but the most helpful thing
> right now is to find some kind of safe config so you can start bisecting.

upthread, you might have noticed that I already did that.   Here is
the other evidence:
*) server running fine for 5+ years
*) other database on same cluster not impacted with 10x write activity
*) no interesting logs reported in  /var/log/messages, dmesg, etc
*) san fabric turns over petabytes/day with no corruption. 100+
postgres clusters, 1000+ sql server clusters (and that's not
production)
*) storage/network teams have been through everything. nothing
intersting/unusual to report
*) we have infrequently run routing (posted upthread) that, when run,
database crashed within minutes
*) after turning on checksums, 30% of invocations of routine resulted
in checksum error
*) problem re-occurred after dump-restore and full cluster rebuild
*) checksum error caused routine rollback.  FWICT this prevented the damage
*) everything is fine now that routine is not being run anymore

you can come up with your conclusion, I've come up with mine.  The
only frustrating thing here is that I can't reproduce out of the
production environment.  If this database goes down I have 30 people
sitting around so I can't take downtime lightly.

> I would also consider alternatives to plsh, just to rule it out if nothing
> else. I'd certainly look at some way to get sqsh out of the loop (again,
> just to get something that doesn't crash). First idea that comes to mind is
> a stand-alone shell script that watches a named pipe for a filename; when it
> gets that file it runs it with sqsh and does something to signal completion.

I do a lot of etl to/from sql server and it's all sqsh based.  If I
can figure out how to reproduce in a better way, I'll zero in on the
problem in about 10 minutes.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-28 Thread Jim Nasby


On 10/28/16 8:23 AM, Merlin Moncure wrote:

On Thu, Oct 27, 2016 at 6:39 PM, Greg Stark  wrote:

On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure  wrote:

I think we can rule out faulty storage


Nobody ever expects the faulty storage


LOL


Believe me, I know.  But the evidence points elsewhere in this case;
this is clearly application driven.


FWIW, just because it's triggered by specific application behavior 
doesn't mean there isn't a storage bug. That's what makes data 
corruption bugs such a joy to figure out.


BTW, if you haven't already, I would reset all your storage related 
options and GUCs to safe defaults... plain old FSYNC, no cute journal / 
FS / mount options, etc. Maybe this is related to the app, but the most 
helpful thing right now is to find some kind of safe config so you can 
start bisecting.


I would also consider alternatives to plsh, just to rule it out if 
nothing else. I'd certainly look at some way to get sqsh out of the loop 
(again, just to get something that doesn't crash). First idea that comes 
to mind is a stand-alone shell script that watches a named pipe for a 
filename; when it gets that file it runs it with sqsh and does something 
to signal completion.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-28 Thread Merlin Moncure

On Thu, Oct 27, 2016 at 6:39 PM, Greg Stark  wrote:
> On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure  wrote:
>> I think we can rule out faulty storage
>
> Nobody ever expects the faulty storage

Believe me, I know.  But the evidence points elsewhere in this case;
this is clearly application driven.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-27 Thread Greg Stark

On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure  wrote:
> I think we can rule out faulty storage

Nobody ever expects the faulty storage



-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-27 Thread Merlin Moncure

On Thu, Oct 27, 2016 at 2:31 AM, Ants Aasma  wrote:
> On Wed, Oct 26, 2016 at 8:43 PM, Merlin Moncure  wrote:
>> /var/lib/pgsql/9.5/data/pg_log/postgresql-26.log | grep "page
>> verification"
>> 2016-10-26 11:26:42 CDT [postgres@castaging]: WARNING:  page
>> verification failed, calculated checksum 37251 but expected 37244
>> 2016-10-26 11:27:55 CDT [postgres@castaging]: WARNING:  page
>> verification failed, calculated checksum 37249 but expected 37244
>> 2016-10-26 12:16:44 CDT [postgres@castaging]: WARNING:  page
>> verification failed, calculated checksum 44363 but expected 44364
>> 2016-10-26 12:18:58 CDT [postgres@castaging]: WARNING:  page
>> verification failed, calculated checksum 49525 but expected 49539
>> 2016-10-26 12:19:12 CDT [postgres@castaging]: WARNING:  page
>> verification failed, calculated checksum 37345 but expected 37340
>
> The checksum values are improbably close. The checksum algorithm has
> decently good mixing of all bits in the page. Having the first byte
> match in 5 checksums makes this 1:2^40 improbable. What is not mixed
> in properly is the block number, it only gets xor'ed before packing
> the value into 16bits using modulo 0x. So I'm pretty sure
> different block numbers were used for writing out and reading in the
> page. Either the blocknum gets corrupted between calculating the
> checksum and writing the page out (unlikely given the proximity), or
> the pages are somehow getting transposed in the storage.

I think we can rule out faulty storage, and I'm reasonably sure
nothing is writing to the database cluster except postgres itself.

So far I have not executed an emergency dump/restore on this database,
nor have I received any other log entries suggesting problems of any
kind.  Meaning, FWICT, the database is completely healthy
notwithstanding the reported errors.

This suggests (but does not prove) that the syscatalog damaging
activity happened during the commission of the attached function and
healed itself during the rollback.   Previously to turning checksums,
the transaction did *not* roll back and the damage was allowed stand
and probably accumulated.

As noted earlier, I was not able to reproduce the issue with
crashme.sh, which was:

NUM_FORKS=16
do_parallel psql -p 5432  -c"select PushMarketSample('1740')" castaging_test
do_parallel psql -p 5432  -c"select PushMarketSample('4400')" castaging_test
do_parallel psql -p 5432  -c"select PushMarketSample('2160')" castaging_test
do_parallel psql -p 5432  -c"select PushMarketSample('6680')" castaging_test

(do_parallel is simple wrapper to executing the command in parallel up
to NUM_FORKS).   This is on the same server and cluster as above.
This kind of suggests that either
A) there is some concurrent activity from another process that is
tripping the issue
or
B) there is something particular to the session invoking the function
that is participating in the problem.  As the application is
structured, a single threaded node.js app is issuing the query that is
high traffic and long lived.  It's still running in fact and I'm kind
of tempted to find some downtime to see if I can still reproduce via
the UI.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-27 Thread Ants Aasma

On Wed, Oct 26, 2016 at 8:43 PM, Merlin Moncure  wrote:
> /var/lib/pgsql/9.5/data/pg_log/postgresql-26.log | grep "page
> verification"
> 2016-10-26 11:26:42 CDT [postgres@castaging]: WARNING:  page
> verification failed, calculated checksum 37251 but expected 37244
> 2016-10-26 11:27:55 CDT [postgres@castaging]: WARNING:  page
> verification failed, calculated checksum 37249 but expected 37244
> 2016-10-26 12:16:44 CDT [postgres@castaging]: WARNING:  page
> verification failed, calculated checksum 44363 but expected 44364
> 2016-10-26 12:18:58 CDT [postgres@castaging]: WARNING:  page
> verification failed, calculated checksum 49525 but expected 49539
> 2016-10-26 12:19:12 CDT [postgres@castaging]: WARNING:  page
> verification failed, calculated checksum 37345 but expected 37340

The checksum values are improbably close. The checksum algorithm has
decently good mixing of all bits in the page. Having the first byte
match in 5 checksums makes this 1:2^40 improbable. What is not mixed
in properly is the block number, it only gets xor'ed before packing
the value into 16bits using modulo 0x. So I'm pretty sure
different block numbers were used for writing out and reading in the
page. Either the blocknum gets corrupted between calculating the
checksum and writing the page out (unlikely given the proximity), or
the pages are somehow getting transposed in the storage.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Andres Freund

On 2016-10-26 15:06:34 -0500, Jim Nasby wrote:
> Removing the exec might "solve" the problem here, assuming that the
> forked process doesn't still inherit all open FH's.

Unless you explicitly close fds or use FD_CLOEXEC when opening fds
they'll be inherited forever.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Merlin Moncure

On Wed, Oct 26, 2016 at 3:06 PM, Jim Nasby  wrote:
> On 10/26/16 2:25 PM, Merlin Moncure wrote:
>>
>> I don't think that's the case.  sqsh is a psql-like utility.  it
>> writes to stdout  and stderr only which is captured by plsh and sent.
>> In this context shexec only wraps rm -f 'file' where 'file' is a file
>> previously created with COPY in the same transaction.
>
> How do you know that? It could potentially be doing anything it wanted with
> file handles. Removing the exec might "solve" the problem here, assuming
> that the forked process doesn't still inherit all open FH's.
>
> In a nutshell, running arbitrary shell code (let alone arbitrary
> executables) in a pl that's running inside the Postgres backend sounds
> insanely risky to me.

Well, it is insantely risky, but the code in this context is not
arbitrary; it's only:
PERFORM shexec('rm -f ' || _OutputFile);

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Jim Nasby


On 10/26/16 2:25 PM, Merlin Moncure wrote:

I don't think that's the case.  sqsh is a psql-like utility.  it
writes to stdout  and stderr only which is captured by plsh and sent.
In this context shexec only wraps rm -f 'file' where 'file' is a file
previously created with COPY in the same transaction.


How do you know that? It could potentially be doing anything it wanted 
with file handles. Removing the exec might "solve" the problem here, 
assuming that the forked process doesn't still inherit all open FH's.


In a nutshell, running arbitrary shell code (let alone arbitrary 
executables) in a pl that's running inside the Postgres backend sounds 
insanely risky to me.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Merlin Moncure

On Wed, Oct 26, 2016 at 2:12 PM, Andres Freund  wrote:
> On 2016-10-26 13:49:12 -0500, Merlin Moncure wrote:
>> On Wed, Oct 26, 2016 at 1:45 PM, Andres Freund  wrote:
>> >
>> >
>> > On October 26, 2016 9:38:49 PM GMT+03:00, Merlin Moncure 
>> >  wrote:
>> >>On Wed, Oct 26, 2016 at 1:34 PM, Andres Freund 
>> >>wrote:
>> >>> Any chance that plsh or the script it executes does anything with the
>> >>file descriptors it inherits? That'd certainly one way to get into odd
>> >>corruption issues.
>> >>
>> >>not sure.  it's pretty small -- see
>> >>https://github.com/petere/plsh/blob/master/plsh.c
>> >
>> > Afaics that could also be in your script, not just plsh. The later doesn't 
>> > seem to close all file handles above stderr, which means that all handles 
>> > for relations etc week be open in your script.  If you e.g. do any unusual 
>> > redirections (2>&17 or such), that could end badly.   But I'm just on my 
>> > phone, in a taxi without seatbelts, at 60mph, so I didn't look carefully.
>>
>> gotcha :-).  see above:
>> *) sqshf:
>> #!/bin/bash
>> cat \
>>   $2 \
>>   | eval "sqsh $1 -L'datetime=%Y-%m-%d %H:%M:%S.%u' -G 7.0"
>>
>> echo "Success"
>>
>> *) shexec:
>> #!/bin/bash
>>
>> eval $1
>>
>> FWICT that's all that's happening here with respect to pl/sh.
>
> My point is that that doesn't mean anything. Whatever sqsh is, or
> whatever $1 eval's to (and $2 for that matter), could access the
> filehandles the backend has opened.

I don't think that's the case.  sqsh is a psql-like utility.  it
writes to stdout  and stderr only which is captured by plsh and sent.
In this context shexec only wraps rm -f 'file' where 'file' is a file
previously created with COPY in the same transaction.

In a test environment, spamming this routine with 16 threads I was not
able to reproduce.  Still hunting...

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Andres Freund

On 2016-10-26 13:49:12 -0500, Merlin Moncure wrote:
> On Wed, Oct 26, 2016 at 1:45 PM, Andres Freund  wrote:
> >
> >
> > On October 26, 2016 9:38:49 PM GMT+03:00, Merlin Moncure 
> >  wrote:
> >>On Wed, Oct 26, 2016 at 1:34 PM, Andres Freund 
> >>wrote:
> >>> Any chance that plsh or the script it executes does anything with the
> >>file descriptors it inherits? That'd certainly one way to get into odd
> >>corruption issues.
> >>
> >>not sure.  it's pretty small -- see
> >>https://github.com/petere/plsh/blob/master/plsh.c
> >
> > Afaics that could also be in your script, not just plsh. The later doesn't 
> > seem to close all file handles above stderr, which means that all handles 
> > for relations etc week be open in your script.  If you e.g. do any unusual 
> > redirections (2>&17 or such), that could end badly.   But I'm just on my 
> > phone, in a taxi without seatbelts, at 60mph, so I didn't look carefully.
> 
> gotcha :-).  see above:
> *) sqshf:
> #!/bin/bash
> cat \
>   $2 \
>   | eval "sqsh $1 -L'datetime=%Y-%m-%d %H:%M:%S.%u' -G 7.0"
> 
> echo "Success"
> 
> *) shexec:
> #!/bin/bash
> 
> eval $1
> 
> FWICT that's all that's happening here with respect to pl/sh.

My point is that that doesn't mean anything. Whatever sqsh is, or
whatever $1 eval's to (and $2 for that matter), could access the
filehandles the backend has opened.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Merlin Moncure

On Wed, Oct 26, 2016 at 1:45 PM, Andres Freund  wrote:
>
>
> On October 26, 2016 9:38:49 PM GMT+03:00, Merlin Moncure  
> wrote:
>>On Wed, Oct 26, 2016 at 1:34 PM, Andres Freund 
>>wrote:
>>> Any chance that plsh or the script it executes does anything with the
>>file descriptors it inherits? That'd certainly one way to get into odd
>>corruption issues.
>>
>>not sure.  it's pretty small -- see
>>https://github.com/petere/plsh/blob/master/plsh.c
>
> Afaics that could also be in your script, not just plsh. The later doesn't 
> seem to close all file handles above stderr, which means that all handles for 
> relations etc week be open in your script.  If you e.g. do any unusual 
> redirections (2>&17 or such), that could end badly.   But I'm just on my 
> phone, in a taxi without seatbelts, at 60mph, so I didn't look carefully.

gotcha :-).  see above:
*) sqshf:
#!/bin/bash
cat \
  $2 \
  | eval "sqsh $1 -L'datetime=%Y-%m-%d %H:%M:%S.%u' -G 7.0"

echo "Success"

*) shexec:
#!/bin/bash

eval $1

FWICT that's all that's happening here with respect to pl/sh. I'm
almost done with reproduction environment and if i get the issue I
should be able to zero in on the problem.  One final thing is that
concurrency is a contributing factor.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Andres Freund

On October 26, 2016 9:38:49 PM GMT+03:00, Merlin Moncure  
wrote:
>On Wed, Oct 26, 2016 at 1:34 PM, Andres Freund 
>wrote:
>> Any chance that plsh or the script it executes does anything with the
>file descriptors it inherits? That'd certainly one way to get into odd
>corruption issues.
>
>not sure.  it's pretty small -- see
>https://github.com/petere/plsh/blob/master/plsh.c

Afaics that could also be in your script, not just plsh. The later doesn't seem 
to close all file handles above stderr, which means that all handles for 
relations etc week be open in your script.  If you e.g. do any unusual 
redirections (2>&17 or such), that could end badly.   But I'm just on my phone, 
in a taxi without seatbelts, at 60mph, so I didn't look carefully.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Merlin Moncure

On Wed, Oct 26, 2016 at 1:34 PM, Andres Freund  wrote:
> Any chance that plsh or the script it executes does anything with the file 
> descriptors it inherits? That'd certainly one way to get into odd corruption 
> issues.

not sure.  it's pretty small -- see
https://github.com/petere/plsh/blob/master/plsh.c

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Andres Freund



On October 26, 2016 8:57:22 PM GMT+03:00, Merlin Moncure  
wrote:
>On Wed, Oct 26, 2016 at 12:43 PM, Merlin Moncure 
>wrote:
>> On Wed, Oct 26, 2016 at 11:35 AM, Merlin Moncure 
>wrote:
>>> On Tue, Oct 25, 2016 at 3:08 PM, Merlin Moncure 
>wrote:
 Confirmation of problem re-occurrence will come in a few days.   
>I'm
 much more likely to believe 6+sigma occurrence (storage, freak bug,
 etc) should it prove the problem goes away post rebuild.
>>>
>>> ok, no major reported outage yet, but just got:
>>>
>>> 2016-10-26 11:27:55 CDT [postgres@castaging]: ERROR:  invalid page
>in
>>> block 12 of relation base/203883/1259
>
>*) I've now strongly correlated this routine with the damage.
>
>[root@rcdylsdbmpf001 ~]# cat
>/var/lib/pgsql/9.5/data/pg_log/postgresql-26.log  | grep -i
>pushmarketsample | head -5
>2016-10-26 11:26:27 CDT [postgres@castaging]: LOG:  execute :
>SELECT PushMarketSample($1::TEXT) AS published
>2016-10-26 11:26:40 CDT [postgres@castaging]: LOG:  execute :
>SELECT PushMarketSample($1::TEXT) AS published
>PL/pgSQL function pushmarketsample(text,date,integer) line 103 at SQL
>statement
>PL/pgSQL function pushmarketsample(text,date,integer) line 103 at SQL
>statement
>2016-10-26 11:26:42 CDT [postgres@castaging]: STATEMENT:  SELECT
>PushMarketSample($1::TEXT) AS published
>
>*) First invocation was 11:26:27 CDT
>
>*) Second invocation was 11:26:40 and gave checksum error (as noted
>earlier 11:26:42)
>
>*) Routine attached (if interested)
>
>My next step is to set up test environment and jam this routine
>aggressively to see what happens.

Any chance that plsh or the script it executes does anything with the file 
descriptors it inherits? That'd certainly one way to get into odd corruption 
issues.

We processor really should use O_CLOEXEC for the majority of it file handles.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Merlin Moncure

On Wed, Oct 26, 2016 at 1:09 PM, Tom Lane  wrote:
> Merlin Moncure  writes:
>> *) I've now strongly correlated this routine with the damage.
>
> Hmm.  Do you have any way to replace the non-core calls with something
> else?  The "shexec('rm -f ' || _OutputFile)" bits could presumably be
> converted to use contrib/adminpack's pg_file_unlink(), or an equivalent
> one-liner in plperlu or plpythonu.  I don't know what the sqshf
> incantation does, though.
>
> This wouldn't be about fixing it so much as narrowing down where the
> problem is.

Will do. Before you spend a lot of time, let me get full confirmation
that this is the source of the issue and I'll bake a reproduction
script (there is still a possibility that something *else* caused it
and by happenstance we caught it here).

For posterity,  sqshf does:

CREATE OR REPLACE FUNCTION sqshf(
  ConnectionString TEXT,
  FileName TEXT) RETURNS TEXT AS
$$#!/bin/bash

cat \
  $2 \
  | eval "sqsh $1 -L'datetime=%Y-%m-%d %H:%M:%S.%u' -G 7.0"

echo "Success"

$$ LANGUAGE plsh;

sqsh being the freetds wrapping linux console to sql server.   So
basically not much.

shexec does:

CREATE OR REPLACE FUNCTION shexec(_cmd TEXT) RETURNS TEXT AS
$$#!/bin/bash

eval $1

$$ LANGUAGE plsh;

:-)

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Tom Lane

Merlin Moncure  writes:
> *) I've now strongly correlated this routine with the damage.

Hmm.  Do you have any way to replace the non-core calls with something
else?  The "shexec('rm -f ' || _OutputFile)" bits could presumably be
converted to use contrib/adminpack's pg_file_unlink(), or an equivalent
one-liner in plperlu or plpythonu.  I don't know what the sqshf
incantation does, though.

This wouldn't be about fixing it so much as narrowing down where the
problem is.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Merlin Moncure

On Wed, Oct 26, 2016 at 12:43 PM, Merlin Moncure  wrote:
> On Wed, Oct 26, 2016 at 11:35 AM, Merlin Moncure  wrote:
>> On Tue, Oct 25, 2016 at 3:08 PM, Merlin Moncure  wrote:
>>> Confirmation of problem re-occurrence will come in a few days.I'm
>>> much more likely to believe 6+sigma occurrence (storage, freak bug,
>>> etc) should it prove the problem goes away post rebuild.
>>
>> ok, no major reported outage yet, but just got:
>>
>> 2016-10-26 11:27:55 CDT [postgres@castaging]: ERROR:  invalid page in
>> block 12 of relation base/203883/1259

*) I've now strongly correlated this routine with the damage.

[root@rcdylsdbmpf001 ~]# cat
/var/lib/pgsql/9.5/data/pg_log/postgresql-26.log  | grep -i
pushmarketsample | head -5
2016-10-26 11:26:27 CDT [postgres@castaging]: LOG:  execute :
SELECT PushMarketSample($1::TEXT) AS published
2016-10-26 11:26:40 CDT [postgres@castaging]: LOG:  execute :
SELECT PushMarketSample($1::TEXT) AS published
PL/pgSQL function pushmarketsample(text,date,integer) line 103 at SQL statement
PL/pgSQL function pushmarketsample(text,date,integer) line 103 at SQL statement
2016-10-26 11:26:42 CDT [postgres@castaging]: STATEMENT:  SELECT
PushMarketSample($1::TEXT) AS published

*) First invocation was 11:26:27 CDT

*) Second invocation was 11:26:40 and gave checksum error (as noted
earlier 11:26:42)

*) Routine attached (if interested)

My next step is to set up test environment and jam this routine
aggressively to see what happens.

merlin
/*
 * Send sample for a market to the pubs database.  It will delete existing 
 * sample and replace it with whatever is determined to be the best available
 * samble for each aparmtent in the market.
 *
 * XXX: assume 'asof' defines the period for which we are replacing data. 
 * Should that assumption be invalidated, we will need to accept an argument
 * overriding the period.
 *
 * XXX: AsOf controls the extraction date but if a properites 'good' sample
 * as after asof, no sample is pulled for the property. 
 */
SELECT DropFunction('PushMarketSample');

CREATE OR REPLACE FUNCTION PushMarketSample(
  _MarketID TEXT,
  _AsOf DATE DEFAULT now()::DATE,
  _MaxSampleDays INT DEFAULT 90,
  Result OUT TEXT) RETURNS TEXT AS
$$
DECLARE
  _OutputFile TEXT;
BEGIN
  _OutputFile := format('/tmp/apartment_sample_%s_%s.sql',
_MarketId,
to_char(_AsOf, 'MMDD'));

  DROP TABLE IF EXISTS AptSample;

  CREATE TEMP TABLE AptSample ON COMMIT DROP AS 
SELECT 
  ApartmentId,
  FloorplanNumber,
  msasubmkt AS SubmarketId,
  (SELECT Period FROM DataEntryPeriod) AS Sampled,
  Vacant,
  DiscountedLowRent,
  DiscountedHighRent,
  LowRent,
  HighRent,
  LowDiscountRate,
  HighDiscountRate,
  Remarks, /* Is there a spot for remarks in MPF? */
  TotalUnits,
  UnitCount,
  IncludeInAggregations,
  Rent,
  Occupancy
FROM vw_ApartmentSample aq
WHERE 
  Current
  AND Sampled::DATE BETWEEN (_AsOf - _MaxSampleDays) AND _AsOf
  AND _MarketID = msa
  AND DataSource = 'Survey';

  /* XXX: write out the sql script because we are stupidly prohibited from 
   * using pl/pgsql variables embedded into sql statements when using COPY.
   */
  DROP TABLE IF EXISTS Script;
  CREATE TEMP TABLE Script ON COMMIT DROP AS 
SELECT 'begin transaction'
UNION ALL SELECT 
  format(
'DELETE tblAptDet '
'FROM tblAptDet d '
'JOIN tblApt a ON a.id = d.id '
'WHERE '
'  a.msa = %s '
'  AND d.Period = %s',
quote_literal(_MarketID),
(SELECT Period FROM DataEntryPeriod))
UNION ALL SELECT format(
  'INSERT INTO tblaptdet('
'Id, PlanNumber, Period, Vacant, DLoRate, DHiRate, '
'FLoRate, FHiRate, LoDiscount, HiDiscount, Remarks, Quantity, '
'UpDtoHistory) '
  'VALUES('
'%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)',
quote_nullable(ApartmentId),
quote_nullable(FloorplanNumber),
quote_nullable(Sampled),
quote_nullable(Vacant),
quote_nullable(DiscountedLowRent),
quote_nullable(DiscountedHighRent),
quote_nullable(LowRent),
quote_nullable(HighRent),
quote_nullable(LowDiscountRate),
quote_nullable(HighDiscountRate),
quote_nullable(Remarks),
quote_nullable(UnitCount),
CASE WHEN IncludeInAggregations THEN 1 ELSE 0 END)
FROM AptSample
UNION ALL SELECT 'commit transaction'
UNION ALL SELECT 'go';

  EXECUTE 'COPY Script TO ' || quote_literal(_OutputFile);

  /* push in the new sample! */
  SELECT sqshf(ConnectTo('MPFUI'), _OutputFile) INTO Result;

  PERFORM shexec('rm -f ' || _OutputFile);

  DELETE FROM MarketPublishTracking WHERE 
MarketId = _MarketID
AND Period =(SELECT Period FROM DataEntryPeriod);

  DELETE FROM SubmarketPublishTracking WHERE 
MarketId = _MarketID
AND Period = (SELECT Period FROM DataEntryPeriod);

  DROP TABLE IF EXISTS ApartmentPublish;
  CREATE

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Merlin Moncure

On Wed, Oct 26, 2016 at 11:35 AM, Merlin Moncure  wrote:
> On Tue, Oct 25, 2016 at 3:08 PM, Merlin Moncure  wrote:
>> Confirmation of problem re-occurrence will come in a few days.I'm
>> much more likely to believe 6+sigma occurrence (storage, freak bug,
>> etc) should it prove the problem goes away post rebuild.
>
> ok, no major reported outage yet, but just got:
>
> 2016-10-26 11:27:55 CDT [postgres@castaging]: ERROR:  invalid page in
> block 12 of relation base/203883/1259


*) still on 9.5.2 :(  not in control of the package update process

*) getting more of these:
[root@rcdylsdbmpf001 data]# cat
/var/lib/pgsql/9.5/data_checksum/pg_log/postgresql-26.log  | grep
"invalid page"
2016-10-26 11:26:42 CDT [postgres@castaging]: ERROR:  invalid page in
block 10 of relation base/203883/1259
2016-10-26 11:27:55 CDT [postgres@castaging]: ERROR:  invalid page in
block 12 of relation base/203883/1259
2016-10-26 12:16:44 CDT [postgres@castaging]: ERROR:  invalid page in
block 13 of relation base/203883/1259
2016-10-26 12:18:58 CDT [postgres@castaging]: ERROR:  invalid page in
block 15 of relation base/203883/1259
2016-10-26 12:19:12 CDT [postgres@castaging]: ERROR:  invalid page in
block 7 of relation base/203883/2662

castaging=# select relname from pg_class where oid in(1259, 2662);
  relname

 pg_class_oid_index
 pg_class

*) only one database in the cluster is damaged.   I do not suspect
storage at this point

*) Currently I can execute pg_dump with no problems.   So far impact
is not severe but soon I have to do dump/restore

*) age(xmin) from pg_class gives all reasonable values (approx 757k)
that aren't 2147483647

*) All cases of statement executing getting this error in a pl/pgsql
routine that does the following:
BEGIN
...
  

  DROP TABLE IF EXISTS foo;
  CREATE TEMP TABLE foo ON COMMIT DROP AS ...
  SELECT  ... INTO Result;

  DROP TABLE IF EXISTS bar;
  CREATE TEMP TABLE bar ON COMMIT DROP AS ...

 PERFORM  ;

EXCEPTION
  WHEN OTHERS THEN
 PERFORM  ;
 RAISE;
END;

*) page verification miss is numerically close
[root@rcdylsdbmpf001 ~]# cat
/var/lib/pgsql/9.5/data/pg_log/postgresql-26.log | grep "page
verification"
2016-10-26 11:26:42 CDT [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 37251 but expected 37244
2016-10-26 11:27:55 CDT [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 37249 but expected 37244
2016-10-26 12:16:44 CDT [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 44363 but expected 44364
2016-10-26 12:18:58 CDT [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 49525 but expected 49539
2016-10-26 12:19:12 CDT [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 37345 but expected 37340

*) sample log leading into the first error (all queries are logged)
2016-10-26 11:26:40 CDT [postgres@castaging]: LOG:  execute :
SELECT PushMarketSample($1::TEXT) AS published
2016-10-26 11:26:40 CDT [postgres@castaging]: DETAIL:  parameters: $1 = '8840'
2016-10-26 11:26:40 CDT [postgres@mpf2]: LOG:  statement: START
TRANSACTION ISOLATION LEVEL REPEATABLE READ
2016-10-26 11:26:40 CDT [postgres@mpf2]: LOG:  statement: SAVEPOINT s2
2016-10-26 11:26:40 CDT [postgres@mpf2]: LOG:  execute :
DECLARE c1 CURSOR FOR
SELECT period FROM public.dataentryperiod
2016-10-26 11:26:40 CDT [postgres@mpf2]: LOG:  statement: FETCH 100 FROM c1
2016-10-26 11:26:41 CDT [postgres@mpf2]: LOG:  statement: CLOSE c1
2016-10-26 11:26:41 CDT [postgres@mpf2]: LOG:  execute :
DECLARE c2 CURSOR FOR
SELECT period FROM public.dataentryperiod
2016-10-26 11:26:41 CDT [postgres@mpf2]: LOG:  statement: FETCH 100 FROM c2
2016-10-26 11:26:41 CDT [postgres@mpf2]: LOG:  statement: CLOSE c2
2016-10-26 11:26:41 CDT [rms@mpf2]: ERROR:  relation "tblagent" does
not exist at character 15
2016-10-26 11:26:41 CDT [rms@mpf2]: STATEMENT:  select * from tblagent
2016-10-26 11:26:41 CDT [rms@mpf2]: LOG:  execute : SELECT
CURRENT_SCHEMA()
2016-10-26 11:26:42 CDT [postgres@mpf2]: LOG:  execute :
DECLARE c3 CURSOR FOR
SELECT period FROM public.dataentryperiod
2016-10-26 11:26:42 CDT [postgres@mpf2]: LOG:  statement: FETCH 100 FROM c3
2016-10-26 11:26:42 CDT [postgres@mpf2]: LOG:  statement: CLOSE c3
2016-10-26 11:26:42 CDT [postgres@mpf2]: LOG:  execute :
DECLARE c4 CURSOR FOR
SELECT period FROM public.dataentryperiod
2016-10-26 11:26:42 CDT [postgres@mpf2]: LOG:  statement: FETCH 100 FROM c4
2016-10-26 11:26:42 CDT [postgres@mpf2]: LOG:  statement: CLOSE c4
2016-10-26 11:26:42 CDT [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 37251 but expected 37244
2016-10-26 11:26:42 CDT [postgres@castaging]: CONTEXT:  SQL statement
"CREATE TEMP TABLE ApartmentPublishBySubmkt ON COMMIT DROP AS
  SELECT
ApartmentId,
TotalUnits,
SubmarketId,
SUM(Rent * UnitCount) / NULLIF(SUM(UnitCount),

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Merlin Moncure

On Tue, Oct 25, 2016 at 3:08 PM, Merlin Moncure  wrote:
> Confirmation of problem re-occurrence will come in a few days.I'm
> much more likely to believe 6+sigma occurrence (storage, freak bug,
> etc) should it prove the problem goes away post rebuild.

ok, no major reported outage yet, but just got:

2016-10-26 11:27:55 CDT [postgres@castaging]: ERROR:  invalid page in
block 12 of relation base/203883/1259

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-25 Thread Jim Nasby

On 10/22/16 12:38 PM, Tom Lane wrote:

Jim Nasby  writes:

> On 10/21/16 7:43 PM, Tom Lane wrote:

>> Alvaro Herrera  writes:

>>> Agreed.  The problem is how to install it without breaking pg_upgrade.

> It can't look up relation names?

It can't shove 64 bytes into a page that has < 64 bytes free space.

Yeah, that would need to be a special case, but unless the page pointers 
are horked you'd be able to tell if there was a name in there.

>> Well, that's the first problem.  The second problem is how to cope with
>> RENAME TABLE.

> If the name was only encoded in the first block of the relation I'm not
> sure how bad this would be; are there any facilities to change the name
> back on a rollback?

No.  How would that work in crash situations?  (And keep in mind that the

Well, if FPI's were enabled you'd get the old page back. Even without 
that recovery could remember rename records it's seen that didn't commit.

argument for this hinges entirely on its being 100% trustworthy after a
crash.)

I don't think this needs to be 100% reliable to be useful, especially if 
we went with Andreas suggestion to store both the old and new name (and 
storing the OID as well isn't a bad idea). If you're ever at the point 
of looking at this info you're already in deep trouble and should 
already be taking everything with a grain of salt anyway.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-25 Thread Merlin Moncure

On Tue, Oct 25, 2016 at 2:31 PM, Tom Lane  wrote:
> Merlin Moncure  writes:
>> What if the subsequent dataloss was in fact a symptom of the first
>> outage?  Is in theory possible for data to appear visible but then be
>> eaten up as the transactions making the data visible get voided out by
>> some other mechanic?  I had to pull a quick restart the first time and
>> everything looked ok -- or so I thought.   What I think was actually
>> happening is that data started to slip into the void.   It's like
>> randomly sys catalogs were dropping off. I bet other data was, too.  I
>> can pull older backups and verify that.  It's as if some creeping xmin
>> was snuffing everything out.
>
> Might be interesting to look at age(xmin) in a few different system
> catalogs.  I think you can ignore entries with age = 2147483647;
> those should be frozen rows.  But if you see entries with very large
> ages that are not that, it'd be suspicious.

nothing really stands out.

The damage did re-occur after a dump/restore -- not sure about a
cluster level rebuild.  No problems previous to that.  This suggests
that if this theory holds the damage would have had to have been under
the database level -- perhaps in clog.  Maybe hint bits and clog did
not agree as to commit or delete status for example.  clog has plenty
of history leading past the problem barrier:
-rwx-- 1 postgres postgres 256K Jul 10 16:21 
-rwx-- 1 postgres postgres 256K Jul 21 12:39 0001
-rwx-- 1 postgres postgres 256K Jul 21 13:19 0002
-rwx-- 1 postgres postgres 256K Jul 21 13:59 0003


Confirmation of problem re-occurrence will come in a few days.I'm
much more likely to believe 6+sigma occurrence (storage, freak bug,
etc) should it prove the problem goes away post rebuild.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-25 Thread Tom Lane

Merlin Moncure  writes:
> What if the subsequent dataloss was in fact a symptom of the first
> outage?  Is in theory possible for data to appear visible but then be
> eaten up as the transactions making the data visible get voided out by
> some other mechanic?  I had to pull a quick restart the first time and
> everything looked ok -- or so I thought.   What I think was actually
> happening is that data started to slip into the void.   It's like
> randomly sys catalogs were dropping off. I bet other data was, too.  I
> can pull older backups and verify that.  It's as if some creeping xmin
> was snuffing everything out.

Might be interesting to look at age(xmin) in a few different system
catalogs.  I think you can ignore entries with age = 2147483647;
those should be frozen rows.  But if you see entries with very large
ages that are not that, it'd be suspicious.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-25 Thread Merlin Moncure

On Tue, Oct 25, 2016 at 12:57 PM, Alvaro Herrera
 wrote:
> Merlin Moncure wrote:
>
>> After last night, I rebuilt the cluster, turning on checksums, turning
>> on synchronous commit (it was off) and added a standby replica.  This
>> should help narrow the problem down should it re-occur; if storage is
>> bad (note, other database on same machine is doing 10x write activity
>> and is fine) or something is scribbling on shared memory (my guess
>> here)  then checksums should be popped, right?
>
> Not really sure about that.  As I recall we compute the CRC on the
> buffer's way out, based on the then-current contents, so if something
> scribbles on the buffer while it's waiting to be evicted, the CRC
> computation would include the new (corrupted) bytes rather than the
> original ones -- see FlushBuffer.

Huh. I have a new theory on this.  Dealing with the reconstituted
database, I'm finding more things -- functions and such, that are
simply gone and had to be rebuilt -- they escaped notice as they were
not in primary code paths.  Recall that the original outage came
manifested as queries getting stuck, possibly on spinlock (we don't
know for sure).  After that, things started to randomly disappear,
possibly from system catalogs (but now need to go back and verify
older data, I think).  There were three autovac processes running.

What if the subsequent dataloss was in fact a symptom of the first
outage?  Is in theory possible for data to appear visible but then be
eaten up as the transactions making the data visible get voided out by
some other mechanic?  I had to pull a quick restart the first time and
everything looked ok -- or so I thought.   What I think was actually
happening is that data started to slip into the void.   It's like
randomly sys catalogs were dropping off. I bet other data was, too.  I
can pull older backups and verify that.  It's as if some creeping xmin
was snuffing everything out.

The confirmation of this should be obvious -- if that's indeed the
case, the backup and restored cluster should no longer present data
loss. Given that I was getting that every 1-2 days, we should be able
to figure that out pretty soon.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-25 Thread Alvaro Herrera

Merlin Moncure wrote:

> After last night, I rebuilt the cluster, turning on checksums, turning
> on synchronous commit (it was off) and added a standby replica.  This
> should help narrow the problem down should it re-occur; if storage is
> bad (note, other database on same machine is doing 10x write activity
> and is fine) or something is scribbling on shared memory (my guess
> here)  then checksums should be popped, right?

Not really sure about that.  As I recall we compute the CRC on the
buffer's way out, based on the then-current contents, so if something
scribbles on the buffer while it's waiting to be evicted, the CRC
computation would include the new (corrupted) bytes rather than the
original ones -- see FlushBuffer.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-25 Thread Merlin Moncure

On Mon, Oct 24, 2016 at 9:18 PM, Alvaro Herrera
 wrote:
> Merlin Moncure wrote:
>> On Mon, Oct 24, 2016 at 6:01 PM, Merlin Moncure  wrote:
>
>> > Corruption struck again.
>> > This time got another case of view busted -- attempting to create
>> > gives missing 'type' error.
>>
>> Call it a hunch -- I think the problem is in pl/sh.
>
> I've heard that before.

well, yeah, previously I had an issue where the database crashed
during a heavy concurrent pl/sh based load.   However the problems
went away when I refactored the code.   Anyways, I looked at the code
and couldn't see anything obviously wrong so who knows?  All I know is
my production database is exploding continuously and I'm looking for
answers.  The only other extension in heavy use on this servers is
postgres_fdw.

The other database on the cluster is fine, which kind of suggests we
are not facing clog or WAL type problems.

After last night, I rebuilt the cluster, turning on checksums, turning
on synchronous commit (it was off) and added a standby replica.  This
should help narrow the problem down should it re-occur; if storage is
bad (note, other database on same machine is doing 10x write activity
and is fine) or something is scribbling on shared memory (my guess
here)  then checksums should be popped, right?

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-24 Thread Alvaro Herrera

Merlin Moncure wrote:
> On Mon, Oct 24, 2016 at 6:01 PM, Merlin Moncure  wrote:

> > Corruption struck again.
> > This time got another case of view busted -- attempting to create
> > gives missing 'type' error.
> 
> Call it a hunch -- I think the problem is in pl/sh.

I've heard that before.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-24 Thread Merlin Moncure

On Mon, Oct 24, 2016 at 6:01 PM, Merlin Moncure  wrote:
> On Thu, Oct 20, 2016 at 1:52 PM, Merlin Moncure  wrote:
>> On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure  wrote:
>>> On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian  wrote:
 On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:
> > Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
> > to be to the system catalogs with at least two critical tables dropped
> > or inaccessible in some fashion.  A lot of the OIDs seemed to be
> > pointing at the wrong thing.  Couple more datapoints here.
> >
> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
> > *) Another database on the same cluster was not impacted.  However
> > it's more olap style and may not have been written to during the
> > outage
> >
> > Now, this infrastructure running this system is running maybe 100ish
> > postgres clusters and maybe 1000ish sql server instances with
> > approximately zero unexplained data corruption issues in the 5 years
> > I've been here.  Having said that, this definitely smells and feels
> > like something on the infrastructure side.  I'll follow up if I have
> > any useful info.
>
> After a thorough investigation I now have credible evidence the source
> of the damage did not originate from the database itself.
> Specifically, this database is mounted on the same volume as the
> operating system (I know, I know) and something non database driven
> sucked up disk space very rapidly and exhausted the volume -- fast
> enough that sar didn't pick it up.  Oh well :-) -- thanks for the help

 However, disk space exhaustion should not lead to corruption unless the
 underlying layers lied in some way.
>>>
>>> I agree -- however I'm sufficiently separated from the things doing
>>> the things that I can't verify that in any real way.   In the meantime
>>> I'm going to take standard precautions (enable checksums/dedicated
>>> volume/replication).  Low disk space also does not explain the bizarre
>>> outage I had last friday.
>>
>> ok, data corruption struck again.  This time disk space is ruled out,
>> and access to the database is completely denied:
>> postgres=# \c castaging
>> WARNING:  leaking still-referenced relcache entry for
>> "pg_index_indexrelid_index"
>
> Corruption struck again.
> This time got another case of view busted -- attempting to create
> gives missing 'type' error.

Call it a hunch -- I think the problem is in pl/sh.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-24 Thread Merlin Moncure

On Thu, Oct 20, 2016 at 1:52 PM, Merlin Moncure  wrote:
> On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure  wrote:
>> On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian  wrote:
>>> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:
 > Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
 > to be to the system catalogs with at least two critical tables dropped
 > or inaccessible in some fashion.  A lot of the OIDs seemed to be
 > pointing at the wrong thing.  Couple more datapoints here.
 >
 > *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
 > *) Another database on the same cluster was not impacted.  However
 > it's more olap style and may not have been written to during the
 > outage
 >
 > Now, this infrastructure running this system is running maybe 100ish
 > postgres clusters and maybe 1000ish sql server instances with
 > approximately zero unexplained data corruption issues in the 5 years
 > I've been here.  Having said that, this definitely smells and feels
 > like something on the infrastructure side.  I'll follow up if I have
 > any useful info.

 After a thorough investigation I now have credible evidence the source
 of the damage did not originate from the database itself.
 Specifically, this database is mounted on the same volume as the
 operating system (I know, I know) and something non database driven
 sucked up disk space very rapidly and exhausted the volume -- fast
 enough that sar didn't pick it up.  Oh well :-) -- thanks for the help
>>>
>>> However, disk space exhaustion should not lead to corruption unless the
>>> underlying layers lied in some way.
>>
>> I agree -- however I'm sufficiently separated from the things doing
>> the things that I can't verify that in any real way.   In the meantime
>> I'm going to take standard precautions (enable checksums/dedicated
>> volume/replication).  Low disk space also does not explain the bizarre
>> outage I had last friday.
>
> ok, data corruption struck again.  This time disk space is ruled out,
> and access to the database is completely denied:
> postgres=# \c castaging
> WARNING:  leaking still-referenced relcache entry for
> "pg_index_indexrelid_index"

Corruption struck again.
This time got another case of view busted -- attempting to create
gives missing 'type' error.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-22 Thread Andres Freund



On October 22, 2016 11:59:15 AM PDT, Tom Lane  wrote:
>Alvaro Herrera  writes:
>> Uh, sorry.  My proposal a couple of years back was to put the
>> relfilenode, not the name.  I didn't notice that it was the name
>being
>> proposed here.  However, now I notice that this idea doesn't solve
>the
>> problem for mapped relations.
>
>Well, as long as a catalog lookup would be required anyway, what about
>putting in the table OID?

How about storing two table names? The old and what the relation is being 
renamed to? We wouldn't be able to tell after a crash which is which, but 
that'll usually still be helpful.   Every rename would clear out the old/wrong 
one, and add the target filename.

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-22 Thread Tom Lane

Alvaro Herrera  writes:
> Uh, sorry.  My proposal a couple of years back was to put the
> relfilenode, not the name.  I didn't notice that it was the name being
> proposed here.  However, now I notice that this idea doesn't solve the
> problem for mapped relations.

Well, as long as a catalog lookup would be required anyway, what about
putting in the table OID?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-22 Thread Alvaro Herrera

Tom Lane wrote:
> Alvaro Herrera  writes:
> > Jim Nasby wrote:
> >> It occurs to me that it might be worth embedding the relation name in the
> >> free space of the first block. Most people would never notice the missing 
> >> 64
> >> bytes, but it would be incredibly helpful in cases like this...
> 
> > Agreed.  The problem is how to install it without breaking pg_upgrade.
> 
> Well, that's the first problem.  The second problem is how to cope with
> RENAME TABLE.

Uh, sorry.  My proposal a couple of years back was to put the
relfilenode, not the name.  I didn't notice that it was the name being
proposed here.  However, now I notice that this idea doesn't solve the
problem for mapped relations.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-22 Thread Tom Lane

Jim Nasby  writes:
> On 10/21/16 7:43 PM, Tom Lane wrote:
>> Alvaro Herrera  writes:
>>> Agreed.  The problem is how to install it without breaking pg_upgrade.

> It can't look up relation names?

It can't shove 64 bytes into a page that has < 64 bytes free space.

>> Well, that's the first problem.  The second problem is how to cope with
>> RENAME TABLE.

> If the name was only encoded in the first block of the relation I'm not 
> sure how bad this would be; are there any facilities to change the name 
> back on a rollback?

No.  How would that work in crash situations?  (And keep in mind that the
argument for this hinges entirely on its being 100% trustworthy after a
crash.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-22 Thread Jim Nasby


On 10/21/16 7:43 PM, Tom Lane wrote:

Alvaro Herrera  writes:

Jim Nasby wrote:

It occurs to me that it might be worth embedding the relation name in the
free space of the first block. Most people would never notice the missing 64
bytes, but it would be incredibly helpful in cases like this...



Agreed.  The problem is how to install it without breaking pg_upgrade.


It can't look up relation names?


Well, that's the first problem.  The second problem is how to cope with
RENAME TABLE.


If the name was only encoded in the first block of the relation I'm not 
sure how bad this would be; are there any facilities to change the name 
back on a rollback?

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-21 Thread Tom Lane

Alvaro Herrera  writes:
> Jim Nasby wrote:
>> It occurs to me that it might be worth embedding the relation name in the
>> free space of the first block. Most people would never notice the missing 64
>> bytes, but it would be incredibly helpful in cases like this...

> Agreed.  The problem is how to install it without breaking pg_upgrade.

Well, that's the first problem.  The second problem is how to cope with
RENAME TABLE.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-21 Thread Alvaro Herrera

Jim Nasby wrote:
> On 10/21/16 2:02 PM, Alvaro Herrera wrote:
> > Merlin Moncure wrote:
> > 
> > > OK, I have some good (very- in the specific case of yours truly) news
> > > to report.  Doing a filesystem level copy to a test server I was able
> > > to relfilenode swap one of the critical tables over the place of the
> > > refilenode of the stored backup.  Not being able know the file to copy
> > > from, I figured out the source node by judging the size and using
> > > 'strings'  utility.  Data recovery for that table at least appears to
> > > be 100%.
> > 
> > FWIW you can use pg_filedump and match based on the number of columns.
> > I suppose you could also use the pageinspect extension, by 'dd'ing a
> > page from the file into the database and feeding into heap_page_items as
> > bytea.
> 
> It occurs to me that it might be worth embedding the relation name in the
> free space of the first block. Most people would never notice the missing 64
> bytes, but it would be incredibly helpful in cases like this...

Agreed.  The problem is how to install it without breaking pg_upgrade.

Note that for DR purposes it is better to put that data in the first
block of *each segment*.  Otherwise if you have many >1GB tables, you
have to go segment by segment ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-21 Thread Jim Nasby


On 10/21/16 2:02 PM, Alvaro Herrera wrote:

Merlin Moncure wrote:


OK, I have some good (very- in the specific case of yours truly) news
to report.  Doing a filesystem level copy to a test server I was able
to relfilenode swap one of the critical tables over the place of the
refilenode of the stored backup.  Not being able know the file to copy
from, I figured out the source node by judging the size and using
'strings'  utility.  Data recovery for that table at least appears to
be 100%.


FWIW you can use pg_filedump and match based on the number of columns.
I suppose you could also use the pageinspect extension, by 'dd'ing a
page from the file into the database and feeding into heap_page_items as
bytea.


It occurs to me that it might be worth embedding the relation name in 
the free space of the first block. Most people would never notice the 
missing 64 bytes, but it would be incredibly helpful in cases like this...

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-21 Thread Alvaro Herrera

Merlin Moncure wrote:

> OK, I have some good (very- in the specific case of yours truly) news
> to report.  Doing a filesystem level copy to a test server I was able
> to relfilenode swap one of the critical tables over the place of the
> refilenode of the stored backup.  Not being able know the file to copy
> from, I figured out the source node by judging the size and using
> 'strings'  utility.  Data recovery for that table at least appears to
> be 100%.

FWIW you can use pg_filedump and match based on the number of columns.
I suppose you could also use the pageinspect extension, by 'dd'ing a
page from the file into the database and feeding into heap_page_items as
bytea.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-21 Thread Merlin Moncure

On Fri, Oct 21, 2016 at 1:37 PM, Merlin Moncure  wrote:
> On Fri, Oct 21, 2016 at 8:03 AM, Kevin Grittner  wrote:
>> On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure  wrote:
>>
>>> Most or all the damage seemed to be to the system catalogs with
>>> at least two critical tables dropped or inaccessible in some
>>> fashion.  A lot of the OIDs seemed to be pointing at the wrong
>>> thing.
>>
>> While the oid in pg_class often matches the filename, that is not
>> true after some operations (like CLUSTER or VACUUM FULL).  It is
>> the relfilenode column that is the definitive link to the file.
>
> no such operations happened.  In the first instance at least one table
> dropped from the system catalogs.   I have a hunch that the heap is
> fine (supported by the size of the database on disk).   At this
> precise moment I'm restoring the database to another fileserver in
> order to do some forensic analysis, also in the hopes of getting the
> second database online in order to expedite recovery.
>
> ah -- done. :-)  deleting the init file didn't help, but starting up
> single user allowed the start up to gracefully fail with a FATAL cache
> lookup.

OK, I have some good (very- in the specific case of yours truly) news
to report.  Doing a filesystem level copy to a test server I was able
to relfilenode swap one of the critical tables over the place of the
refilenode of the stored backup.  Not being able know the file to copy
from, I figured out the source node by judging the size and using
'strings'  utility.  Data recovery for that table at least appears to
be 100%.

For those following along, this simple process is only likely to work
easily if the table contains only system types; no user types, enums,
composites, etc, since those have a unique ID for each data restore.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-21 Thread Merlin Moncure

On Fri, Oct 21, 2016 at 8:03 AM, Kevin Grittner  wrote:
> On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure  wrote:
>
>> Most or all the damage seemed to be to the system catalogs with
>> at least two critical tables dropped or inaccessible in some
>> fashion.  A lot of the OIDs seemed to be pointing at the wrong
>> thing.
>
> While the oid in pg_class often matches the filename, that is not
> true after some operations (like CLUSTER or VACUUM FULL).  It is
> the relfilenode column that is the definitive link to the file.

no such operations happened.  In the first instance at least one table
dropped from the system catalogs.   I have a hunch that the heap is
fine (supported by the size of the database on disk).   At this
precise moment I'm restoring the database to another fileserver in
order to do some forensic analysis, also in the hopes of getting the
second database online in order to expedite recovery.

ah -- done. :-)  deleting the init file didn't help, but starting up
single user allowed the start up to gracefully fail with a FATAL cache
lookup.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-21 Thread Kevin Grittner

On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure  wrote:

> Most or all the damage seemed to be to the system catalogs with
> at least two critical tables dropped or inaccessible in some
> fashion.  A lot of the OIDs seemed to be pointing at the wrong
> thing.

While the oid in pg_class often matches the filename, that is not
true after some operations (like CLUSTER or VACUUM FULL).  It is
the relfilenode column that is the definitive link to the file.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-20 Thread Merlin Moncure

On Thu, Oct 20, 2016 at 3:16 PM, Alvaro Herrera
 wrote:
> Merlin Moncure wrote:
>
>> single user mode dumps core :(
>>
>> bash-4.1$ postgres --single -D /var/lib/pgsql/9.5/data castaging
>> LOG:  0: could not change directory to "/root": Permission denied
>> LOCATION:  resolve_symlinks, exec.c:293
>> Segmentation fault (core dumped)
>>
>> Core was generated by `postgres --single -D /var/lib/pgsql/9.5/data 
>> castaging'.
>> Program terminated with signal 11, Segmentation fault.
>> #0  0x00797d6f in ?? ()
>> Missing separate debuginfos, use: debuginfo-install
>> postgresql95-server-9.5.2-1PGDG.rhel6.x86_64
>> (gdb) bt
>> #0  0x00797d6f in ?? ()
>> #1  0x0079acf1 in RelationCacheInitializePhase3 ()
>> #2  0x007b35c5 in InitPostgres ()
>> #3  0x006b9b53 in PostgresMain ()
>> #4  0x005f30fb in main ()
>
> Maybe
>   rm global/pg_internal.init
> and try again?

Will do when I can do that had to do emergency restore + some unfun
data reconstruction from the query log.

Notably there is a much larger database in the same cluster which is
undamaged.  This server is new to production usage, maybe 2 months.

Here is contents of pg_extension
 plpgsql
 dblink
 hstore
 postgres_fdw
 plsh * not used
 pg_trgm * not used
 plr * not used
 tablefunc * not used
 adminpack * not used
 plpythonu * not used
 postgis * not used
 postgis_topology * not used

Short term plan is to separate the database to it's own cluster,
install replication and checksums.  All queries to this database are
logged.  Here is the contents of the log leading into and after the
the crash:

oct 17 crash:
2016-10-17 12:12:24 CDT [rms@castaging]: DETAIL:  parameters: $1 =
'21121', $2 = '8', $3 = '2016-10-13', $4 = NULL, $5 = NULL, $6 = NULL,
$7 = NULL, $8 = NULL, $9 = NULL, $10 = NULL, $11 = 't', $12
2016-10-17 12:12:24 CDT [rms@castaging]: LOG:  execute :
SELECT NULL AS PROCEDURE_CAT, n.nspname AS PROCEDURE_SCHEM, p.proname
AS PROCEDURE_NAME, NULL, NULL, NULL, d.description AS REMARKS
2016-10-17 12:12:24 CDT [rms@castaging]: LOG:  execute :
SELECT n.nspname,p.proname,p.prorettype,p.proargtypes,
t.typtype,t.typrelid , p.proargnames, p.proargmodes, p.proallargtypes
, p.o
2016-10-17 12:12:24 CDT [rms@castaging]: LOG:  execute :
select * from checkin($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11,
$12) as result
2016-10-17 12:12:24 CDT [rms@castaging]: DETAIL:  parameters: $1 =
'114333', $2 = 'rrosillo', $3 = 'CALLER', $4 = 'Survey', $5 = 'Happy',
$6 = 'Callback', $7 = 'OTHER', $8 = '2016-10-18 01:05:00',
2016-10-17 12:12:24 CDT [rms@castaging]: LOG:  execute S_3: COMMIT
2016-10-17 12:12:25 CDT [@]: ERROR:  could not open relation with OID
1203933 <-- first sign of damage
2016-10-17 12:12:25 CDT [@]: CONTEXT:  automatic analyze of table
"castaging.public.apartment"

oct 20 crash:
2016-10-20 12:46:38 CDT [postgres@castaging]: LOG:  statement: SELECT
CallsByUser() AS byuser
2016-10-20 12:46:40 CDT [postgres@castaging]: LOG:  statement: SELECT
CallCenterOverviewJSON() AS overview
2016-10-20 12:46:41 CDT [postgres@castaging]: LOG:  statement: SELECT
CallCenterUserTrackingJSON() AS tracking
2016-10-20 12:46:41 CDT [postgres@castaging]: LOG:  statement: SELECT
MarketOverviewJSON() AS market
2016-10-20 12:46:42 CDT [postgres@castaging]: LOG:  execute :
SELECT SubMarketOverviewJSON($1::TEXT) AS submkt
2016-10-20 12:46:42 CDT [postgres@castaging]: DETAIL:  parameters: $1 = '640'
2016-10-20 12:46:44 CDT [postgres@castaging]: LOG:  statement: SELECT
CallsByUser() AS byuser
2016-10-20 12:46:46 CDT [postgres@castaging]: LOG:  statement: SELECT
CallCenterOverviewJSON() AS overview
2016-10-20 12:46:47 CDT [postgres@castaging]: LOG:  statement: SELECT
CallCenterUserTrackingJSON() AS tracking
2016-10-20 12:46:47 CDT [postgres@castaging]: ERROR:
"pg_description_o_c_o_index" is an index <-- first sign of damage
2016-10-20 12:46:47 CDT [postgres@castaging]: CONTEXT:  SQL function
"callcenterusertrackingjson" during startup
2016-10-20 12:46:47 CDT [postgres@castaging]: STATEMENT:  SELECT
CallCenterUserTrackingJSON() AS tracking
2016-10-20 12:46:47 CDT [postgres@castaging]: WARNING:  leaking
still-referenced relcache entry for "pg_class_oid_index"

CallCenterUserTrackingJSON() and friends are not particularly
interesting except that they are making use of of json_agg().  They
were also called basically all day long in 5 second intervals.   I
guess this isn't saying very much, but I'm starting to smell a rat
here.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-20 Thread Alvaro Herrera

Merlin Moncure wrote:

> single user mode dumps core :(
> 
> bash-4.1$ postgres --single -D /var/lib/pgsql/9.5/data castaging
> LOG:  0: could not change directory to "/root": Permission denied
> LOCATION:  resolve_symlinks, exec.c:293
> Segmentation fault (core dumped)
> 
> Core was generated by `postgres --single -D /var/lib/pgsql/9.5/data 
> castaging'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x00797d6f in ?? ()
> Missing separate debuginfos, use: debuginfo-install
> postgresql95-server-9.5.2-1PGDG.rhel6.x86_64
> (gdb) bt
> #0  0x00797d6f in ?? ()
> #1  0x0079acf1 in RelationCacheInitializePhase3 ()
> #2  0x007b35c5 in InitPostgres ()
> #3  0x006b9b53 in PostgresMain ()
> #4  0x005f30fb in main ()

Maybe
  rm global/pg_internal.init
and try again?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-20 Thread Merlin Moncure

On Thu, Oct 20, 2016 at 2:07 PM, Tom Lane  wrote:
> Merlin Moncure  writes:
>> single user mode dumps core :(
>
> You've got a mess there :-(
>
>> Missing separate debuginfos, use: debuginfo-install
>> postgresql95-server-9.5.2-1PGDG.rhel6.x86_64
>
> This backtrace would likely be much more informative if you did the above.

can't; don't have the package unfortunately.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-20 Thread Tom Lane

Merlin Moncure  writes:
> single user mode dumps core :(

You've got a mess there :-(

> Missing separate debuginfos, use: debuginfo-install
> postgresql95-server-9.5.2-1PGDG.rhel6.x86_64

This backtrace would likely be much more informative if you did the above.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-20 Thread Merlin Moncure

On Thu, Oct 20, 2016 at 1:52 PM, Merlin Moncure  wrote:
> On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure  wrote:
>> On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian  wrote:
>>> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:
 > Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
 > to be to the system catalogs with at least two critical tables dropped
 > or inaccessible in some fashion.  A lot of the OIDs seemed to be
 > pointing at the wrong thing.  Couple more datapoints here.
 >
 > *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
 > *) Another database on the same cluster was not impacted.  However
 > it's more olap style and may not have been written to during the
 > outage
 >
 > Now, this infrastructure running this system is running maybe 100ish
 > postgres clusters and maybe 1000ish sql server instances with
 > approximately zero unexplained data corruption issues in the 5 years
 > I've been here.  Having said that, this definitely smells and feels
 > like something on the infrastructure side.  I'll follow up if I have
 > any useful info.

 After a thorough investigation I now have credible evidence the source
 of the damage did not originate from the database itself.
 Specifically, this database is mounted on the same volume as the
 operating system (I know, I know) and something non database driven
 sucked up disk space very rapidly and exhausted the volume -- fast
 enough that sar didn't pick it up.  Oh well :-) -- thanks for the help
>>>
>>> However, disk space exhaustion should not lead to corruption unless the
>>> underlying layers lied in some way.
>>
>> I agree -- however I'm sufficiently separated from the things doing
>> the things that I can't verify that in any real way.   In the meantime
>> I'm going to take standard precautions (enable checksums/dedicated
>> volume/replication).  Low disk space also does not explain the bizarre
>> outage I had last friday.
>
> ok, data corruption struck again.  This time disk space is ruled out,
> and access to the database is completely denied:
> postgres=# \c castaging
> WARNING:  leaking still-referenced relcache entry for
> "pg_index_indexrelid_index"

single user mode dumps core :(

bash-4.1$ postgres --single -D /var/lib/pgsql/9.5/data castaging
LOG:  0: could not change directory to "/root": Permission denied
LOCATION:  resolve_symlinks, exec.c:293
Segmentation fault (core dumped)

Core was generated by `postgres --single -D /var/lib/pgsql/9.5/data castaging'.
Program terminated with signal 11, Segmentation fault.
#0  0x00797d6f in ?? ()
Missing separate debuginfos, use: debuginfo-install
postgresql95-server-9.5.2-1PGDG.rhel6.x86_64
(gdb) bt
#0  0x00797d6f in ?? ()
#1  0x0079acf1 in RelationCacheInitializePhase3 ()
#2  0x007b35c5 in InitPostgres ()
#3  0x006b9b53 in PostgresMain ()
#4  0x005f30fb in main ()
(gdb)


merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-20 Thread Merlin Moncure

On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure  wrote:
> On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian  wrote:
>> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:
>>> > Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
>>> > to be to the system catalogs with at least two critical tables dropped
>>> > or inaccessible in some fashion.  A lot of the OIDs seemed to be
>>> > pointing at the wrong thing.  Couple more datapoints here.
>>> >
>>> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
>>> > *) Another database on the same cluster was not impacted.  However
>>> > it's more olap style and may not have been written to during the
>>> > outage
>>> >
>>> > Now, this infrastructure running this system is running maybe 100ish
>>> > postgres clusters and maybe 1000ish sql server instances with
>>> > approximately zero unexplained data corruption issues in the 5 years
>>> > I've been here.  Having said that, this definitely smells and feels
>>> > like something on the infrastructure side.  I'll follow up if I have
>>> > any useful info.
>>>
>>> After a thorough investigation I now have credible evidence the source
>>> of the damage did not originate from the database itself.
>>> Specifically, this database is mounted on the same volume as the
>>> operating system (I know, I know) and something non database driven
>>> sucked up disk space very rapidly and exhausted the volume -- fast
>>> enough that sar didn't pick it up.  Oh well :-) -- thanks for the help
>>
>> However, disk space exhaustion should not lead to corruption unless the
>> underlying layers lied in some way.
>
> I agree -- however I'm sufficiently separated from the things doing
> the things that I can't verify that in any real way.   In the meantime
> I'm going to take standard precautions (enable checksums/dedicated
> volume/replication).  Low disk space also does not explain the bizarre
> outage I had last friday.

ok, data corruption struck again.  This time disk space is ruled out,
and access to the database is completely denied:
postgres=# \c castaging
WARNING:  leaking still-referenced relcache entry for
"pg_index_indexrelid_index"

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-19 Thread Merlin Moncure

On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian  wrote:
> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:
>> > Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
>> > to be to the system catalogs with at least two critical tables dropped
>> > or inaccessible in some fashion.  A lot of the OIDs seemed to be
>> > pointing at the wrong thing.  Couple more datapoints here.
>> >
>> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
>> > *) Another database on the same cluster was not impacted.  However
>> > it's more olap style and may not have been written to during the
>> > outage
>> >
>> > Now, this infrastructure running this system is running maybe 100ish
>> > postgres clusters and maybe 1000ish sql server instances with
>> > approximately zero unexplained data corruption issues in the 5 years
>> > I've been here.  Having said that, this definitely smells and feels
>> > like something on the infrastructure side.  I'll follow up if I have
>> > any useful info.
>>
>> After a thorough investigation I now have credible evidence the source
>> of the damage did not originate from the database itself.
>> Specifically, this database is mounted on the same volume as the
>> operating system (I know, I know) and something non database driven
>> sucked up disk space very rapidly and exhausted the volume -- fast
>> enough that sar didn't pick it up.  Oh well :-) -- thanks for the help
>
> However, disk space exhaustion should not lead to corruption unless the
> underlying layers lied in some way.

I agree -- however I'm sufficiently separated from the things doing
the things that I can't verify that in any real way.   In the meantime
I'm going to take standard precautions (enable checksums/dedicated
volume/replication).  Low disk space also does not explain the bizarre
outage I had last friday.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-19 Thread Bruce Momjian

On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:
> > Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
> > to be to the system catalogs with at least two critical tables dropped
> > or inaccessible in some fashion.  A lot of the OIDs seemed to be
> > pointing at the wrong thing.  Couple more datapoints here.
> >
> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
> > *) Another database on the same cluster was not impacted.  However
> > it's more olap style and may not have been written to during the
> > outage
> >
> > Now, this infrastructure running this system is running maybe 100ish
> > postgres clusters and maybe 1000ish sql server instances with
> > approximately zero unexplained data corruption issues in the 5 years
> > I've been here.  Having said that, this definitely smells and feels
> > like something on the infrastructure side.  I'll follow up if I have
> > any useful info.
> 
> After a thorough investigation I now have credible evidence the source
> of the damage did not originate from the database itself.
> Specifically, this database is mounted on the same volume as the
> operating system (I know, I know) and something non database driven
> sucked up disk space very rapidly and exhausted the volume -- fast
> enough that sar didn't pick it up.  Oh well :-) -- thanks for the help

However, disk space exhaustion should not lead to corruption unless the
underlying layers lied in some way.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-19 Thread Merlin Moncure

On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure  wrote:
> On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera
>  wrote:
>> Merlin Moncure wrote:
>>
>>> castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
>>> castaging-#   SELECT ...
>>> ERROR:  42809: "pg_cast_oid_index" is an index
>>> LINE 11:   FROM ApartmentSample s
>>> ^
>>> LOCATION:  heap_openrv_extended, heapam.c:1304
>>>
>>> should I be restoring from backups?
>>
>> It's pretty clear to me that you've got catalog corruption here.  You
>> can try to fix things manually as they emerge, but that sounds like a
>> fool's errand.
>
> Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
> to be to the system catalogs with at least two critical tables dropped
> or inaccessible in some fashion.  A lot of the OIDs seemed to be
> pointing at the wrong thing.  Couple more datapoints here.
>
> *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
> *) Another database on the same cluster was not impacted.  However
> it's more olap style and may not have been written to during the
> outage
>
> Now, this infrastructure running this system is running maybe 100ish
> postgres clusters and maybe 1000ish sql server instances with
> approximately zero unexplained data corruption issues in the 5 years
> I've been here.  Having said that, this definitely smells and feels
> like something on the infrastructure side.  I'll follow up if I have
> any useful info.

After a thorough investigation I now have credible evidence the source
of the damage did not originate from the database itself.
Specifically, this database is mounted on the same volume as the
operating system (I know, I know) and something non database driven
sucked up disk space very rapidly and exhausted the volume -- fast
enough that sar didn't pick it up.  Oh well :-) -- thanks for the help

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-18 Thread Merlin Moncure

On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera
 wrote:
> Merlin Moncure wrote:
>
>> castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
>> castaging-#   SELECT ...
>> ERROR:  42809: "pg_cast_oid_index" is an index
>> LINE 11:   FROM ApartmentSample s
>> ^
>> LOCATION:  heap_openrv_extended, heapam.c:1304
>>
>> should I be restoring from backups?
>
> It's pretty clear to me that you've got catalog corruption here.  You
> can try to fix things manually as they emerge, but that sounds like a
> fool's errand.

Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
to be to the system catalogs with at least two critical tables dropped
or inaccessible in some fashion.  A lot of the OIDs seemed to be
pointing at the wrong thing.  Couple more datapoints here.

*) This database is OLTP, doing ~ 20 tps avg (but very bursty)
*) Another database on the same cluster was not impacted.  However
it's more olap style and may not have been written to during the
outage

Now, this infrastructure running this system is running maybe 100ish
postgres clusters and maybe 1000ish sql server instances with
approximately zero unexplained data corruption issues in the 5 years
I've been here.  Having said that, this definitely smells and feels
like something on the infrastructure side.  I'll follow up if I have
any useful info.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-17 Thread Gavin Flower


On 18/10/16 14:12, Michael Paquier wrote:

On Tue, Oct 18, 2016 at 4:21 AM, Alvaro Herrera
 wrote:

Merlin Moncure wrote:


We had several good backups since the previous outage so it's not
clear the events are related but after months of smooth operation I
find that coincidence highly suspicious. As always, we need to suspect
hardware problems but I'm highly abstracted from them -- using esx +
san.

Ah, so you're subject not only to hardware flaws but also to
virtualization layer bugs :-)

Wait a couple of more years, and we'll get more complains about
Postgres running in containers running in VMs. Even more fun waiting
ahead.


that started life on different hardware with a different O/S




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-17 Thread Michael Paquier

On Tue, Oct 18, 2016 at 4:21 AM, Alvaro Herrera
 wrote:
> Merlin Moncure wrote:
>
>> We had several good backups since the previous outage so it's not
>> clear the events are related but after months of smooth operation I
>> find that coincidence highly suspicious. As always, we need to suspect
>> hardware problems but I'm highly abstracted from them -- using esx +
>> san.
>
> Ah, so you're subject not only to hardware flaws but also to
> virtualization layer bugs :-)

Wait a couple of more years, and we'll get more complains about
Postgres running in containers running in VMs. Even more fun waiting
ahead.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-17 Thread Alvaro Herrera

Merlin Moncure wrote:

> We had several good backups since the previous outage so it's not
> clear the events are related but after months of smooth operation I
> find that coincidence highly suspicious. As always, we need to suspect
> hardware problems but I'm highly abstracted from them -- using esx +
> san.

Ah, so you're subject not only to hardware flaws but also to
virtualization layer bugs :-)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-17 Thread Merlin Moncure

On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera
 wrote:
> Merlin Moncure wrote:
>
>> castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
>> castaging-#   SELECT ...
>> ERROR:  42809: "pg_cast_oid_index" is an index
>> LINE 11:   FROM ApartmentSample s
>> ^
>> LOCATION:  heap_openrv_extended, heapam.c:1304
>>
>> should I be restoring from backups?
>
> It's pretty clear to me that you've got catalog corruption here.  You
> can try to fix things manually as they emerge, but that sounds like a
> fool's errand.

agreed. current plan is to restore from backups, and recover as much
data as I can.  Also doing bugfix release and going to enable
checksums.

We had several good backups since the previous outage so it's not
clear the events are related but after months of smooth operation I
find that coincidence highly suspicious. As always, we need to suspect
hardware problems but I'm highly abstracted from them -- using esx +
san.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-17 Thread Alvaro Herrera

Merlin Moncure wrote:

> castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
> castaging-#   SELECT ...
> ERROR:  42809: "pg_cast_oid_index" is an index
> LINE 11:   FROM ApartmentSample s
> ^
> LOCATION:  heap_openrv_extended, heapam.c:1304
> 
> should I be restoring from backups?

It's pretty clear to me that you've got catalog corruption here.  You
can try to fix things manually as they emerge, but that sounds like a
fool's errand.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-17 Thread Merlin Moncure

On Mon, Oct 17, 2016 at 1:39 PM, Merlin Moncure  wrote:
> On Thu, Oct 13, 2016 at 4:13 PM, Tom Lane  wrote:
>> Merlin Moncure  writes:
>>> Today I had an emergency production outage on a server.
>>> ...
>>> Adding all this up it smells like processes were getting stuck on a 
>>> spinlock.
>>
>> Maybe.  If it happens again, probably the most useful debug data would
>> be stack traces from some of the busy processes.
>
> Another odd datapoint on this server. Things were running pretty good
> but an application crashed on a missing view.  Trying to recreate the
> view, I got:
>
> CREATE OR REPLACE VIEW vw_ApartmentQueueLastGood AS
>   SELECT ...
>
> ERROR:  type "vw_apartmentqueuelastgood" already exists
> HINT:  A relation has an associated type of the same name, so you must
> use a name that doesn't conflict with any existing type.
>
> ...which was pretty strange.  I had to manually delete the pg_type
> record in order to create the view.   I'm getting more reports of
> 'could not open relation with oid=X' errors so I could be facing data
> corruption :(.

castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
castaging-#   SELECT ...
ERROR:  42809: "pg_cast_oid_index" is an index
LINE 11:   FROM ApartmentSample s
^
LOCATION:  heap_openrv_extended, heapam.c:1304

should I be restoring from backups?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-17 Thread Merlin Moncure

On Thu, Oct 13, 2016 at 4:13 PM, Tom Lane  wrote:
> Merlin Moncure  writes:
>> Today I had an emergency production outage on a server.
>> ...
>> Adding all this up it smells like processes were getting stuck on a spinlock.
>
> Maybe.  If it happens again, probably the most useful debug data would
> be stack traces from some of the busy processes.

Another odd datapoint on this server. Things were running pretty good
but an application crashed on a missing view.  Trying to recreate the
view, I got:

CREATE OR REPLACE VIEW vw_ApartmentQueueLastGood AS
  SELECT ...

ERROR:  type "vw_apartmentqueuelastgood" already exists
HINT:  A relation has an associated type of the same name, so you must
use a name that doesn't conflict with any existing type.

...which was pretty strange.  I had to manually delete the pg_type
record in order to create the view.   I'm getting more reports of
'could not open relation with oid=X' errors so I could be facing data
corruption :(.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-13 Thread Tom Lane

Merlin Moncure  writes:
> Today I had an emergency production outage on a server.
> ...
> Adding all this up it smells like processes were getting stuck on a spinlock.

Maybe.  If it happens again, probably the most useful debug data would
be stack traces from some of the busy processes.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

74 matches

Mail list logo