Re: [HACKERS] emergency outage requiring database restart

2017-08-10 Thread Merlin Moncure
On Thu, Aug 10, 2017 at 12:01 PM, Ants Aasma wrote: > On Wed, Jan 18, 2017 at 4:33 PM, Merlin Moncure wrote: >> On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma wrote: >>> On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure

Re: [HACKERS] emergency outage requiring database restart

2017-08-10 Thread Ants Aasma
On Wed, Jan 18, 2017 at 4:33 PM, Merlin Moncure wrote: > On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma wrote: >> On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure wrote: >>> Still getting checksum failures. Over the last 30 days, I see

Re: [HACKERS] emergency outage requiring database restart

2017-01-30 Thread Michael Paquier
On Wed, Jan 4, 2017 at 4:17 AM, Peter Eisentraut wrote: > It seems like everyone was generally in favor of this. I looked around > the internet for caveats but everyone was basically saying, you should > definitely do this. > > Why not for EXEC_BACKEND? > >

Re: [HACKERS] emergency outage requiring database restart

2017-01-18 Thread Merlin Moncure
On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma wrote: > On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure wrote: >> Still getting checksum failures. Over the last 30 days, I see the >> following. Since enabling checksums FWICT none of the damage is >>

Re: [HACKERS] emergency outage requiring database restart

2017-01-18 Thread Ants Aasma
On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure wrote: > Still getting checksum failures. Over the last 30 days, I see the > following. Since enabling checksums FWICT none of the damage is > permanent and rolls back with the transaction. So creepy! The checksums still

Re: [HACKERS] emergency outage requiring database restart

2017-01-04 Thread Merlin Moncure
On Tue, Jan 3, 2017 at 1:05 PM, Peter Eisentraut wrote: > On 11/7/16 5:31 PM, Merlin Moncure wrote: >> Regardless, it seems like you might be on to something, and I'm >> inclined to patch your change, test it, and roll it out to production. >> If it helps or at

Re: [HACKERS] emergency outage requiring database restart

2017-01-03 Thread Peter Eisentraut
On 11/2/16 11:45 AM, Oskari Saarenmaa wrote: > 26.10.2016, 21:34, Andres Freund kirjoitti: >> Any chance that plsh or the script it executes does anything with the file >> descriptors it inherits? That'd certainly one way to get into odd corruption >> issues. >> >> We processor really should use

Re: [HACKERS] emergency outage requiring database restart

2017-01-03 Thread Peter Eisentraut
On 11/7/16 5:31 PM, Merlin Moncure wrote: > Regardless, it seems like you might be on to something, and I'm > inclined to patch your change, test it, and roll it out to production. > If it helps or at least narrows the problem down, we ought to give it > consideration for inclusion (unless someone

Re: [HACKERS] emergency outage requiring database restart

2016-11-07 Thread Merlin Moncure
On Wed, Nov 2, 2016 at 10:45 AM, Oskari Saarenmaa wrote: > 26.10.2016, 21:34, Andres Freund kirjoitti: >> >> Any chance that plsh or the script it executes does anything with the file >> descriptors it inherits? That'd certainly one way to get into odd corruption >> issues. >> >> We

Re: [HACKERS] emergency outage requiring database restart

2016-11-02 Thread Oskari Saarenmaa
26.10.2016, 21:34, Andres Freund kirjoitti: Any chance that plsh or the script it executes does anything with the file descriptors it inherits? That'd certainly one way to get into odd corruption issues. We processor really should use O_CLOEXEC for the majority of it file handles. Attached

Re: [HACKERS] emergency outage requiring database restart

2016-11-01 Thread Merlin Moncure
On Tue, Nov 1, 2016 at 8:56 AM, Tom Lane wrote: > Merlin Moncure writes: >> On Mon, Oct 31, 2016 at 10:32 AM, Oskari Saarenmaa wrote: >>> Your production system's postgres backends probably have a lot more open >>> files associated with them

Re: [HACKERS] emergency outage requiring database restart

2016-11-01 Thread Andres Freund
On 2016-11-01 09:56:45 -0400, Tom Lane wrote: > The real problem with Oskari's theory is that it requires not merely > busted, but positively brain-dead error handling in the shell and/or > sqsh, ie ignoring open() failures altogether. That seems kind of > unlikely. Still, I suspect he might be

Re: [HACKERS] emergency outage requiring database restart

2016-11-01 Thread Tom Lane
Merlin Moncure writes: > On Mon, Oct 31, 2016 at 10:32 AM, Oskari Saarenmaa wrote: >> Your production system's postgres backends probably have a lot more open >> files associated with them than the simple test case does. Since Postgres >> likes to keep files

Re: [HACKERS] emergency outage requiring database restart

2016-11-01 Thread Merlin Moncure
On Mon, Oct 31, 2016 at 10:32 AM, Oskari Saarenmaa wrote: > 27.10.2016, 21:53, Merlin Moncure kirjoitti: >> >> As noted earlier, I was not able to reproduce the issue with >> crashme.sh, which was: >> >> NUM_FORKS=16 >> do_parallel psql -p 5432 -c"select PushMarketSample('1740')"

Re: [HACKERS] emergency outage requiring database restart

2016-10-31 Thread Oskari Saarenmaa
27.10.2016, 21:53, Merlin Moncure kirjoitti: As noted earlier, I was not able to reproduce the issue with crashme.sh, which was: NUM_FORKS=16 do_parallel psql -p 5432 -c"select PushMarketSample('1740')" castaging_test do_parallel psql -p 5432 -c"select PushMarketSample('4400')" castaging_test

Re: [HACKERS] emergency outage requiring database restart

2016-10-28 Thread Merlin Moncure
On Fri, Oct 28, 2016 at 3:16 PM, Jim Nasby wrote: > On 10/28/16 8:23 AM, Merlin Moncure wrote: >> >> On Thu, Oct 27, 2016 at 6:39 PM, Greg Stark wrote: >>> >>> On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure >>> wrote: I

Re: [HACKERS] emergency outage requiring database restart

2016-10-28 Thread Jim Nasby
On 10/28/16 8:23 AM, Merlin Moncure wrote: On Thu, Oct 27, 2016 at 6:39 PM, Greg Stark wrote: On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure wrote: I think we can rule out faulty storage Nobody ever expects the faulty storage LOL Believe me, I know.

Re: [HACKERS] emergency outage requiring database restart

2016-10-28 Thread Merlin Moncure
On Thu, Oct 27, 2016 at 6:39 PM, Greg Stark wrote: > On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure wrote: >> I think we can rule out faulty storage > > Nobody ever expects the faulty storage Believe me, I know. But the evidence points elsewhere in this

Re: [HACKERS] emergency outage requiring database restart

2016-10-27 Thread Greg Stark
On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure wrote: > I think we can rule out faulty storage Nobody ever expects the faulty storage -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:

Re: [HACKERS] emergency outage requiring database restart

2016-10-27 Thread Merlin Moncure
On Thu, Oct 27, 2016 at 2:31 AM, Ants Aasma wrote: > On Wed, Oct 26, 2016 at 8:43 PM, Merlin Moncure wrote: >> /var/lib/pgsql/9.5/data/pg_log/postgresql-26.log | grep "page >> verification" >> 2016-10-26 11:26:42 CDT [postgres@castaging]: WARNING: page

Re: [HACKERS] emergency outage requiring database restart

2016-10-27 Thread Ants Aasma
On Wed, Oct 26, 2016 at 8:43 PM, Merlin Moncure wrote: > /var/lib/pgsql/9.5/data/pg_log/postgresql-26.log | grep "page > verification" > 2016-10-26 11:26:42 CDT [postgres@castaging]: WARNING: page > verification failed, calculated checksum 37251 but expected 37244 >

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Andres Freund
On 2016-10-26 15:06:34 -0500, Jim Nasby wrote: > Removing the exec might "solve" the problem here, assuming that the > forked process doesn't still inherit all open FH's. Unless you explicitly close fds or use FD_CLOEXEC when opening fds they'll be inherited forever. -- Sent via pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Merlin Moncure
On Wed, Oct 26, 2016 at 3:06 PM, Jim Nasby wrote: > On 10/26/16 2:25 PM, Merlin Moncure wrote: >> >> I don't think that's the case. sqsh is a psql-like utility. it >> writes to stdout and stderr only which is captured by plsh and sent. >> In this context shexec only

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Jim Nasby
On 10/26/16 2:25 PM, Merlin Moncure wrote: I don't think that's the case. sqsh is a psql-like utility. it writes to stdout and stderr only which is captured by plsh and sent. In this context shexec only wraps rm -f 'file' where 'file' is a file previously created with COPY in the same

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Merlin Moncure
On Wed, Oct 26, 2016 at 2:12 PM, Andres Freund wrote: > On 2016-10-26 13:49:12 -0500, Merlin Moncure wrote: >> On Wed, Oct 26, 2016 at 1:45 PM, Andres Freund wrote: >> > >> > >> > On October 26, 2016 9:38:49 PM GMT+03:00, Merlin Moncure >> >

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Andres Freund
On 2016-10-26 13:49:12 -0500, Merlin Moncure wrote: > On Wed, Oct 26, 2016 at 1:45 PM, Andres Freund wrote: > > > > > > On October 26, 2016 9:38:49 PM GMT+03:00, Merlin Moncure > > wrote: > >>On Wed, Oct 26, 2016 at 1:34 PM, Andres Freund

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Merlin Moncure
On Wed, Oct 26, 2016 at 1:45 PM, Andres Freund wrote: > > > On October 26, 2016 9:38:49 PM GMT+03:00, Merlin Moncure > wrote: >>On Wed, Oct 26, 2016 at 1:34 PM, Andres Freund >>wrote: >>> Any chance that plsh or the script it executes

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Andres Freund
On October 26, 2016 9:38:49 PM GMT+03:00, Merlin Moncure wrote: >On Wed, Oct 26, 2016 at 1:34 PM, Andres Freund >wrote: >> Any chance that plsh or the script it executes does anything with the >file descriptors it inherits? That'd certainly one way to

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Merlin Moncure
On Wed, Oct 26, 2016 at 1:34 PM, Andres Freund wrote: > Any chance that plsh or the script it executes does anything with the file > descriptors it inherits? That'd certainly one way to get into odd corruption > issues. not sure. it's pretty small -- see

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Andres Freund
On October 26, 2016 8:57:22 PM GMT+03:00, Merlin Moncure wrote: >On Wed, Oct 26, 2016 at 12:43 PM, Merlin Moncure >wrote: >> On Wed, Oct 26, 2016 at 11:35 AM, Merlin Moncure >wrote: >>> On Tue, Oct 25, 2016 at 3:08 PM, Merlin

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Merlin Moncure
On Wed, Oct 26, 2016 at 1:09 PM, Tom Lane wrote: > Merlin Moncure writes: >> *) I've now strongly correlated this routine with the damage. > > Hmm. Do you have any way to replace the non-core calls with something > else? The "shexec('rm -f ' ||

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Tom Lane
Merlin Moncure writes: > *) I've now strongly correlated this routine with the damage. Hmm. Do you have any way to replace the non-core calls with something else? The "shexec('rm -f ' || _OutputFile)" bits could presumably be converted to use contrib/adminpack's

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Merlin Moncure
On Wed, Oct 26, 2016 at 12:43 PM, Merlin Moncure wrote: > On Wed, Oct 26, 2016 at 11:35 AM, Merlin Moncure wrote: >> On Tue, Oct 25, 2016 at 3:08 PM, Merlin Moncure wrote: >>> Confirmation of problem re-occurrence will come in a few

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Merlin Moncure
On Wed, Oct 26, 2016 at 11:35 AM, Merlin Moncure wrote: > On Tue, Oct 25, 2016 at 3:08 PM, Merlin Moncure wrote: >> Confirmation of problem re-occurrence will come in a few days.I'm >> much more likely to believe 6+sigma occurrence (storage, freak bug,

Re: [HACKERS] emergency outage requiring database restart

2016-10-26 Thread Merlin Moncure
On Tue, Oct 25, 2016 at 3:08 PM, Merlin Moncure wrote: > Confirmation of problem re-occurrence will come in a few days.I'm > much more likely to believe 6+sigma occurrence (storage, freak bug, > etc) should it prove the problem goes away post rebuild. ok, no major

Re: [HACKERS] emergency outage requiring database restart

2016-10-25 Thread Jim Nasby
On 10/22/16 12:38 PM, Tom Lane wrote: Jim Nasby writes: > On 10/21/16 7:43 PM, Tom Lane wrote: >> Alvaro Herrera writes: >>> Agreed. The problem is how to install it without breaking pg_upgrade. > It can't look up relation names? It

Re: [HACKERS] emergency outage requiring database restart

2016-10-25 Thread Merlin Moncure
On Tue, Oct 25, 2016 at 2:31 PM, Tom Lane wrote: > Merlin Moncure writes: >> What if the subsequent dataloss was in fact a symptom of the first >> outage? Is in theory possible for data to appear visible but then be >> eaten up as the transactions making

Re: [HACKERS] emergency outage requiring database restart

2016-10-25 Thread Tom Lane
Merlin Moncure writes: > What if the subsequent dataloss was in fact a symptom of the first > outage? Is in theory possible for data to appear visible but then be > eaten up as the transactions making the data visible get voided out by > some other mechanic? I had to pull a

Re: [HACKERS] emergency outage requiring database restart

2016-10-25 Thread Merlin Moncure
On Tue, Oct 25, 2016 at 12:57 PM, Alvaro Herrera wrote: > Merlin Moncure wrote: > >> After last night, I rebuilt the cluster, turning on checksums, turning >> on synchronous commit (it was off) and added a standby replica. This >> should help narrow the problem down

Re: [HACKERS] emergency outage requiring database restart

2016-10-25 Thread Alvaro Herrera
Merlin Moncure wrote: > After last night, I rebuilt the cluster, turning on checksums, turning > on synchronous commit (it was off) and added a standby replica. This > should help narrow the problem down should it re-occur; if storage is > bad (note, other database on same machine is doing 10x

Re: [HACKERS] emergency outage requiring database restart

2016-10-25 Thread Merlin Moncure
On Mon, Oct 24, 2016 at 9:18 PM, Alvaro Herrera wrote: > Merlin Moncure wrote: >> On Mon, Oct 24, 2016 at 6:01 PM, Merlin Moncure wrote: > >> > Corruption struck again. >> > This time got another case of view busted -- attempting to create >> > gives

Re: [HACKERS] emergency outage requiring database restart

2016-10-24 Thread Alvaro Herrera
Merlin Moncure wrote: > On Mon, Oct 24, 2016 at 6:01 PM, Merlin Moncure wrote: > > Corruption struck again. > > This time got another case of view busted -- attempting to create > > gives missing 'type' error. > > Call it a hunch -- I think the problem is in pl/sh. I've

Re: [HACKERS] emergency outage requiring database restart

2016-10-24 Thread Merlin Moncure
On Mon, Oct 24, 2016 at 6:01 PM, Merlin Moncure wrote: > On Thu, Oct 20, 2016 at 1:52 PM, Merlin Moncure wrote: >> On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure wrote: >>> On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian

Re: [HACKERS] emergency outage requiring database restart

2016-10-24 Thread Merlin Moncure
On Thu, Oct 20, 2016 at 1:52 PM, Merlin Moncure wrote: > On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure wrote: >> On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian wrote: >>> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:

Re: [HACKERS] emergency outage requiring database restart

2016-10-22 Thread Andres Freund
On October 22, 2016 11:59:15 AM PDT, Tom Lane wrote: >Alvaro Herrera writes: >> Uh, sorry. My proposal a couple of years back was to put the >> relfilenode, not the name. I didn't notice that it was the name >being >> proposed here. However, now

Re: [HACKERS] emergency outage requiring database restart

2016-10-22 Thread Tom Lane
Alvaro Herrera writes: > Uh, sorry. My proposal a couple of years back was to put the > relfilenode, not the name. I didn't notice that it was the name being > proposed here. However, now I notice that this idea doesn't solve the > problem for mapped relations. Well,

Re: [HACKERS] emergency outage requiring database restart

2016-10-22 Thread Alvaro Herrera
Tom Lane wrote: > Alvaro Herrera writes: > > Jim Nasby wrote: > >> It occurs to me that it might be worth embedding the relation name in the > >> free space of the first block. Most people would never notice the missing > >> 64 > >> bytes, but it would be incredibly

Re: [HACKERS] emergency outage requiring database restart

2016-10-22 Thread Tom Lane
Jim Nasby writes: > On 10/21/16 7:43 PM, Tom Lane wrote: >> Alvaro Herrera writes: >>> Agreed. The problem is how to install it without breaking pg_upgrade. > It can't look up relation names? It can't shove 64 bytes into a page that has < 64

Re: [HACKERS] emergency outage requiring database restart

2016-10-22 Thread Jim Nasby
On 10/21/16 7:43 PM, Tom Lane wrote: Alvaro Herrera writes: Jim Nasby wrote: It occurs to me that it might be worth embedding the relation name in the free space of the first block. Most people would never notice the missing 64 bytes, but it would be incredibly

Re: [HACKERS] emergency outage requiring database restart

2016-10-21 Thread Tom Lane
Alvaro Herrera writes: > Jim Nasby wrote: >> It occurs to me that it might be worth embedding the relation name in the >> free space of the first block. Most people would never notice the missing 64 >> bytes, but it would be incredibly helpful in cases like this... >

Re: [HACKERS] emergency outage requiring database restart

2016-10-21 Thread Alvaro Herrera
Jim Nasby wrote: > On 10/21/16 2:02 PM, Alvaro Herrera wrote: > > Merlin Moncure wrote: > > > > > OK, I have some good (very- in the specific case of yours truly) news > > > to report. Doing a filesystem level copy to a test server I was able > > > to relfilenode swap one of the critical tables

Re: [HACKERS] emergency outage requiring database restart

2016-10-21 Thread Jim Nasby
On 10/21/16 2:02 PM, Alvaro Herrera wrote: Merlin Moncure wrote: OK, I have some good (very- in the specific case of yours truly) news to report. Doing a filesystem level copy to a test server I was able to relfilenode swap one of the critical tables over the place of the refilenode of the

Re: [HACKERS] emergency outage requiring database restart

2016-10-21 Thread Alvaro Herrera
Merlin Moncure wrote: > OK, I have some good (very- in the specific case of yours truly) news > to report. Doing a filesystem level copy to a test server I was able > to relfilenode swap one of the critical tables over the place of the > refilenode of the stored backup. Not being able know the

Re: [HACKERS] emergency outage requiring database restart

2016-10-21 Thread Merlin Moncure
On Fri, Oct 21, 2016 at 1:37 PM, Merlin Moncure wrote: > On Fri, Oct 21, 2016 at 8:03 AM, Kevin Grittner wrote: >> On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure wrote: >> >>> Most or all the damage seemed to be to the system catalogs

Re: [HACKERS] emergency outage requiring database restart

2016-10-21 Thread Merlin Moncure
On Fri, Oct 21, 2016 at 8:03 AM, Kevin Grittner wrote: > On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure wrote: > >> Most or all the damage seemed to be to the system catalogs with >> at least two critical tables dropped or inaccessible in some >> fashion.

Re: [HACKERS] emergency outage requiring database restart

2016-10-21 Thread Kevin Grittner
On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure wrote: > Most or all the damage seemed to be to the system catalogs with > at least two critical tables dropped or inaccessible in some > fashion. A lot of the OIDs seemed to be pointing at the wrong > thing. While the oid in

Re: [HACKERS] emergency outage requiring database restart

2016-10-20 Thread Merlin Moncure
On Thu, Oct 20, 2016 at 3:16 PM, Alvaro Herrera wrote: > Merlin Moncure wrote: > >> single user mode dumps core :( >> >> bash-4.1$ postgres --single -D /var/lib/pgsql/9.5/data castaging >> LOG: 0: could not change directory to "/root": Permission denied >> LOCATION:

Re: [HACKERS] emergency outage requiring database restart

2016-10-20 Thread Alvaro Herrera
Merlin Moncure wrote: > single user mode dumps core :( > > bash-4.1$ postgres --single -D /var/lib/pgsql/9.5/data castaging > LOG: 0: could not change directory to "/root": Permission denied > LOCATION: resolve_symlinks, exec.c:293 > Segmentation fault (core dumped) > > Core was generated

Re: [HACKERS] emergency outage requiring database restart

2016-10-20 Thread Merlin Moncure
On Thu, Oct 20, 2016 at 2:07 PM, Tom Lane wrote: > Merlin Moncure writes: >> single user mode dumps core :( > > You've got a mess there :-( > >> Missing separate debuginfos, use: debuginfo-install >> postgresql95-server-9.5.2-1PGDG.rhel6.x86_64 > > This

Re: [HACKERS] emergency outage requiring database restart

2016-10-20 Thread Tom Lane
Merlin Moncure writes: > single user mode dumps core :( You've got a mess there :-( > Missing separate debuginfos, use: debuginfo-install > postgresql95-server-9.5.2-1PGDG.rhel6.x86_64 This backtrace would likely be much more informative if you did the above.

Re: [HACKERS] emergency outage requiring database restart

2016-10-20 Thread Merlin Moncure
On Thu, Oct 20, 2016 at 1:52 PM, Merlin Moncure wrote: > On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure wrote: >> On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian wrote: >>> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:

Re: [HACKERS] emergency outage requiring database restart

2016-10-20 Thread Merlin Moncure
On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure wrote: > On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian wrote: >> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote: >>> > Yeah. Believe me -- I know the drill. Most or all the damage seemed >>>

Re: [HACKERS] emergency outage requiring database restart

2016-10-19 Thread Merlin Moncure
On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian wrote: > On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote: >> > Yeah. Believe me -- I know the drill. Most or all the damage seemed >> > to be to the system catalogs with at least two critical tables dropped >> > or

Re: [HACKERS] emergency outage requiring database restart

2016-10-19 Thread Bruce Momjian
On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote: > > Yeah. Believe me -- I know the drill. Most or all the damage seemed > > to be to the system catalogs with at least two critical tables dropped > > or inaccessible in some fashion. A lot of the OIDs seemed to be > > pointing at

Re: [HACKERS] emergency outage requiring database restart

2016-10-19 Thread Merlin Moncure
On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure wrote: > On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera > wrote: >> Merlin Moncure wrote: >> >>> castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS >>> castaging-# SELECT ... >>> ERROR: 42809:

Re: [HACKERS] emergency outage requiring database restart

2016-10-18 Thread Merlin Moncure
On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera wrote: > Merlin Moncure wrote: > >> castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS >> castaging-# SELECT ... >> ERROR: 42809: "pg_cast_oid_index" is an index >> LINE 11: FROM ApartmentSample s >>

Re: [HACKERS] emergency outage requiring database restart

2016-10-17 Thread Gavin Flower
On 18/10/16 14:12, Michael Paquier wrote: On Tue, Oct 18, 2016 at 4:21 AM, Alvaro Herrera wrote: Merlin Moncure wrote: We had several good backups since the previous outage so it's not clear the events are related but after months of smooth operation I find that

Re: [HACKERS] emergency outage requiring database restart

2016-10-17 Thread Michael Paquier
On Tue, Oct 18, 2016 at 4:21 AM, Alvaro Herrera wrote: > Merlin Moncure wrote: > >> We had several good backups since the previous outage so it's not >> clear the events are related but after months of smooth operation I >> find that coincidence highly suspicious. As

Re: [HACKERS] emergency outage requiring database restart

2016-10-17 Thread Alvaro Herrera
Merlin Moncure wrote: > We had several good backups since the previous outage so it's not > clear the events are related but after months of smooth operation I > find that coincidence highly suspicious. As always, we need to suspect > hardware problems but I'm highly abstracted from them -- using

Re: [HACKERS] emergency outage requiring database restart

2016-10-17 Thread Merlin Moncure
On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera wrote: > Merlin Moncure wrote: > >> castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS >> castaging-# SELECT ... >> ERROR: 42809: "pg_cast_oid_index" is an index >> LINE 11: FROM ApartmentSample s >>

Re: [HACKERS] emergency outage requiring database restart

2016-10-17 Thread Alvaro Herrera
Merlin Moncure wrote: > castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS > castaging-# SELECT ... > ERROR: 42809: "pg_cast_oid_index" is an index > LINE 11: FROM ApartmentSample s > ^ > LOCATION: heap_openrv_extended, heapam.c:1304 > > should I be restoring from

Re: [HACKERS] emergency outage requiring database restart

2016-10-17 Thread Merlin Moncure
On Mon, Oct 17, 2016 at 1:39 PM, Merlin Moncure wrote: > On Thu, Oct 13, 2016 at 4:13 PM, Tom Lane wrote: >> Merlin Moncure writes: >>> Today I had an emergency production outage on a server. >>> ... >>> Adding all this up it smells

Re: [HACKERS] emergency outage requiring database restart

2016-10-17 Thread Merlin Moncure
On Thu, Oct 13, 2016 at 4:13 PM, Tom Lane wrote: > Merlin Moncure writes: >> Today I had an emergency production outage on a server. >> ... >> Adding all this up it smells like processes were getting stuck on a spinlock. > > Maybe. If it happens again,

Re: [HACKERS] emergency outage requiring database restart

2016-10-13 Thread Tom Lane
Merlin Moncure writes: > Today I had an emergency production outage on a server. > ... > Adding all this up it smells like processes were getting stuck on a spinlock. Maybe. If it happens again, probably the most useful debug data would be stack traces from some of the busy

[HACKERS] emergency outage requiring database restart

2016-10-13 Thread Merlin Moncure
Today I had an emergency production outage on a server. This particular server was running 9.5.2. The symptoms were interesting so I thought I'd report. Here is what I saw: *) User CPU was pegged 100% *) Queries reading data would block and not respond to cancel or terminate *)