Re: [GENERAL] Index corruption & broken clog

2017-10-13 Thread Benoit Lobréau
I seem that a failed vmotion occured in the same time frame.


Re: [GENERAL] Index corruption & broken clog

2017-10-13 Thread Benoit Lobréau
Thanks for the help.



> Shut down the corrupted database and make a physical backup of it.
>

I did it before trying anything.



> Check the disks and memory on the machine where the problem occurred.
>

We are using virtual machines (VMWare). I will ask then to investigate
anyway.


> You can try to salvage data from the corrupted tables, either
> using "zero_damaged_pages" (but that will only work if the page
> header is corrupted), or by excluding problematic tuples by their
> "ctid".
>

Thanks will try.


Re: [GENERAL] Index corruption & broken clog

2017-10-13 Thread Laurenz Albe
Benoit Lobréau wrote:
> One of my PostgreSQL server crashed badly yesterday. A process was killed 
> (see dmesg below) and postgres was stuck with theses process:
> 
> postgres  2083     1  0 Oct08 ?        00:19:02 
> /usr/lib/postgresql/9.5/bin/postgres -D /home/postgres/data/i090/systeme
> postgres  2221  2083  0 Oct08 ?        00:02:03 postgres: i090: logger 
> process
> postgres 14068  2083  8 Oct09 ?        02:50:22 [postgres]
> postgres 19668 19665  0 09:40 pts/0    00:00:00 -su
> 
> I couldn't stop it with "pg_ctl stop immediate". We had to kill them and 
> release the shared memory block.
> 
> dmesg showed this.
> 
> [Tue Oct 10 07:45:29 2017] postgres[25506]: segfault at 7f2253ecb000 ip 
> 7f225aead994 sp 7ffc2a9c0c28 error 4 in 
> libc-2.23.so[7f225ae0e000+1c]
> [Tue Oct 10 07:45:30 2017] BUG: Bad rss-counter state mm:8800516a7c00 
> idx:0 val:178
> [Tue Oct 10 07:45:31 2017] BUG: Bad page map in process postgres  
> pte:480090248c8d pmd:2c3de067 
> 
> When I checked the postgresql log I found lots of these :
> 
>  could not read block 76638 in file 
> "pg_tblspc/16395/PG_9.5_201510051/16396/20082": read only 0 of 8192 bytes
>  could not read block 76669 in file 
> "pg_tblspc/16395/PG_9.5_201510051/16396/19993": read only 0 of 8192 bytes
>  index "degrade_pkey" contains corrupted page at block 60392
>  index "degrade_pkey" contains unexpected zero page at block 60392
>  index "idx_coor_brute_geometrie" contains corrupted page at block 53061
>  index "idx_coor_brute_geometrie" contains corrupted page at block 56109
>  index "idx_coor_brute_geometrie" contains corrupted page at block 58867
>  index "idx_coor_brute_geometrie" contains corrupted page at block 59003
>  index "idx_coor_brute_geometrie" contains corrupted page at block 60138
>  index "idx_coor_brute_geometrie" contains corrupted page at block 71974
>  index "idx_coor_brute_geometrie" contains corrupted page at block 80968
>  index "idx_coor_brute_geometrie" contains corrupted page at block 86777
>  index "idx_coor_brute_geometrie" contains unexpected zero page at block 78586
>  index "idx_coor_proj_ligne_geometrie" contains corrupted page at block 64349
>  index "idx_numero_course" contains corrupted page at block 3435
>  index "idx_numero_course" contains corrupted page at block 7176
>  index "idx_numero_engin" contains corrupted page at block 30996
>  index "idx_utc_envoi_composant_amont" contains corrupted page at block 76497
>  index "idx_utc_envoi_reseau" contains corrupted page at block 76524
>  right sibling's left-link doesn't match: block 102923 links to 49947 instead 
> of expected 1 in index "idx_utc_reception_fournisseur"
>  right sibling's left-link doesn't match: block 103014 links to 51087 instead 
> of expected 1 in index "idx_utc_reception_composant_amont"
>  right sibling's left-link doesn't match: block 76978 links to 59148 instead 
> of expected 1 in index "idx_utc_envoi_reseau"
>  right sibling's left-link doesn't match: block 77073 links to 82204 instead 
> of expected 1 in index "idx_utc_envoi_composant_amont"
>  
> The errors started to show up on inserts 8 hours before the crash. 
> I dont understand because when I look into theses pages with pageinspect they 
> are not empty / dont look bad.
> For exemple, this  "SELECT * FROM bt_page_items('degrade_pkey',60392);" 
> showed a normal page but the error messages claims its zeroed.
> 
> I checked the tables for duplicates on the primary key. Checksum are enabled 
> and I didn't get any error message when I scanned the tables.
> I found 77 duplicates on one table and this:
> 
> ERROR:  could not access status of transaction 3443523584
> DETAIL:  Could not open file "pg_clog/0CD4": No such file or directory.
> 
> I dont think this transaction ever existed given the files I found in the 
> clog directory.
> 
> postgres@ulbdgomp01:~/data/igomp090/systeme/pg_clog$ ls -al
> total 30292
> drwx--  2 postgres postgres   4096 Oct  9 22:50 .
> drwx-- 18 postgres postgres   4096 Oct 12 02:32 ..
> -rw---  1 postgres postgres 262144 Aug 16 22:50 
> -rw---  1 postgres postgres 262144 Aug 17 16:20 0001
> -rw---  1 postgres postgres 262144 Aug 17 22:25 0002
> ...
> -rw---  1 postgres postgres 262144 Oct  7 19:31 0072
> -rw---  1 postgres postgres 262144 Oct  9 22:50 0073
> -rw---  1 postgres postgres 122880 Oct 12 11:56 0074
> 
> In your opinion :
> * am I missing a (lots of) clog file(s) or is it a error in the heap page 
> giving an erroneous txid ?
> * Is there a way to find which row has the transaction number 3443523584 (in 
> t_xmin t_xmax I suppose)? 
> * Why am I not seeing any obiously broken page when I check with pageinspect 
> ? 
> 
> We are using:
> * Ubuntu 16.04 LTS
> * PotgreSQL 9.5.8 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
> 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609, 64-bit

PostgreSQL did not complain about an index being zeroed, it just
says it can't find a certain page (which probably never 

[GENERAL] Index corruption & broken clog

2017-10-12 Thread Benoit Lobréau
Hi,

One of my PostgreSQL server crashed badly yesterday. A process was killed
(see dmesg below) and postgres was stuck with theses process:

postgres  2083 1  0 Oct08 ?00:19:02
/usr/lib/postgresql/9.5/bin/postgres -D /home/postgres/data/i090/systeme
postgres  2221  2083  0 Oct08 ?00:02:03 postgres: i090: logger
process
postgres 14068  2083  8 Oct09 ?02:50:22 [postgres]
postgres 19668 19665  0 09:40 pts/000:00:00 -su

I couldn't stop it with "pg_ctl stop immediate". We had to kill them and
release the shared memory block.

dmesg showed this.

[Tue Oct 10 07:45:29 2017] postgres[25506]: segfault at 7f2253ecb000 ip
7f225aead994 sp 7ffc2a9c0c28 error 4 in libc-2.23.so
[7f225ae0e000+1c]
[Tue Oct 10 07:45:30 2017] BUG: Bad rss-counter state mm:8800516a7c00
idx:0 val:178
[Tue Oct 10 07:45:31 2017] BUG: Bad page map in process postgres
pte:480090248c8d pmd:2c3de067

When I checked the postgresql log I found lots of these :

 could not read block 76638 in file
"pg_tblspc/16395/PG_9.5_201510051/16396/20082": read only 0 of 8192 bytes
 could not read block 76669 in file
"pg_tblspc/16395/PG_9.5_201510051/16396/19993": read only 0 of 8192 bytes
 index "degrade_pkey" contains corrupted page at block 60392
 index "degrade_pkey" contains unexpected zero page at block 60392
 index "idx_coor_brute_geometrie" contains corrupted page at block 53061
 index "idx_coor_brute_geometrie" contains corrupted page at block 56109
 index "idx_coor_brute_geometrie" contains corrupted page at block 58867
 index "idx_coor_brute_geometrie" contains corrupted page at block 59003
 index "idx_coor_brute_geometrie" contains corrupted page at block 60138
 index "idx_coor_brute_geometrie" contains corrupted page at block 71974
 index "idx_coor_brute_geometrie" contains corrupted page at block 80968
 index "idx_coor_brute_geometrie" contains corrupted page at block 86777
 index "idx_coor_brute_geometrie" contains unexpected zero page at block
78586
 index "idx_coor_proj_ligne_geometrie" contains corrupted page at block
64349
 index "idx_numero_course" contains corrupted page at block 3435
 index "idx_numero_course" contains corrupted page at block 7176
 index "idx_numero_engin" contains corrupted page at block 30996
 index "idx_utc_envoi_composant_amont" contains corrupted page at block
76497
 index "idx_utc_envoi_reseau" contains corrupted page at block 76524
 right sibling's left-link doesn't match: block 102923 links to 49947
instead of expected 1 in index "idx_utc_reception_fournisseur"
 right sibling's left-link doesn't match: block 103014 links to 51087
instead of expected 1 in index "idx_utc_reception_composant_amont"
 right sibling's left-link doesn't match: block 76978 links to 59148
instead of expected 1 in index "idx_utc_envoi_reseau"
 right sibling's left-link doesn't match: block 77073 links to 82204
instead of expected 1 in index "idx_utc_envoi_composant_amont"

The errors started to show up on inserts 8 hours before the crash.
I dont understand because when I look into theses pages with pageinspect
they are not empty / dont look bad.
For exemple, this  "SELECT * FROM bt_page_items('degrade_pkey',60392);"
showed a normal page but the error messages claims its zeroed.

I checked the tables for duplicates on the primary key. Checksum are
enabled and I didn't get any error message when I scanned the tables.
I found 77 duplicates on one table and this:

ERROR:  could not access status of transaction 3443523584
DETAIL:  Could not open file "pg_clog/0CD4": No such file or directory.

I dont think this transaction ever existed given the files I found in the
clog directory.

postgres@ulbdgomp01:~/data/igomp090/systeme/pg_clog$ ls -al
total 30292
drwx--  2 postgres postgres   4096 Oct  9 22:50 .
drwx-- 18 postgres postgres   4096 Oct 12 02:32 ..
-rw---  1 postgres postgres 262144 Aug 16 22:50 
-rw---  1 postgres postgres 262144 Aug 17 16:20 0001
-rw---  1 postgres postgres 262144 Aug 17 22:25 0002
...
-rw---  1 postgres postgres 262144 Oct  7 19:31 0072
-rw---  1 postgres postgres 262144 Oct  9 22:50 0073
-rw---  1 postgres postgres 122880 Oct 12 11:56 0074

In your opinion :
* am I missing a (lots of) clog file(s) or is it a error in the heap page
giving an erroneous txid ?
* Is there a way to find which row has the transaction number 3443523584
(in t_xmin t_xmax I suppose)?
* Why am I not seeing any obiously broken page when I check with
pageinspect ?

We are using:
* Ubuntu 16.04 LTS
* PotgreSQL 9.5.8 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609, 64-bit

Thx a lot.

Benoit.