Re: [GENERAL] Index corruption & broken clog
I seem that a failed vmotion occured in the same time frame.
Re: [GENERAL] Index corruption & broken clog
Thanks for the help. > Shut down the corrupted database and make a physical backup of it. > I did it before trying anything. > Check the disks and memory on the machine where the problem occurred. > We are using virtual machines (VMWare). I will ask then to investigate anyway. > You can try to salvage data from the corrupted tables, either > using "zero_damaged_pages" (but that will only work if the page > header is corrupted), or by excluding problematic tuples by their > "ctid". > Thanks will try.
Re: [GENERAL] Index corruption & broken clog
Benoit Lobréau wrote: > One of my PostgreSQL server crashed badly yesterday. A process was killed > (see dmesg below) and postgres was stuck with theses process: > > postgres 2083 1 0 Oct08 ? 00:19:02 > /usr/lib/postgresql/9.5/bin/postgres -D /home/postgres/data/i090/systeme > postgres 2221 2083 0 Oct08 ? 00:02:03 postgres: i090: logger > process > postgres 14068 2083 8 Oct09 ? 02:50:22 [postgres] > postgres 19668 19665 0 09:40 pts/0 00:00:00 -su > > I couldn't stop it with "pg_ctl stop immediate". We had to kill them and > release the shared memory block. > > dmesg showed this. > > [Tue Oct 10 07:45:29 2017] postgres[25506]: segfault at 7f2253ecb000 ip > 7f225aead994 sp 7ffc2a9c0c28 error 4 in > libc-2.23.so[7f225ae0e000+1c] > [Tue Oct 10 07:45:30 2017] BUG: Bad rss-counter state mm:8800516a7c00 > idx:0 val:178 > [Tue Oct 10 07:45:31 2017] BUG: Bad page map in process postgres > pte:480090248c8d pmd:2c3de067 > > When I checked the postgresql log I found lots of these : > > could not read block 76638 in file > "pg_tblspc/16395/PG_9.5_201510051/16396/20082": read only 0 of 8192 bytes > could not read block 76669 in file > "pg_tblspc/16395/PG_9.5_201510051/16396/19993": read only 0 of 8192 bytes > index "degrade_pkey" contains corrupted page at block 60392 > index "degrade_pkey" contains unexpected zero page at block 60392 > index "idx_coor_brute_geometrie" contains corrupted page at block 53061 > index "idx_coor_brute_geometrie" contains corrupted page at block 56109 > index "idx_coor_brute_geometrie" contains corrupted page at block 58867 > index "idx_coor_brute_geometrie" contains corrupted page at block 59003 > index "idx_coor_brute_geometrie" contains corrupted page at block 60138 > index "idx_coor_brute_geometrie" contains corrupted page at block 71974 > index "idx_coor_brute_geometrie" contains corrupted page at block 80968 > index "idx_coor_brute_geometrie" contains corrupted page at block 86777 > index "idx_coor_brute_geometrie" contains unexpected zero page at block 78586 > index "idx_coor_proj_ligne_geometrie" contains corrupted page at block 64349 > index "idx_numero_course" contains corrupted page at block 3435 > index "idx_numero_course" contains corrupted page at block 7176 > index "idx_numero_engin" contains corrupted page at block 30996 > index "idx_utc_envoi_composant_amont" contains corrupted page at block 76497 > index "idx_utc_envoi_reseau" contains corrupted page at block 76524 > right sibling's left-link doesn't match: block 102923 links to 49947 instead > of expected 1 in index "idx_utc_reception_fournisseur" > right sibling's left-link doesn't match: block 103014 links to 51087 instead > of expected 1 in index "idx_utc_reception_composant_amont" > right sibling's left-link doesn't match: block 76978 links to 59148 instead > of expected 1 in index "idx_utc_envoi_reseau" > right sibling's left-link doesn't match: block 77073 links to 82204 instead > of expected 1 in index "idx_utc_envoi_composant_amont" > > The errors started to show up on inserts 8 hours before the crash. > I dont understand because when I look into theses pages with pageinspect they > are not empty / dont look bad. > For exemple, this "SELECT * FROM bt_page_items('degrade_pkey',60392);" > showed a normal page but the error messages claims its zeroed. > > I checked the tables for duplicates on the primary key. Checksum are enabled > and I didn't get any error message when I scanned the tables. > I found 77 duplicates on one table and this: > > ERROR: could not access status of transaction 3443523584 > DETAIL: Could not open file "pg_clog/0CD4": No such file or directory. > > I dont think this transaction ever existed given the files I found in the > clog directory. > > postgres@ulbdgomp01:~/data/igomp090/systeme/pg_clog$ ls -al > total 30292 > drwx-- 2 postgres postgres 4096 Oct 9 22:50 . > drwx-- 18 postgres postgres 4096 Oct 12 02:32 .. > -rw--- 1 postgres postgres 262144 Aug 16 22:50 > -rw--- 1 postgres postgres 262144 Aug 17 16:20 0001 > -rw--- 1 postgres postgres 262144 Aug 17 22:25 0002 > ... > -rw--- 1 postgres postgres 262144 Oct 7 19:31 0072 > -rw--- 1 postgres postgres 262144 Oct 9 22:50 0073 > -rw--- 1 postgres postgres 122880 Oct 12 11:56 0074 > > In your opinion : > * am I missing a (lots of) clog file(s) or is it a error in the heap page > giving an erroneous txid ? > * Is there a way to find which row has the transaction number 3443523584 (in > t_xmin t_xmax I suppose)? > * Why am I not seeing any obiously broken page when I check with pageinspect > ? > > We are using: > * Ubuntu 16.04 LTS > * PotgreSQL 9.5.8 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu > 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609, 64-bit PostgreSQL did not complain about an index being zeroed, it just says it can't find a certain page (which probably never
[GENERAL] Index corruption & broken clog
Hi, One of my PostgreSQL server crashed badly yesterday. A process was killed (see dmesg below) and postgres was stuck with theses process: postgres 2083 1 0 Oct08 ?00:19:02 /usr/lib/postgresql/9.5/bin/postgres -D /home/postgres/data/i090/systeme postgres 2221 2083 0 Oct08 ?00:02:03 postgres: i090: logger process postgres 14068 2083 8 Oct09 ?02:50:22 [postgres] postgres 19668 19665 0 09:40 pts/000:00:00 -su I couldn't stop it with "pg_ctl stop immediate". We had to kill them and release the shared memory block. dmesg showed this. [Tue Oct 10 07:45:29 2017] postgres[25506]: segfault at 7f2253ecb000 ip 7f225aead994 sp 7ffc2a9c0c28 error 4 in libc-2.23.so [7f225ae0e000+1c] [Tue Oct 10 07:45:30 2017] BUG: Bad rss-counter state mm:8800516a7c00 idx:0 val:178 [Tue Oct 10 07:45:31 2017] BUG: Bad page map in process postgres pte:480090248c8d pmd:2c3de067 When I checked the postgresql log I found lots of these : could not read block 76638 in file "pg_tblspc/16395/PG_9.5_201510051/16396/20082": read only 0 of 8192 bytes could not read block 76669 in file "pg_tblspc/16395/PG_9.5_201510051/16396/19993": read only 0 of 8192 bytes index "degrade_pkey" contains corrupted page at block 60392 index "degrade_pkey" contains unexpected zero page at block 60392 index "idx_coor_brute_geometrie" contains corrupted page at block 53061 index "idx_coor_brute_geometrie" contains corrupted page at block 56109 index "idx_coor_brute_geometrie" contains corrupted page at block 58867 index "idx_coor_brute_geometrie" contains corrupted page at block 59003 index "idx_coor_brute_geometrie" contains corrupted page at block 60138 index "idx_coor_brute_geometrie" contains corrupted page at block 71974 index "idx_coor_brute_geometrie" contains corrupted page at block 80968 index "idx_coor_brute_geometrie" contains corrupted page at block 86777 index "idx_coor_brute_geometrie" contains unexpected zero page at block 78586 index "idx_coor_proj_ligne_geometrie" contains corrupted page at block 64349 index "idx_numero_course" contains corrupted page at block 3435 index "idx_numero_course" contains corrupted page at block 7176 index "idx_numero_engin" contains corrupted page at block 30996 index "idx_utc_envoi_composant_amont" contains corrupted page at block 76497 index "idx_utc_envoi_reseau" contains corrupted page at block 76524 right sibling's left-link doesn't match: block 102923 links to 49947 instead of expected 1 in index "idx_utc_reception_fournisseur" right sibling's left-link doesn't match: block 103014 links to 51087 instead of expected 1 in index "idx_utc_reception_composant_amont" right sibling's left-link doesn't match: block 76978 links to 59148 instead of expected 1 in index "idx_utc_envoi_reseau" right sibling's left-link doesn't match: block 77073 links to 82204 instead of expected 1 in index "idx_utc_envoi_composant_amont" The errors started to show up on inserts 8 hours before the crash. I dont understand because when I look into theses pages with pageinspect they are not empty / dont look bad. For exemple, this "SELECT * FROM bt_page_items('degrade_pkey',60392);" showed a normal page but the error messages claims its zeroed. I checked the tables for duplicates on the primary key. Checksum are enabled and I didn't get any error message when I scanned the tables. I found 77 duplicates on one table and this: ERROR: could not access status of transaction 3443523584 DETAIL: Could not open file "pg_clog/0CD4": No such file or directory. I dont think this transaction ever existed given the files I found in the clog directory. postgres@ulbdgomp01:~/data/igomp090/systeme/pg_clog$ ls -al total 30292 drwx-- 2 postgres postgres 4096 Oct 9 22:50 . drwx-- 18 postgres postgres 4096 Oct 12 02:32 .. -rw--- 1 postgres postgres 262144 Aug 16 22:50 -rw--- 1 postgres postgres 262144 Aug 17 16:20 0001 -rw--- 1 postgres postgres 262144 Aug 17 22:25 0002 ... -rw--- 1 postgres postgres 262144 Oct 7 19:31 0072 -rw--- 1 postgres postgres 262144 Oct 9 22:50 0073 -rw--- 1 postgres postgres 122880 Oct 12 11:56 0074 In your opinion : * am I missing a (lots of) clog file(s) or is it a error in the heap page giving an erroneous txid ? * Is there a way to find which row has the transaction number 3443523584 (in t_xmin t_xmax I suppose)? * Why am I not seeing any obiously broken page when I check with pageinspect ? We are using: * Ubuntu 16.04 LTS * PotgreSQL 9.5.8 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609, 64-bit Thx a lot. Benoit.