Re: [GENERAL] Index corruption & broken clog

2017-10-13 Thread Benoit Lobréau
I seem that a failed vmotion occured in the same time frame.


Re: [GENERAL] Index corruption & broken clog

2017-10-13 Thread Benoit Lobréau
Thanks for the help.



> Shut down the corrupted database and make a physical backup of it.
>

I did it before trying anything.



> Check the disks and memory on the machine where the problem occurred.
>

We are using virtual machines (VMWare). I will ask then to investigate
anyway.


> You can try to salvage data from the corrupted tables, either
> using "zero_damaged_pages" (but that will only work if the page
> header is corrupted), or by excluding problematic tuples by their
> "ctid".
>

Thanks will try.


[GENERAL] Index corruption & broken clog

2017-10-12 Thread Benoit Lobréau
Hi,

One of my PostgreSQL server crashed badly yesterday. A process was killed
(see dmesg below) and postgres was stuck with theses process:

postgres  2083 1  0 Oct08 ?00:19:02
/usr/lib/postgresql/9.5/bin/postgres -D /home/postgres/data/i090/systeme
postgres  2221  2083  0 Oct08 ?00:02:03 postgres: i090: logger
process
postgres 14068  2083  8 Oct09 ?02:50:22 [postgres]
postgres 19668 19665  0 09:40 pts/000:00:00 -su

I couldn't stop it with "pg_ctl stop immediate". We had to kill them and
release the shared memory block.

dmesg showed this.

[Tue Oct 10 07:45:29 2017] postgres[25506]: segfault at 7f2253ecb000 ip
7f225aead994 sp 7ffc2a9c0c28 error 4 in libc-2.23.so
[7f225ae0e000+1c]
[Tue Oct 10 07:45:30 2017] BUG: Bad rss-counter state mm:8800516a7c00
idx:0 val:178
[Tue Oct 10 07:45:31 2017] BUG: Bad page map in process postgres
pte:480090248c8d pmd:2c3de067

When I checked the postgresql log I found lots of these :

 could not read block 76638 in file
"pg_tblspc/16395/PG_9.5_201510051/16396/20082": read only 0 of 8192 bytes
 could not read block 76669 in file
"pg_tblspc/16395/PG_9.5_201510051/16396/19993": read only 0 of 8192 bytes
 index "degrade_pkey" contains corrupted page at block 60392
 index "degrade_pkey" contains unexpected zero page at block 60392
 index "idx_coor_brute_geometrie" contains corrupted page at block 53061
 index "idx_coor_brute_geometrie" contains corrupted page at block 56109
 index "idx_coor_brute_geometrie" contains corrupted page at block 58867
 index "idx_coor_brute_geometrie" contains corrupted page at block 59003
 index "idx_coor_brute_geometrie" contains corrupted page at block 60138
 index "idx_coor_brute_geometrie" contains corrupted page at block 71974
 index "idx_coor_brute_geometrie" contains corrupted page at block 80968
 index "idx_coor_brute_geometrie" contains corrupted page at block 86777
 index "idx_coor_brute_geometrie" contains unexpected zero page at block
78586
 index "idx_coor_proj_ligne_geometrie" contains corrupted page at block
64349
 index "idx_numero_course" contains corrupted page at block 3435
 index "idx_numero_course" contains corrupted page at block 7176
 index "idx_numero_engin" contains corrupted page at block 30996
 index "idx_utc_envoi_composant_amont" contains corrupted page at block
76497
 index "idx_utc_envoi_reseau" contains corrupted page at block 76524
 right sibling's left-link doesn't match: block 102923 links to 49947
instead of expected 1 in index "idx_utc_reception_fournisseur"
 right sibling's left-link doesn't match: block 103014 links to 51087
instead of expected 1 in index "idx_utc_reception_composant_amont"
 right sibling's left-link doesn't match: block 76978 links to 59148
instead of expected 1 in index "idx_utc_envoi_reseau"
 right sibling's left-link doesn't match: block 77073 links to 82204
instead of expected 1 in index "idx_utc_envoi_composant_amont"

The errors started to show up on inserts 8 hours before the crash.
I dont understand because when I look into theses pages with pageinspect
they are not empty / dont look bad.
For exemple, this  "SELECT * FROM bt_page_items('degrade_pkey',60392);"
showed a normal page but the error messages claims its zeroed.

I checked the tables for duplicates on the primary key. Checksum are
enabled and I didn't get any error message when I scanned the tables.
I found 77 duplicates on one table and this:

ERROR:  could not access status of transaction 3443523584
DETAIL:  Could not open file "pg_clog/0CD4": No such file or directory.

I dont think this transaction ever existed given the files I found in the
clog directory.

postgres@ulbdgomp01:~/data/igomp090/systeme/pg_clog$ ls -al
total 30292
drwx--  2 postgres postgres   4096 Oct  9 22:50 .
drwx-- 18 postgres postgres   4096 Oct 12 02:32 ..
-rw---  1 postgres postgres 262144 Aug 16 22:50 
-rw---  1 postgres postgres 262144 Aug 17 16:20 0001
-rw---  1 postgres postgres 262144 Aug 17 22:25 0002
...
-rw---  1 postgres postgres 262144 Oct  7 19:31 0072
-rw---  1 postgres postgres 262144 Oct  9 22:50 0073
-rw---  1 postgres postgres 122880 Oct 12 11:56 0074

In your opinion :
* am I missing a (lots of) clog file(s) or is it a error in the heap page
giving an erroneous txid ?
* Is there a way to find which row has the transaction number 3443523584
(in t_xmin t_xmax I suppose)?
* Why am I not seeing any obiously broken page when I check with
pageinspect ?

We are using:
* Ubuntu 16.04 LTS
* PotgreSQL 9.5.8 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609, 64-bit

Thx a lot.

Benoit.


[GENERAL] Explain analyse and toasted data.

2017-08-24 Thread Benoit Lobréau
Hi,

In this video (PostgreSQL at 10 TB and beyond: https://youtu.be/8mKpfutwD0U
at 6mn05), the speaker says that on very big databases explain analyse can
give unrealistic results because it doesn't read toasted data.

I suppose it's because of this: "The big values of TOASTed attributes will
only be pulled out (if selected at all) at the time the result set is sent
to the client." But I dont know if explain analyse does something to
simulate this (if it's possible).

Could you give me a hand to verify/understand this information please ?

Thanks in advance.

Benoit.


Re: [GENERAL] NOTIFY command impact

2017-02-23 Thread Benoit Lobréau
Hi !

Indeed :)
Thanks for taking the time to explain !

Benoit.


Re: [GENERAL] NOTIFY command impact

2017-02-23 Thread Benoit Lobréau
Hi,

I might have missed something.
This should be enough to solve the problem no ?

SELECT
CASE
WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0
 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())::INTEGER
END
AS replication_lag;

Benoit.


Re: [GENERAL] clarification about standby promotion

2017-02-08 Thread Benoit Lobréau
hi,

Tank you for the confirmation !

For the second part, I understand your explanation but I fail to see how
checking what we have replayed against what we have received will confirm
we have received everything (unless we are in sync replication).

Have a good day !

Benoit.


[GENERAL] clarification about standby promotion

2017-02-08 Thread Benoit Lobréau
Hi,


I would like to clarify something about standby promotion. From the
sentence below. I understand that, during the promotion process, postgres
will replay all the available wals (from the archive or pg_xlog).


https://www.postgresql.org/docs/9.5/static/warm-standby.html#STREAMING-REPLICATION

25.2.2. Standby Server Operation
...
Standby mode is exited and the server switches to normal operation when
pg_ctl promote is run or a trigger file is found (trigger_file). Before
failover, any WAL immediately available in the archive or in pg_xlog will
be restored, but no attempt is made to connect to the master.

I have seen several articles like this one (
https://www.enterprisedb.com/switchoverswitchback-postgresql-93) where they
say that pg_last_xlog_receive_location() and pg_last_xlog_replay_location()
should be checked before promotion. I don't understand why since they will
be replayed anyway. Did something changed since 9.3 about this ?

Thanks for you help. please excuse my poor english.

Benoit


[GENERAL] Re: [GENERAL] [GENERAL] Replication between différent versions of the same OS.

2016-12-01 Thread Benoit Lobréau
Thanks a lot. That s what I was looking for ;)

Yes, I was trying to avoid logical replication. I guess it s time for me to
delve into it...


[GENERAL] [GENERAL] Replication between différent versions of the same OS.

2016-12-01 Thread Benoit Lobréau
Hi,

Is it possible to use the built in replication to replicate between two
PostgreSQL in the same version but in different version of the same OS (Say
Pg 9.1 Ubuntu 12 to Pg 9.1 Ubuntu 14)

I think I read in Hackers that since PostgreSQL uses the OS libraries for
encoding. It could cause silent corruption because the encoding might be
different between versions of the OS. But I cant find the email again so I
can't find the exact context ... maybe I dreamed it ..

We would like to replicate to the other server then upgrade to a newer
version. (the import is too long)


Thanks
Benoit


[GENERAL] PGDATA / data_directory

2016-09-02 Thread Benoit Lobréau
Hi,

My company is using PGDATA to store configuration files and the guc
data_directory to give the path to the instance directory.

They would use it like this:

pg_ctl start -D  -w

with this directory setup:

/CONFDIR => postgresql.conf pg_hba.conf pg_ident.conf
/SYSTEM => All the normal stuff in the postgres instance
directory + recovery.conf recovery.done etc...

Is it commonly used ?

Thanks for your help / remarks / feedback.

Benoit.