Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Peter Geoghegan
On Tue, Oct 24, 2017 at 10:20 PM, Justin Pryzby  wrote:
> I think you must have compared these:

Yes, I did. My mistake.

> On Tue, Oct 24, 2017 at 03:11:44PM -0500, Justin Pryzby wrote:
>> ts=# SELECT * FROM bt_page_items(get_raw_page('sites_idx', 1));
>>
>> itemoffset | 48
>> ctid   | (1,37)
>> itemlen| 32
>> nulls  | f
>> vars   | t
>> data   | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 0b 31 31 31 31 00 00 00 
>> 00 00 00
> ...
>> itemoffset | 37
>> ctid   | (0,97)
>> itemlen| 24
>> nulls  | f
>> vars   | t
>> data   | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 03 00 00
>
> ..but note those are both items in sites_idx (48 and 37, which I seem to have
> pasted out of order)..  I included both because I'm not confident I know what
> the "index tid=(1,37)" referred to, but I gather it means item at offset=37
> (and not item with ctid=(1,37).)
>
> | [pryzbyj@database amcheck]$ psql --port 5678 ts -c "SELECT 
> bt_index_check('sites_idx'::regclass::oid, heapallindexed=>True)"
> | ERROR:  high key invariant violated for index "sites_idx"
> | DETAIL:  Index tid=(1,37) points to heap tid=(0,97) page lsn=0/0.

This means that the item at (1,37) in the index is not <= the high
key, which is located at (1,1). (The high key comes first, despite
being an upper bound rather than a lower bound, per pageinspect docs.)

I find it suspicious that the page lsn is 0 here, btw.

> ts=# SELECT * FROM page_header(get_raw_page('sites_idx', 1));
>  lsn | checksum | flags | lower | upper | special | pagesize | version | 
> prune_xid
> -+--+---+---+---+-+--+-+---
>  0/0 |0 | 0 |   872 |  1696 |8176 | 8192 |   4 |  
>0
>
> Here is its heap page:
>
> ts=# SELECT * FROM heap_page_items(get_raw_page('sites', 0)) WHERE lp=97;
>  lp | lp_off | lp_flags | lp_len | t_xmin |  t_xmax  | t_field3 | t_ctid | 
> t_infomask2 | t_infomask | t_hoff |  t_bits  | t_oid |
>t_data
> ++--+++--+--++-+++--+---+
>  97 |968 |1 | 52 |  21269 | 33567444 |0 | (3,27) |
> 8204 |   2307 | 32 | 11101001 |   | 
> \x70001b4352434c4d542d43454d5330030303
>
> Which I see ends with 0303 vs ..

Looks like I was accidentally right, then -- the heap and index do differ.

You might try this tool I published recently, to get a better sense of
details like this:

https://github.com/petergeoghegan/pg_hexedit

(Don't do so with a data directory that you cannot afford to corrupt
again, though!)

> Maybe this is relevant ?
> ts=# SELECT * FROM heap_page_items(get_raw_page('sites', 3)) WHERE lp=27;
>  lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | 
> t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
> ++--++++--++-++++---+
>  27 |  0 |0 |  0 |||  ||  
>||||   |

This looks like an LP_DEAD item.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Justin Pryzby
On Tue, Oct 24, 2017 at 02:57:47PM -0700, Peter Geoghegan wrote:
> On Tue, Oct 24, 2017 at 1:11 PM, Justin Pryzby  wrote:
> > ..which I gather just verifies that the index is corrupt, not sure if 
> > there's
> > anything else to do with it?  Note, we've already removed the duplicate 
> > rows.
> 
> Yes, the index itself is definitely corrupt -- this failed before the
> new "heapallindexed" check even started. Though it looks like that
> would have failed too, if you got that far, since the index points to
> a row that does not contain the same data. (I only know this because
> you dumped the heap tuple and the index tuple.)

I think you must have compared these:

On Tue, Oct 24, 2017 at 03:11:44PM -0500, Justin Pryzby wrote:
> ts=# SELECT * FROM bt_page_items(get_raw_page('sites_idx', 1));
> 
> itemoffset | 48
> ctid   | (1,37)
> itemlen| 32
> nulls  | f
> vars   | t
> data   | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 0b 31 31 31 31 00 00 00 
> 00 00 00
...
> itemoffset | 37
> ctid   | (0,97)
> itemlen| 24
> nulls  | f
> vars   | t
> data   | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 03 00 00

..but note those are both items in sites_idx (48 and 37, which I seem to have
pasted out of order)..  I included both because I'm not confident I know what
the "index tid=(1,37)" referred to, but I gather it means item at offset=37
(and not item with ctid=(1,37).)

| [pryzbyj@database amcheck]$ psql --port 5678 ts -c "SELECT 
bt_index_check('sites_idx'::regclass::oid, heapallindexed=>True)"
| ERROR:  high key invariant violated for index "sites_idx"
| DETAIL:  Index tid=(1,37) points to heap tid=(0,97) page lsn=0/0.

ts=# SELECT * FROM page_header(get_raw_page('sites_idx', 1));
 lsn | checksum | flags | lower | upper | special | pagesize | version | 
prune_xid 
-+--+---+---+---+-+--+-+---
 0/0 |0 | 0 |   872 |  1696 |8176 | 8192 |   4 |
 0

Here is its heap page:

ts=# SELECT * FROM heap_page_items(get_raw_page('sites', 0)) WHERE lp=97;
 lp | lp_off | lp_flags | lp_len | t_xmin |  t_xmax  | t_field3 | t_ctid | 
t_infomask2 | t_infomask | t_hoff |  t_bits  | t_oid |  
 t_data   
++--+++--+--++-+++--+---+
 97 |968 |1 | 52 |  21269 | 33567444 |0 | (3,27) |  
  8204 |   2307 | 32 | 11101001 |   | 
\x70001b4352434c4d542d43454d5330030303

Which I see ends with 0303 vs ..

t_infomask=2307=2048+256+3 =>
#define HEAP_HASNULL0x0001  /* has null attribute(s) */
#define HEAP_HASVARWIDTH0x0002  /* has variable-width 
attribute(s) */
#define HEAP_XMIN_COMMITTED 0x0100  /* t_xmin committed */
#define HEAP_XMAX_INVALID   0x0800  /* t_xmax invalid/aborted */

t_infomask2=8204 => 8192+12 =>
#define HEAP_KEYS_UPDATED   0x2000  /* tuple was updated and key 
cols modified, or tuple deleted */

Maybe this is relevant ?
ts=# SELECT * FROM heap_page_items(get_raw_page('sites', 3)) WHERE lp=27;
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | 
t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data 
++--++++--++-++++---+
 27 |  0 |0 |  0 |||  ||
 ||||   | 

Justin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Justin Pryzby
On Tue, Oct 24, 2017 at 02:57:47PM -0700, Peter Geoghegan wrote:
> On Tue, Oct 24, 2017 at 1:11 PM, Justin Pryzby  wrote:
> > ..which I gather just verifies that the index is corrupt, not sure if 
> > there's
> > anything else to do with it?  Note, we've already removed the duplicate 
> > rows.
> Maybe you could try verifying a different index on the same table with
> "heapallindexed", too. Perhaps that would fail in a more interesting
> way.

The only other index seems fine:

[pryzbyj@database ~]$ psql --port 5678 ts -txc "SELECT 
bt_index_parent_check('sites_pkey'::regclass::oid, heapallindexed=>True)"
bt_index_parent_check | 

[pryzbyj@database ~]$ psql --port 5678 ts -txc "SELECT 
bt_index_check('sites_pkey'::regclass::oid, heapallindexed=>True)"
bt_index_check | 

> You're using LVM snapshots -- I hope that you're aware that they're
> not guaranteed to be consistent across logical volumes. There are a
> few different ways that they could cause corruption like this if you
> weren't careful. (In general, I wouldn't recommend using LVM snapshots
> as any kind of backup solution.)

Right, I asked a coworker to create the snapshot before trying to fix the
immediate problem, as a forensic measure.  We have postgres on just one LVM LV.

Justin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Justin Pryzby
On Tue, Oct 24, 2017 at 01:48:55PM -0500, Kenneth Marshall wrote:
> I just dealt with a similar problem with pg_repack and a PostgreSQL 9.5 DB,
> the exact same error. It seemed to caused by a tuple visibility issue that
> allowed the "working" unique index to be built, even though a duplicate row
> existed. Then the next pg_repack would fail with the error you got.

FTR, I was able to run the repack script several times without issue, hitting
that table each time.

Justin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Peter Geoghegan
On Tue, Oct 24, 2017 at 1:11 PM, Justin Pryzby  wrote:
> ..which I gather just verifies that the index is corrupt, not sure if there's
> anything else to do with it?  Note, we've already removed the duplicate rows.

Yes, the index itself is definitely corrupt -- this failed before the
new "heapallindexed" check even started. Though it looks like that
would have failed too, if you got that far, since the index points to
a row that does not contain the same data. (I only know this because
you dumped the heap tuple and the index tuple.)

Maybe you could try verifying a different index on the same table with
"heapallindexed", too. Perhaps that would fail in a more interesting
way.

I don't know how pg_repack works in any detail, but I have a hard time
imagining it causing corruption like this, where a single B-Tree page
is corrupt (high key invariant fails), possibly because of a torn page
(possibly due to recovery not replaying all the WAL needed, for
whatever reason).

You're using LVM snapshots -- I hope that you're aware that they're
not guaranteed to be consistent across logical volumes. There are a
few different ways that they could cause corruption like this if you
weren't careful. (In general, I wouldn't recommend using LVM snapshots
as any kind of backup solution.)

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Justin Pryzby
On Tue, Oct 24, 2017 at 12:31:49PM -0700, Peter Geoghegan wrote:
> On Tue, Oct 24, 2017 at 11:48 AM, Kenneth Marshall  wrote:
> >> Really ?  pg_repack "found" and was victim to the duplicate keys, and 
> >> rolled
> >> back its work.  The CSV logs clearly show that our application INSERTed 
> >> rows
> >> which are duplicates.
> >>
> >> [pryzbyj@database ~]$ rpm -qa pg_repack10
> >> pg_repack10-1.4.2-1.rhel6.x86_64
> >>
> >> Justin
> >
> > Hi Justin,
> >
> > I just dealt with a similar problem with pg_repack and a PostgreSQL 9.5 DB,
> > the exact same error. It seemed to caused by a tuple visibility issue that
> > allowed the "working" unique index to be built, even though a duplicate row
> > existed. Then the next pg_repack would fail with the error you got. In our
> > case I needed the locality of reference to keep the DB performance 
> > acceptable
> > and it was not a critical issue if there was a duplicate. We would remove 
> > the
> > duplicates if we had a failure. We never had a problem with the NO pg_repack
> > scenarios.
> 
> A new, enhanced version of the corruption detection tool amcheck is
> now available, and has both apt + yum packages available:
> 
> https://github.com/petergeoghegan/amcheck
> 
> Unlike the version in Postgres 10, this enhanced version (V1.2) has
> "heapallindexed" verification, which is really what you want here. If
> you install it, and run it against the unique index in question (with
> "heapallindexed" verification), that could help. It might provide a
> more useful diagnostic message.
> 
> This is very new, so do let me know how you get on if you try it out.

I started an instance connected to a copy of the LVM snapshot I saved:
[pryzbyj@database ~]$ sudo -u postgres /usr/pgsql-10/bin/postmaster -c 
port=5678 -D /mnt/10/data

[pryzbyj@database amcheck]$ psql --port 5678 ts -c "SELECT 
bt_index_check('sites_idx'::regclass::oid, heapallindexed=>True)"
ERROR:  high key invariant violated for index "sites_idx"
DETAIL:  Index tid=(1,37) points to heap tid=(0,97) page lsn=0/0.
[pryzbyj@database amcheck]$ psql --port 5678 ts -c "SELECT 
bt_index_parent_check('sites_idx'::regclass::oid, heapallindexed=>True)"
ERROR:  high key invariant violated for index "sites_idx"
DETAIL:  Index tid=(1,37) points to heap tid=(0,97) page lsn=0/0.

ts=# SELECT * FROM page_header(get_raw_page('sites_idx', 1));
lsn   | 0/0
checksum  | 0
flags | 0
lower | 872
upper | 1696
special   | 8176
pagesize  | 8192
version   | 4
prune_xid | 0

ts=# SELECT * FROM page_header(get_raw_page('sites', 0));
lsn   | 1FB/AC5A4908
checksum  | 0
flags | 5
lower | 436
upper | 464
special   | 8192
pagesize  | 8192
version   | 4
prune_xid | 0

ts=# SELECT * FROM bt_page_items(get_raw_page('sites_idx', 1));

itemoffset | 48
ctid   | (1,37)
itemlen| 32
nulls  | f
vars   | t
data   | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 0b 31 31 31 31 00 00 00 00 
00 00

itemoffset | 37
ctid   | (0,97)
itemlen| 24
nulls  | f
vars   | t
data   | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 03 00 00

..which I gather just verifies that the index is corrupt, not sure if there's
anything else to do with it?  Note, we've already removed the duplicate rows.

Justin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Peter Geoghegan
On Tue, Oct 24, 2017 at 11:48 AM, Kenneth Marshall  wrote:
>> Really ?  pg_repack "found" and was victim to the duplicate keys, and rolled
>> back its work.  The CSV logs clearly show that our application INSERTed rows
>> which are duplicates.
>>
>> [pryzbyj@database ~]$ rpm -qa pg_repack10
>> pg_repack10-1.4.2-1.rhel6.x86_64
>>
>> Justin
>
> Hi Justin,
>
> I just dealt with a similar problem with pg_repack and a PostgreSQL 9.5 DB,
> the exact same error. It seemed to caused by a tuple visibility issue that
> allowed the "working" unique index to be built, even though a duplicate row
> existed. Then the next pg_repack would fail with the error you got. In our
> case I needed the locality of reference to keep the DB performance acceptable
> and it was not a critical issue if there was a duplicate. We would remove the
> duplicates if we had a failure. We never had a problem with the NO pg_repack
> scenarios.

A new, enhanced version of the corruption detection tool amcheck is
now available, and has both apt + yum packages available:

https://github.com/petergeoghegan/amcheck

Unlike the version in Postgres 10, this enhanced version (V1.2) has
"heapallindexed" verification, which is really what you want here. If
you install it, and run it against the unique index in question (with
"heapallindexed" verification), that could help. It might provide a
more useful diagnostic message.

This is very new, so do let me know how you get on if you try it out.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Kenneth Marshall
On Tue, Oct 24, 2017 at 01:30:19PM -0500, Justin Pryzby wrote:
> On Tue, Oct 24, 2017 at 01:27:14PM -0500, Kenneth Marshall wrote:
> > On Tue, Oct 24, 2017 at 01:14:53PM -0500, Justin Pryzby wrote:
> 
> > > Note:
> > > I run a script which does various combinations of ANALYZE/VACUUM 
> > > (FULL/ANALYZE)
> > > following the upgrade, and a script runs nightly with REINDEX and 
> > > pg_repack
> > > (and a couple of CLUSTER), so you should assume that any combination of 
> > > those
> > > maintenance commands have been run.
> > > 
> > > In our reindex/repack log I found the first error due to duplicates:
> > > Tue Oct 24 01:27:53 MDT 2017: sites: sites_idx(repack non-partitioned)...
> > > WARNING: Error creating index "public"."index_61764": ERROR:  could not 
> > > create unique index "index_61764"
> > > DETAIL:  Key (site_office, site_location)=(CRCLMT-DOEMS0, 1120) is 
> > > duplicated.
> > > WARNING: Skipping index swapping for "sites", since no new indexes built
> > > WARNING: repack failed for "sites_idx"
> > > reindex: warning, dropping invalid/unswapped index: index_61764
> > > 
> > 
> > Hi Justin,
> > 
> > This sounds like a pg_repack bug and not a PostgreSQL bug. What version are
> > you running?
> 
> Really ?  pg_repack "found" and was victim to the duplicate keys, and rolled
> back its work.  The CSV logs clearly show that our application INSERTed rows
> which are duplicates.
> 
> [pryzbyj@database ~]$ rpm -qa pg_repack10
> pg_repack10-1.4.2-1.rhel6.x86_64
> 
> Justin

Hi Justin,

I just dealt with a similar problem with pg_repack and a PostgreSQL 9.5 DB,
the exact same error. It seemed to caused by a tuple visibility issue that
allowed the "working" unique index to be built, even though a duplicate row
existed. Then the next pg_repack would fail with the error you got. In our
case I needed the locality of reference to keep the DB performance acceptable
and it was not a critical issue if there was a duplicate. We would remove the
duplicates if we had a failure. We never had a problem with the NO pg_repack
scenarios.

Regards,
Ken 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Justin Pryzby
On Tue, Oct 24, 2017 at 01:27:14PM -0500, Kenneth Marshall wrote:
> On Tue, Oct 24, 2017 at 01:14:53PM -0500, Justin Pryzby wrote:

> > Note:
> > I run a script which does various combinations of ANALYZE/VACUUM 
> > (FULL/ANALYZE)
> > following the upgrade, and a script runs nightly with REINDEX and pg_repack
> > (and a couple of CLUSTER), so you should assume that any combination of 
> > those
> > maintenance commands have been run.
> > 
> > In our reindex/repack log I found the first error due to duplicates:
> > Tue Oct 24 01:27:53 MDT 2017: sites: sites_idx(repack non-partitioned)...
> > WARNING: Error creating index "public"."index_61764": ERROR:  could not 
> > create unique index "index_61764"
> > DETAIL:  Key (site_office, site_location)=(CRCLMT-DOEMS0, 1120) is 
> > duplicated.
> > WARNING: Skipping index swapping for "sites", since no new indexes built
> > WARNING: repack failed for "sites_idx"
> > reindex: warning, dropping invalid/unswapped index: index_61764
> > 
> 
> Hi Justin,
> 
> This sounds like a pg_repack bug and not a PostgreSQL bug. What version are
> you running?

Really ?  pg_repack "found" and was victim to the duplicate keys, and rolled
back its work.  The CSV logs clearly show that our application INSERTed rows
which are duplicates.

[pryzbyj@database ~]$ rpm -qa pg_repack10
pg_repack10-1.4.2-1.rhel6.x86_64

Justin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Kenneth Marshall
On Tue, Oct 24, 2017 at 01:14:53PM -0500, Justin Pryzby wrote:
> I upgrade another instance to PG10 yesterday and this AM found unique key
> violations.
> 
> Our application is SELECTing FROM sites WHERE site_location=$1, and if it
> doesn't find one, INSERTs one (I know that's racy and not ideal).  We ended up
> with duplicate sites, despite a unique index.  We removed the duplicate rows
> and reindexed fine.  This is just a heads up with all the detail I can fit in 
> a
> mail (but there's more if needed).
> ...
> Note:
> I run a script which does various combinations of ANALYZE/VACUUM 
> (FULL/ANALYZE)
> following the upgrade, and a script runs nightly with REINDEX and pg_repack
> (and a couple of CLUSTER), so you should assume that any combination of those
> maintenance commands have been run.
> 
> In our reindex/repack log I found the first error due to duplicates:
> Tue Oct 24 01:27:53 MDT 2017: sites: sites_idx(repack non-partitioned)...
> WARNING: Error creating index "public"."index_61764": ERROR:  could not 
> create unique index "index_61764"
> DETAIL:  Key (site_office, site_location)=(CRCLMT-DOEMS0, 1120) is duplicated.
> WARNING: Skipping index swapping for "sites", since no new indexes built
> WARNING: repack failed for "sites_idx"
> reindex: warning, dropping invalid/unswapped index: index_61764
> 

Hi Justin,

This sounds like a pg_repack bug and not a PostgreSQL bug. What version are
you running?

Regards,
Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers