Re: [HACKERS] Current int & float overflow checking is slow.

2017-10-24 Thread Andres Freund
On 2017-10-25 07:33:46 +0200, Robert Haas wrote:
> On Tue, Oct 24, 2017 at 9:28 PM, Tom Lane  wrote:
> > I don't like changing well-defined, user-visible query behavior for
> > no other reason than a performance gain (of a size that hasn't even
> > been shown to be interesting, btw).  Will we change it back in another
> > ten years if the performance tradeoff changes?

That part of the argument seems unconvincing. It's not like the overflow
check is likely to ever have been beneficial performancewise, nor is it
remotely likely for that to ever be the case.


> > Also, if I recall the old discussion properly, one concern was getting
> > uniform behavior across different platforms.  I'm worried that if we do
> > what Andres suggests, we'll get behavior that is not only different but
> > platform-specific.  Now, to the extent that you believe that every modern
> > platform implements edge-case IEEE float behavior the same way, that worry
> > may be obsolete.  But I don't think I believe that.
> 
> Yeah, those are reasonable concerns.

I agree. I'm not really sure what the right way is here. I do however
think it's worth discussing what ways to address the performance penalty
due to the overflow checks, and one obvious way to do so is not to play.

It'd be interesting to write the overflow checking addition in x86
inline asm, and see how much better that gets - just so we know the
maximum we can reach with that. The problem with the C99 stuff seems to
be the external function calls.  With either, one problem would be that
we'd have to reset the overflow register before doing math, which isn't
free either - otherwise some external function could have left it set to
on.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Peter Geoghegan
On Tue, Oct 24, 2017 at 10:20 PM, Justin Pryzby  wrote:
> I think you must have compared these:

Yes, I did. My mistake.

> On Tue, Oct 24, 2017 at 03:11:44PM -0500, Justin Pryzby wrote:
>> ts=# SELECT * FROM bt_page_items(get_raw_page('sites_idx', 1));
>>
>> itemoffset | 48
>> ctid   | (1,37)
>> itemlen| 32
>> nulls  | f
>> vars   | t
>> data   | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 0b 31 31 31 31 00 00 00 
>> 00 00 00
> ...
>> itemoffset | 37
>> ctid   | (0,97)
>> itemlen| 24
>> nulls  | f
>> vars   | t
>> data   | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 03 00 00
>
> ..but note those are both items in sites_idx (48 and 37, which I seem to have
> pasted out of order)..  I included both because I'm not confident I know what
> the "index tid=(1,37)" referred to, but I gather it means item at offset=37
> (and not item with ctid=(1,37).)
>
> | [pryzbyj@database amcheck]$ psql --port 5678 ts -c "SELECT 
> bt_index_check('sites_idx'::regclass::oid, heapallindexed=>True)"
> | ERROR:  high key invariant violated for index "sites_idx"
> | DETAIL:  Index tid=(1,37) points to heap tid=(0,97) page lsn=0/0.

This means that the item at (1,37) in the index is not <= the high
key, which is located at (1,1). (The high key comes first, despite
being an upper bound rather than a lower bound, per pageinspect docs.)

I find it suspicious that the page lsn is 0 here, btw.

> ts=# SELECT * FROM page_header(get_raw_page('sites_idx', 1));
>  lsn | checksum | flags | lower | upper | special | pagesize | version | 
> prune_xid
> -+--+---+---+---+-+--+-+---
>  0/0 |0 | 0 |   872 |  1696 |8176 | 8192 |   4 |  
>0
>
> Here is its heap page:
>
> ts=# SELECT * FROM heap_page_items(get_raw_page('sites', 0)) WHERE lp=97;
>  lp | lp_off | lp_flags | lp_len | t_xmin |  t_xmax  | t_field3 | t_ctid | 
> t_infomask2 | t_infomask | t_hoff |  t_bits  | t_oid |
>t_data
> ++--+++--+--++-+++--+---+
>  97 |968 |1 | 52 |  21269 | 33567444 |0 | (3,27) |
> 8204 |   2307 | 32 | 11101001 |   | 
> \x70001b4352434c4d542d43454d5330030303
>
> Which I see ends with 0303 vs ..

Looks like I was accidentally right, then -- the heap and index do differ.

You might try this tool I published recently, to get a better sense of
details like this:

https://github.com/petergeoghegan/pg_hexedit

(Don't do so with a data directory that you cannot afford to corrupt
again, though!)

> Maybe this is relevant ?
> ts=# SELECT * FROM heap_page_items(get_raw_page('sites', 3)) WHERE lp=27;
>  lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | 
> t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
> ++--++++--++-++++---+
>  27 |  0 |0 |  0 |||  ||  
>||||   |

This looks like an LP_DEAD item.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Current int & float overflow checking is slow.

2017-10-24 Thread Robert Haas
On Tue, Oct 24, 2017 at 9:28 PM, Tom Lane  wrote:
> I don't like changing well-defined, user-visible query behavior for
> no other reason than a performance gain (of a size that hasn't even
> been shown to be interesting, btw).  Will we change it back in another
> ten years if the performance tradeoff changes?
>
> Also, if I recall the old discussion properly, one concern was getting
> uniform behavior across different platforms.  I'm worried that if we do
> what Andres suggests, we'll get behavior that is not only different but
> platform-specific.  Now, to the extent that you believe that every modern
> platform implements edge-case IEEE float behavior the same way, that worry
> may be obsolete.  But I don't think I believe that.

Yeah, those are reasonable concerns.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Justin Pryzby
On Tue, Oct 24, 2017 at 02:57:47PM -0700, Peter Geoghegan wrote:
> On Tue, Oct 24, 2017 at 1:11 PM, Justin Pryzby  wrote:
> > ..which I gather just verifies that the index is corrupt, not sure if 
> > there's
> > anything else to do with it?  Note, we've already removed the duplicate 
> > rows.
> 
> Yes, the index itself is definitely corrupt -- this failed before the
> new "heapallindexed" check even started. Though it looks like that
> would have failed too, if you got that far, since the index points to
> a row that does not contain the same data. (I only know this because
> you dumped the heap tuple and the index tuple.)

I think you must have compared these:

On Tue, Oct 24, 2017 at 03:11:44PM -0500, Justin Pryzby wrote:
> ts=# SELECT * FROM bt_page_items(get_raw_page('sites_idx', 1));
> 
> itemoffset | 48
> ctid   | (1,37)
> itemlen| 32
> nulls  | f
> vars   | t
> data   | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 0b 31 31 31 31 00 00 00 
> 00 00 00
...
> itemoffset | 37
> ctid   | (0,97)
> itemlen| 24
> nulls  | f
> vars   | t
> data   | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 03 00 00

..but note those are both items in sites_idx (48 and 37, which I seem to have
pasted out of order)..  I included both because I'm not confident I know what
the "index tid=(1,37)" referred to, but I gather it means item at offset=37
(and not item with ctid=(1,37).)

| [pryzbyj@database amcheck]$ psql --port 5678 ts -c "SELECT 
bt_index_check('sites_idx'::regclass::oid, heapallindexed=>True)"
| ERROR:  high key invariant violated for index "sites_idx"
| DETAIL:  Index tid=(1,37) points to heap tid=(0,97) page lsn=0/0.

ts=# SELECT * FROM page_header(get_raw_page('sites_idx', 1));
 lsn | checksum | flags | lower | upper | special | pagesize | version | 
prune_xid 
-+--+---+---+---+-+--+-+---
 0/0 |0 | 0 |   872 |  1696 |8176 | 8192 |   4 |
 0

Here is its heap page:

ts=# SELECT * FROM heap_page_items(get_raw_page('sites', 0)) WHERE lp=97;
 lp | lp_off | lp_flags | lp_len | t_xmin |  t_xmax  | t_field3 | t_ctid | 
t_infomask2 | t_infomask | t_hoff |  t_bits  | t_oid |  
 t_data   
++--+++--+--++-+++--+---+
 97 |968 |1 | 52 |  21269 | 33567444 |0 | (3,27) |  
  8204 |   2307 | 32 | 11101001 |   | 
\x70001b4352434c4d542d43454d5330030303

Which I see ends with 0303 vs ..

t_infomask=2307=2048+256+3 =>
#define HEAP_HASNULL0x0001  /* has null attribute(s) */
#define HEAP_HASVARWIDTH0x0002  /* has variable-width 
attribute(s) */
#define HEAP_XMIN_COMMITTED 0x0100  /* t_xmin committed */
#define HEAP_XMAX_INVALID   0x0800  /* t_xmax invalid/aborted */

t_infomask2=8204 => 8192+12 =>
#define HEAP_KEYS_UPDATED   0x2000  /* tuple was updated and key 
cols modified, or tuple deleted */

Maybe this is relevant ?
ts=# SELECT * FROM heap_page_items(get_raw_page('sites', 3)) WHERE lp=27;
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | 
t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data 
++--++++--++-++++---+
 27 |  0 |0 |  0 |||  ||
 ||||   | 

Justin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-24 Thread Michael Paquier
On Tue, Oct 24, 2017 at 7:24 PM, Tsunakawa, Takayuki
 wrote:
> (3)
> Should we change the default value of max_wal_size from 1 GB to a smaller 
> size?  I vote for "no" for performance.

The default has just changed in v10, so changing it again could be
confusing, so I agree with your position.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-24 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Simon Riggs
> This
> * reduces disk space requirements on master
> * removes a minor bug in fast failover
> * simplifies code

I welcome this patch.  I was wondering why PostgreSQL retains the previous 
checkpoint.  (Although unrelated to this, I've also been wondering why 
PostgreSQL flushes WAL to disk when writing a page in the shared buffer, 
because PostgreSQL doesn't use WAL for undo.)

Here are my review comments.

(1)
-* a) we keep WAL for two checkpoint cycles, back to the "prev" 
checkpoint.
+* a) we keep WAL for only one checkpoint cycle (prior to PG11 we kept
+*WAL for two checkpoint cycles to allow us to recover from the
+*secondary checkpoint if the first checkpoint failed, though we
+*only did this on the master anyway, not on standby. Keeping just
+*one checkpoint simplifies processing and reduces disk space in
+*many smaller databases.)

/*
-* The last valid checkpoint record required for a 
streaming
-* recovery exists in neither standby nor the primary.
+* We used to attempt to go back to a secondary 
checkpoint
+* record here, but only when not in standby_mode. We 
now
+* just fail if we can't read the last checkpoint 
because
+* this allows us to simplify processing around 
checkpoints.
 */


if (fast_promote)
{
-   checkPointLoc = ControlFile->prevCheckPoint;
+   /*
+* It appears to be a bug that we used to use 
prevCheckpoint here
+*/
+   checkPointLoc = ControlFile->checkPoint;

-   XLByteToSeg(PriorRedoPtr, _logSegNo, wal_segment_size);
+   /* Trim from the last checkpoint, not the last - 1 */
+   XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
 
I suggest to remove references to the secondary checkpoint (the old behavior) 
from the comments.  I'm afraid those could confuse new developers.



(2)
@@ -8110,10 +8100,6 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, 
XLogRecPtr RecPtr,
ereport(LOG,
(errmsg("invalid primary 
checkpoint link in control file")));
break;

@@ -8135,10 +8121,6 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, 
XLogRecPtr RecPtr,
ereport(LOG,
(errmsg("invalid primary 
checkpoint record")));
break;

@@ -8154,10 +8136,6 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, 
XLogRecPtr RecPtr,
ereport(LOG,
(errmsg("invalid resource 
manager ID in primary checkpoint record")));
break;

etc, etc.

"primary checkpoint" should be just "checkpoint", now that the 
primary/secondary concept will disappear.


(3)
Should we change the default value of max_wal_size from 1 GB to a smaller size? 
 I vote for "no" for performance.

Regards
Takayuki Tsunakawa





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-24 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Michael Paquier
> On Tue, Oct 24, 2017 at 5:58 PM, Tsunakawa, Takayuki
>  wrote:
> > If the latest checkpoint record is unreadable (the WAL
> segment/block/record is corrupt?), recovery from the previous checkpoint
> would also stop at the latest checkpoint.  And we don't need to replay the
> WAL records between the previous checkpoint and the latest one, because
> their changes are already persisted when the latest checkpoint was taken.
> So, the user should just do pg_resetxlog and start the database server when
> the recovery cannot find the latest checkpoint record and PANICs?
> 
> Not necessarily. If a failure is detected when reading the last checkpoint,
> as you say recovery would begin at the checkpoint prior that and would stop
> when reading the record of last checkpoint, still one could use a
> recovery.conf with restore_command to fetch segments from a different
> source, like a static archive, as only the local segment may be corrupted.

Oh, you are right.  "If the crash recovery fails, perform recovery from backup."

Anyway, I agree that the secondary checkpoint isn't necessary.

Regards
Takayuki Tsunakawa



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-24 Thread Michael Paquier
On Tue, Oct 24, 2017 at 5:58 PM, Tsunakawa, Takayuki
 wrote:
> If the latest checkpoint record is unreadable (the WAL segment/block/record 
> is corrupt?), recovery from the previous checkpoint would also stop at the 
> latest checkpoint.  And we don't need to replay the WAL records between the 
> previous checkpoint and the latest one, because their changes are already 
> persisted when the latest checkpoint was taken.  So, the user should just do 
> pg_resetxlog and start the database server when the recovery cannot find the 
> latest checkpoint record and PANICs?

Not necessarily. If a failure is detected when reading the last
checkpoint, as you say recovery would begin at the checkpoint prior
that and would stop when reading the record of last checkpoint, still
one could use a recovery.conf with restore_command to fetch segments
from a different source, like a static archive, as only the local
segment may be corrupted.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-24 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane
> Doesn't it also make crash recovery less robust?  The whole point
> of that mechanism is to be able to cope if the latest checkpoint
> record is unreadable.

If the latest checkpoint record is unreadable (the WAL segment/block/record is 
corrupt?), recovery from the previous checkpoint would also stop at the latest 
checkpoint.  And we don't need to replay the WAL records between the previous 
checkpoint and the latest one, because their changes are already persisted when 
the latest checkpoint was taken.  So, the user should just do pg_resetxlog and 
start the database server when the recovery cannot find the latest checkpoint 
record and PANICs?

Regards
Takayuki Tsunakawa






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Implementing pg_receivewal --no-sync

2017-10-24 Thread Michael Paquier
Hi all,

After thinking a bit on the subject, I have decided to submit a patch
to do $subject. This makes pg_receivewal more consistent with
pg_basebackup. This option is mainly useful for testing, something
that becomes way more doable since support for --endpos has been
added.

Unsurprisingly, --synchronous and --no-sync are incompatible options.
Thanks,
-- 
Michael


pg_receivewal_nosync.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] per-sesson errors after interrupting CLUSTER pg_attrdef

2017-10-24 Thread Michael Paquier
On Fri, Oct 20, 2017 at 9:01 AM, Justin Pryzby  wrote:
> This was briefly scary but seems to have been limited to my psql session (no
> other errors logged).  Issue with catcache (?)
>
> I realized that the backup job I'd kicked off was precluding the CLUSTER from
> running, but that CLUSTER was still holding lock and stalling everything else
> under the sun.
>
> ts=# \df qci_add*
> ERROR:  could not read block 8 in file "base/16400/999225102": read only 0 of 
> 8192 bytes
> ts=# \dt+ pg_att
>
> ts=# \dt+ pg_attrdef
> ERROR:  could not read block 8 in file "base/16400/999225102": read only 0 of 
> 8192 bytes

Perhaps that's too late, but to which relation is this relfilenode
actually referring to?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Justin Pryzby
On Tue, Oct 24, 2017 at 02:57:47PM -0700, Peter Geoghegan wrote:
> On Tue, Oct 24, 2017 at 1:11 PM, Justin Pryzby  wrote:
> > ..which I gather just verifies that the index is corrupt, not sure if 
> > there's
> > anything else to do with it?  Note, we've already removed the duplicate 
> > rows.
> Maybe you could try verifying a different index on the same table with
> "heapallindexed", too. Perhaps that would fail in a more interesting
> way.

The only other index seems fine:

[pryzbyj@database ~]$ psql --port 5678 ts -txc "SELECT 
bt_index_parent_check('sites_pkey'::regclass::oid, heapallindexed=>True)"
bt_index_parent_check | 

[pryzbyj@database ~]$ psql --port 5678 ts -txc "SELECT 
bt_index_check('sites_pkey'::regclass::oid, heapallindexed=>True)"
bt_index_check | 

> You're using LVM snapshots -- I hope that you're aware that they're
> not guaranteed to be consistent across logical volumes. There are a
> few different ways that they could cause corruption like this if you
> weren't careful. (In general, I wouldn't recommend using LVM snapshots
> as any kind of backup solution.)

Right, I asked a coworker to create the snapshot before trying to fix the
immediate problem, as a forensic measure.  We have postgres on just one LVM LV.

Justin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Justin Pryzby
On Tue, Oct 24, 2017 at 01:48:55PM -0500, Kenneth Marshall wrote:
> I just dealt with a similar problem with pg_repack and a PostgreSQL 9.5 DB,
> the exact same error. It seemed to caused by a tuple visibility issue that
> allowed the "working" unique index to be built, even though a duplicate row
> existed. Then the next pg_repack would fail with the error you got.

FTR, I was able to run the repack script several times without issue, hitting
that table each time.

Justin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-24 Thread Michael Paquier
On Mon, Oct 23, 2017 at 11:20 PM, Simon Riggs  wrote:
> Remove the code that maintained two checkpoint's WAL files and all
> associated stuff.
>
> Try to avoid breaking anything else
>
> This
> * reduces disk space requirements on master
> * removes a minor bug in fast failover
> * simplifies code

In short, yeah! I had a close look at the patch and noticed a couple of issues.

+else
 /*
- * The last valid checkpoint record required for a streaming
- * recovery exists in neither standby nor the primary.
+ * We used to attempt to go back to a secondary checkpoint
+ * record here, but only when not in standby_mode. We now
+ * just fail if we can't read the last checkpoint because
+ * this allows us to simplify processing around checkpoints.
  */
 ereport(PANIC,
 (errmsg("could not locate a valid checkpoint record")));
-}
Using brackets in this case is more elegant style IMO. OK, there is
one line, but the comment is long so the block becomes
confusingly-shaped.

 /* Version identifier for this pg_control format */
-#define PG_CONTROL_VERSION1003
+#define PG_CONTROL_VERSION1100
Yes, this had to happen anyway in this release cycle.

backup.sgml describes the following:
to higher segment numbers.  It's assumed that segment files whose
contents precede the checkpoint-before-last are no longer of
interest and can be recycled.
However this is not true anymore with this patch.

The documentation of pg_control_checkpoint() is not updated. func.sgml
lists the tuple fields returned for this function.

You forgot to update pg_control_checkpoint in pg_proc.h. prior_lsn is
still listed in the list of arguments returned. And you need to update
as well the output list of types.

  * whichChkpt identifies the checkpoint (merely for reporting purposes).
- * 1 for "primary", 2 for "secondary", 0 for "other" (backup_label)
+ * 1 for "primary", 0 for "other" (backup_label)
+ * 2 for "secondary" is no longer supported
  */
I would suggest to just remove the reference to "secondary".

-* Delete old log files (those no longer needed even for previous
-* checkpoint or the standbys in XLOG streaming).
+* Delete old log files and recycle them
 */
Here that's more "Delete or recycle old log files". Recycling of a WAL
segment is its renaming into a newer segment.

You forgot to update this formula in xlog.c:
distance = (2.0 + CheckPointCompletionTarget) * CheckPointDistanceEstimate;
/* add 10% for good measure. */
distance *= 1.10;
You can guess easily where the mistake is.

-   checkPointLoc = ControlFile->prevCheckPoint;
+   /*
+* It appears to be a bug that we used to use
prevCheckpoint here
+*/
+   checkPointLoc = ControlFile->checkPoint;
Er, no. This is correct because we expect the prior checkpoint to
still be present in the event of a failure detecting the latest
checkpoint.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Peter Geoghegan
On Tue, Oct 24, 2017 at 1:11 PM, Justin Pryzby  wrote:
> ..which I gather just verifies that the index is corrupt, not sure if there's
> anything else to do with it?  Note, we've already removed the duplicate rows.

Yes, the index itself is definitely corrupt -- this failed before the
new "heapallindexed" check even started. Though it looks like that
would have failed too, if you got that far, since the index points to
a row that does not contain the same data. (I only know this because
you dumped the heap tuple and the index tuple.)

Maybe you could try verifying a different index on the same table with
"heapallindexed", too. Perhaps that would fail in a more interesting
way.

I don't know how pg_repack works in any detail, but I have a hard time
imagining it causing corruption like this, where a single B-Tree page
is corrupt (high key invariant fails), possibly because of a torn page
(possibly due to recovery not replaying all the WAL needed, for
whatever reason).

You're using LVM snapshots -- I hope that you're aware that they're
not guaranteed to be consistent across logical volumes. There are a
few different ways that they could cause corruption like this if you
weren't careful. (In general, I wouldn't recommend using LVM snapshots
as any kind of backup solution.)

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Domains and arrays and composites, oh my

2017-10-24 Thread Tom Lane
I wrote:
> Anyway, PFA an updated patch that also fixes some conflicts with the
> already-committed arrays-of-domains patch.

I realized that the pending patch for jsonb_build_object doesn't
actually have any conflict with what I needed to touch here, so
I went ahead and fixed the JSON functions that needed fixing,
along with hstore's populate_record.  I ended up rewriting the
argument-metadata-collection portions of populate_record_worker
and populate_recordset_worker rather heavily, because I didn't
like them at all: aside from not working for domains over composite,
they were pretty inefficient (redoing a lot of work on each call
for no good reason) and they were randomly different from each
other, resulting in json{b}_populate_recordset rejecting some cases
that worked in json{b}_populate_record.

I've also updated the documentation.

I think that this patch version is done so far as the core code
and contrib are concerned.  The PLs need varying amounts of work,
but as I said earlier, I think it would be better to tackle those
in separate patches instead of continuing to enlarge the footprint
of the core patch.  So, barring objection, I'd like to go ahead
and commit this.

regards, tom lane

diff --git a/contrib/hstore/hstore_io.c b/contrib/hstore/hstore_io.c
index d828401..e999a8e 100644
*** a/contrib/hstore/hstore_io.c
--- b/contrib/hstore/hstore_io.c
*** typedef struct RecordIOData
*** 752,757 
--- 752,759 
  {
  	Oid			record_type;
  	int32		record_typmod;
+ 	/* this field is used only if target type is domain over composite: */
+ 	void	   *domain_info;	/* opaque cache for domain checks */
  	int			ncolumns;
  	ColumnIOData columns[FLEXIBLE_ARRAY_MEMBER];
  } RecordIOData;
*** hstore_from_record(PG_FUNCTION_ARGS)
*** 780,788 
  		Oid			argtype = get_fn_expr_argtype(fcinfo->flinfo, 0);
  
  		/*
! 		 * have no tuple to look at, so the only source of type info is the
! 		 * argtype. The lookup_rowtype_tupdesc call below will error out if we
! 		 * don't have a known composite type oid here.
  		 */
  		tupType = argtype;
  		tupTypmod = -1;
--- 782,792 
  		Oid			argtype = get_fn_expr_argtype(fcinfo->flinfo, 0);
  
  		/*
! 		 * We have no tuple to look at, so the only source of type info is the
! 		 * argtype --- which might be domain over composite, but we don't care
! 		 * here, since we have no need to be concerned about domain
! 		 * constraints.  The lookup_rowtype_tupdesc_domain call below will
! 		 * error out if we don't have a known composite type oid here.
  		 */
  		tupType = argtype;
  		tupTypmod = -1;
*** hstore_from_record(PG_FUNCTION_ARGS)
*** 793,804 
  	{
  		rec = PG_GETARG_HEAPTUPLEHEADER(0);
  
! 		/* Extract type info from the tuple itself */
  		tupType = HeapTupleHeaderGetTypeId(rec);
  		tupTypmod = HeapTupleHeaderGetTypMod(rec);
  	}
  
! 	tupdesc = lookup_rowtype_tupdesc(tupType, tupTypmod);
  	ncolumns = tupdesc->natts;
  
  	/*
--- 797,811 
  	{
  		rec = PG_GETARG_HEAPTUPLEHEADER(0);
  
! 		/*
! 		 * Extract type info from the tuple itself -- this will work even for
! 		 * anonymous record types.
! 		 */
  		tupType = HeapTupleHeaderGetTypeId(rec);
  		tupTypmod = HeapTupleHeaderGetTypMod(rec);
  	}
  
! 	tupdesc = lookup_rowtype_tupdesc_domain(tupType, tupTypmod, false);
  	ncolumns = tupdesc->natts;
  
  	/*
*** hstore_populate_record(PG_FUNCTION_ARGS)
*** 943,951 
  		rec = NULL;
  
  		/*
! 		 * have no tuple to look at, so the only source of type info is the
! 		 * argtype. The lookup_rowtype_tupdesc call below will error out if we
! 		 * don't have a known composite type oid here.
  		 */
  		tupType = argtype;
  		tupTypmod = -1;
--- 950,958 
  		rec = NULL;
  
  		/*
! 		 * We have no tuple to look at, so the only source of type info is the
! 		 * argtype.  The lookup_rowtype_tupdesc_domain call below will error
! 		 * out if we don't have a known composite type oid here.
  		 */
  		tupType = argtype;
  		tupTypmod = -1;
*** hstore_populate_record(PG_FUNCTION_ARGS)
*** 957,963 
  		if (PG_ARGISNULL(1))
  			PG_RETURN_POINTER(rec);
  
! 		/* Extract type info from the tuple itself */
  		tupType = HeapTupleHeaderGetTypeId(rec);
  		tupTypmod = HeapTupleHeaderGetTypMod(rec);
  	}
--- 964,973 
  		if (PG_ARGISNULL(1))
  			PG_RETURN_POINTER(rec);
  
! 		/*
! 		 * Extract type info from the tuple itself -- this will work even for
! 		 * anonymous record types.
! 		 */
  		tupType = HeapTupleHeaderGetTypeId(rec);
  		tupTypmod = HeapTupleHeaderGetTypMod(rec);
  	}
*** hstore_populate_record(PG_FUNCTION_ARGS)
*** 975,981 
  	if (HS_COUNT(hs) == 0 && rec)
  		PG_RETURN_POINTER(rec);
  
! 	tupdesc = lookup_rowtype_tupdesc(tupType, tupTypmod);
  	ncolumns = tupdesc->natts;
  
  	if (rec)
--- 985,995 
  	if (HS_COUNT(hs) == 0 && rec)
  		PG_RETURN_POINTER(rec);
  
! 	/*
! 	 * Lookup the input record's tupdesc.  For the moment, 

Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Justin Pryzby
On Tue, Oct 24, 2017 at 12:31:49PM -0700, Peter Geoghegan wrote:
> On Tue, Oct 24, 2017 at 11:48 AM, Kenneth Marshall  wrote:
> >> Really ?  pg_repack "found" and was victim to the duplicate keys, and 
> >> rolled
> >> back its work.  The CSV logs clearly show that our application INSERTed 
> >> rows
> >> which are duplicates.
> >>
> >> [pryzbyj@database ~]$ rpm -qa pg_repack10
> >> pg_repack10-1.4.2-1.rhel6.x86_64
> >>
> >> Justin
> >
> > Hi Justin,
> >
> > I just dealt with a similar problem with pg_repack and a PostgreSQL 9.5 DB,
> > the exact same error. It seemed to caused by a tuple visibility issue that
> > allowed the "working" unique index to be built, even though a duplicate row
> > existed. Then the next pg_repack would fail with the error you got. In our
> > case I needed the locality of reference to keep the DB performance 
> > acceptable
> > and it was not a critical issue if there was a duplicate. We would remove 
> > the
> > duplicates if we had a failure. We never had a problem with the NO pg_repack
> > scenarios.
> 
> A new, enhanced version of the corruption detection tool amcheck is
> now available, and has both apt + yum packages available:
> 
> https://github.com/petergeoghegan/amcheck
> 
> Unlike the version in Postgres 10, this enhanced version (V1.2) has
> "heapallindexed" verification, which is really what you want here. If
> you install it, and run it against the unique index in question (with
> "heapallindexed" verification), that could help. It might provide a
> more useful diagnostic message.
> 
> This is very new, so do let me know how you get on if you try it out.

I started an instance connected to a copy of the LVM snapshot I saved:
[pryzbyj@database ~]$ sudo -u postgres /usr/pgsql-10/bin/postmaster -c 
port=5678 -D /mnt/10/data

[pryzbyj@database amcheck]$ psql --port 5678 ts -c "SELECT 
bt_index_check('sites_idx'::regclass::oid, heapallindexed=>True)"
ERROR:  high key invariant violated for index "sites_idx"
DETAIL:  Index tid=(1,37) points to heap tid=(0,97) page lsn=0/0.
[pryzbyj@database amcheck]$ psql --port 5678 ts -c "SELECT 
bt_index_parent_check('sites_idx'::regclass::oid, heapallindexed=>True)"
ERROR:  high key invariant violated for index "sites_idx"
DETAIL:  Index tid=(1,37) points to heap tid=(0,97) page lsn=0/0.

ts=# SELECT * FROM page_header(get_raw_page('sites_idx', 1));
lsn   | 0/0
checksum  | 0
flags | 0
lower | 872
upper | 1696
special   | 8176
pagesize  | 8192
version   | 4
prune_xid | 0

ts=# SELECT * FROM page_header(get_raw_page('sites', 0));
lsn   | 1FB/AC5A4908
checksum  | 0
flags | 5
lower | 436
upper | 464
special   | 8192
pagesize  | 8192
version   | 4
prune_xid | 0

ts=# SELECT * FROM bt_page_items(get_raw_page('sites_idx', 1));

itemoffset | 48
ctid   | (1,37)
itemlen| 32
nulls  | f
vars   | t
data   | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 0b 31 31 31 31 00 00 00 00 
00 00

itemoffset | 37
ctid   | (0,97)
itemlen| 24
nulls  | f
vars   | t
data   | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 03 00 00

..which I gather just verifies that the index is corrupt, not sure if there's
anything else to do with it?  Note, we've already removed the duplicate rows.

Justin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Peter Geoghegan
On Tue, Oct 24, 2017 at 11:48 AM, Kenneth Marshall  wrote:
>> Really ?  pg_repack "found" and was victim to the duplicate keys, and rolled
>> back its work.  The CSV logs clearly show that our application INSERTed rows
>> which are duplicates.
>>
>> [pryzbyj@database ~]$ rpm -qa pg_repack10
>> pg_repack10-1.4.2-1.rhel6.x86_64
>>
>> Justin
>
> Hi Justin,
>
> I just dealt with a similar problem with pg_repack and a PostgreSQL 9.5 DB,
> the exact same error. It seemed to caused by a tuple visibility issue that
> allowed the "working" unique index to be built, even though a duplicate row
> existed. Then the next pg_repack would fail with the error you got. In our
> case I needed the locality of reference to keep the DB performance acceptable
> and it was not a critical issue if there was a duplicate. We would remove the
> duplicates if we had a failure. We never had a problem with the NO pg_repack
> scenarios.

A new, enhanced version of the corruption detection tool amcheck is
now available, and has both apt + yum packages available:

https://github.com/petergeoghegan/amcheck

Unlike the version in Postgres 10, this enhanced version (V1.2) has
"heapallindexed" verification, which is really what you want here. If
you install it, and run it against the unique index in question (with
"heapallindexed" verification), that could help. It might provide a
more useful diagnostic message.

This is very new, so do let me know how you get on if you try it out.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Current int & float overflow checking is slow.

2017-10-24 Thread Tom Lane
Robert Haas  writes:
> On Tue, Oct 24, 2017 at 4:36 PM, Tom Lane  wrote:
>> Yeah, but I lost the argument.  For better or worse, our expected
>> behavior is now that we throw errors.  You don't get to change that
>> just because it would save a few cycles.

> I don't know that we can consider the results of a discussion in 2006
> to be binding policy for the indefinite future.   A lot of things get
> relitigated more than once per decade on this mailing list, and if we
> know things now that we didn't know then (e.g. that one choice has a
> far more severe performance consequence than the other) that's
> reasonable justification for deciding to change our mind.

I don't like changing well-defined, user-visible query behavior for
no other reason than a performance gain (of a size that hasn't even
been shown to be interesting, btw).  Will we change it back in another
ten years if the performance tradeoff changes?

Also, if I recall the old discussion properly, one concern was getting
uniform behavior across different platforms.  I'm worried that if we do
what Andres suggests, we'll get behavior that is not only different but
platform-specific.  Now, to the extent that you believe that every modern
platform implements edge-case IEEE float behavior the same way, that worry
may be obsolete.  But I don't think I believe that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Kenneth Marshall
On Tue, Oct 24, 2017 at 01:30:19PM -0500, Justin Pryzby wrote:
> On Tue, Oct 24, 2017 at 01:27:14PM -0500, Kenneth Marshall wrote:
> > On Tue, Oct 24, 2017 at 01:14:53PM -0500, Justin Pryzby wrote:
> 
> > > Note:
> > > I run a script which does various combinations of ANALYZE/VACUUM 
> > > (FULL/ANALYZE)
> > > following the upgrade, and a script runs nightly with REINDEX and 
> > > pg_repack
> > > (and a couple of CLUSTER), so you should assume that any combination of 
> > > those
> > > maintenance commands have been run.
> > > 
> > > In our reindex/repack log I found the first error due to duplicates:
> > > Tue Oct 24 01:27:53 MDT 2017: sites: sites_idx(repack non-partitioned)...
> > > WARNING: Error creating index "public"."index_61764": ERROR:  could not 
> > > create unique index "index_61764"
> > > DETAIL:  Key (site_office, site_location)=(CRCLMT-DOEMS0, 1120) is 
> > > duplicated.
> > > WARNING: Skipping index swapping for "sites", since no new indexes built
> > > WARNING: repack failed for "sites_idx"
> > > reindex: warning, dropping invalid/unswapped index: index_61764
> > > 
> > 
> > Hi Justin,
> > 
> > This sounds like a pg_repack bug and not a PostgreSQL bug. What version are
> > you running?
> 
> Really ?  pg_repack "found" and was victim to the duplicate keys, and rolled
> back its work.  The CSV logs clearly show that our application INSERTed rows
> which are duplicates.
> 
> [pryzbyj@database ~]$ rpm -qa pg_repack10
> pg_repack10-1.4.2-1.rhel6.x86_64
> 
> Justin

Hi Justin,

I just dealt with a similar problem with pg_repack and a PostgreSQL 9.5 DB,
the exact same error. It seemed to caused by a tuple visibility issue that
allowed the "working" unique index to be built, even though a duplicate row
existed. Then the next pg_repack would fail with the error you got. In our
case I needed the locality of reference to keep the DB performance acceptable
and it was not a critical issue if there was a duplicate. We would remove the
duplicates if we had a failure. We never had a problem with the NO pg_repack
scenarios.

Regards,
Ken 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Justin Pryzby
On Tue, Oct 24, 2017 at 01:27:14PM -0500, Kenneth Marshall wrote:
> On Tue, Oct 24, 2017 at 01:14:53PM -0500, Justin Pryzby wrote:

> > Note:
> > I run a script which does various combinations of ANALYZE/VACUUM 
> > (FULL/ANALYZE)
> > following the upgrade, and a script runs nightly with REINDEX and pg_repack
> > (and a couple of CLUSTER), so you should assume that any combination of 
> > those
> > maintenance commands have been run.
> > 
> > In our reindex/repack log I found the first error due to duplicates:
> > Tue Oct 24 01:27:53 MDT 2017: sites: sites_idx(repack non-partitioned)...
> > WARNING: Error creating index "public"."index_61764": ERROR:  could not 
> > create unique index "index_61764"
> > DETAIL:  Key (site_office, site_location)=(CRCLMT-DOEMS0, 1120) is 
> > duplicated.
> > WARNING: Skipping index swapping for "sites", since no new indexes built
> > WARNING: repack failed for "sites_idx"
> > reindex: warning, dropping invalid/unswapped index: index_61764
> > 
> 
> Hi Justin,
> 
> This sounds like a pg_repack bug and not a PostgreSQL bug. What version are
> you running?

Really ?  pg_repack "found" and was victim to the duplicate keys, and rolled
back its work.  The CSV logs clearly show that our application INSERTed rows
which are duplicates.

[pryzbyj@database ~]$ rpm -qa pg_repack10
pg_repack10-1.4.2-1.rhel6.x86_64

Justin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Kenneth Marshall
On Tue, Oct 24, 2017 at 01:14:53PM -0500, Justin Pryzby wrote:
> I upgrade another instance to PG10 yesterday and this AM found unique key
> violations.
> 
> Our application is SELECTing FROM sites WHERE site_location=$1, and if it
> doesn't find one, INSERTs one (I know that's racy and not ideal).  We ended up
> with duplicate sites, despite a unique index.  We removed the duplicate rows
> and reindexed fine.  This is just a heads up with all the detail I can fit in 
> a
> mail (but there's more if needed).
> ...
> Note:
> I run a script which does various combinations of ANALYZE/VACUUM 
> (FULL/ANALYZE)
> following the upgrade, and a script runs nightly with REINDEX and pg_repack
> (and a couple of CLUSTER), so you should assume that any combination of those
> maintenance commands have been run.
> 
> In our reindex/repack log I found the first error due to duplicates:
> Tue Oct 24 01:27:53 MDT 2017: sites: sites_idx(repack non-partitioned)...
> WARNING: Error creating index "public"."index_61764": ERROR:  could not 
> create unique index "index_61764"
> DETAIL:  Key (site_office, site_location)=(CRCLMT-DOEMS0, 1120) is duplicated.
> WARNING: Skipping index swapping for "sites", since no new indexes built
> WARNING: repack failed for "sites_idx"
> reindex: warning, dropping invalid/unswapped index: index_61764
> 

Hi Justin,

This sounds like a pg_repack bug and not a PostgreSQL bug. What version are
you running?

Regards,
Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Current int & float overflow checking is slow.

2017-10-24 Thread Robert Haas
On Tue, Oct 24, 2017 at 4:36 PM, Tom Lane  wrote:
>> Does it? In plenty of cases getting infinity rather than an error is
>> just about as useful.
>> This was argued by a certain Tom Lane a few years back ;)
>> http://archives.postgresql.org/message-id/19208.1167246902%40sss.pgh.pa.us
>
> Yeah, but I lost the argument.  For better or worse, our expected
> behavior is now that we throw errors.  You don't get to change that
> just because it would save a few cycles.

I don't know that we can consider the results of a discussion in 2006
to be binding policy for the indefinite future.   A lot of things get
relitigated more than once per decade on this mailing list, and if we
know things now that we didn't know then (e.g. that one choice has a
far more severe performance consequence than the other) that's
reasonable justification for deciding to change our mind.  Also, it's
not like there were a million votes on one side vs. just you on the
other; reading the thread, it's not at all clear that you were in the
minority with that position.

That's not to say I necessarily support Andres's proposal.  Changing
query behavior is a big deal; we can't do it very often without
causing a lot of hassles for users (and maybe damaging our reputation
for stability in the process).  And it's not very clear to me that
someone who does a SUM(a * b) over many rows will be happy to get
infinity rather than an error.  It could be true, but I don't have the
experience to be sure of it -- and I'm a bit worried that if we change
anything, we'll only find out whether users like it after we cut the
release.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Justin Pryzby
I upgrade another instance to PG10 yesterday and this AM found unique key
violations.

Our application is SELECTing FROM sites WHERE site_location=$1, and if it
doesn't find one, INSERTs one (I know that's racy and not ideal).  We ended up
with duplicate sites, despite a unique index.  We removed the duplicate rows
and reindexed fine.  This is just a heads up with all the detail I can fit in a
mail (but there's more if needed).

ts=# \d sites
 site_id  | integer  |   | not null | 
nextval('sites_site_id_seq'::regclass)
 site_office  | text |   |  | 
 site_location| text |   |  | 
[...]
Indexes:
"sites_pkey" PRIMARY KEY, btree (site_id)
"sites_idx" UNIQUE, btree (site_office, site_location)

ts=# SELECT site_office, site_location, count(*), min(site_id), max(site_id) 
FROM sites GROUP BY 1,2 HAVING count(*)>1 ORDER BY 1,2;  
  site_office   | site_location | count | min | max 
+---+---+-+-
 CRCLMT-DOEMS0  |   | 2 | 165 | 351
 CRCLMT-DOEMS0  | 1101  | 2 | 123 | 343
 CRCLMT-DOEMS0  | 1102  | 2 | 134 | 318
 CRCLMT-DOEMS0  | 1103  | 2 | 145 | 322
 CRCLMT-DOEMS0  | 1104  | 2 | 156 | 329

The duplicate site_ids mean this isn't an issue with row version/visibility due
to XIDs (right?).

ts=# SELECT 1 FROM sites WHERE site_office='CRCLMT-CEMS0' AND site_location='';
(0 rows)
ts=# SET enable_bitmapscan=off; SET enable_indexscan=off; SELECT 1 FROM sites 
WHERE site_office='CRCLMT-CEMS0' AND site_location='';
-[ RECORD 1 ]
?column? | 1

So there's an issue with indices failing to return matching rows (and thereby
allowing inserting duplicate violating rows).

That's the only table/index affected (and the only PG instance thus far
affected).

Note regarding my pg_upgrade: 3 years ago, this was our first and smallest
customer who I upgraded off PG8.4 (to PG9.2 if I recall), and I did it using
pg_dump |pg_restore.  I believe, as a consequence, its postgres database was in
"C" locale and ASCII encoding.  So the last few upgrades (9.3, .4 .5 and .6), I
believe I've manually used initdb --encoding followed by pg_upgrade (else it
fails due to new postgres/template DBs with different locale/encoding from
old).  This upgrade, I finally renamed postgres DB (which has imported CSV logs
and one or two other things) and pg_dump|pg_restore into a new DB with UTF8
encoding, which allowed pg_upgrade to run without special initdb invocation.

I have an LVM snapshot and full CSV logs imported into a table.  I also have a
backup from 22:00 which doesn't have duplicate sites.  Those seem to have been
inserted by our application around 00:30:

These IDs which inserted duplicate rows:
postgres=# SELECT session_id, max(session_line) FROM 
postgres_log_2017_10_24_ WHERE message LIKE 'statement: SELECT site_id FROM 
sites WHERE%' GROUP BY 1 ;
  session_id   | max  
---+--
 59eedfb1.5cea |  714
 59eedfb5.5cf1 | 1741
(2 rows)

postgres=# SELECT log_time, session_id, session_line, left(message,333) FROM 
postgres_log WHERE (session_id='59eedfb1.5cea' OR session_id='59eedfb5.5cf1') 
AND (session_line<6 OR message LIKE '%INSERT INTO site%') ORDER BY 1,2,3;
-[ RECORD 4 
]+--
log_time | 2017-10-24 00:37:37.888-06
session_id   | 59eedfb1.5cea
session_line | 4
left | statement: SELECT site_id FROM sites WHERE   
+
 | site_office = 'CRCLMT-DOEMS0' AND site_location 
= '1203'
-[ RECORD 5 
]+--
log_time | 2017-10-24 00:37:37.89-06
session_id   | 59eedfb1.5cea
session_line | 5
left | statement: INSERT INTO sites 
(site_office,site_location,site_alias)  
+
 | VALUES ('CRCLMT-DOEMS0', '1203', (SELECT 
site_id FROM sites  +
 | WHERE site_office = 'CRCLMT-CEMS0' AND 
site_location = '1203'))

Note:
I run a script which does various combinations of ANALYZE/VACUUM (FULL/ANALYZE)
following the upgrade, and a script runs nightly with REINDEX and pg_repack
(and a couple of CLUSTER), so you should assume that any combination of those
maintenance commands have been run.

In our reindex/repack log I found the first error due to duplicates:
Tue Oct 24 01:27:53 MDT 2017: sites: sites_idx(repack non-partitioned)...
WARNING: Error creating index "public"."index_61764": ERROR:  could not create 
unique index "index_61764"
DETAIL:  Key (site_office, 

[HACKERS] Deadlock in ALTER SUBSCRIPTION REFRESH PUBLICATION

2017-10-24 Thread Konstantin Knizhnik
Parallel execution of ALTER SUBSCRIPTION REFRESH PUBLICATION at several 
nodes may cause deadlock:


knizhnik  1480  0.0  0.1 1417532 16496 ?   Ss   20:01   0:00 
postgres: bgworker: logical replication worker for subscription 16589 
sync 16720    waiting
knizhnik  1481  0.0  0.1 1417668 17668 ?   Ss   20:01   0:00 
postgres: knizhnik postgres 127.0.0.1(36378) idle
knizhnik  1484  0.0  0.1 1406952 16676 ?   Ss   20:01   0:00 
postgres: knizhnik postgres 127.0.0.1(56194) ALTER SUBSCRIPTION waiting
knizhnik  1486  0.0  0.1 1410040 21268 ?   Ss   20:01   0:00 
postgres: wal sender process knizhnik 127.0.0.1(36386) idle
knizhnik  1487  0.0  0.1 1417532 16736 ?   Ss   20:01   0:00 
postgres: bgworker: logical replication worker for subscription 16589 
sync 16774
knizhnik  1489  0.0  0.1 1410040 21336 ?   Ss   20:01   0:00 
postgres: wal sender process knizhnik 127.0.0.1(36388) idle
knizhnik  1510  0.0  0.1 1406952 16820 ?   Ss   20:01   0:00 
postgres: knizhnik postgres 127.0.0.1(56228) ALTER SUBSCRIPTION waiting
knizhnik  1530  0.0  0.1 1406952 16736 ?   Ss   20:01   0:00 
postgres: knizhnik postgres 127.0.0.1(56260) ALTER SUBSCRIPTION waiting
knizhnik  1534  0.0  0.0 1417504 14360 ?   Ss   20:01   0:00 
postgres: bgworker: logical replication worker for subscription 16590
knizhnik  1537  0.0  0.1 1409164 16920 ?   Ss   20:01   0:00 
postgres: wal sender process knizhnik 127.0.0.1(45054) idle
knizhnik  1545  0.0  0.0 1417344 14576 ?   Ss   20:01   0:00 
postgres: bgworker: logical replication worker for subscription 16592
knizhnik  1546  0.0  0.1 1409168 16588 ?   Ss   20:01   0:00 
postgres: wal sender process knizhnik 127.0.0.1(56274) idle
knizhnik  1547  0.0  0.0 1417348 14332 ?   Ss   20:01   0:00 
postgres: bgworker: logical replication worker for subscription 16592 
sync 16606
knizhnik  1549  0.0  0.0 1409004 14512 ?   Ss   20:01   0:00 
postgres: wal sender process knizhnik 127.0.0.1(56276) idle in 
transaction waiting
knizhnik  1553  0.0  0.0 1417348 14540 ?   Ss   20:01   0:00 
postgres: bgworker: logical replication worker for subscription 16592
knizhnik  1554  0.0  0.1 1409168 16596 ?   Ss   20:01   0:00 
postgres: wal sender process knizhnik 127.0.0.1(56280) idle
knizhnik  1555  0.0  0.0 1417348 14332 ?   Ss   20:01   0:00 
postgres: bgworker: logical replication worker for subscription 16592 
sync 16660
knizhnik  1556  0.0  0.0 1409004 14520 ?   Ss   20:01   0:00 
postgres: wal sender process knizhnik 127.0.0.1(56282) idle in 
transaction waiting
knizhnik  1558  0.0  0.0 1417348 14460 ?   Ss   20:01   0:00 
postgres: bgworker: logical replication worker for subscription 16592 
sync 16696
knizhnik  1560  0.0  0.0 1409004 14516 ?   Ss   20:01   0:00 
postgres: wal sender process knizhnik 127.0.0.1(56286) idle in 
transaction waiting
knizhnik  1562  0.0  0.0 1417348 14084 ?   Ss   20:01   0:00 
postgres: bgworker: logical replication worker for subscription 16592 
sync 16732
knizhnik  1567  0.0  0.0 1409004 14516 ?   Ss   20:01   0:00 
postgres: wal sender process knizhnik 127.0.0.1(56288) idle in 
transaction waiting


knizhnik@knizhnik:~$ gdb -p 1556
(gdb) bt
#0  0x7f7b991569b3 in __epoll_wait_nocancel () at 
../sysdeps/unix/syscall-template.S:84
#1  0x0087185a in WaitEventSetWaitBlock (set=0x246d1e8, cur_timeout=-1, 
occurred_events=0x73a43580, nevents=1) at latch.c:1048
#2  0x00871716 in WaitEventSetWait (set=0x246d1e8, timeout=-1, 
occurred_events=0x73a43580, nevents=1, wait_event_info=50331652) at 
latch.c:1000
#3  0x00870e6c in WaitLatchOrSocket (latch=0x7f7b975daba4, 
wakeEvents=1, sock=-1, timeout=-1, wait_event_info=50331652) at latch.c:385
#4  0x00870d3e in WaitLatch (latch=0x7f7b975daba4, wakeEvents=1, 
timeout=0, wait_event_info=50331652) at latch.c:339
#5  0x0088995c in ProcSleep (locallock=0x24764d0, lockMethodTable=0xc17420 
) at proc.c:1255
#6  0x008828cb in WaitOnLock (locallock=0x24764d0, owner=0x2517c78) at 
lock.c:1702
#7  0x0088157e in LockAcquireExtended (locktag=0x73a439a0, 
lockmode=5, sessionLock=0 '\000', dontWait=0 '\000', reportMemoryError=1 
'\001') at lock.c:998
#8  0x00880ba8 in LockAcquire (locktag=0x73a439a0, lockmode=5, 
sessionLock=0 '\000', dontWait=0 '\000') at lock.c:688
#9  0x0087f8dc in XactLockTableWait (xid=624, rel=0x0, ctid=0x0, 
oper=XLTW_None) at lmgr.c:587
#10 0x0082f524 in SnapBuildWaitSnapshot (running=0x246d1b0, cutoff=632) 
at snapbuild.c:1400
#11 0x0082f2f6 in SnapBuildFindSnapshot (builder=0x254b550, 
lsn=24410008, running=0x246d1b0) at snapbuild.c:1311
#12 0x0082ee67 in SnapBuildProcessRunningXacts (builder=0x254b550, 
lsn=24410008, running=0x246d1b0) at snapbuild.c:1106
#13 0x0081c458 in DecodeStandbyOp (ctx=0x25334b0, buf=0x73a43b10) 
at decode.c:314
#14 0x0081bf39 in LogicalDecodingProcessRecord (ctx=0x25334b0, 

Re: [HACKERS] CurTransactionContext freed before transaction COMMIT ???

2017-10-24 Thread Gaddam Sai Ram




This sounds broken on its face --- if you want stuff to survive to 

top-level commit, you need to keep it in TopTransactionContext. 

CurTransactionContext might be a subtransaction's context that will 

go away at subtransaction commit/abort. 







https://github.com/postgres/postgres/blob/master/src/backend/utils/mmgr/README



>From the above README:



CurTransactionContext --- this holds data that has to survive until the end

of the current transaction, and in particular will be needed at top-level

transaction commit.  When we are in a top-level transaction this is the same

as TopTransactionContext, but in subtransactions it points to a child context.

It is important to understand that if a subtransaction aborts, its

CurTransactionContext is thrown away after finishing the abort processing;

but a committed subtransaction's CurTransactionContext is kept until top-level

commit (unless of course one of the intermediate levels of subtransaction

aborts).  This ensures that we do not keep data from a failed subtransaction

longer than necessary.  Because of this behavior, you must be careful to clean

up properly during subtransaction abort --- the subtransaction's state must be

delinked from any pointers or lists kept in upper transactions, or you will

have dangling pointers leading to a crash at top-level commit.  An example of

data kept here is pending NOTIFY messages, which are sent at top-level commit,

but only if the generating subtransaction did not abort.



-- So even if sub-transaction is committed, subtransaction's 
CurTransactionContext is kept until top-level commit.

-- If sub-transaction is aborted, we handled(clearing the data) it via 
RegisterSubXactCallback().



And in our particular use case(which gave segmentation fault), we didn't issue 
any sub-transaction. It was a single insert transaction.

Even then it resulted in some kind of context free.




Regards

G. Sai Ram








Re: [HACKERS] Remove secondary checkpoint

2017-10-24 Thread Simon Riggs
On 24 October 2017 at 09:50, Tom Lane  wrote:
> Simon Riggs  writes:
>> Remove the code that maintained two checkpoint's WAL files and all
>> associated stuff.
>
>> Try to avoid breaking anything else
>
>> This
>> * reduces disk space requirements on master
>> * removes a minor bug in fast failover
>> * simplifies code
>
> Doesn't it also make crash recovery less robust?  The whole point
> of that mechanism is to be able to cope if the latest checkpoint
> record is unreadable.  If you want to toss that overboard, I think
> you need to make the case why we don't need it, not just post a
> patch removing it.  *Of course* the code is simpler without it.
> That's utterly irrelevant.  The code would be even simpler with
> no crash recovery at all ... but we're not going there.

Well, the mechanism has already been partially removed since we don't
maintain two checkpoints on a standby. So all I'm proposing is we
remove the other half.

I've not seen myself, nor can I find an example online where the
primary failed yet the secondary did not also fail from the same
cause.

If it is a possibility to do this, now we have pg_waldump we can
easily search for a different checkpoint and start from there instead
which is a more flexible approach. If you didn't save your WAL and
don't have any other form of backup, relying on the secondary
checkpoint is not exactly a safe bet.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Current int & float overflow checking is slow.

2017-10-24 Thread Tom Lane
Andres Freund  writes:
> On 2017-10-24 10:09:09 -0400, Tom Lane wrote:
>> There's an ancient saying that code can be arbitrarily fast if it
>> doesn't have to get the right answer.  I think this proposal falls
>> in that category.

> Does it? In plenty of cases getting infinity rather than an error is
> just about as useful.
> This was argued by a certain Tom Lane a few years back ;)
> http://archives.postgresql.org/message-id/19208.1167246902%40sss.pgh.pa.us

Yeah, but I lost the argument.  For better or worse, our expected
behavior is now that we throw errors.  You don't get to change that
just because it would save a few cycles.

>> SIGFPE isn't going to be easy to recover from, nor portable.

> Hm? A trivial hack implementing the above survives the regression test,
> with the exception of one output change because some functions currently
> do *not* check for overflow.  What's the issue you're concerned about?

The real problem with it is that it's a process-wide setting, and would
for example probably break PL/R, or other libraries that are not expecting
to lose control to overflows.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CurTransactionContext freed before transaction COMMIT ???

2017-10-24 Thread Tom Lane
Gaddam Sai Ram  writes:
> We are implementing in-memory index. As a part of that, during 
> index callbacks, under CurTransactionContext, we cache all the DMLs of a 
> transaction in dlist(postgres's doubly linked list).
> We registered transaction callback, and in transaction pre-commit 
> event(XACT_EVENT_PRE_COMMIT), we iterate through all cached DMLs(dlist) and 
> populate in my in-memory data structure.

This sounds broken on its face --- if you want stuff to survive to
top-level commit, you need to keep it in TopTransactionContext.
CurTransactionContext might be a subtransaction's context that will
go away at subtransaction commit/abort.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Current int & float overflow checking is slow.

2017-10-24 Thread Andres Freund
On 2017-10-24 10:09:09 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > There's no comparable overflow handling to the above integer
> > intrinsics. But I think we can still do a lot better. Two very different
> > ways:
> 
> > 1) Just give up on detecting overflows for floats. Generating inf in
> >these cases actually seems entirely reasonable. We already don't
> >detect them in a bunch of cases anyway.  I can't quite parse the
> >standard's language around this.
> 
> There's an ancient saying that code can be arbitrarily fast if it
> doesn't have to get the right answer.  I think this proposal falls
> in that category.

Does it? In plenty of cases getting infinity rather than an error is
just about as useful.

This was argued by a certain Tom Lane a few years back ;)
http://archives.postgresql.org/message-id/19208.1167246902%40sss.pgh.pa.us


> > 2) Use platform specific float exception handling where available. We
> >could at backend start, and in FloatExceptionHandler(), us
> >feenableexcept() (windows has similar) to trigger SIGFPE on float
> >overflow.
> 
> SIGFPE isn't going to be easy to recover from, nor portable.

Hm? A trivial hack implementing the above survives the regression test,
with the exception of one output change because some functions currently
do *not* check for overflow.  What's the issue you're concerned about?

The portability indeed is a problem.


> I think what you actually want to do is *disable* SIGFPE (see
> feholdexcept), and then have individual functions use feclearexcept
> and fetestexcept.  These functions were standardized by C99 so
> they should be pretty widely available ... of course, whether they
> actually are widely portable remains to be seen.  Whether they're
> faster than what we're doing now also remains to be seen.

I tested it, and they're fairly slow on at least gcc-7 + glibc 2.24.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] CurTransactionContext freed before transaction COMMIT ???

2017-10-24 Thread Gaddam Sai Ram


Hello people,



We are implementing in-memory index. As a part of that, during 
index callbacks, under CurTransactionContext, we cache all the DMLs of a 
transaction in dlist(postgres's doubly linked list).

We registered transaction callback, and in transaction pre-commit 
event(XACT_EVENT_PRE_COMMIT), we iterate through all cached DMLs(dlist) and 
populate in my in-memory data structure.



   -- For detailed explanation of our implementation, please look 
into below thread.






   
https://www.postgresql.org/message-id/15f4ea99b34.e69a4e1b633.8937474039794097334%40zohocorp.com




   -- It was working fine until I was greeted with a segmentation 
fault while accessing dlist!

   

   -- On further examining I found that dlist head is not NULL and 
it is pointing to some address, but the container I extracted is pointing to 
0x7f7f7f7f7f7f7f5f, and I was not able to access any members in my container.



   -- 
https://wiki.postgresql.org/wiki/Developer_FAQ#Why_are_my_variables_full_of_0x7f_bytes.3F

From the above wiki, we came to a conclusion that the memory 
context in which that dlist was present was freed.


And yes CLOBBER_FREED_MEMORY is enabled by passing --enable-cassert 
to enable asserts in my code.



   -- Normally the memory context inside XACT_EVENT_PRE_COMMIT is 
TopTransactionContext but in this particular case, the memory context was 
'MessageContext'. Little unusual! Not sure why!



   -- So basically CurTransactionContext shouldn't get freed before 
transaction COMMIT.

   -- So what has actually happened here??? Kindly help us with 
your insights!



For your reference, a code snippet of our implementation is pasted below:



Sample code snippet:



/*** Code snippet starts ***/



dlist_head DMLInfo = {{NULL, NULL}};




Insert_callback()


{

  old_context = MemoryContextSwitchTo(CurTransactionContext);

  

  GraphIndex_InsertInfo *current;

  current = (GraphIndex_InsertInfo *)palloc(sizeof(GraphIndex_InsertInfo));

  current-tableOid = tableOid;

  current-subTransactionID = GetCurrentSubTransactionId();

  current-colValue = (long)values[0]; // Varies with column type

  current-ctid = ctid;



  dlist_push_head(DMLInfo, current-list_node);

  

  MemoryContextSwitchTo(old_context);

  return TRUE;

}



static void GraphIndex_xactCallback(XactEvent event, void *arg) 

{


  GraphIndex_Table *tableInfo;


  GraphIndex_InsertInfo *current;

  dlist_iter iter;



  switch (event) 

  {

  case XACT_EVENT_PRE_PREPARE:

  case XACT_EVENT_PRE_COMMIT:

  PG_TRY();

  {

if(DMLInfo.head.next)

{

  dlist_reverse_foreach(iter, DMLInfo)

  {

current = 
dlist_container(GraphIndex_InsertInfo, list_node, iter.cur);

DMLProcessingFunction(current); // This is 
giving me the ERROR (while accessing members of current)

  }

  DMLInfo.head.next = NULL;

}

 }

 PG_CATCH();

 {

   /* Do cleanup */

 }

 PG_END_TRY();



 break;



 case XACT_EVENT_ABORT:

 case XACT_EVENT_PARALLEL_ABORT:

  /* No cleanup Necessary(No structure created) */

  break;



 default:

  return;

  }

}





 


 
/*** Code snippet ends ***/



Regards

G. Sai Ram









Re: [HACKERS] Current int & float overflow checking is slow.

2017-10-24 Thread Tom Lane
Andres Freund  writes:
> There's no comparable overflow handling to the above integer
> intrinsics. But I think we can still do a lot better. Two very different
> ways:

> 1) Just give up on detecting overflows for floats. Generating inf in
>these cases actually seems entirely reasonable. We already don't
>detect them in a bunch of cases anyway.  I can't quite parse the
>standard's language around this.

There's an ancient saying that code can be arbitrarily fast if it
doesn't have to get the right answer.  I think this proposal falls
in that category.

> 2) Use platform specific float exception handling where available. We
>could at backend start, and in FloatExceptionHandler(), us
>feenableexcept() (windows has similar) to trigger SIGFPE on float
>overflow.

SIGFPE isn't going to be easy to recover from, nor portable.

I think what you actually want to do is *disable* SIGFPE (see
feholdexcept), and then have individual functions use feclearexcept
and fetestexcept.  These functions were standardized by C99 so
they should be pretty widely available ... of course, whether they
actually are widely portable remains to be seen.  Whether they're
faster than what we're doing now also remains to be seen.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-24 Thread Andres Freund
On 2017-10-24 09:50:12 -0400, Tom Lane wrote:
> Simon Riggs  writes:
> > Remove the code that maintained two checkpoint's WAL files and all
> > associated stuff.
> 
> > Try to avoid breaking anything else
> 
> > This
> > * reduces disk space requirements on master
> > * removes a minor bug in fast failover
> > * simplifies code
> 
> Doesn't it also make crash recovery less robust?

I think it does the contrary. The current mechanism is, in my opinion,
downright dangerous:
https://www.postgresql.org/message-id/20160201235854.go8...@awork2.anarazel.de

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Help needed regarding DSA based in-memory index!

2017-10-24 Thread Gaddam Sai Ram
Hi Munro,

  Thanks for cautioning us about possible memory leaks(during error cases) 
incase of long-lived DSA segements(have a look in below thread for more 
details).



  
https://www.postgresql.org/message-id/CAEepm%3D3c4WAtSQG4tAF7Y_VCnO5cKh7KuFYZhpKbwGQOF%3DdZ4A%40mail.gmail.com


  Actually we are following an approach to avoid this DSA memory leaks. Let 
me explain our implementation and please validate and correct us in-case we 
  miss anything.



  Implementation:

  

  Basically we have to put our index data into memory (Index Column Value 
Vs Ctid) which we get in aminsert callback function.

  

  Coming to the implementation, in aminsert Callback function, 

We Switch to CurTransactionContext 

Cache the DMLs of a transaction into dlist(global per process)

Even if different clients work parallel, it won't be a problem because every 
client gets one dlist in separate process and it'll have it's own 
CurTransactionContext

We have registered transaction callback (using RegisterXactCallback() 
function). And during event pre-commit(XACT_EVENT_PRE_COMMIT), we populate all 
the transaction specific DMLs (from dlist) into our in-memory index(DSA) 
obviously inside PG_TRY/PG_CATCH block.

In case we got some errors(because of dsa_allocate() or something else) while 
processing dlist(while populating in-memory index), we cleanup the DSA memory 
in PG_CATCH block that is allocated/used till that point.

During other error cases, typically transactions gets aborted and PRE_COMMIT 
event is not called and hence we don't touch DSA at that time. Hence no need to 
bother about leaks.

Even sub transaction case is handled with sub transaction callbacks.

CurTransactionContext(dlist basically) is automatically cleared after that 
particular transaction.



I want to know if this approach is good and works well in all cases. Kindly 
provide your feedback on this.



 


 
Regards

G. Sai Ram








Re: [HACKERS] Remove secondary checkpoint

2017-10-24 Thread Tom Lane
Simon Riggs  writes:
> Remove the code that maintained two checkpoint's WAL files and all
> associated stuff.

> Try to avoid breaking anything else

> This
> * reduces disk space requirements on master
> * removes a minor bug in fast failover
> * simplifies code

Doesn't it also make crash recovery less robust?  The whole point
of that mechanism is to be able to cope if the latest checkpoint
record is unreadable.  If you want to toss that overboard, I think
you need to make the case why we don't need it, not just post a
patch removing it.  *Of course* the code is simpler without it.
That's utterly irrelevant.  The code would be even simpler with
no crash recovery at all ... but we're not going there.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transform for pl/perl

2017-10-24 Thread Anthony Bykov
There are some moments I should mention:
1. {"1":1}::jsonb is transformed into HV {"1"=>"1"}, while
["1","2"]::jsonb is transformed into AV ["1", "2"]

2. If there is a numeric value appear in jsonb, it will be transformed
to SVnv through string (Numeric->String->SV->SVnv). Not the best
solution, but as far as I understand this is usual practise in
postgresql to serialize Numerics and de-serialize them.

3. SVnv is transformed into jsonb through string
(SVnv->String->Numeric).

An example may also be helpful to understand extension. So, as an
example, function "test" transforms incoming jsonb into perl,
transforms it back into jsonb and returns it.

create extension jsonb_plperl cascade;

create or replace function test(val jsonb)
returns jsonb
transform for type jsonb
language plperl
as $$
return $_[0];
$$;

select test('{"1":1,"example": null}'::jsonb);


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Current int & float overflow checking is slow.

2017-10-24 Thread Greg Stark
We already know this integer overflow checking is non-standard and
compilers keep trying to optimize them out.  Our only strategy to
defeat that depends on compiler flags like -fwrapv that vary by
compiler and may or may not be working on less well tested compiler.

So if there's a nice readable and convenient way to portably use cpu
flags That would be brilliant. And I'm not too concerned if it doesn't
run on VAX.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept

2017-10-24 Thread Alexander Korotkov
On Tue, Oct 24, 2017 at 10:56 AM, Ivan Kartyshov  wrote:

> Hello. I made some bugfixes and rewrite the patch.
>
> Simon Riggs писал 2017-09-05 14:44:
>
>> As Alexander says, simply skipping truncation if standby is busy isn't
>> a great plan.
>>
>> If we defer an action on standby replay, when and who will we apply
>> it? What happens if the standby is shutdown or crashes while an action
>> is pending.
>>
>
> After crash standby server will go to the nearest checkout and will replay
> all his wal`s to reach consistent state and then open for read-only load.
> (all collected wal`s will be replayed in right way, I meant wal`s that
> truncates table and wal`s that make table grow)


--- a/src/backend/storage/lmgr/lock.c
> +++ b/src/backend/storage/lmgr/lock.c
> @@ -49,6 +49,8 @@
>  #include "utils/ps_status.h"
>  #include "utils/resowner_private.h"
>
> +#include 
> +#include 
>
>  /* This configuration variable is used to set the lock table size */
>  int max_locks_per_xact; /* set by guc.c */


Why do you need these includes?

+ FreeFakeRelcacheEntry(rel);
> + } else
> + {
> + XLogFlush(lsn);
> + }


This code violates our coding style.  According to it, "else" shouldn't
share its line with braces.

Also, patch definitely needs some general comment explaining what's going
on.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] [POC] hash partitioning

2017-10-24 Thread amul sul
On Tue, Oct 24, 2017 at 5:00 PM, Andres Freund  wrote:
> On 2017-10-24 12:43:12 +0530, amul sul wrote:
>> I tried to get suggested SMHasher[1] test result for the hash_combine
>> for 32-bit and 64-bit version.
>>
>> SMHasher works on hash keys of the form {0}, {0,1}, {0,1,2}... up to
>> N=255, using 256-N as the seed, for the hash_combine testing we
>> needed two hash value to be combined, for that, I've generated 64
>> and 128-bit hash using cityhash functions[2] for the given smhasher
>> key then split in two part to test 32-bit and 64-bit hash_combine
>> function respectively.   Attached patch for SMHasher code changes &
>> output of 32-bit and 64-bit hash_combine testing. Note that I have
>> skipped speed test this test which is irrelevant here.
>>
>> By referring other hash function results [3], we can see that hash_combine
>> test results are not bad either.
>>
>> Do let me know if current testing is not good enough or if you want me to do
>> more testing, thanks.
>
> This looks very good! Both the tests you did, and the results for
> hash_combineXX. I therefore think we can go ahead with that formulation
> of hash_combine64?
>

Thanks, Andres. Yes we can, I've added your suggested hash_combine64 in
the latest patch[1].

Regards,
Amul

1] 
https://postgr.es/m/CAAJ_b97R2rJinGPAVmZZzpNV%3D-5BgYFxDfY9HYdM1bCYJFGmQw%40mail.gmail.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-10-24 Thread Andres Freund
On 2017-10-24 12:43:12 +0530, amul sul wrote:
> I tried to get suggested SMHasher[1] test result for the hash_combine
> for 32-bit and 64-bit version.
> 
> SMHasher works on hash keys of the form {0}, {0,1}, {0,1,2}... up to
> N=255, using 256-N as the seed, for the hash_combine testing we
> needed two hash value to be combined, for that, I've generated 64
> and 128-bit hash using cityhash functions[2] for the given smhasher
> key then split in two part to test 32-bit and 64-bit hash_combine
> function respectively.   Attached patch for SMHasher code changes &
> output of 32-bit and 64-bit hash_combine testing. Note that I have
> skipped speed test this test which is irrelevant here.
> 
> By referring other hash function results [3], we can see that hash_combine
> test results are not bad either.
> 
> Do let me know if current testing is not good enough or if you want me to do
> more testing, thanks.

This looks very good! Both the tests you did, and the results for
hash_combineXX. I therefore think we can go ahead with that formulation
of hash_combine64?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Transform for pl/perl

2017-10-24 Thread anthony
Hello.
Please, check out jsonb transform
(https://www.postgresql.org/docs/9.5/static/sql-createtransform.html)
for pl/perl language I've implemented.diff --git a/contrib/Makefile b/contrib/Makefile
index 8046ca4..53d44fe 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -75,9 +75,9 @@ ALWAYS_SUBDIRS += sepgsql
 endif
 
 ifeq ($(with_perl),yes)
-SUBDIRS += hstore_plperl
+SUBDIRS += hstore_plperl jsonb_plperl
 else
-ALWAYS_SUBDIRS += hstore_plperl
+ALWAYS_SUBDIRS += hstore_plperl jsonb_plperl
 endif
 
 ifeq ($(with_python),yes)
diff --git a/contrib/jsonb_plperl/Makefile b/contrib/jsonb_plperl/Makefile
new file mode 100644
index 000..8c427c5
--- /dev/null
+++ b/contrib/jsonb_plperl/Makefile
@@ -0,0 +1,40 @@
+# contrib/jsonb_plperl/Makefile
+
+MODULE_big = jsonb_plperl
+OBJS = jsonb_plperl.o $(WIN32RES)
+PGFILEDESC = "jsonb_plperl - jsonb transform for plperl"
+
+PG_CPPFLAGS = -I$(top_srcdir)/src/pl/plperl
+
+EXTENSION = jsonb_plperlu jsonb_plperl
+DATA = jsonb_plperlu--1.0.sql jsonb_plperl--1.0.sql
+
+REGRESS = jsonb_plperl jsonb_plperlu
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/jsonb_plperl
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# We must link libperl explicitly
+ifeq ($(PORTNAME), win32)
+# these settings are the same as for plperl
+override CPPFLAGS += -DPLPERL_HAVE_UID_GID -Wno-comment
+# ... see silliness in plperl Makefile ...
+SHLIB_LINK += $(sort $(wildcard ../../src/pl/plperl/libperl*.a))
+else
+rpathdir = $(perl_archlibexp)/CORE
+SHLIB_LINK += $(perl_embed_ldflags)
+endif
+
+# As with plperl we need to make sure that the CORE directory is included
+# last, probably because it sometimes contains some header files with names
+# that clash with some of ours, or with some that we include, notably on
+# Windows.
+override CPPFLAGS := $(CPPFLAGS) $(perl_embed_ccflags) -I$(perl_archlibexp)/CORE
diff --git a/contrib/jsonb_plperl/expected/jsonb_plperl.out b/contrib/jsonb_plperl/expected/jsonb_plperl.out
new file mode 100644
index 000..7a85361
--- /dev/null
+++ b/contrib/jsonb_plperl/expected/jsonb_plperl.out
@@ -0,0 +1,76 @@
+CREATE EXTENSION jsonb_plperl CASCADE;
+NOTICE:  installing required extension "plperl"
+-- test hash -> jsonb
+CREATE FUNCTION testHVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+$val = {a => 1, b => 'boo', c => undef};
+return $val;
+$$;
+SELECT testHVToJsonb();
+  testhvtojsonb  
+-
+ {"a": 1, "b": "boo", "c": null}
+(1 row)
+
+-- test array -> jsonb
+CREATE FUNCTION testAVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+$val = [{a => 1, b => 'boo', c => undef}, {d => 2}];
+return $val;
+$$;
+SELECT testAVToJsonb();
+testavtojsonb
+-
+ [{"a": 1, "b": "boo", "c": null}, {"d": 2}]
+(1 row)
+
+-- test scalar -> jsonb
+CREATE FUNCTION testSVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+$val = 1;
+return $val;
+$$;
+SELECT testAVToJsonb();
+testavtojsonb
+-
+ [{"a": 1, "b": "boo", "c": null}, {"d": 2}]
+(1 row)
+
+-- test jsonb -> scalar -> jsonb
+CREATE FUNCTION testSVToJsonb2(val jsonb) RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+return $_[0];
+$$;
+SELECT testSVToJsonb2('1');
+ testsvtojsonb2 
+
+ 1
+(1 row)
+
+SELECT testSVToJsonb2('[1,2,3]');
+ testsvtojsonb2 
+
+ [1, 2, 3]
+(1 row)
+
+SELECT testSVToJsonb2('{"1":{"2":[3,4,5]},"2":3}');
+ testsvtojsonb2  
+-
+ {"1": {"2": [3, 4, 5]}, "2": 3}
+(1 row)
+
+DROP EXTENSION plperl CASCADE;
+NOTICE:  drop cascades to 5 other objects
+DETAIL:  drop cascades to extension jsonb_plperl
+drop cascades to function testhvtojsonb()
+drop cascades to function testavtojsonb()
+drop cascades to function testsvtojsonb()
+drop cascades to function testsvtojsonb2(jsonb)
diff --git a/contrib/jsonb_plperl/expected/jsonb_plperlu.out b/contrib/jsonb_plperl/expected/jsonb_plperlu.out
new file mode 100644
index 000..6d4be1c
--- /dev/null
+++ b/contrib/jsonb_plperl/expected/jsonb_plperlu.out
@@ -0,0 +1,33 @@
+CREATE EXTENSION jsonb_plperlu CASCADE;
+NOTICE:  installing required extension "plperlu"
+-- test jsonb -> hash
+CREATE FUNCTION testJsonbToHV(val jsonb) RETURNS jsonb
+LANGUAGE plperlu
+TRANSFORM FOR TYPE jsonb
+AS $$
+return $_[0];
+$$;
+SELECT testJsonbToHV('{"aa":"bb", "cc":null, "dd":2}'::jsonb);
+   testjsonbtohv   
+---
+ {"aa": "bb", "cc": null, "dd": 2}
+(1 row)
+
+-- test jsonb -> av
+CREATE FUNCTION testJsonbToAV(val jsonb) RETURNS jsonb
+LANGUAGE plperlu
+TRANSFORM FOR TYPE jsonb
+AS $$
+return $_[0];
+$$;
+SELECT testJsonbToAV('["bb", 

Re: [HACKERS] SIGSEGV in BRIN autosummarize

2017-10-24 Thread Alvaro Herrera
Alvaro Herrera wrote:

> Before pushing, I'll give a look to the regular autovacuum path to see
> if it needs a similar fix.

Reading that one, my conclusion is that it doesn't have the same problem
because the strings are allocated in AutovacuumMemCxt which is not reset
by error recovery.  This gave me the idea to use that context instead of
TopTransactionContext to store the strings in workitem processing also.
The patch I propose now is attached.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 013e54c9df54b7a324a30beba22138a86417f34f Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 23 Oct 2017 18:55:12 +0200
Subject: [PATCH v2] Fix autovacuum work items

---
 src/backend/postmaster/autovacuum.c | 23 ++-
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/src/backend/postmaster/autovacuum.c 
b/src/backend/postmaster/autovacuum.c
index 776b1c0a9d..dc76728d71 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2521,9 +2521,10 @@ deleted:
{
AutoVacuumWorkItem *workitem = 
>av_workItems[i];
 
-   if (!workitem->avw_used)
-   continue;
-   if (workitem->avw_active)
+   /* Item empty, already being processed, not this database? skip 
it */
+   if (!workitem->avw_used ||
+   workitem->avw_active ||
+   workitem->avw_database != MyDatabaseId)
continue;
 
/* claim this one, and release lock while performing it */
@@ -2531,6 +2532,7 @@ deleted:
LWLockRelease(AutovacuumLock);
 
perform_work_item(workitem);
+   /* we intentially do not set did_vacuum here */
 
/*
 * Check for config changes before acquiring lock for further
@@ -2601,11 +2603,9 @@ perform_work_item(AutoVacuumWorkItem *workitem)
/*
 * Save the relation name for a possible error message, to avoid a 
catalog
 * lookup in case of an error.  If any of these return NULL, then the
-* relation has been dropped since last we checked; skip it. Note: they
-* must live in a long-lived memory context because we call vacuum and
-* analyze in different transactions.
+* relation has been dropped since last we checked; skip it.
 */
-
+   MemoryContextSwitchTo(AutovacMemCxt);
cur_relname = get_rel_name(workitem->avw_relation);
cur_nspname = 
get_namespace_name(get_rel_namespace(workitem->avw_relation));
cur_datname = get_database_name(MyDatabaseId);
@@ -2615,6 +2615,8 @@ perform_work_item(AutoVacuumWorkItem *workitem)
autovac_report_workitem(workitem, cur_nspname, cur_datname);
 
/*
+* Have at it.
+*
 * We will abort the current work item if something errors out, and
 * continue with the next one; in particular, this happens if we are
 * interrupted with SIGINT.  Note that this means that the work item 
list
@@ -2622,9 +2624,6 @@ perform_work_item(AutoVacuumWorkItem *workitem)
 */
PG_TRY();
{
-   /* have at it */
-   MemoryContextSwitchTo(TopTransactionContext);
-
switch (workitem->avw_type)
{
case AVW_BRINSummarizeRange:
@@ -2668,10 +2667,8 @@ perform_work_item(AutoVacuumWorkItem *workitem)
}
PG_END_TRY();
 
-   /* We intentionally do not set did_vacuum here */
-
-   /* be tidy */
 deleted2:
+   /* be tidy */
if (cur_datname)
pfree(cur_datname);
if (cur_nspname)
-- 
2.11.0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Current int & float overflow checking is slow.

2017-10-24 Thread Andres Freund
Hi,

In analytics queries that involve a large amounts of integers and/or
floats (i.e. a large percentage) it's quite easy to see the functions
underlying the operators in profiles. Partially that's the function call
overhead, but even *after* removing most of that via JITing, they're
surprisingly expensive.

Largely that's due to the overflow checks.

For integers we currently do:

#define SAMESIGN(a,b)   (((a) < 0) == ((b) < 0))

/*
 * Overflow check.  If the inputs are of different signs then their sum
 * cannot overflow.  If the inputs are of the same sign, their sum had
 * better be that sign too.
 */
if (SAMESIGN(arg1, arg2) && !SAMESIGN(result, arg1))
ereport(ERROR,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
 errmsg("integer out of range")));

which means that we turn a single integer instruction into ~10,
including a bunch of branches.  All that despite the fact that most
architectures have flag registers signalling integer overflow. It's just
that C doesn't easily make that available.

gcc exposes more efficient overflow detection via intrinsics:
https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/Integer-Overflow-Builtins.html

Using that turns the non-error path from int4pl from:

   0x00826ec0 <+0>: mov0x20(%rdi),%rcx # arg1
   0x00826ec4 <+4>: mov0x28(%rdi),%rdx # arg2
   0x00826ec8 <+8>: mov%ecx,%esi
   0x00826eca <+10>:lea(%rdx,%rcx,1),%eax # add
   # overflow check
   0x00826ecd <+13>:shr$0x1f,%edx
   0x00826ed0 <+16>:not%esi
   0x00826ed2 <+18>:shr$0x1f,%esi
   0x00826ed5 <+21>:cmp%dl,%sil
   0x00826ed8 <+24>:je 0x826f30 
   0x00826eda <+26>:mov%eax,%edx
   0x00826edc <+28>:shr$0x1f,%ecx
   0x00826edf <+31>:shr$0x1f,%edx
   0x00826ee2 <+34>:cmp%cl,%dl
   0x00826ee4 <+36>:je 0x826f30 
   /* overflow error code */
   0x00826f30 <+112>:   retq

into

   0x00826ec0 <+0>: mov0x28(%rdi),%rax # arg2
   0x00826ec4 <+4>: add0x20(%rdi),%eax # arg1 + arg2
   0x00826ec7 <+7>: jo 0x826ecc  # jump if overflowed
   0x00826ec9 <+9>: mov%eax,%eax # clear high bits
   0x00826ecb <+11>:retq

which, not that surprisingly, is faster. Not to speak of easier to read
;)

Besides the fact that the code is faster, there's also the issue that
the current way to do overflow checks is not actually correct C, and
requires compiler flags like -fwrapv.


For floating point it's even worse.

/*
 * check to see if a float4/8 val has underflowed or overflowed
 */
#define CHECKFLOATVAL(val, inf_is_valid, zero_is_valid) \
do {
\
if (isinf(val) && !(inf_is_valid))  
\
ereport(ERROR,  
\
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),   
\
  errmsg("value out of range: overflow"))); 
\

\
if ((val) == 0.0 && !(zero_is_valid))   
\
ereport(ERROR,  
\
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),   
\
 errmsg("value out of range: underflow"))); 
\
} while(0)

result = arg1 + arg2;

/*
 * There isn't any way to check for underflow of addition/subtraction
 * because numbers near the underflow value have already been rounded to
 * the point where we can't detect that the two values were originally
 * different, e.g. on x86, '1e-45'::float4 == '2e-45'::float4 ==
 * 1.4013e-45.
 */
CHECKFLOATVAL(result, isinf(arg1) || isinf(arg2), true);

The disassembled code for float4pl is:
   0x0043ce90 <+0>: vmovss 0x20(%rdi),%xmm1
   0x0043ce95 <+5>: vmovss 0x28(%rdi),%xmm2
   0x0043ce9a <+10>:vmovss 0x2b6a7e(%rip),%xmm3# 0x6f3920
   0x0043cea2 <+18>:vaddss %xmm1,%xmm2,%xmm0
   0x0043cea6 <+22>:vmovaps %xmm0,%xmm4
   0x0043ceaa <+26>:vandps %xmm3,%xmm4,%xmm4
   0x0043ceae <+30>:vucomiss 0x2b6a4a(%rip),%xmm4# 0x6f3900
   0x0043ceb6 <+38>:jbe0x43ced4 
   0x0043ceb8 <+40>:vandps %xmm3,%xmm1,%xmm1
   0x0043cebc 

Re: [HACKERS] Parallel Hash take II

2017-10-24 Thread Thomas Munro
On Tue, Sep 19, 2017 at 8:06 AM, Robert Haas  wrote:
> On Thu, Sep 14, 2017 at 10:01 AM, Thomas Munro
>  wrote:
>> 3.  Gather Merge and Parallel Hash Join may have a deadlock problem.
>
> [...]
>
> Thomas and I spent about an hour and a half brainstorming about this
> just now.
>
> [...]
>
> First, as long as nbatches == 1, we can use a hash
> table of up to size (participants * work_mem); if we have to switch to
> multiple batches, then just increase the number of batches enough that
> the current memory usage drops below work_mem.  Second, following an
> idea originally by Ashutosh Bapat whose relevance to this issue Thomas
> Munro realized during our discussion, we can make all the batches
> small enough to fit in work_mem (rather than participants * work_mem
> as the current patch does) and spread them across the workers (in the
> style of Parallel Append, including potentially deploying multiple
> workers against the same batch if there are fewer batches than
> workers).

Here is an updated patch set that does that ^.

Assorted details:

1.  To avoid deadlocks, we can only wait at barriers when we know that
all other attached backends have either arrived already or are
actively executing the code preceding the barrier wait, so that
progress is guaranteed.  The rules is that executor nodes can remain
attached to a barrier after they've emitted a tuple, which is useful
for resource management (ie avoids inventing a separate reference
counting scheme), but must never again wait for it.  With that
programming rule there can be no deadlock between executor nodes.

2.  Multiple batches are processed at the same time in parallel,
rather than being processed serially.  Participants try to spread
themselves out over different batches to reduce contention.

3.  I no longer try to handle outer joins.  I have an idea for how to
do that while adhering to the above deadlock-avoidance programming
rule, but I would like to consider that for a later patch.

4.  I moved most of the parallel-aware code into ExecParallelHash*()
functions that exist alongside the private hash table versions.  This
avoids uglifying the existing hash join code and introducing
conditional branches all over the place, as requested by Andres.  This
made some of the earlier refactoring patches unnecessary.

5.  Inner batch repartitioning, if required, is now completed up front
for Parallel Hash.  Rather than waiting until we try to load hash
tables back into memory to discover that they don't fit, this version
tracks the size of hash table contents while writing the batches out.
That change has various pros and cons, but its main purpose is to
remove dependencies between batches.

6.  Outer batch repartitioning is now done up front too, if it's
necessary.  This removes the dependencies that exist between batch 0
and later batches, but means that outer batch 0 is now written to disk
if for multi-batch joins.  I don't see any way to avoid that while
adhering to the deadlock avoidance rule stated above.  If we've
already started emitting tuples for batch 0 (by returning control to
higher nodes) then we have no deadlock-free way to wait for the scan
of the outer relation to finish, so we can't safely process later
batches.  Therefore we must buffer batch 0's tuples.  This renders the
skew optimisation useless.

7.  There is now some book-keeping state for each batch.  For typical
cases the total space used is negligible but obviously you can
contrive degenerate cases where it eats a lot of memory (say by
creating a million batches, which is unlikely to work well for other
reasons).  I have some ideas on reducing their size, but on the other
hand I also have some ideas on increasing it profitably... (this is
the perfect place to put the Bloom filters discussed elsewhere that
would  make up for the loss of the skew optimisation, for selective
joins; a subject for another day).

8.  Barrier API extended slightly.  (1) BarrierWait() is renamed to
BarrierArriveAndWait().  (2) BarrierArriveAndDetach() is new.  The new
function is the mechanism required to get from PHJ_BATCH_PROBING to
PHJ_BATCH_DONE without waiting, and corresponds to the operation known
as Phaser.arriveAndDeregister() in Java (and maybe also
barrier::arrive_and_drop() in the C++ concurrency TS and Clock.drop()
in X10, not sure).

9.  I got rid of PARALLEL_KEY_EXECUTOR_NODE_NTH().  Previously I
wanted more than one reserved smh_toc key per executor node.  Robert
didn't like that.

10.  I got rid of "LeaderGate".  That earlier attempt at deadlock
avoidance clearly missed the mark.  (I think it probably defended
against the Gather + full TupleQueue deadlock but not the
GatherMerge-induced deadlock so it was a useless non-general
solution.)

11.  The previous incarnation of SharedTuplestore had a built-in
concept of partitions, which allowed the number of partitions to be
expanded any time but only allowed one partition to be read back 

Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept

2017-10-24 Thread Ivan Kartyshov

Hello. I made some bugfixes and rewrite the patch.

Simon Riggs писал 2017-09-05 14:44:

As Alexander says, simply skipping truncation if standby is busy isn't
a great plan.

If we defer an action on standby replay, when and who will we apply
it? What happens if the standby is shutdown or crashes while an action
is pending.


After crash standby server will go to the nearest checkout and will 
replay
all his wal`s to reach consistent state and then open for read-only 
load.

(all collected wal`s will be replayed in right way, I meant wal`s that
truncates table and wal`s that make table grow)

--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companydiff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9a5fde0..68decab 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -31,6 +31,9 @@
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "storage/lmgr.h"
+#include "storage/procarray.h"
+#include "access/transam.h"
 
 /*
  * We keep a list of all relations (represented as RelFileNode values)
@@ -498,50 +501,56 @@ smgr_redo(XLogReaderState *record)
 
 		reln = smgropen(xlrec->rnode, InvalidBackendId);
 
-		/*
-		 * Forcibly create relation if it doesn't exist (which suggests that
-		 * it was dropped somewhere later in the WAL sequence).  As in
-		 * XLogReadBufferForRedo, we prefer to recreate the rel and replay the
-		 * log as best we can until the drop is seen.
-		 */
-		smgrcreate(reln, MAIN_FORKNUM, true);
-
-		/*
-		 * Before we perform the truncation, update minimum recovery point to
-		 * cover this WAL record. Once the relation is truncated, there's no
-		 * going back. The buffer manager enforces the WAL-first rule for
-		 * normal updates to relation files, so that the minimum recovery
-		 * point is always updated before the corresponding change in the data
-		 * file is flushed to disk. We have to do the same manually here.
-		 *
-		 * Doing this before the truncation means that if the truncation fails
-		 * for some reason, you cannot start up the system even after restart,
-		 * until you fix the underlying situation so that the truncation will
-		 * succeed. Alternatively, we could update the minimum recovery point
-		 * after truncation, but that would leave a small window where the
-		 * WAL-first rule could be violated.
-		 */
-		XLogFlush(lsn);
-
-		if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0)
+		if (!ConditionalLockRelationOid(reln->smgr_rnode.node.relNode, AccessExclusiveLock))
 		{
-			smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno);
+			/*
+			 * Forcibly create relation if it doesn't exist (which suggests that
+			 * it was dropped somewhere later in the WAL sequence).  As in
+			 * XLogReadBufferForRedo, we prefer to recreate the rel and replay the
+			 * log as best we can until the drop is seen.
+			 */
+			smgrcreate(reln, MAIN_FORKNUM, true);
+
+			/*
+			 * Before we perform the truncation, update minimum recovery point to
+			 * cover this WAL record. Once the relation is truncated, there's no
+			 * going back. The buffer manager enforces the WAL-first rule for
+			 * normal updates to relation files, so that the minimum recovery
+			 * point is always updated before the corresponding change in the data
+			 * file is flushed to disk. We have to do the same manually here.
+			 *
+			 * Doing this before the truncation means that if the truncation fails
+			 * for some reason, you cannot start up the system even after restart,
+			 * until you fix the underlying situation so that the truncation will
+			 * succeed. Alternatively, we could update the minimum recovery point
+			 * after truncation, but that would leave a small window where the
+			 * WAL-first rule could be violated.
+			 */
+			XLogFlush(lsn);
 
-			/* Also tell xlogutils.c about it */
-			XLogTruncateRelation(xlrec->rnode, MAIN_FORKNUM, xlrec->blkno);
-		}
+			if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0)
+			{
+smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno);
+
+/* Also tell xlogutils.c about it */
+XLogTruncateRelation(xlrec->rnode, MAIN_FORKNUM, xlrec->blkno);
+			}
 
-		/* Truncate FSM and VM too */
-		rel = CreateFakeRelcacheEntry(xlrec->rnode);
+			/* Truncate FSM and VM too */
+			rel = CreateFakeRelcacheEntry(xlrec->rnode);
 
-		if ((xlrec->flags & SMGR_TRUNCATE_FSM) != 0 &&
-			smgrexists(reln, FSM_FORKNUM))
-			FreeSpaceMapTruncateRel(rel, xlrec->blkno);
-		if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0 &&
-			smgrexists(reln, VISIBILITYMAP_FORKNUM))
-			visibilitymap_truncate(rel, xlrec->blkno);
+			if ((xlrec->flags & SMGR_TRUNCATE_FSM) != 0 &&
+smgrexists(reln, FSM_FORKNUM))
+FreeSpaceMapTruncateRel(rel, xlrec->blkno);
+			if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0 &&
+smgrexists(reln, VISIBILITYMAP_FORKNUM))
+visibilitymap_truncate(rel, xlrec->blkno);
 
-		FreeFakeRelcacheEntry(rel);
+			FreeFakeRelcacheEntry(rel);
+		} else
+		{
+			

Re: [HACKERS] [POC] hash partitioning

2017-10-24 Thread amul sul
On Fri, Oct 13, 2017 at 3:00 AM, Andres Freund  wrote:
> On 2017-10-12 17:27:52 -0400, Robert Haas wrote:
>> On Thu, Oct 12, 2017 at 4:20 PM, Andres Freund  wrote:
>> >> In other words, it's not utterly fixed in stone --- we invented
>> >> --load-via-partition-root primarily to cope with circumstances that
>> >> could change hash values --- but we sure don't want to be changing it
>> >> with any regularity, or for a less-than-excellent reason.
>> >
>> > Yea, that's what I expected. It'd probably good for somebody to run
>> > smhasher or such on the output of the combine function (or even better,
>> > on both the 32 and 64 bit variants) in that case.
>>
>> Not sure how that test suite works exactly, but presumably the
>> characteristics in practice will depend the behavior of the hash
>> functions used as input the combine function - so the behavior could
>> be good for an (int, int) key but bad for a (text, date) key, or
>> whatever.
>
> I don't think that's true, unless you have really bad hash functions on
> the the component hashes. A hash combine function can't really do
> anything about badly hashed input, what you want is that it doesn't
> *reduce* the quality of the hash by combining.
>

I tried to get suggested SMHasher[1] test result for the hash_combine
for 32-bit and 64-bit version.

SMHasher works on hash keys of the form {0}, {0,1}, {0,1,2}... up to
N=255, using 256-N as the seed, for the hash_combine testing we
needed two hash value to be combined, for that, I've generated 64
and 128-bit hash using cityhash functions[2] for the given smhasher
key then split in two part to test 32-bit and 64-bit hash_combine
function respectively.   Attached patch for SMHasher code changes &
output of 32-bit and 64-bit hash_combine testing. Note that I have
skipped speed test this test which is irrelevant here.

By referring other hash function results [3], we can see that hash_combine
test results are not bad either.

Do let me know if current testing is not good enough or if you want me to do
more testing, thanks.

1] https://github.com/aappleby/smhasher
2] https://github.com/aappleby/smhasher/blob/master/src/CityTest.cpp
3] https://github.com/rurban/smhasher/tree/master/doc

Regards,
Amul
[amul.sul@power2 src]$ ./SMHasher hash_combine64
---
--- Testing hash_combine64 (test hash combine 64 pg function.)

[[[ Sanity Tests ]]]   

Verification value 0x3B439A64 : Passed!
Running sanity check 1..PASS
Running sanity check 2..PASS

[[[ Differential Tests ]]]

Testing 8303632 up-to-5-bit differentials in 64-bit keys -> 64 bit hashes.
1000 reps, 8303632000 total tests, expecting 0.00 random collisions..
0 total collisions, of which 0 single collisions were ignored

Testing 11017632 up-to-4-bit differentials in 128-bit keys -> 64 bit hashes.
1000 reps, 11017632000 total tests, expecting 0.00 random collisions..
0 total collisions, of which 0 single collisions were ignored

Testing 2796416 up-to-3-bit differentials in 256-bit keys -> 64 bit hashes.
1000 reps, 2796416000 total tests, expecting 0.00 random collisions..
0 total collisions, of which 0 single collisions were ignored


[[[ Avalanche Tests ]]]

Testing  32-bit keys ->  64-bit hashes,   30 reps.. worst bias is 
0.742667%
Testing  40-bit keys ->  64-bit hashes,   30 reps.. worst bias is 
0.684667%
Testing  48-bit keys ->  64-bit hashes,   30 reps.. worst bias is 
0.570667%
Testing  56-bit keys ->  64-bit hashes,   30 reps.. worst bias is 
0.726667%
Testing  64-bit keys ->  64-bit hashes,   30 reps.. worst bias is 
0.67%
Testing  72-bit keys ->  64-bit hashes,   30 reps.. worst bias is 
0.659333%
Testing  80-bit keys ->  64-bit hashes,   30 reps.. worst bias is 
0.666000%
Testing  88-bit keys ->  64-bit hashes,   30 reps.. worst bias is 
0.712000%
Testing  96-bit keys ->  64-bit hashes,   30 reps.. worst bias is 
0.632667%
Testing 104-bit keys ->  64-bit hashes,   30 reps.. worst bias is 
0.646000%
Testing 112-bit keys ->  64-bit hashes,   30 reps.. worst bias is 
0.746000%
Testing 120-bit keys ->  64-bit hashes,   30 reps.. worst bias is 
0.692667%
Testing 128-bit keys ->  64-bit hashes,   30 reps.. worst bias is 
0.684000%
Testing 136-bit keys ->  64-bit hashes,   30 reps.. worst bias is 
0.758667%
Testing 144-bit keys ->  64-bit hashes,   30 reps.. worst bias is 
0.725333%
Testing 152-bit keys ->  64-bit hashes,   30 reps.. worst bias is 
0.776000%

[[[ Keyset 'Cyclic' Tests ]]]

Keyset 'Cyclic' - 8 cycles of 8 bytes - 1000 keys
Testing collisions   - Expected 0.00, actual 0.00 ( 0.00x)
Testing distribution - Worst bias is the  20-bit window at bit   6 - 0.039%

Keyset 'Cyclic' - 8 cycles of 9 bytes 

[HACKERS] Remove secondary checkpoint

2017-10-24 Thread Simon Riggs
Remove the code that maintained two checkpoint's WAL files and all
associated stuff.

Try to avoid breaking anything else

This
* reduces disk space requirements on master
* removes a minor bug in fast failover
* simplifies code

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


remove_secondary_checkpoint.v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers