Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-12-29 Thread Jeff Janes
On Mon, Dec 29, 2014 at 9:12 PM, Peter Geoghegan  wrote:

> On Mon, Dec 29, 2014 at 2:29 PM, Jeff Janes  wrote:
> > Using the vallock2 version of V1.8, using the test I previously
> described, I
> > get some all-null rows, which my code should never create.  Also, the
> index
> > and table don't agree, in this example I find 3 all-null rows in the
> table,
> > but only 2 in the index.
>
> Just to be clear: You haven't found any such issue with approach #1 to
> value locking, right?
>

Correct, I haven't seen any problems with approach #1


>
> I'm curious about how long it took you to see the issue with #2. Were
> there any special steps? What were the exact steps involved in turning
> off the hard crash mechanism you mention?


Generally the problem will occur early on in the process, and if not then
it will not occur at all.  I think that is because the table starts out
empty, and so a lot of insertions collide with each other.  Once the table
is more thoroughly populated, most query takes the CONFLICT branch and
therefore two insertion-branches are unlikely to collide.

At its simplest, I just use the count_upsert.pl script and your patch and
forget all the rest of the stuff from my test platform.

So:

pg_ctl stop -D /tmp/data2; rm /tmp/data2 -r;
../torn_bisect/bin/pg_ctl initdb -D /tmp/data2;
../torn_bisect/bin/pg_ctl start -D /tmp/data2 -o "--fsync=off" -w ;
createdb;
perl count_upsert.pl 8 10

A run of count_upsert.pl 8 10 takes about 30 seconds on my machine (8
core), and if it doesn't create a problem then I just destroy the database
and start over.

The fsync=off is not important, I've seen the problem once without it.  I
just include it because otherwise the run takes a lot longer.

I've attached another version of the count_upsert.pl script, with some more
logging targeted to this particular issue.

The problem shows up like this:

init done at count_upsert.pl line 97.
sum is 1036
count is 9720
seq scan doesn't match index scan  1535 == 1535 and 1 == 6 $VAR1 = [
  [
6535,
-21
  ],
.
(Thousands of more lines, as it outputs the entire table twice, once
gathered by seq scan, once by bitmap index scan).

The first three lines are normal, the problem starts with the "seq scan
doesn't match"...

In this case the first problem it ran into was that key 1535 was present
once with a count column of 1 (found by seq scan) and once with a count
column of 6 (found by index scan).  It was also in the seq scan with a
count of 6, but the way the comparison works is that it sorts each
representation of the table by the key column value and then stops at the
first difference, in this case count columns 1 == 6 failed the assertion.

If you get some all-NULL rows, then you will also get Perl warnings issued
when the RETURNING clause starts returning NULL when none are expected to
be.

The overall pattern seems to be pretty streaky.  It could go 20 iterations
with no problem, and then it will fail several times in a row.  I've seen
this pattern quite a bit with other race conditions as well, I think that
they may be sensitive to how memory gets laid out between CPUs, and that
might depend on some longer-term characteristic of the state of the machine
that survives an initdb.

By the way, I also got a new error message a few times that I think might
be a manifestation of the same thing:

ERROR:  duplicate key value violates unique constraint "foo_index_idx"
DETAIL:  Key (index)=(6106) already exists.
STATEMENT:  insert into foo (index, count) values ($2,$1) on conflict
(index)
  update set count=TARGET.count + EXCLUDED.count
returning foo.count

Cheers,

Jeff


count_upsert.pl
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What exactly is our CRC algorithm?

2014-12-29 Thread Abhijit Menon-Sen
Hi.

I'm re-attaching the two patches as produced by format-patch. I haven't
listed any reviewers. It's either just Andres, or maybe a lot of people.

Is anyone in a position to try out the patches on MSVC and see if they
build and work sensibly, please? (Otherwise it may be better to remove
those bits from the patch for now.)

Thanks.

-- Abhijit
>From d5a90d31f9c35fea2636ec6baf6acc66e91fd655 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen 
Date: Tue, 30 Dec 2014 12:41:19 +0530
Subject: Implement slice-by-8 CRC calculation

The COMP_CRC32C macro now calls pg_comp_crc32c(), which processes eight
data bytes at a time. Its output is identical to the byte-at-a-time CRC
code. (This change does not apply to the LEGACY_CRC32 computation.)

Author: Abhijit Menon-Sen
---
 src/include/utils/pg_crc.h|   6 +-
 src/include/utils/pg_crc_tables.h | 578 +-
 src/port/pg_crc.c |  87 ++
 3 files changed, 604 insertions(+), 67 deletions(-)

diff --git a/src/include/utils/pg_crc.h b/src/include/utils/pg_crc.h
index f871cba..55934e5 100644
--- a/src/include/utils/pg_crc.h
+++ b/src/include/utils/pg_crc.h
@@ -41,6 +41,8 @@
 
 typedef uint32 pg_crc32;
 
+extern pg_crc32 pg_comp_crc32c(pg_crc32 crc, const void *data, size_t len);
+
 /*
  * CRC calculation using the CRC-32C (Castagnoli) polynomial.
  *
@@ -51,7 +53,7 @@ typedef uint32 pg_crc32;
 #define INIT_CRC32C(crc) ((crc) = 0x)
 #define FIN_CRC32C(crc)	((crc) ^= 0x)
 #define COMP_CRC32C(crc, data, len)	\
-	COMP_CRC32_NORMAL_TABLE(crc, data, len, pg_crc32c_table)
+	((crc) = pg_comp_crc32c((crc), (char *) (data), (len)))
 #define EQ_CRC32C(c1, c2) ((c1) == (c2))
 
 /*
@@ -115,7 +117,7 @@ do {			  \
 } while (0)
 
 /* Constant tables for CRC-32C and CRC-32 polynomials */
-extern CRCDLLIMPORT const uint32 pg_crc32c_table[];
+extern CRCDLLIMPORT const uint32 pg_crc32c_table[8][256];
 extern CRCDLLIMPORT const uint32 pg_crc32_table[];
 
 #endif   /* PG_CRC_H */
diff --git a/src/include/utils/pg_crc_tables.h b/src/include/utils/pg_crc_tables.h
index cb6b470..f814c91 100644
--- a/src/include/utils/pg_crc_tables.h
+++ b/src/include/utils/pg_crc_tables.h
@@ -28,71 +28,519 @@
  * This table is based on the so-called Castagnoli polynomial (the same
  * that is used e.g. in iSCSI).
  */
-const uint32 pg_crc32c_table[256] = {
-	0x, 0xF26B8303, 0xE13B70F7, 0x1350F3F4,
-	0xC79A971F, 0x35F1141C, 0x26A1E7E8, 0xD4CA64EB,
-	0x8AD958CF, 0x78B2DBCC, 0x6BE22838, 0x9989AB3B,
-	0x4D43CFD0, 0xBF284CD3, 0xAC78BF27, 0x5E133C24,
-	0x105EC76F, 0xE235446C, 0xF165B798, 0x030E349B,
-	0xD7C45070, 0x25AFD373, 0x36FF2087, 0xC494A384,
-	0x9A879FA0, 0x68EC1CA3, 0x7BBCEF57, 0x89D76C54,
-	0x5D1D08BF, 0xAF768BBC, 0xBC267848, 0x4E4DFB4B,
-	0x20BD8EDE, 0xD2D60DDD, 0xC186FE29, 0x33ED7D2A,
-	0xE72719C1, 0x154C9AC2, 0x061C6936, 0xF477EA35,
-	0xAA64D611, 0x580F5512, 0x4B5FA6E6, 0xB93425E5,
-	0x6DFE410E, 0x9F95C20D, 0x8CC531F9, 0x7EAEB2FA,
-	0x30E349B1, 0xC288CAB2, 0xD1D83946, 0x23B3BA45,
-	0xF779DEAE, 0x05125DAD, 0x1642AE59, 0xE4292D5A,
-	0xBA3A117E, 0x4851927D, 0x5B016189, 0xA96AE28A,
-	0x7DA08661, 0x8FCB0562, 0x9C9BF696, 0x6EF07595,
-	0x417B1DBC, 0xB3109EBF, 0xA0406D4B, 0x522BEE48,
-	0x86E18AA3, 0x748A09A0, 0x67DAFA54, 0x95B17957,
-	0xCBA24573, 0x39C9C670, 0x2A993584, 0xD8F2B687,
-	0x0C38D26C, 0xFE53516F, 0xED03A29B, 0x1F682198,
-	0x5125DAD3, 0xA34E59D0, 0xB01EAA24, 0x42752927,
-	0x96BF4DCC, 0x64D4CECF, 0x77843D3B, 0x85EFBE38,
-	0xDBFC821C, 0x2997011F, 0x3AC7F2EB, 0xC8AC71E8,
-	0x1C661503, 0xEE0D9600, 0xFD5D65F4, 0x0F36E6F7,
-	0x61C69362, 0x93AD1061, 0x80FDE395, 0x72966096,
-	0xA65C047D, 0x5437877E, 0x4767748A, 0xB50CF789,
-	0xEB1FCBAD, 0x197448AE, 0x0A24BB5A, 0xF84F3859,
-	0x2C855CB2, 0xDEEEDFB1, 0xCDBE2C45, 0x3FD5AF46,
-	0x7198540D, 0x83F3D70E, 0x90A324FA, 0x62C8A7F9,
-	0xB602C312, 0x44694011, 0x5739B3E5, 0xA55230E6,
-	0xFB410CC2, 0x092A8FC1, 0x1A7A7C35, 0xE811FF36,
-	0x3CDB9BDD, 0xCEB018DE, 0xDDE0EB2A, 0x2F8B6829,
-	0x82F63B78, 0x709DB87B, 0x63CD4B8F, 0x91A6C88C,
-	0x456CAC67, 0xB7072F64, 0xA457DC90, 0x563C5F93,
-	0x082F63B7, 0xFA44E0B4, 0xE9141340, 0x1B7F9043,
-	0xCFB5F4A8, 0x3DDE77AB, 0x2E8E845F, 0xDCE5075C,
-	0x92A8FC17, 0x60C37F14, 0x73938CE0, 0x81F80FE3,
-	0x55326B08, 0xA759E80B, 0xB4091BFF, 0x466298FC,
-	0x1871A4D8, 0xEA1A27DB, 0xF94AD42F, 0x0B21572C,
-	0xDFEB33C7, 0x2D80B0C4, 0x3ED04330, 0xCCBBC033,
-	0xA24BB5A6, 0x502036A5, 0x4370C551, 0xB11B4652,
-	0x65D122B9, 0x97BAA1BA, 0x84EA524E, 0x7681D14D,
-	0x2892ED69, 0xDAF96E6A, 0xC9A99D9E, 0x3BC21E9D,
-	0xEF087A76, 0x1D63F975, 0x0E330A81, 0xFC588982,
-	0xB21572C9, 0x407EF1CA, 0x532E023E, 0xA145813D,
-	0x758FE5D6, 0x87E466D5, 0x94B49521, 0x66DF1622,
-	0x38CC2A06, 0xCAA7A905, 0xD9F75AF1, 0x2B9CD9F2,
-	0xFF56BD19, 0x0D3D3E1A, 0x1E6DCDEE, 0xEC064EED,
-	0xC38D26C4, 0x31E6A5C7, 0x22B65633, 0xD0DDD530,
-	0x0417B1DB, 0xF67C32D8, 0xE52CC12C, 0x1747422F,
-	0x49547E0B, 0xBB3FFD08, 0xA86F0EFC, 0x5A048DFF,
-	0x8ECEE914, 0x7CA56A17, 0x6FF599E3, 0x9D9E1AE0,
-	0xD3D3E1AB, 0x21B862A

Re: [HACKERS] What exactly is our CRC algorithm?

2014-12-29 Thread Abhijit Menon-Sen
At 2014-12-29 18:44:18 +0530, a...@2ndquadrant.com wrote:
>
> > > +#ifdef __GNUC__
> > > + __asm__ ("crc32b %[data], %[crc]\n" : [crc] "+r" (crc) : [data] "rm" 
> > > (data));
> > 
> > Have you checked which version of gcc introduced named references to
> > input/output parameters?

OK, here we go:

«As of GCC version 3.1, it is also possible to specify input and
output operands using symbolic names which can be referenced within
the assembler code.»

GCC 3.1 was released on May 15, 2002. So it should be quite safe to use
this feature.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Additional role attributes && superuser review

2014-12-29 Thread Amit Kapila
On Tue, Dec 30, 2014 at 6:52 AM, Stephen Frost  wrote:
> > >There is one issue that occurs to me, however.  We're talking about
> > >pg_dump, but what about pg_dumpall?  In particular, I don't think the
> > >DUMP privilege should provide access to pg_authid, as that would allow
> > >the user to bypass the privilege system in some environments by using
> > >the hash to log in as a superuser.  Now, I don't encourage using
> > >password based authentication, especially for superuser accounts, but
> > >lots of people do.  The idea with these privileges is to allow certain
> > >operations to be performed by a non-superuser while preventing trivial
> > >access to superuser.  Perhaps it's pie-in-the-sky, but my hope is to
> > >achieve that.
> >
> > Ugh, hadn't thought about that. :(
>
> I don't think it kills the whole idea, but we do need to work out how to
> address it.
>

One way to address this might be to allow this only for superusers,
another could be have a separate privilege (DBA) or a role which is
not a superuser, however can be used to perform such tasks.

> > >>* BACKUP - allows role to perform backup operations (as originally
proposed)
> > >>* LOG - allows role to rotate log files - remains broad enough to
consider
> > >>future log related operations
> > >>* DUMP -  allows role to perform pg_dump* backups of whole database
> > >>* MONITOR - allows role to view pg_stat_* details (as originally
proposed)
> > >>* PROCSIGNAL - allows role to signal backend processes (as originally
> > >>proposed)well
> >
> > Given the confusion that can exist with the data reading stuff, perhaps
BINARY BACKUP or XLOG would be a better term, especially since this will
probably have some overlap with streaming replication.
>
> I don't really like 'xlog' as a permission.  BINARY BACKUP is
> interesting but if we're going to go multi-word, I would think we would
> do PITR BACKUP or something FILESYSTEM BACKUP or similar.  I'm not
> really a big fan of the two-word options though.
>

How about PHYSICAL BACKUP (for basebackup) and LOGICAL BACKUP
for pg_dump?

This is normally the terminology I have seen people using for these
backup's.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-12-29 Thread Peter Geoghegan
On Mon, Dec 29, 2014 at 2:29 PM, Jeff Janes  wrote:
> Using the vallock2 version of V1.8, using the test I previously described, I
> get some all-null rows, which my code should never create.  Also, the index
> and table don't agree, in this example I find 3 all-null rows in the table,
> but only 2 in the index.

Just to be clear: You haven't found any such issue with approach #1 to
value locking, right?

I'm curious about how long it took you to see the issue with #2. Were
there any special steps? What were the exact steps involved in turning
off the hard crash mechanism you mention? It looks like the condition
you describe ought to be highlighted by the script automatically. Is
that right? (I don't know any Perl and the script isn't really
documented at a high level).

Thanks
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: ctidscan as an example of custom-scan (Re: [HACKERS] [v9.5] Custom Plan API)

2014-12-29 Thread Jim Nasby
On 12/24/14, 7:54 PM, Kouhei Kaigai wrote:
>>> On Mon, Dec 15, 2014 at 4:55 PM, Kouhei Kaigai 
>> wrote:
 I'm not certain whether we should have this functionality in contrib
 from the perspective of workload that can help, but its major worth
 is for an example of custom-scan interface.
>>> worker_spi is now in src/test/modules. We may add it there as well, no?
>>>
>> Hmm, it makes sense for me. Does it also make sense to add a test-case to
>> the core regression test cases?
>>
> The attached patch adds ctidscan module at test/module instead of contrib.
> Basic portion is not changed from the previous post, but file locations
> and test cases in regression test are changed.

First, I'm glad for this. It will be VERY valuable for anyone trying to clean 
up the end of a majorly bloated table.

Here's a partial review...

+++ b/src/test/modules/ctidscan/ctidscan.c

+/* missing declaration in pg_proc.h */
+#ifndef TIDGreaterOperator
+#define TIDGreaterOperator 2800
...
If we're calling that out here, should we have a corresponding comment in 
pg_proc.h, in case these ever get renumbered?

+CTidQualFromExpr(Node *expr, int varno)
...
+   if (IsCTIDVar(arg1, varno))
+   other = arg2;
+   else if (IsCTIDVar(arg2, varno))
+   other = arg1;
+   else
+   return NULL;
+   if (exprType(other) != TIDOID)
+   return NULL;/* should not happen */
+   /* The other argument must be a pseudoconstant */
+   if (!is_pseudo_constant_clause(other))
+   return NULL;

I think this needs some additional blank lines...

+   if (IsCTIDVar(arg1, varno))
+   other = arg2;
+   else if (IsCTIDVar(arg2, varno))
+   other = arg1;
+   else
+   return NULL;
+
+   if (exprType(other) != TIDOID)
+   return NULL;/* should not happen */
+
+   /* The other argument must be a pseudoconstant */
+   if (!is_pseudo_constant_clause(other))
+   return NULL;

+ * CTidEstimateCosts
+ *
+ * It estimates cost to scan the target relation according to the given
+ * restriction clauses. Its logic to scan relations are almost same as
+ * SeqScan doing, because it uses regular heap_getnext(), except for
+ * the number of tuples to be scanned if restriction clauses work well.

That should read "same as what SeqScan is doing"... however, what 
actual function are you talking about? I couldn't find SeqScanEstimateCosts (or 
anything ending EstimateCosts).

BTW, there's other grammar issues but it'd be best to handle those all at once 
after all the code stuff is done.

+   opno = get_commutator(op->opno);
What happens if there's no commutator? Perhaps best to Assert(opno != 
InvalidOid) at the end of that if block.

Though, it seems things will fall apart anyway if ctid_quals isn't exactly what 
we're expecting; I don't know if that's OK or not.

+   /* disk costs --- assume each tuple on a different page */
+   run_cost += spc_random_page_cost * ntuples;
Isn't that extremely pessimistic?

I'm not familiar enough with the custom-scan stuff to really comment past this 
point, and I could certainly be missing some details about planning and 
execution.

I do have some concerns about the regression test, but perhaps I'm just being 
paranoid:

- The EXPLAIN tests don't cover >. They do cover <= and >= via BETWEEN.
- Nothing tests the case of '...'::tid op ctid; only lvalue cases are tested.

Also, it seems that we don't handle joining on a ctid qual... is that 
intentional? I know that sounds silly, but you'd probably want to do that if 
you're trying to move tuples off the end of a bloated table. You could work 
around it by constructing a dynamic SQL string, but it'd be easier to do 
something like:

UPDATE table1 SET ...
  WHERE ctid >= (SELECT '(' || relpages || ',0)' FROM pg_class WHERE oid = 
'table1'::regclass)
;

in some kind of loop.

Obviously better to only handle what you already are then not get this in at 
all though... :)
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Kevin Grittner
Jim Nasby  wrote:
> On 12/29/14, 10:53 AM, Kevin Grittner wrote:
>> Merlin Moncure  wrote:
>>
>>> Serialization errors only exist as a concession to concurrency
>>> and performance.  Again, they should be returned as sparsely as
>>> possible because they provide absolutely (as Tom pointed
>>> out) zero detail to the application.
>>
>> That is false.  They provide an *extremely* valuable piece of
>> information which is not currently available when you get a
>> duplicate key error -- whether the error occurred because of a
>> race condition and will not fail for the same cause if retried.
>
> As for missing details like the duplicated key value, is there a
> reasonable way to add that as an errdetail() to a serialization
> failure error?

Sure.  What we've been discussing is the case where a unique index
is checked for a duplicate with visibility which would cause it to
fail; whether, at transaction isolation level of serializable, to
check to see if the committed transaction which created the
existing entry overlaps the entry checking for duplicates, and use
a different SQLSTATE.  Anything available for display in the detail
currently would still be available.  We give different detail
messages for different types of serialization failures; it would
seem a little silly not to mention the table and the duplicate
values for this type of serialization failure.  My guess is that in
production this would be completely ignored by the software, but
logging it might be helpful in tuning.

> We do still have to be careful here though; you could still have
> code using our documented upsert methodology inside a serialized
> transaction. For example, via a 3rd party extension that can't
> assume you're using serialized transactions. Would it still be OK
> to treat that as a serialization failure and bubble that all the
> way back to the application?

If the current upsert function example were used, unmodified, by
serializable transactions with the suggested changes, it would not
loop back for another try within the function, but generate a
serialization failure for the transaction as a whole.  If there was
a framework watching for that and retrying transactions when it is
seen, the transaction would start over and see the new state and
behave correctly.  Without the change I'm proposing, I bet it would
loop endlessly, until interrupted.  [tries it]  Yep, there it is.
All the hand-wringing about how we might break existing
applications is entirely backwards.  Under either repeatable read 
or serializable that technique goes to 100% of one CPU until you 
cancel it -- completely broken.  If it generated a serialization 
failure the code in our docs would throw the serialization failure 
out to the client, which (in most environments using serializable 
transactions, or even repeatable read) would retry the transaction 
from the beginning, with much less pain.  It makes me wonder 
whether it would be better to generate the serialization failure 
for both repeatable read and serializable, since those are the 
cases where the sample code in our docs generates the endless loop.

> As part of this, we should probably modify our upsert example so
> it takes transaction_isolation into account...

Probably.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch : Allow toast tables to be moved to a different tablespace

2014-12-29 Thread Andreas Karlsson

Hi,

Here is my review of the feature.

- I have not worked enough with tablespaces to see if this patch would 
be useful. There was some uncertainty about this upstream.


- Would it not be enough to simply allow running ALTER TABLE SET 
TABLESPACE on the TOAST tables? Or is manually modifying those too ugly?


- I like that it adds tab completion for the new commands.

- The feature seems to work as described, other than the ALTER INDEX 
issue noted below and that it broke pg_dump. The broken pg_dump means I 
have not tested how this changes database dumps.


- A test fails in create_view.out. I looked some into it and did not see 
how this could happen.


*** 
/home/andreas/dev/postgresql/src/test/regress/expected/create_view.out 
2014-08-09 01:35:50.008886776 +0200
--- 
/home/andreas/dev/postgresql/src/test/regress/results/create_view.out 
2014-12-30 00:41:17.966525339 +0100

***
*** 283,289 ***
relname   | relkind |reloptions
  +-+--
   mysecview1 | v   |
!  mysecview2 | v   |
   mysecview3 | v   | {security_barrier=true}
   mysecview4 | v   | {security_barrier=false}
  (4 rows)
--- 283,289 
relname   | relkind |reloptions
  +-+--
   mysecview1 | v   |
!  mysecview2 | v   | {security_barrier=true}
   mysecview3 | v   | {security_barrier=true}
   mysecview4 | v   | {security_barrier=false}
  (4 rows)

- pg_dump is broken

pg_dump: [archiver (db)] query failed: ERROR:  syntax error at or 
near "("
LINE 1: ...nest(tc.reloptions) x), ', ') AS toast_reloptions 
(SELECT sp...


- "ALTER INDEX foo_idx SET TABLE TABLESPACE ..." should not be allowed, 
currently it changes the tablespace of the index. I do not think "ALTER 
INDEX foo_idx SET (TABLE|TOAST) TABLESPACE ..." should even be allowed 
in the grammar.


- You should add a link to 
http://www.postgresql.org/docs/current/interactive/storage-toast.html to 
the manual page of ALTER TABLE.


- I do not like how \d handles the toast tablespace. Having TOAST in 
pg_default and the table in another space looks the same as if there was 
no TOAST table at all. I think we should always print both tablespaces 
if either of them are not pg_default.


- Would it be interesting to add syntax for moving the toast index to a 
separate tablespace?


- There is no warning if you set the toast table space of a table 
lacking toast. I think there should be one.


- No support for materialized views as pointed out by Alex.

- I think the code would be cleaner if ATPrepSetTableSpace and 
ATPrepSetToastTableSpace were merged into one function or split into 
two, one setting the toast and one setting the tablespace of the table 
and one setting it on the TOAST table.


- I think a couple of tests for the error cases would be nice.

= Minor style issue

- Missing periods on the ALTER TABLE manual page after "See also CREATE 
TABLESPACE" (in two places).


- Missing period last in the new paragraph added to the storage manual page.

- Double spaces in src/backend/catalog/toasting.c after "if (new_toast".

- The comment "and if this is not a TOAST re-creation case (through 
CLUSTER)." incorrectly implies that CLUSTER is the only case where the 
TOAST table is recreated.


- The local variables ToastTableSpace and ToastRel do not follow the 
naming conventions.


- Remove the parentheses around "(is_system_catalog)".

- Why was the "Save info for Phase 3 to do the real work" comment 
changed to "XXX: Save info for Phase 3 to do the real work"?


- Incorrect indentation for new code added last in ATExecSetTableSpace.

- The patch adds commented out code in src/include/commands/tablecmds.h.

--
Andreas Karlsson


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] nls and server log

2014-12-29 Thread Jim Nasby

On 12/29/14, 7:40 PM, Craig Ringer wrote:

On 12/30/2014 06:39 AM, Jim Nasby wrote:




How much of this issue is caused by trying to machine-parse log files?
Is a better option to improve that case, possibly doing something like
including a field in each line that tells you the encoding for that entry?


That'd be absolutely ghastly. You couldn't just view the logs with
'less' or a text editor if your logs had mixed encodings, you'd need
some kind of special PostgreSQL log viewer tool.


I was specifically talking about logs intended for machine reading (ie: CSV), 
not human reading.

Similar to how client logging (where encoding is a lot more important) and 
server logging aren't exactly the same use case, human read logs vs something 
for a machine to read aren't the same thing either.

BTW, before someone makes an argument for using tools like cut or grep with 
CSV, that actually falls apart spectacularly at the first multi-line log 
message. I think that's just another example that trying to make one logfile 
serve two different purposes just won't work well.

Perhaps the solution here is to include a tool that makes it easier to deal 
with CSV logs, including encoding. I've certainly wished for such a tool to 
allow me to effectively deal with CSV logs in a way that didn't necessitate 
loading them into a table.


Why would we possibly do that when we could just emit utf-8 instead?


What happens if we get a translation/encoding failure (the case Tom's worried 
about)?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-12-29 Thread Peter Geoghegan
On Mon, Dec 29, 2014 at 2:29 PM, Jeff Janes  wrote:
> I've attached an example output of querying via index and via full table
> scan, and also the pageinspect output of the blocks which have the 3 rows,
> in case that is helpful.

You might have also included output from pageinspect's bt_page_items()
function. Take a look at the documentation patch I just posted if the
details are unclear.

Thanks
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Documentation of bt_page_items()'s ctid field

2014-12-29 Thread Peter Geoghegan
Attached documentation patch describes the purpose of
bt_page_items()'s ctid field. This has come up enough times in
disaster recovery or testing scenarios that I feel it's worth drawing
particular attention to.

-- 
Peter Geoghegan
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 6f51dc6..433bcb2 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -192,6 +192,13 @@ test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
   7 | (0,7)   |  12 | f | f| 29 27 00 00
   8 | (0,8)   |  12 | f | f| 2a 27 00 00
 
+  Note that ctid addresses a heap tuple if the
+  page under consideration is a B-Tree leaf page.  Otherwise, for
+  internal B-Tree pages ctid addresses a page in
+  the B-Tree itself (excluding the root page if and only if there
+  has not yet been a root page split, as in the example above).
+  These internally referenced pages are child pages, and may
+  themselves be leaf pages or internal pages.
  
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Additional role attributes && superuser review

2014-12-29 Thread Jim Nasby

On 12/29/14, 7:22 PM, Stephen Frost wrote:

One other point is that this shouldn't imply any other privileges, imv.
> >I'm specifically thinking of BYPASSRLS- that's independently grantable
> >and therefore should be explicitly set, if it's intended.  Things
> >should work 'sanely' with any combination of the two options.

>
>That does violate POLA, but it's probably worth doing so...

That really depends on your view of the privilege.  I'm inclined to view
it from the more-tightly-constrained point of view which matches up with
what I anticipate it actually providing.  That doesn't translate into a
short name very well though, unfortunately.


READ MOST? ;)


> >The read-all vs. ability-to-pg_dump distinction doesn't really exist for
> >role attributes, as I see it (see my comments above).  That said, having
> >DUMP or read-all is different from read-*only*, which would probably be
> >good to have independently.  I can imagine a use-case for a read-only
> >account which only has read ability for those tables, schemas, etc,
> >explicitly granted to it.

>
>My specific concern about DUMP vs read all/only is IIRC dump needs to do more 
extensive locking than regular selects do, which can cause production problems.

The locks aren't any more extensive than regular selects, it's just that
the entire pg_dump works inside of one large transaction which lasts a
good long time and simply that can cause issues with high-rate update
systems.


Good, so really DUMP and READ ALL are the same.


> >There is one issue that occurs to me, however.  We're talking about
> >pg_dump, but what about pg_dumpall?  In particular, I don't think the
> >DUMP privilege should provide access to pg_authid, as that would allow
> >the user to bypass the privilege system in some environments by using
> >the hash to log in as a superuser.  Now, I don't encourage using
> >password based authentication, especially for superuser accounts, but
> >lots of people do.  The idea with these privileges is to allow certain
> >operations to be performed by a non-superuser while preventing trivial
> >access to superuser.  Perhaps it's pie-in-the-sky, but my hope is to
> >achieve that.

>
>Ugh, hadn't thought about that.:(

I don't think it kills the whole idea, but we do need to work out how to
address it.


Yeah... it does indicate that we shouldn't call this thing DUMP, but perhaps as 
long as we put appropriate HINTS in the error paths that pg_dumpall would hit 
it's OK to call it DUMP. (My specific concern is it's natural to assume that 
DUMP would include pg_dumpall).



>Given the confusion that can exist with the data reading stuff, perhaps BINARY 
BACKUP or XLOG would be a better term, especially since this will probably have 
some overlap with streaming replication.

I don't really like 'xlog' as a permission.  BINARY BACKUP is
interesting but if we're going to go multi-word, I would think we would
do PITR BACKUP or something FILESYSTEM BACKUP or similar.  I'm not
really a big fan of the two-word options though.


BINARY? Though that's pretty generic. STREAM might be better, even though it's 
not exactly accurate for a simple backup.

Perhaps this is just a documentation issue, but there's enough caveats floating 
around here that I'm reluctant to rely on that.


>Likewise, if we segregate "DUMP" and BYPASSRLS then I think we need to call DUMP 
something else. Otherwise, it's a massive foot-gun; you get a "successful" backup only to 
find out it contains only a small part of the database.

That's actually already dealt with.  pg_dump defaults to setting the
row_security GUC to 'off', in which case you'll get an error if you hit
a table that has RLS enabled and you don't have BYPASSRLS.  If you're
not checking for errors from pg_dump, well, there's a lot of ways you
could run into problems.


This also indicates DUMP is better than something like READ ALL, if we can 
handle the pg_dumpall case.

Based on all this, my inclination is DUMP with appropriate hints for 
pg_dumpall, and BINARY.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Detecting backend failures via libpq / DBD::Pg

2014-12-29 Thread Jim Nasby

On 12/29/14, 6:43 PM, Greg Sabino Mullane wrote:

I am working on enhancing the ping() method of DBD::Pg. The goal of that
is for a user to be able to determine if the connection to the database
is still valid.


This is actually a VERY common thing for monitoring frameworks to do. 
Currently, the general method seems to be something akin to

psql -c 'SELECT 1' ...

perhaps instead of going through all the effort that you currently are we could 
add a FEBE ping command, and expose that through a special psql option (or 
maybe a special pg_ping command).

This won't be noticeably faster than SELECT 1, but it would prevent a bunch of 
pointless work on the backend side, and should greatly simplify DBD's ping(). 
Only thing I'm not sure of is if this could be made to be safe within a COPY... 
:(
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] nls and server log

2014-12-29 Thread Craig Ringer
On 12/30/2014 06:39 AM, Jim Nasby wrote:
>>
> 
> How much of this issue is caused by trying to machine-parse log files?
> Is a better option to improve that case, possibly doing something like
> including a field in each line that tells you the encoding for that entry?

That'd be absolutely ghastly. You couldn't just view the logs with
'less' or a text editor if your logs had mixed encodings, you'd need
some kind of special PostgreSQL log viewer tool.

Why would we possibly do that when we could just emit utf-8 instead?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CATUPDATE confusion?

2014-12-29 Thread Stephen Frost
Adam,

* Adam Brightwell (adam.brightw...@crunchydatasolutions.com) wrote:
> Given this discussion, I have attached a patch that removes CATUPDATE for
> review/discussion.

Thanks!

> One of the interesting behaviors (or perhaps not) is how 'pg_class_aclmask'
> handles an invalid role id when checking permissions against 'rolsuper'
> instead of 'rolcatupdate'.  This is demonstrated by the
> 'has_table_privilege' regression test expected results.  In summary,
> 'has_rolcatupdate' raises an error when an invalid OID is provided,
> however, as documented in the source 'superuser_arg' does not, simply
> returning false.  Therefore, when checking for superuser-ness of an
> non-existent role, the result will be false and not an error.  Perhaps this
> is OK, but my concern would be on the expected behavior around items like
> 'has_table_privilege'.  My natural thought would be that we would want to
> preserve that specific functionality, though short of adding a
> 'has_rolsuper' function that will raise an appropriate error, I'm uncertain
> of an approach.  Thoughts?

This role oid check is only getting called in this case because it's
pg_authid which is getting checked (because it's a system catalog table)
and it's then falling into this one case.  I don't think we need to
preserve that.

If you do the same query with that bogus OID but a normal table, you get
the same 'f' result that the regression test gives with this change.
We're not back-patching this and I seriously doubt anyone is relying on
this special response for system catalog tables.

So, unless anyone wants to object, I'll continue to look over this patch
for eventual application against master.  If there are objections, it'd
be great if they could be voiced sooner rather than later.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Additional role attributes && superuser review

2014-12-29 Thread Stephen Frost
Jim,

* Jim Nasby (jim.na...@bluetreble.com) wrote:
> On 12/29/14, 4:01 PM, Stephen Frost wrote:
> >That said, a 'DUMP' privilege which allows the user to dump the contents
> >of the entire database is entirely reasonable.  We need to be clear in
> >the documentation though- such a 'DUMP' privilege is essentially
> >granting USAGE on all schemas and table-level SELECT on all tables and
> >sequences (anything else..?).  To be clear, a user with this privilege
> >can utilize that access without using pg_dump.
> 
> Yeah... it may be better to call this something other than DUMP (see below).

Did you favor something else below..?  I don't see it.

> >One other point is that this shouldn't imply any other privileges, imv.
> >I'm specifically thinking of BYPASSRLS- that's independently grantable
> >and therefore should be explicitly set, if it's intended.  Things
> >should work 'sanely' with any combination of the two options.
> 
> That does violate POLA, but it's probably worth doing so...

That really depends on your view of the privilege.  I'm inclined to view
it from the more-tightly-constrained point of view which matches up with
what I anticipate it actually providing.  That doesn't translate into a
short name very well though, unfortunately.

> >The read-all vs. ability-to-pg_dump distinction doesn't really exist for
> >role attributes, as I see it (see my comments above).  That said, having
> >DUMP or read-all is different from read-*only*, which would probably be
> >good to have independently.  I can imagine a use-case for a read-only
> >account which only has read ability for those tables, schemas, etc,
> >explicitly granted to it.
> 
> My specific concern about DUMP vs read all/only is IIRC dump needs to do more 
> extensive locking than regular selects do, which can cause production 
> problems.

The locks aren't any more extensive than regular selects, it's just that
the entire pg_dump works inside of one large transaction which lasts a
good long time and simply that can cause issues with high-rate update
systems.

> >There is one issue that occurs to me, however.  We're talking about
> >pg_dump, but what about pg_dumpall?  In particular, I don't think the
> >DUMP privilege should provide access to pg_authid, as that would allow
> >the user to bypass the privilege system in some environments by using
> >the hash to log in as a superuser.  Now, I don't encourage using
> >password based authentication, especially for superuser accounts, but
> >lots of people do.  The idea with these privileges is to allow certain
> >operations to be performed by a non-superuser while preventing trivial
> >access to superuser.  Perhaps it's pie-in-the-sky, but my hope is to
> >achieve that.
> 
> Ugh, hadn't thought about that. :(

I don't think it kills the whole idea, but we do need to work out how to
address it.

> >>* BACKUP - allows role to perform backup operations (as originally proposed)
> >>* LOG - allows role to rotate log files - remains broad enough to consider
> >>future log related operations
> >>* DUMP -  allows role to perform pg_dump* backups of whole database
> >>* MONITOR - allows role to view pg_stat_* details (as originally proposed)
> >>* PROCSIGNAL - allows role to signal backend processes (as originally
> >>proposed)well
> 
> Given the confusion that can exist with the data reading stuff, perhaps 
> BINARY BACKUP or XLOG would be a better term, especially since this will 
> probably have some overlap with streaming replication.

I don't really like 'xlog' as a permission.  BINARY BACKUP is
interesting but if we're going to go multi-word, I would think we would
do PITR BACKUP or something FILESYSTEM BACKUP or similar.  I'm not
really a big fan of the two-word options though.

> Likewise, if we segregate "DUMP" and BYPASSRLS then I think we need to call 
> DUMP something else. Otherwise, it's a massive foot-gun; you get a 
> "successful" backup only to find out it contains only a small part of the 
> database.

That's actually already dealt with.  pg_dump defaults to setting the
row_security GUC to 'off', in which case you'll get an error if you hit
a table that has RLS enabled and you don't have BYPASSRLS.  If you're
not checking for errors from pg_dump, well, there's a lot of ways you
could run into problems.

> My how this has become a can of worms...

I'm not sure it's as bad as all that, but it does require some
thinking..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Detecting backend failures via libpq / DBD::Pg

2014-12-29 Thread Andrew Gierth
> "Greg" == Greg Sabino Mullane  writes:

 Greg> I am working on enhancing the ping() method of DBD::Pg. The
 Greg> goal of that is for a user to be able to determine if the
 Greg> connection to the database is still valid. The basic way we do
 Greg> this is to send a simple SELECT via PQexec

Why not PQexec(conn, "") ?

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Additional role attributes && superuser review

2014-12-29 Thread Jim Nasby

On 12/29/14, 4:01 PM, Stephen Frost wrote:

Adam,

* Adam Brightwell (adam.brightw...@crunchydatasolutions.com) wrote:

I'd suggest it's called DUMP if that's what it allows, to keep it separate
from the backup parts.


Makes sense to me.


I'm fine calling it 'DUMP', but for different reasons.

We have no (verifiable) idea what client program is being used to
connect and therefore we shouldn't try to tie the name of the client
program to the permission.

That said, a 'DUMP' privilege which allows the user to dump the contents
of the entire database is entirely reasonable.  We need to be clear in
the documentation though- such a 'DUMP' privilege is essentially
granting USAGE on all schemas and table-level SELECT on all tables and
sequences (anything else..?).  To be clear, a user with this privilege
can utilize that access without using pg_dump.


Yeah... it may be better to call this something other than DUMP (see below).


One other point is that this shouldn't imply any other privileges, imv.
I'm specifically thinking of BYPASSRLS- that's independently grantable
and therefore should be explicitly set, if it's intended.  Things
should work 'sanely' with any combination of the two options.


That does violate POLA, but it's probably worth doing so...


Similairly, DUMP shouldn't imply BACKUP or visa-versa.  In particular,
I'd like to see roles which have only the BACKUP privilege be unable to
directly read any data (modulo things granted to PUBLIC).  This would
allow an unprivileged user to manage backups, kick off ad-hoc ones, etc,
without being able to actually access any of the data (this would
require the actual backup system to have a similar control, but that's
entirely possible with more advanced SANs and enterprise backup
solutions).


Yes, since a binary backup doesn't need to actually "read" data, it shouldn't 
implicitly allow a user granted that role to either.


Given this clarification:

I think #1 could certainly be answered by using DUMP.  I have no strong
opinion in either direction, though I do think that BACKUP does make the
most sense for #2.  Previously, Stephen had mentioned a READONLY capability
that could effectively work for pg_dump, though, Jim's suggestion of
keeping 'read-all' separate from 'ability to pg_dump' seems logical.  In
either case, I certainly wouldn't mind having a wider agreement/consensus
on this approach.


The read-all vs. ability-to-pg_dump distinction doesn't really exist for
role attributes, as I see it (see my comments above).  That said, having
DUMP or read-all is different from read-*only*, which would probably be
good to have independently.  I can imagine a use-case for a read-only
account which only has read ability for those tables, schemas, etc,
explicitly granted to it.


My specific concern about DUMP vs read all/only is IIRC dump needs to do more 
extensive locking than regular selects do, which can cause production problems.


There is one issue that occurs to me, however.  We're talking about
pg_dump, but what about pg_dumpall?  In particular, I don't think the
DUMP privilege should provide access to pg_authid, as that would allow
the user to bypass the privilege system in some environments by using
the hash to log in as a superuser.  Now, I don't encourage using
password based authentication, especially for superuser accounts, but
lots of people do.  The idea with these privileges is to allow certain
operations to be performed by a non-superuser while preventing trivial
access to superuser.  Perhaps it's pie-in-the-sky, but my hope is to
achieve that.


Ugh, hadn't thought about that. :(


So, here is a revised list of proposed attributes:

* BACKUP - allows role to perform backup operations (as originally proposed)
* LOG - allows role to rotate log files - remains broad enough to consider
future log related operations
* DUMP -  allows role to perform pg_dump* backups of whole database
* MONITOR - allows role to view pg_stat_* details (as originally proposed)
* PROCSIGNAL - allows role to signal backend processes (as originally
proposed)well


Given the confusion that can exist with the data reading stuff, perhaps BINARY 
BACKUP or XLOG would be a better term, especially since this will probably have 
some overlap with streaming replication.

Likewise, if we segregate "DUMP" and BYPASSRLS then I think we need to call DUMP 
something else. Otherwise, it's a massive foot-gun; you get a "successful" backup only to 
find out it contains only a small part of the database.

My how this has become a can of worms...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Detecting backend failures via libpq / DBD::Pg

2014-12-29 Thread Greg Sabino Mullane

-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160

I am working on enhancing the ping() method of DBD::Pg. The goal of that 
is for a user to be able to determine if the connection to the database 
is still valid. The basic way we do this is to send a simple SELECT via 
PQexec and then check for a valid return value (and when in doubt, we check 
PQstatus). This works fine for most transaction statuses, including idle, 
active, and idle in transaction. It even works for copy in and copy out, 
although it obviously invalidates the current COPY (caveat emptor!). 
The problem comes when ping() is called and we are in a failed transaction. 
After some experimenting, the best solution I found is to send the PQexec, 
and then check if PQresultErrorField(result, 'C') is '25P02'. If it is, then 
all is "well", in that the server is still alive. If it is not, then we 
can assume the backend is bad (for example, PQerrorMessage gives a 
"could not receive data from server: Bad file descriptor"). Being that we 
cannot do a rollback before calling the PQexec, is this a decent solution? 
Can we depend on really serious errors always trumping the expected 25P02?

- -- 
Greg Sabino Mullane g...@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201412291942
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-BEGIN PGP SIGNATURE-

iEYEAREDAAYFAlSh9QEACgkQvJuQZxSWSsjDMQCg3CO1eyrFXNUnfRbk/rRJmrCl
PEoAnRl+M67kTkuZDi+3zMyVyblLvl9I
=uW6Q
-END PGP SIGNATURE-




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Jim Nasby

On 12/29/14, 10:53 AM, Kevin Grittner wrote:

Merlin Moncure  wrote:


Serialization errors only exist as a concession to concurrency
and performance.  Again, they should be returned as sparsely as
possible because they provide absolutely (as Tom pointed
out) zero detail to the application.


That is false.  They provide an *extremely* valuable piece of
information which is not currently available when you get a
duplicate key error -- whether the error occurred because of a race
condition and will not fail for the same cause if retried.


As for missing details like the duplicated key value, is there a reasonable way 
to add that as an errdetail() to a serialization failure error?

We do still have to be careful here though; you could still have code using our 
documented upsert methodology inside a serialized transaction. For example, via 
a 3rd party extension that can't assume you're using serialized transactions. 
Would it still be OK to treat that as a serialization failure and bubble that 
all the way back to the application?

As part of this, we should probably modify our upsert example so it takes 
transaction_isolation into account...


As for the fact that RI violations also don't return a
serialization failure when caused by a race with concurrent
transactions, I view that as another weakness in PostgreSQL.  I
don't think there is a problem curing one without curing the other
at the same time.  I have known of people writing their own
triggers to enforce RI rather than defining FKs precisely so that
they can get a serialization failure return code and do automatic
retry if it is caused by a race condition.  That's less practical
to compensate for when it comes to unique indexes or constraints.


Wow, that's horrible. :(
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CATUPDATE confusion?

2014-12-29 Thread Adam Brightwell
All,

On Sat, Dec 27, 2014 at 6:31 PM, Noah Misch  wrote:

> On Sat, Dec 27, 2014 at 06:26:02PM -0500, Tom Lane wrote:
> > Stephen Frost  writes:
> > > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > >> Plan C (remove CATUPDATE altogether) also has some merit.  But adding
> a
> > >> superuser override there would be entirely pointless.
> >
> > > This is be my recommendation.  I do not see the point of carrying the
> > > option around just to confuse new users of pg_authid and reviewers of
> > > the code.
> >
> > Yeah ... if no one's found it interesting in the 20 years since the
> > code left Berkeley, it's unlikely that interest will emerge in the
> > future.
>
> No objection here.
>

Given this discussion, I have attached a patch that removes CATUPDATE for
review/discussion.

One of the interesting behaviors (or perhaps not) is how 'pg_class_aclmask'
handles an invalid role id when checking permissions against 'rolsuper'
instead of 'rolcatupdate'.  This is demonstrated by the
'has_table_privilege' regression test expected results.  In summary,
'has_rolcatupdate' raises an error when an invalid OID is provided,
however, as documented in the source 'superuser_arg' does not, simply
returning false.  Therefore, when checking for superuser-ness of an
non-existent role, the result will be false and not an error.  Perhaps this
is OK, but my concern would be on the expected behavior around items like
'has_table_privilege'.  My natural thought would be that we would want to
preserve that specific functionality, though short of adding a
'has_rolsuper' function that will raise an appropriate error, I'm uncertain
of an approach.  Thoughts?

Thanks,
Adam

-- 
Adam Brightwell - adam.brightw...@crunchydatasolutions.com
Database Engineer - www.crunchydatasolutions.com
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
new file mode 100644
index 9ceb96b..635032d
*** a/doc/src/sgml/catalogs.sgml
--- b/doc/src/sgml/catalogs.sgml
***
*** 1416,1430 
   
  
   
-   rolcatupdate
-   bool
-   
-Role can update system catalogs directly.  (Even a superuser cannot do
-this unless this column is true)
-   
-  
- 
-  
rolcanlogin
bool

--- 1416,1421 
*** SELECT * FROM pg_locks pl LEFT JOIN pg_p
*** 8479,8494 
   
  
   
-   rolcatupdate
-   bool
-   
-   
-Role can update system catalogs directly.  (Even a superuser cannot do
-this unless this column is true)
-   
-  
- 
-  
rolcanlogin
bool

--- 8470,8475 
diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
new file mode 100644
index d30612c..18623ef
*** a/src/backend/catalog/aclchk.c
--- b/src/backend/catalog/aclchk.c
*** aclcheck_error_type(AclResult aclerr, Oi
*** 3423,3448 
  }
  
  
- /* Check if given user has rolcatupdate privilege according to pg_authid */
- static bool
- has_rolcatupdate(Oid roleid)
- {
- 	bool		rolcatupdate;
- 	HeapTuple	tuple;
- 
- 	tuple = SearchSysCache1(AUTHOID, ObjectIdGetDatum(roleid));
- 	if (!HeapTupleIsValid(tuple))
- 		ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
-  errmsg("role with OID %u does not exist", roleid)));
- 
- 	rolcatupdate = ((Form_pg_authid) GETSTRUCT(tuple))->rolcatupdate;
- 
- 	ReleaseSysCache(tuple);
- 
- 	return rolcatupdate;
- }
- 
  /*
   * Relay for the various pg_*_mask routines depending on object kind
   */
--- 3423,3428 
*** pg_class_aclmask(Oid table_oid, Oid role
*** 3620,3627 
  
  	/*
  	 * Deny anyone permission to update a system catalog unless
! 	 * pg_authid.rolcatupdate is set.   (This is to let superusers protect
! 	 * themselves from themselves.)  Also allow it if allowSystemTableMods.
  	 *
  	 * As of 7.4 we have some updatable system views; those shouldn't be
  	 * protected in this way.  Assume the view rules can take care of
--- 3600,3606 
  
  	/*
  	 * Deny anyone permission to update a system catalog unless
! 	 * pg_authid.rolsuper is set.  Also allow it if allowSystemTableMods.
  	 *
  	 * As of 7.4 we have some updatable system views; those shouldn't be
  	 * protected in this way.  Assume the view rules can take care of
*** pg_class_aclmask(Oid table_oid, Oid role
*** 3630,3636 
  	if ((mask & (ACL_INSERT | ACL_UPDATE | ACL_DELETE | ACL_TRUNCATE | ACL_USAGE)) &&
  		IsSystemClass(table_oid, classForm) &&
  		classForm->relkind != RELKIND_VIEW &&
! 		!has_rolcatupdate(roleid) &&
  		!allowSystemTableMods)
  	{
  #ifdef ACLDEBUG
--- 3609,3615 
  	if ((mask & (ACL_INSERT | ACL_UPDATE | ACL_DELETE | ACL_TRUNCATE | ACL_USAGE)) &&
  		IsSystemClass(table_oid, classForm) &&
  		classForm->relkind != RELKIND_VIEW &&
! 		!superuser_arg(roleid) &&
  		!allowSystemTableMods)
  	{
  #ifdef ACLDEBUG
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
new file mode 100644
index 22b8c

Re: [HACKERS] Misaligned BufferDescriptors causing major performance problems on AMD

2014-12-29 Thread Andres Freund
On 2014-12-29 16:59:05 -0500, Bruce Momjian wrote:
> I am glad someone else considers this important.

I do consider it important. I just considered the lwlock scalability
more important...

> Andres reported the above 2x pgbench difference in February, but no
> action was taken as everyone felt there needed to be more performance
> testing, but it never happened:

FWIW, I have no idea what exactly should be tested there. Right now I
think what we should do is to remove the BUFFERALIGN from shmem.c and
instead add explicit alignment code in a couple callers
(BufferDescriptors/Blocks, proc.c stuff).

>   $ pgbench --initialize --scale 1 pgbench

Scale 1 isn't particularly helpful in benchmarks, not even read only
ones.

>   $ pgbench --protocol prepared --client 16 --jobs 16 --transactions 
> 10 --select-only pgbench

I'd suspect you're more likely to see differences with a higher client
count. Also, I seriously doubt 100k xacts is enough to get stable
results - even on my laptop I get 100k+ TPS. I'd suggest using something
like -P 1 -T 100 or so.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-12-29 Thread Peter Geoghegan
Hi Jeff,

On Mon, Dec 29, 2014 at 2:29 PM, Jeff Janes  wrote:
> Using the vallock2 version of V1.8, using the test I previously described, I
> get some all-null rows, which my code should never create.  Also, the index
> and table don't agree, in this example I find 3 all-null rows in the table,
> but only 2 in the index.  I've attached an example output of querying via
> index and via full table scan, and also the pageinspect output of the blocks
> which have the 3 rows, in case that is helpful.

Interesting. Thanks a lot for your help!

> This was just a straight forward issue of firing queries at the database,
> the crash-inducing part of my test harness was not active during this test.
> I also ran it with my crashing patch reversed out, in case I introduced the
> problem myself, and it still occurs.
>
> Using V1.7 of the vallock2 patch, I saw the same thing with some all-null
> rows.  I also saw some other issues where two rows with the same key value
> would be present twice in the table (violating the unique constraint) but
> only one of them would appear in the index.  I suspect it is caused by the
> same issue as the all-null rows, and maybe I just didn't run v1.8 enough
> times to find that particular manifestation under v1.8.

This is almost certainly a latent bug with approach #2 to value
locking, that has probably been there all along. Semantics and syntax
have been a recent focus, and so the probability that I introduced a
regression of this nature in any recent revision seems low. I am going
to investigate the problem, and hope to have a diagnosis soon.

Once again, thanks!
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-12-29 Thread Jim Nasby

On 12/28/14, 2:45 PM, Peter Geoghegan wrote:

On Sun, Dec 28, 2014 at 12:37 PM, Jeff Davis  wrote:

Do others have similar numbers? I'm quite surprised at how little
work_mem seems to matter for these plans (HashJoin might be a different
story though). I feel like I made a mistake -- can someone please do a
sanity check on my numbers?


I have seen external sorts that were quicker than internal sorts
before. With my abbreviated key patch, under certain circumstances
external sorts are faster, while presumably the same thing is true of
int4 attribute sorts today. Actually, I saw a 10MB work_mem setting
that was marginally faster than a multi-gigabyte one that fit the
entire sort in memory. It probably has something to do with caching
effects dominating over the expense of more comparisons, since higher
work_mem settings that still resulted in an external sort were slower
than the 10MB setting.

I was surprised by this too, but it has been independently reported by
Jeff Janes.


I haven't tested for external faster than internal in a while, but I've 
certainly seen this effect before. Generally, once you get beyond a certain 
size (maybe 100MB?) you run the risk of a tapesort being faster than an 
internal sort.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] nls and server log

2014-12-29 Thread Jim Nasby

On 12/28/14, 2:56 AM, Craig Ringer wrote:

On 12/25/2014 02:35 AM, Euler Taveira wrote:

Hi,

Currently the same message goes to server log and client app. Sometimes
it bothers me since I have to analyze server logs and discovered that
lc_messages is set to pt_BR and to worse things that stup^H^H^H
application parse some error messages in portuguese.


IMO logging is simply broken for platforms where the postmaster and all
DBs don't share an encoding. We mix different encodings in log messages
and provide no way to separate them out. Nor is there a way to log
different messages to different files.

It's not just an issue with translations. We mix and mangle encodings of
user-supplied text, like RAISE strings in procs, for example.

We really need to be treating encoding for logging and for the client
much more separately than we currently do. I think any consideration of
translations for logging should be done with the underlying encoding
issues in mind.


Agreed.


My personal opinion is that we should require the server log to be
capable of representing all chars in the encodings used by any DB. Which
in practice means that we always just log in utf-8 if the user wants to
permit DBs with different encodings. An alternative would be one file
per database, always in the encoding of that database.


How much of this issue is caused by trying to machine-parse log files? Is a 
better option to improve that case, possibly doing something like including a 
field in each line that tells you the encoding for that entry?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replicating DROP commands across servers

2014-12-29 Thread Alvaro Herrera
Here's a patch that tweaks the grammar to use TypeName in COMMENT,
SECURITY LABEL, and DROP for the type and domain cases.  The required
changes in the code are pretty minimal, thankfully.  Note the slight
changes to the new object_address test also.

I think I'm pretty much done with this now, so I intend to push this
first thing tomorrow and then the rebased getObjectIdentityParts patch
sometime later.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
commit 701a2b7cc3cc6dff865db64ecaeb2fd8bb07b396
Author: Alvaro Herrera 
Date:   Mon Dec 29 18:55:59 2014 -0300

Use TypeName to represent type names in certain commands

In COMMENT, DROP and SECURITY LABEL we were representing types as a list
of names, rather than the TypeName struct that's used everywhere; this
practice also recently found its way into the new pg_get_object_address.

Right now it doesn't affect anything (other than some code cleanliness),
but it is more problematic for future changes that operate with object
addresses from the SQL interface; type details such as array-ness were
being forgotten.

diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c
index 7cb46e1..9ca609d 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -646,13 +646,11 @@ get_object_address(ObjectType objtype, List *objname, List *objargs,
 break;
 			case OBJECT_DOMCONSTRAINT:
 {
-	List		   *domname;
 	ObjectAddress	domaddr;
 	char		   *constrname;
 
-	domname = list_truncate(list_copy(objname), list_length(objname) - 1);
-	constrname = strVal(llast(objname));
-	domaddr = get_object_address_type(OBJECT_DOMAIN, domname, missing_ok);
+	domaddr = get_object_address_type(OBJECT_DOMAIN, objname, missing_ok);
+	constrname = strVal(linitial(objargs));
 
 	address.classId = ConstraintRelationId;
 	address.objectId = get_domain_constraint_oid(domaddr.objectId,
@@ -1291,14 +1289,13 @@ get_object_address_attrdef(ObjectType objtype, List *objname,
  * Find the ObjectAddress for a type or domain
  */
 static ObjectAddress
-get_object_address_type(ObjectType objtype,
-		List *objname, bool missing_ok)
+get_object_address_type(ObjectType objtype, List *objname, bool missing_ok)
 {
 	ObjectAddress address;
 	TypeName   *typename;
 	Type		tup;
 
-	typename = makeTypeNameFromNameList(objname);
+	typename = (TypeName *) linitial(objname);
 
 	address.classId = TypeRelationId;
 	address.objectId = InvalidOid;
@@ -1428,27 +1425,8 @@ pg_get_object_address(PG_FUNCTION_ARGS)
 	 * given object type.  Most use a simple string Values list, but there
 	 * are some exceptions.
 	 */
-	if (type == OBJECT_TYPE || type == OBJECT_DOMAIN)
-	{
-		Datum	*elems;
-		bool	*nulls;
-		int		nelems;
-		TypeName *typname;
-
-		deconstruct_array(namearr, TEXTOID, -1, false, 'i',
-		  &elems, &nulls, &nelems);
-		if (nelems != 1)
-			ereport(ERROR,
-	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-	 errmsg("name list length must be exactly %d", 1)));
-		if (nulls[0])
-			ereport(ERROR,
-	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-	 errmsg("name or argument lists may not contain nulls")));
-		typname = typeStringToTypeName(TextDatumGetCString(elems[0]));
-		name = typname->names;
-	}
-	else if (type == OBJECT_CAST)
+	if (type == OBJECT_TYPE || type == OBJECT_DOMAIN || type == OBJECT_CAST ||
+		type == OBJECT_DOMCONSTRAINT)
 	{
 		Datum	*elems;
 		bool	*nulls;
@@ -1533,18 +1511,13 @@ pg_get_object_address(PG_FUNCTION_ARGS)
 	 */
 	switch (type)
 	{
-		case OBJECT_DOMCONSTRAINT:
-			if (list_length(name) < 2)
-ereport(ERROR,
-		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-		 errmsg("name list length must be at least %d", 2)));
-			break;
 		case OBJECT_LARGEOBJECT:
 			if (list_length(name) != 1)
 ereport(ERROR,
 		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-		 errmsg("name list length must be %d", 1)));
+		 errmsg("name list length must be exactly %d", 1)));
 			break;
+		case OBJECT_DOMCONSTRAINT:
 		case OBJECT_OPCLASS:
 		case OBJECT_OPFAMILY:
 		case OBJECT_CAST:
diff --git a/src/backend/commands/dropcmds.c b/src/backend/commands/dropcmds.c
index 8583581..21a2ae4 100644
--- a/src/backend/commands/dropcmds.c
+++ b/src/backend/commands/dropcmds.c
@@ -264,10 +264,14 @@ does_not_exist_skipping(ObjectType objtype, List *objname, List *objargs)
 	{
 		case OBJECT_TYPE:
 		case OBJECT_DOMAIN:
-			if (!schema_does_not_exist_skipping(objname, &msg, &name))
 			{
-msg = gettext_noop("type \"%s\" does not exist, skipping");
-name = TypeNameToString(makeTypeNameFromNameList(objname));
+TypeName   *typ = linitial(objname);
+
+if (!schema_does_not_exist_skipping(typ->names, &msg, &name))
+{
+	msg = gettext_noop("type \"%s\" does not exist, skipping");
+	name = TypeNameToString(typ);
+}
 			}
 			break;
 		c

Re: [HACKERS] Additional role attributes && superuser review

2014-12-29 Thread Stephen Frost
Adam,

* Adam Brightwell (adam.brightw...@crunchydatasolutions.com) wrote:
> > I'd suggest it's called DUMP if that's what it allows, to keep it separate
> > from the backup parts.
> 
> Makes sense to me.

I'm fine calling it 'DUMP', but for different reasons.

We have no (verifiable) idea what client program is being used to
connect and therefore we shouldn't try to tie the name of the client
program to the permission.

That said, a 'DUMP' privilege which allows the user to dump the contents
of the entire database is entirely reasonable.  We need to be clear in
the documentation though- such a 'DUMP' privilege is essentially
granting USAGE on all schemas and table-level SELECT on all tables and
sequences (anything else..?).  To be clear, a user with this privilege
can utilize that access without using pg_dump.

One other point is that this shouldn't imply any other privileges, imv.
I'm specifically thinking of BYPASSRLS- that's independently grantable
and therefore should be explicitly set, if it's intended.  Things
should work 'sanely' with any combination of the two options.
Similairly, DUMP shouldn't imply BACKUP or visa-versa.  In particular,
I'd like to see roles which have only the BACKUP privilege be unable to
directly read any data (modulo things granted to PUBLIC).  This would
allow an unprivileged user to manage backups, kick off ad-hoc ones, etc,
without being able to actually access any of the data (this would
require the actual backup system to have a similar control, but that's
entirely possible with more advanced SANs and enterprise backup
solutions).

> > That seems really bad names, IMHO. Why? Because we use WAL and XLOG
> > throughout documentation and parameters and code to mean *the same thing*.
> > And here they'd suddenly mean different things. If we need them as separate
> > privileges, I think we need much better names. (And a better description -
> > what is "xlog operations" really?)
> >
> 
> Fair enough, ultimately what I was trying to address is the following
> concern raised by Alvaro:
> 
> "To me, what this repeated discussion on this particular BACKUP point
> says, is that the ability to run pg_start/stop_backend and the xlog
> related functions should be a different privilege, i.e. something other
> than BACKUP; because later we will want the ability to grant someone the
> ability to run pg_dump on the whole database without being superuser,
> and we will want to use the name BACKUP for that.  So I'm inclined to
> propose something more specific for this like WAL_CONTROL or
> XLOG_OPERATOR, say."

Note that the BACKUP role attribute was never intended to cover the
pg_dump use-case.  Simply the name of it caused confusion though.  I'm
not sure if adding a DUMP role attribute is sufficient enough to address
that confusion, but I haven't got a better idea either.

> When indeed, what it meant was to have the following separate (effectively
> merging #2 and #3):
> 
> 1) ability to pg_dump
> 2) ability to start/stop backups *and* ability to execute xlog related
> functions.

That sounds reasonable to me (and is what was initially proposed, though
I've come around to the thinking that this BACKUP role attribute should
also allow pg_xlog_replay_pause/resume(), as those can be useful on
replicas).

> Given this clarification:
> 
> I think #1 could certainly be answered by using DUMP.  I have no strong
> opinion in either direction, though I do think that BACKUP does make the
> most sense for #2.  Previously, Stephen had mentioned a READONLY capability
> that could effectively work for pg_dump, though, Jim's suggestion of
> keeping 'read-all' separate from 'ability to pg_dump' seems logical.  In
> either case, I certainly wouldn't mind having a wider agreement/consensus
> on this approach.

The read-all vs. ability-to-pg_dump distinction doesn't really exist for
role attributes, as I see it (see my comments above).  That said, having
DUMP or read-all is different from read-*only*, which would probably be
good to have independently.  I can imagine a use-case for a read-only
account which only has read ability for those tables, schemas, etc,
explicitly granted to it.

There is one issue that occurs to me, however.  We're talking about
pg_dump, but what about pg_dumpall?  In particular, I don't think the
DUMP privilege should provide access to pg_authid, as that would allow
the user to bypass the privilege system in some environments by using
the hash to log in as a superuser.  Now, I don't encourage using
password based authentication, especially for superuser accounts, but
lots of people do.  The idea with these privileges is to allow certain
operations to be performed by a non-superuser while preventing trivial
access to superuser.  Perhaps it's pie-in-the-sky, but my hope is to
achieve that.

> So, here is a revised list of proposed attributes:
> 
> * BACKUP - allows role to perform backup operations (as originally proposed)
> * LOG - allows role to rotate log files - remains broad

Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Kevin Grittner
[errata]


Kevin Grittner  wrote:

> Quoting from the peer-reviewed paper presented in Istanbul[1]:

That should have been [3], not [1].


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Misaligned BufferDescriptors causing major performance problems on AMD

2014-12-29 Thread Bruce Momjian
On Sat, Dec 27, 2014 at 08:05:42PM -0500, Robert Haas wrote:
> On Wed, Dec 24, 2014 at 11:20 AM, Andres Freund  
> wrote:
> > I just verified that I can still reproduce the problem:
> >
> > # aligned case (max_connections=401)
> > afreund@axle:~$ pgbench -P 1 -h /tmp/ -p5440 postgres -n -M prepared -c 96 
> > -j 96 -T 100 -S
> > progress: 1.0 s, 405170.2 tps, lat 0.195 ms stddev 0.928
> > progress: 2.0 s, 467011.1 tps, lat 0.204 ms stddev 0.140
> > progress: 3.0 s, 462832.1 tps, lat 0.205 ms stddev 0.154
> > progress: 4.0 s, 471035.5 tps, lat 0.202 ms stddev 0.154
> > progress: 5.0 s, 500329.0 tps, lat 0.190 ms stddev 0.132
> >
> > BufferDescriptors is at 0x7f63610a6960 (which is 32byte aligned)
> >
> > # unaligned case (max_connections=400)
> > afreund@axle:~$ pgbench -P 1 -h /tmp/ -p5440 postgres -n -M prepared -c 96 
> > -j 96 -T 100 -S
> > progress: 1.0 s, 202271.1 tps, lat 0.448 ms stddev 1.232
> > progress: 2.0 s, 223823.4 tps, lat 0.427 ms stddev 3.007
> > progress: 3.0 s, 227584.5 tps, lat 0.414 ms stddev 4.760
> > progress: 4.0 s, 221095.6 tps, lat 0.410 ms stddev 4.390
> > progress: 5.0 s, 217430.6 tps, lat 0.454 ms stddev 7.913
> > progress: 6.0 s, 210275.9 tps, lat 0.411 ms stddev 0.606
> > BufferDescriptors is at 0x7f1718aeb980 (which is 64byte aligned)
> 
> So, should we increase ALIGNOF_BUFFER from 32 to 64?  Seems like
> that's what these results are telling us.

I am glad someone else considers this important.  Andres reported the
above 2x pgbench difference in February, but no action was taken as
everyone felt there needed to be more performance testing, but it never
happened:


http://www.postgresql.org/message-id/20140202151319.gd32...@awork2.anarazel.de

I have now performance tested this by developing the attached two
patches which both increase the Buffer Descriptors allocation by 64
bytes.  The first patch causes each 64-byte Buffer Descriptor struct to
align on a 32-byte boundary but not a 64-byte boundary, while the second
patch aligns it with a 64-byte boundary.

I tried many tests, including this:

$ pgbench --initialize --scale 1 pgbench
$ pgbench --protocol prepared --client 16 --jobs 16 --transactions 
10 --select-only pgbench

I cannot measure any difference on my dual-CPU-socket, 16-vcore server
(http://momjian.us/main/blogs/pgblog/2012.html#January_20_2012).   I
thought this test would cause the most Buffer Descriptor contention
between the two CPUs.  Can anyone else see a difference when testing
these two patches?  (The patch reports alignment in the server logs.)

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
new file mode 100644
index ff6c713..c4dce5b
*** a/src/backend/storage/buffer/buf_init.c
--- b/src/backend/storage/buffer/buf_init.c
*** InitBufferPool(void)
*** 67,72 
--- 67,73 
  	bool		foundBufs,
  foundDescs;
  
+ 	fprintf(stderr, "Buffer Descriptors size = %ld\n", sizeof(BufferDesc));
  	BufferDescriptors = (BufferDesc *)
  		ShmemInitStruct("Buffer Descriptors",
  		NBuffers * sizeof(BufferDesc), &foundDescs);
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
new file mode 100644
index 2ea2216..669c07f
*** a/src/backend/storage/ipc/shmem.c
--- b/src/backend/storage/ipc/shmem.c
*** ShmemInitStruct(const char *name, Size s
*** 327,332 
--- 327,335 
  	ShmemIndexEnt *result;
  	void	   *structPtr;
  
+ 	if (strcmp(name, "Buffer Descriptors") == 0)
+ 		size = BUFFERALIGN(size) + 64;
+ 
  	LWLockAcquire(ShmemIndexLock, LW_EXCLUSIVE);
  
  	if (!ShmemIndex)
*** ShmemInitStruct(const char *name, Size s
*** 413,418 
--- 416,432 
  			" \"%s\" (%zu bytes requested)",
  			name, size)));
  		}
+ 		if (strcmp(name, "Buffer Descriptors") == 0)
+ 		{
+ 			/* align on 32 */
+ 			if ((int64)structPtr % 32 != 0)
+ structPtr = (void *)((int64)structPtr + 32 - (int64)structPtr % 32);
+ 			/* align on 32 but not 64 */
+ 			if ((int64)structPtr % 64 == 0)
+ structPtr = (void *)((int64)structPtr + 32);
+ 		}
+ 		fprintf(stderr, "shared memory alignment of %s:  %ld-byte\n", name,
+ 			(int64)structPtr % 64 == 0 ? 64 : (int64)structPtr % 64);
  		result->size = size;
  		result->location = structPtr;
  	}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
new file mode 100644
index ff6c713..c4dce5b
*** a/src/backend/storage/buffer/buf_init.c
--- b/src/backend/storage/buffer/buf_init.c
*** InitBufferPool(void)
*** 67,72 
--- 67,73 
  	bool		foundBufs,
  foundDescs;
  
+ 	fprintf(stderr, "Buffer Descriptors size = %ld\n", sizeof(BufferDesc));
  	BufferDescriptors = (BufferDesc *)
  		ShmemInitStruct("Buffer Descriptors",
  		NBuffers * sizeof(BufferDesc), &foundDescs);
diff --git a/src/bac

Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Kevin Grittner
Merlin Moncure  wrote:
> On Mon, Dec 29, 2014 at 10:53 AM, Kevin Grittner  wrote:
>> The semantics are so imprecise that Tom argued that we should
>> document that transactions should be retried from the start when
>> you get the duplicate key error, since it *might* have been caused
>> by a race condition.

> That sounds off to me also. In terms of a classic uniqueness
> constraint (say, a identifying user name), every violation is
> technically a race condition -- whether or not the transactions
> overlap on time is completely irrelevant.

That is completely bogus. The point is that if you return a
serialization failure the transaction should be immediately retried
from the beginning (including, in many cases, acquiring key
values). If the error was caused by concurrent insertion of a
duplicate key where the transaction does *not* acquire the value
within the transaction, *then* you get the duplicate key error on
the retry.

> If the transactions
> touching off the error happen to overlap or not is an accident of
> timing and irrelevant; a serialization error suggests that the
> transaction should be retried when in fact it shouldn't be,
> particularly just to get the *actual* error. What if the transaction
> is non-trivial? Why do we want to bother our users about those
> details at all?

Where serializable transactions are used to manage race conditions
the users typically do not see them. The application code does not
see them. There is normally some sort of framework (possibly using
dependency injection, although not necessarily) which catches these
and retries the transaction from the start without notifying the
user or letting the application software know that it happened.
There is normally some server-side logging so that high volumes of
rollbacks can be identified and fixed. In a real-world situation
where this was used for 100 production databases running millions
of transactions per day, I saw about 10 or 20 serialization
failures per day across the whole set of database servers.

While I have certainly heard of workloads where it didn't work out
that well, Dan Ports found that many common benchmarks perform
about that well. Quoting from the peer-reviewed paper presented in
Istanbul[1]:

| SSI outperforms S2PL for all transaction mixes, and does so by a
| significant margin when the fraction of read-only transactions is
| high. On these workloads, there are more rw-conflicts between
| concurrent transactions, so locking imposes a larger performance
| penalty. (The 100%-read-only workload is a special case; there are
| no lock conflicts under S2PL, and SSI has no overhead because all
| snapshots are safe.) The 150-warehouse configuration (Figure 5b )
| behaves similarly, but the differences are less pronounced: on this
| disk-bound benchmark, CPU overhead is not a factor, and improved
| concurrency has a limited benefit. Here, the performance of SSI
| is indistinguishable from that of SI. Transactions rarely need to
| be retried; in all cases, the serialization failure rate was under
| 0.25%.

> Consider the 'idiomatic upsert' as it exists in the documentation (!):

That documentation should probably be updated to indicate which
isolation levels need that code. If you are relying on
serializable transactions that dance is unnecessary and pointless.
The rule when relying on serializable transactions is that you
write the code to behave correctly if the transaction executing it
is running by itself. Period. No special handling for race
conditions. Detecting race conditions is the job of the database
engine and retrying affected transactions is the job of the
framework. Absolutely nobody who understands serializable
transactions would use that idiom inside of serializable
transactions.

> By changing the error code, for decades worth of dealing with this
> problem, you've just converted a server side loop to a full round
> trip, and, if the user does not automatically retry serialization
> failures, broken his/her code.

If they are not automatically retrying on serialization failures,
they should probably not be using serializable transactions. That
is rather the point. No need for table locks. No need for SELECT
FOR UPDATE. No need to special-case for concurrent transactions.
"Just do it." Let the framework retry as needed.

I have no problem with there being people who choose not to use
this approach. What I'm asking is that they not insist on
PostgreSQL being needlessly crippled for those who *do* use it this
way.

> It's impossible to fix the round trip
> issue, at least provably, because there is no way to know for sure
> that the serialization failure is coming from this exact insertion, or
> say, a dependent trigger (aside: the idiomatic example aught to be
> checking the table name!) such that your loop (either here or from
> application) would execute a bazillion times until some other
> transaction clears.

Until the inserting transaction commits, it would block the other
insertion atte

Re: [HACKERS] Additional role attributes && superuser review

2014-12-29 Thread Adam Brightwell
Magnus,

Thanks for the feedback.


>> BACKUP - allows role to perform pg_dump* backups of whole database.
>>
>
> I'd suggest it's called DUMP if that's what it allows, to keep it separate
> from the backup parts.
>

Makes sense to me.

That seems really bad names, IMHO. Why? Because we use WAL and XLOG
> throughout documentation and parameters and code to mean *the same thing*.
> And here they'd suddenly mean different things. If we need them as separate
> privileges, I think we need much better names. (And a better description -
> what is "xlog operations" really?)
>

Fair enough, ultimately what I was trying to address is the following
concern raised by Alvaro:

"To me, what this repeated discussion on this particular BACKUP point
says, is that the ability to run pg_start/stop_backend and the xlog
related functions should be a different privilege, i.e. something other
than BACKUP; because later we will want the ability to grant someone the
ability to run pg_dump on the whole database without being superuser,
and we will want to use the name BACKUP for that.  So I'm inclined to
propose something more specific for this like WAL_CONTROL or
XLOG_OPERATOR, say."

Upon re-reading it (and other discussions around it) I believe that I must
have misinterpreted.  Initially, I read it to mean that we needed the
following separate permissions.

1) ability to pg_dump
2) ability to start/stop backups
3) ability to execute xlog related functions

When indeed, what it meant was to have the following separate (effectively
merging #2 and #3):

1) ability to pg_dump
2) ability to start/stop backups *and* ability to execute xlog related
functions.

My apologies on that confusion.

Given this clarification:

I think #1 could certainly be answered by using DUMP.  I have no strong
opinion in either direction, though I do think that BACKUP does make the
most sense for #2.  Previously, Stephen had mentioned a READONLY capability
that could effectively work for pg_dump, though, Jim's suggestion of
keeping 'read-all' separate from 'ability to pg_dump' seems logical.  In
either case, I certainly wouldn't mind having a wider agreement/consensus
on this approach.

So, here is a revised list of proposed attributes:

* BACKUP - allows role to perform backup operations (as originally proposed)
* LOG - allows role to rotate log files - remains broad enough to consider
future log related operations
* DUMP -  allows role to perform pg_dump* backups of whole database
* MONITOR - allows role to view pg_stat_* details (as originally proposed)
* PROCSIGNAL - allows role to signal backend processes (as originally
proposed)well

Thanks,
Adam

-- 
Adam Brightwell - adam.brightw...@crunchydatasolutions.com
Database Engineer - www.crunchydatasolutions.com


Re: [HACKERS] [GENERAL] ON_ERROR_ROLLBACK

2014-12-29 Thread Adrian Klaver

On 12/29/2014 09:55 AM, Tom Lane wrote:

Adrian Klaver  writes:

So it seems you can turn ON_ERROR_ROLLBACK on with either 1 or 'on', but you 
can only turn it off with 'off'.
With ON_ERROR_STOP 1/on and 0/off both seem to work.



Is this expected?






Given the lack of previous complaints, this probably isn't backpatching
material, but it sure seems like a bit of attention to consistency
would be warranted here.


I would appreciate it, thanks.



regards, tom lane




--
Adrian Klaver
adrian.kla...@aklaver.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Merlin Moncure
On Mon, Dec 29, 2014 at 10:47 AM, Nikita Volkov  wrote:
>> [combining replies -- nikita, better not to top-post (FYI)]

[combining replied again]

> I'm sorry. I don't know what you mean. I just replied to an email.

http://www.idallen.com/topposting.html

>> To prove your statement, you need to demonstrate how a transaction left
>> the database in a bad state given concurrent activity without counting
>> failures.
>
> 1. Transaction A looks up a row by ID 1 and gets an empty result.
> 2. Concurrent transaction B inserts a row with ID 1.
> 3. Transaction A goes on with the presumption that a row with ID 1 does not
> exist, because a transaction is supposed to be isolated and because it has
> made sure that the row does not exist. With this presumption it confidently
> inserts a row with ID 1 only to get Postgres report a duplicate key. Wat?

Your understanding of isolation is incorrect.   Transaction A does not
go on with anything -- it's guaranteed to fail in this case.  The only
debatable point here is how exactly it fails.  Again, isolation's job
is to protect the data.

On Mon, Dec 29, 2014 at 10:53 AM, Kevin Grittner  wrote:
> The semantics are so imprecise that Tom argued that we should
> document that transactions should be retried from the start when
> you get the duplicate key error, since it *might* have been caused
> by a race condition.

That sounds off to me also.  In terms of a classic uniqueness
constraint (say, a identifying user name), every violation is
technically a race condition -- whether or not the transactions
overlap on time is completely irrelevant.  If the transactions
touching off the error happen to overlap or not is an accident of
timing and irrelevant; a serialization error suggests that the
transaction should be retried when in fact it shouldn't be,
particularly just to get the *actual* error.  What if the transaction
is non-trivial?  Why do we want to bother our users about those
details at all?

Consider the 'idiomatic upsert' as it exists in the documentation (!):

LOOP
-- first try to update the key
UPDATE db SET b = data WHERE a = key;
IF found THEN
RETURN;
END IF;
--   XXX merlin's note: if any dependent table throws a UV,
--   say, via a trigger, this code will loop endlessly
-- not there, so try to insert the key
-- if someone else inserts the same key concurrently,
-- we could get a unique-key failure
BEGIN
INSERT INTO db(a,b) VALUES (key, data);
RETURN;
EXCEPTION WHEN unique_violation THEN
-- do nothing, and loop to try the UPDATE again
END;
END LOOP;

By changing the error code, for decades worth of dealing with this
problem, you've just converted a server side loop to a full round
trip, and, if the user does not automatically retry serialization
failures, broken his/her code.  It's impossible to fix the round trip
issue, at least provably, because there is no way to know for sure
that the serialization failure is coming from this exact insertion, or
say, a dependent trigger (aside: the idiomatic example aught to be
checking the table name!) such that your loop (either here or from
application) would execute a bazillion times until some other
transaction clears.  OK, this is a mostly academic detail, but the
picture is not so clear as you're saying, I think; you're travelling
at high speed in uncertain waters.

The key point here is that OP issued a SELECT first, and he's chaining
DML decisions to the output of that select. He's expecting that SELECT
to be protected via ACID, but it isn't and can't be unless you're
prepared to predicate lock every row selected.  What he wants is for
the database to bounce his transaction because the select lied to him,
but that can't be done obviously.

> I'm curious how heavily you use serializable transactions, because
> I have trouble believing that those who rely on them as their
> primary (or only) strategy for dealing with race conditions under
> high concurrency would take that position.

I don't use them much, admittedly.  That said, I don't use them as
race condition guards.  I use locks or other techniques to manage the
problem.   I tend to build out applications on top of functions and
the inability to set isolation mode inside a function confounds me
from using anything but 'read committed'.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] recovery_min_apply_delay with a negative value

2014-12-29 Thread Fabrízio de Royes Mello
On Sun, Dec 28, 2014 at 12:31 PM, Michael Paquier 
wrote:
>
> Hi all,
>
> While reviewing another patch, I have noticed that
recovery_min_apply_delay can have a negative value. And the funny part is
that we actually attempt to apply a delay even in this case, per se this
condition recoveryApplyDelay@xlog.c:
> /* nothing to do if no delay configured */
> if (recovery_min_apply_delay == 0)
> return false;
> Shouldn't we simply leave if recovery_min_apply_delay is lower 0, and not
only equal to 0?
>

Seems reasonable.

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello


Re: [HACKERS] [GENERAL] ON_ERROR_ROLLBACK

2014-12-29 Thread Tom Lane
Adrian Klaver  writes:
> So it seems you can turn ON_ERROR_ROLLBACK on with either 1 or 'on', but you 
> can only turn it off with 'off'.
> With ON_ERROR_STOP 1/on and 0/off both seem to work. 

> Is this expected?

on_error_stop_hook() uses ParseVariableBool, while
on_error_rollback_hook() uses some hard-coded logic that checks for
"interactive" and "off" and otherwise assumes "on".  This is inconsistent
first in not accepting as many spellings as ParseVariableBool, and second
in not complaining about invalid input --- ParseVariableBool does, so
why not here?

echo_hook, echo_hidden_hook, histcontrol_hook, and verbosity_hook are
equally cavalierly written.

For on_error_rollback_hook and echo_hidden_hook, where "on" and "off"
are documented values, I think it would make sense to allow all spellings
accepted by ParseVariableBool, say like this:

 if (newval == NULL)
 pset.on_error_rollback = PSQL_ERROR_ROLLBACK_OFF;
 else if (pg_strcasecmp(newval, "interactive") == 0)
 pset.on_error_rollback = PSQL_ERROR_ROLLBACK_INTERACTIVE;
-else if (pg_strcasecmp(newval, "off") == 0)
-pset.on_error_rollback = PSQL_ERROR_ROLLBACK_OFF;
-else
-pset.on_error_rollback = PSQL_ERROR_ROLLBACK_ON;
+else if (ParseVariableBool(newval))
+pset.on_error_rollback = PSQL_ERROR_ROLLBACK_ON;
+else
+pset.on_error_rollback = PSQL_ERROR_ROLLBACK_OFF;

The other 3 functions don't need to accept on/off I think, but they should
print a warning for unrecognized input like ParseVariableBool does.

And while we're at it, we should consistently allow case-insensitive
input; right now it looks like somebody rolled dice to decide whether
to use "strcmp" or "pg_strcasecmp" in these functions.  Per line, not
even per function.

Given the lack of previous complaints, this probably isn't backpatching
material, but it sure seems like a bit of attention to consistency
would be warranted here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replicating DROP commands across servers

2014-12-29 Thread Alvaro Herrera
(The changes in the regression test are bogus, BTW; I didn't care enough
to get them fixed before sorting out the rest of the mess.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
commit 89c8cbed0072ad4d921128b834fcb4f9e2eb4c33
Author: Alvaro Herrera 
Date:   Mon Dec 22 18:32:43 2014 -0300

array objname/objargs stuff

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 24c64b7..112b6a0 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17772,6 +17772,23 @@ FOR EACH ROW EXECUTE PROCEDURE suppress_redundant_updates_trigger();
  identifier present in the identity is quoted if necessary.
 

+   
+address_names
+text[]
+
+ An array that, together with address_args,
+ can be used by the pg_get_object_address() to
+ recreate the object address in a remote server containing an
+ identically named object of the same kind.
+
+   
+   
+address_args
+text[]
+
+ Complement for address_names above.
+
+   
   
  
 
diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c
index 85079d6..789af5f 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -74,6 +74,7 @@
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
+#include "utils/memutils.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
 
@@ -556,8 +557,9 @@ static void getRelationTypeDescription(StringInfo buffer, Oid relid,
 		   int32 objectSubId);
 static void getProcedureTypeDescription(StringInfo buffer, Oid procid);
 static void getConstraintTypeDescription(StringInfo buffer, Oid constroid);
-static void getOpFamilyIdentity(StringInfo buffer, Oid opfid);
-static void getRelationIdentity(StringInfo buffer, Oid relid);
+static void getOpFamilyIdentity(StringInfo buffer, Oid opfid, List **objname,
+	List **objargs);
+static void getRelationIdentity(StringInfo buffer, Oid relid, List **objname);
 
 /*
  * Translate an object name and arguments (as passed by the parser) to an
@@ -2960,6 +2962,66 @@ pg_identify_object(PG_FUNCTION_ARGS)
 }
 
 /*
+ * SQL-level callable function to obtain object type + identity
+ */
+Datum
+pg_identify_object_as_address(PG_FUNCTION_ARGS)
+{
+	Oid			classid = PG_GETARG_OID(0);
+	Oid			objid = PG_GETARG_OID(1);
+	int32		subobjid = PG_GETARG_INT32(2);
+	ObjectAddress address;
+	char	   *identity;
+	List	   *names;
+	List	   *args;
+	Datum		values[3];
+	bool		nulls[3];
+	TupleDesc	tupdesc;
+	HeapTuple	htup;
+
+	address.classId = classid;
+	address.objectId = objid;
+	address.objectSubId = subobjid;
+
+	/*
+	 * Construct a tuple descriptor for the result row.  This must match this
+	 * function's pg_proc entry!
+	 */
+	tupdesc = CreateTemplateTupleDesc(3, false);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "type",
+	   TEXTOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "object_names",
+	   TEXTARRAYOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 3, "object_args",
+	   TEXTARRAYOID, -1, 0);
+
+	tupdesc = BlessTupleDesc(tupdesc);
+
+	/* object type */
+	values[0] = CStringGetTextDatum(getObjectTypeDescription(&address));
+	nulls[0] = false;
+
+	/* object identity */
+	identity = getObjectIdentityParts(&address, &names, &args);
+	pfree(identity);
+
+	/* object_names */
+	values[1] = PointerGetDatum(strlist_to_textarray(names));
+	nulls[1] = false;
+
+	/* object_args */
+	if (args)
+		values[2] = PointerGetDatum(strlist_to_textarray(args));
+	else
+		values[2] = PointerGetDatum(construct_empty_array(TEXTOID));
+	nulls[2] = false;
+
+	htup = heap_form_tuple(tupdesc, values, nulls);
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+}
+
+/*
  * Return a palloc'ed string that describes the type of object that the
  * passed address is for.
  *
@@ -3215,7 +3277,7 @@ getProcedureTypeDescription(StringInfo buffer, Oid procid)
 }
 
 /*
- * Return a palloc'ed string that identifies an object.
+ * Obtain a given object's identity, as a palloc'ed string.
  *
  * This is for machine consumption, so it's not translated.  All elements are
  * schema-qualified when appropriate.
@@ -3223,14 +3285,42 @@ getProcedureTypeDescription(StringInfo buffer, Oid procid)
 char *
 getObjectIdentity(const ObjectAddress *object)
 {
+	return getObjectIdentityParts(object, NULL, NULL);
+}
+
+/*
+ * As above, but more detailed.
+ *
+ * There are two sets of return values: the identity itself as a palloc'd
+ * string is returned.  objname and objargs, if not NULL, are output parameters
+ * that receive lists of C-strings that are useful to give back to
+ * get_object_address() to reconstruct the ObjectAddress.
+ */
+char *
+getObjectIdentityParts(const ObjectAddress *object,
+	   List **objname, List **objargs)
+{
 	StringInfoData buffer;
 
 	initStrin

Re: [HACKERS] replicating DROP commands across servers

2014-12-29 Thread Alvaro Herrera
Alvaro Herrera wrote:

> Patch 0005 adds getObjectIdentityParts(), which returns the object
> identity in arrays that can be passed to pg_get_object_address.  This
> part needs slight revisions so I'm not sure I will be able to push
> tomorrow.

Here's a fresh version of this patch.  I chose to add a SQL-accessible
version, pg_identify_object_as_address, to make it easier to test.  In
doing so I noticed a couple of bugs, and most interestingly I noticed
that it was essentially impossible to cleanly address an array type;
doing a roundtrip through the new functions would get me the base type
when I used "integer[]" but the array type when I used "_int4".  This
looked like a problem, so I traced through it and noticed that we're
using the type name *list* as a list, rather than as a TypeName, to
refer to OBJECT_TYPE and OBJECT_DOMAIN; I hadn't understood the
significance of this until I realized that domains would be represented
with arrayBounds set to a non-empty list for the integer[] syntax, but
the name list would have "pg_catalog integer" only; when the rest of
TypeName was discarded, the fact that we were talking about an array was
completely forgotten.  Before the dawn of time we had this:

-static void
-CommentType(List *typename, char *comment)
-{
-   TypeName   *tname;
-   Oid oid;
-
-   /* XXX a bit of a crock; should accept TypeName in COMMENT syntax */
-   tname = makeTypeNameFromNameList(typename);

where the XXX comment was removed by commit c10575ff005c330d047534562
without a corresponding comment in the new function.

I'm going to see about changing the grammar to get this fixed; this
patch is important because it will enable us to complete^Wcontinue
working on the DDL deparse testing framework.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Nikita Volkov
I believe, the objections expressed in this thread miss a very important
point of all this: the isolation property (the "I" in ACID) is violated.

Here’s a quote from the Wikipedia article on ACID
:

The isolation property ensures that the concurrent execution of
transactions results in a system state that would be obtained if
transactions were executed serially, i.e., one after the other.

The original example proves that it's violated. Such behaviour can neither
be expected by a user, nor is even mentioned anywhere. Instead in the first
paragraph of the “About” section of the Postgres site
 it states:

It is fully ACID compliant

Which is basically just a lie, until issues like this one get dealt with.
​

2014-12-29 18:31 GMT+03:00 Merlin Moncure :

> On Mon, Dec 29, 2014 at 9:09 AM, Kevin Grittner  wrote:
> > Merlin Moncure  wrote:
> >> In other words, the current behavior is:
> >> txn A,B begin
> >> txn A inserts
> >> txn B inserts over A, locks, waits
> >> txn A commits.  B aborts with duplicate key error
> >
> > What I'm proposing is that for serializable transactions B would
> > get a serialization failure; otherwise B would get a duplicate key
> > error.  If the retry of B looks at something in the database to
> > determine what it's primary key should be it will get a new value
> > on the retry, since it will be starting after the commit of A.  If
> > it is using a literal key, not based on something changed by A, it
> > will get a duplicate key error on the retry, since it will be
> > starting after the commit of A.
> >
> > It will either succeed on retry or get an error for a different
> > reason.
>
> In that case: we don't agree.  How come duplicate key errors would be
> reported as serialization failures but not RI errors (for example,
> inserting a record pointing to another record which a concurrent
> transaction deleted)?
>
> IMO, serialization errors are an implementation artifact and should
> not mask well defined errors in SQL under any circumstances (or if
> they must, those cases should absolutely be as narrow as possible).
>
> merlin
>


Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Nikita Volkov
> [combining replies -- nikita, better not to top-post (FYI)]

I'm sorry. I don't know what you mean. I just replied to an email.

> To prove your statement, you need to demonstrate how a transaction left
the database in a bad state given concurrent activity without counting
failures.

1. Transaction A looks up a row by ID 1 and gets an empty result.
2. Concurrent transaction B inserts a row with ID 1.
3. Transaction A goes on with the presumption that a row with ID 1 does not
exist, because a transaction is supposed to be isolated and because it has
made sure that the row does not exist. With this presumption it confidently
inserts a row with ID 1 only to get Postgres report a duplicate key. Wat?


2014-12-29 19:17 GMT+03:00 Merlin Moncure :

> On Mon, Dec 29, 2014 at 9:44 AM, Greg Stark  wrote:
> > On Mon, Dec 29, 2014 at 3:31 PM, Merlin Moncure 
> wrote:
> >> In that case: we don't agree.  How come duplicate key errors would be
> >> reported as serialization failures but not RI errors (for example,
> >> inserting a record pointing to another record which a concurrent
> >> transaction deleted)?
> >
> > The key question is not the type of error but whether the transaction
> > saw another state previously.
>
> [combining replies -- nikita, better not to top-post (FYI)]
>
> How is that relevant?  Serialization errors only exist as a concession
> to concurrency and performance.  Again, they should be returned as
> sparsely as possible because they provide absolutely (as Tom pointed
> out) zero detail to the application.   The precise definition of the
> error is up to us, but I'd rather keep it to it's current rather
> specific semantics.
>
> On Mon, Dec 29, 2014 at 9:48 AM, Nikita Volkov 
> wrote:
> > I believe, the objections expressed in this thread miss a very important
> > point of all this: the isolation property (the "I" in ACID) is violated.
> >
> > Here’s a quote from the Wikipedia article on ACID:
> >
> > The isolation property ensures that the concurrent execution of
> transactions
> > results in a system state that would be obtained if transactions were
> > executed serially, i.e., one after the other.
> >
> > The original example proves that it's violated. Such behaviour can
> neither
> > be expected by a user, nor is even mentioned anywhere. Instead in the
> first
> > paragraph of the “About” section of the Postgres site it states:
> >
> > It is fully ACID compliant
> >
> > Which is basically just a lie, until issues like this one get dealt with.
>
> That's simply untrue: inconvenience != acid violation
>
> Transaction levels provide certain guarantees regarding the state of
> the data in the presence of concurrent overlapping operations.   They
> do not define the mechanism of failure or how/when those failures
> should occur.  To prove your statement, you need to demonstrate how a
> transaction left the database in a bad state given concurrent activity
> without counting failures.  Postgres can, and does, for example,
> return concurrency type errors more aggressively than it needs to for
> the 'repeatable read', level.  Point being, this is completely ok as
> database implementation is free to understand that, just as it's free
> to define precisely how and when it fails given concurrency as long as
> those guarantees are provided.
>
> merlin
>


Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Kevin Grittner
Merlin Moncure  wrote:

> Serialization errors only exist as a concession to concurrency
> and performance.  Again, they should be returned as sparsely as
> possible because they provide absolutely (as Tom pointed
> out) zero detail to the application.

That is false.  They provide an *extremely* valuable piece of
information which is not currently available when you get a
duplicate key error -- whether the error occurred because of a race
condition and will not fail for the same cause if retried.

> The precise definition of the error is up to us, but I'd rather
> keep it to it's current rather specific semantics.

The semantics are so imprecise that Tom argued that we should
document that transactions should be retried from the start when
you get the duplicate key error, since it *might* have been caused
by a race condition.

I'm curious how heavily you use serializable transactions, because
I have trouble believing that those who rely on them as their 
primary (or only) strategy for dealing with race conditions under 
high concurrency would take that position.

As for the fact that RI violations also don't return a
serialization failure when caused by a race with concurrent
transactions, I view that as another weakness in PostgreSQL.  I
don't think there is a problem curing one without curing the other
at the same time.  I have known of people writing their own
triggers to enforce RI rather than defining FKs precisely so that
they can get a serialization failure return code and do automatic
retry if it is caused by a race condition.  That's less practical
to compensate for when it comes to unique indexes or constraints.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Merlin Moncure
On Mon, Dec 29, 2014 at 9:44 AM, Greg Stark  wrote:
> On Mon, Dec 29, 2014 at 3:31 PM, Merlin Moncure  wrote:
>> In that case: we don't agree.  How come duplicate key errors would be
>> reported as serialization failures but not RI errors (for example,
>> inserting a record pointing to another record which a concurrent
>> transaction deleted)?
>
> The key question is not the type of error but whether the transaction
> saw another state previously.

[combining replies -- nikita, better not to top-post (FYI)]

How is that relevant?  Serialization errors only exist as a concession
to concurrency and performance.  Again, they should be returned as
sparsely as possible because they provide absolutely (as Tom pointed
out) zero detail to the application.   The precise definition of the
error is up to us, but I'd rather keep it to it's current rather
specific semantics.

On Mon, Dec 29, 2014 at 9:48 AM, Nikita Volkov  wrote:
> I believe, the objections expressed in this thread miss a very important
> point of all this: the isolation property (the "I" in ACID) is violated.
>
> Here’s a quote from the Wikipedia article on ACID:
>
> The isolation property ensures that the concurrent execution of transactions
> results in a system state that would be obtained if transactions were
> executed serially, i.e., one after the other.
>
> The original example proves that it's violated. Such behaviour can neither
> be expected by a user, nor is even mentioned anywhere. Instead in the first
> paragraph of the “About” section of the Postgres site it states:
>
> It is fully ACID compliant
>
> Which is basically just a lie, until issues like this one get dealt with.

That's simply untrue: inconvenience != acid violation

Transaction levels provide certain guarantees regarding the state of
the data in the presence of concurrent overlapping operations.   They
do not define the mechanism of failure or how/when those failures
should occur.  To prove your statement, you need to demonstrate how a
transaction left the database in a bad state given concurrent activity
without counting failures.  Postgres can, and does, for example,
return concurrency type errors more aggressively than it needs to for
the 'repeatable read', level.  Point being, this is completely ok as
database implementation is free to understand that, just as it's free
to define precisely how and when it fails given concurrency as long as
those guarantees are provided.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Publish autovacuum informations

2014-12-29 Thread Guillaume Lelarge
2014-12-29 17:03 GMT+01:00 Tom Lane :

> Guillaume Lelarge  writes:
> > All in all, I want to get informations that are typically stored in
> shared
> > memory, handled by the autovacuum launcher and autovacuum workers. I
> first
> > thought I could get that by writing some C functions embedded in an
> > extension. But it doesn't seem to me I can access this part of the shared
> > memory from a C function. If I'm wrong, I'd love to get a pointer on how
> to
> > do this.
>
> > Otherwise, I wonder what would be more welcome: making the shared memory
> > structs handles available outside of the autovacuum processes (and then
> > build an extension to decode the informations I need), or adding
> functions
> > in core to get access to this information (in that case, no need for an
> > extension)?
>
> Either one of those approaches would cripple our freedom to change those
> data structures; which we've done repeatedly in the past and will surely
> want to do again.  So I'm pretty much -1 on exposing them.
>
>
I don't see how that's going to deny us the right to change any structs. If
they are in-core functions, we'll just have to update them.  If they are
extension functions, then the developer of those functions would simply
need to update his code.


-- 
Guillaume.
  http://blog.guillaume.lelarge.info
  http://www.dalibo.com


Re: [HACKERS] Publish autovacuum informations

2014-12-29 Thread Tom Lane
Guillaume Lelarge  writes:
> All in all, I want to get informations that are typically stored in shared
> memory, handled by the autovacuum launcher and autovacuum workers. I first
> thought I could get that by writing some C functions embedded in an
> extension. But it doesn't seem to me I can access this part of the shared
> memory from a C function. If I'm wrong, I'd love to get a pointer on how to
> do this.

> Otherwise, I wonder what would be more welcome: making the shared memory
> structs handles available outside of the autovacuum processes (and then
> build an extension to decode the informations I need), or adding functions
> in core to get access to this information (in that case, no need for an
> extension)?

Either one of those approaches would cripple our freedom to change those
data structures; which we've done repeatedly in the past and will surely
want to do again.  So I'm pretty much -1 on exposing them.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Publish autovacuum informations

2014-12-29 Thread Guillaume Lelarge
Hey,

There are times where I would need more informations on the autovacuum
processes.

I'd love to know what each worker is currently doing. I can get something
like this from the pg_stat_activity view but it doesn't give me as much
informations as the WorkerInfoData struct.

I'd also love to have more informations on the contents of the tables list
(how many tables still to process, which table next, what kind of
processing they'll get, etc... kinda what you have in the autovac_table
struct).

All in all, I want to get informations that are typically stored in shared
memory, handled by the autovacuum launcher and autovacuum workers. I first
thought I could get that by writing some C functions embedded in an
extension. But it doesn't seem to me I can access this part of the shared
memory from a C function. If I'm wrong, I'd love to get a pointer on how to
do this.

Otherwise, I wonder what would be more welcome: making the shared memory
structs handles available outside of the autovacuum processes (and then
build an extension to decode the informations I need), or adding functions
in core to get access to this information (in that case, no need for an
extension)?

Thanks.

Regards.


-- 
Guillaume.
  http://blog.guillaume.lelarge.info
  http://www.dalibo.com


Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Greg Stark
On Mon, Dec 29, 2014 at 3:31 PM, Merlin Moncure  wrote:
> In that case: we don't agree.  How come duplicate key errors would be
> reported as serialization failures but not RI errors (for example,
> inserting a record pointing to another record which a concurrent
> transaction deleted)?

The key question is not the type of error but whether the transaction
saw another state previously. That is, if you select to check for
duplicate keys, don't see any, and then try to insert and get a
duplicate key error that would be easy to argue is a serialization
error. The same could be true for an RI check -- if you select some
data and then insert a record that refers to the data you already
verified existed and get an RI failure then you could consider that a
serialization failure.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Merlin Moncure
On Mon, Dec 29, 2014 at 9:09 AM, Kevin Grittner  wrote:
> Merlin Moncure  wrote:
>> In other words, the current behavior is:
>> txn A,B begin
>> txn A inserts
>> txn B inserts over A, locks, waits
>> txn A commits.  B aborts with duplicate key error
>
> What I'm proposing is that for serializable transactions B would
> get a serialization failure; otherwise B would get a duplicate key
> error.  If the retry of B looks at something in the database to
> determine what it's primary key should be it will get a new value
> on the retry, since it will be starting after the commit of A.  If
> it is using a literal key, not based on something changed by A, it
> will get a duplicate key error on the retry, since it will be
> starting after the commit of A.
>
> It will either succeed on retry or get an error for a different
> reason.

In that case: we don't agree.  How come duplicate key errors would be
reported as serialization failures but not RI errors (for example,
inserting a record pointing to another record which a concurrent
transaction deleted)?

IMO, serialization errors are an implementation artifact and should
not mask well defined errors in SQL under any circumstances (or if
they must, those cases should absolutely be as narrow as possible).

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Serialization exception : Who else was involved?

2014-12-29 Thread Kevin Grittner
Craig Ringer  wrote:

> I don't see how that'd necessarily correctly identify the
> query/queries in the other tx that're involved, though.
>
> Perhaps I'm thinking in terms of more complicated serialization
> failures?

Yeah, it might be possible to provide useful information about
specific conflicting queries in some cases, but I'm skeptical that
it would be available in the majority of cases.  In most cases you
can probably identify one or two other transactions that are
involved.  In at least some cases you won't even have that.

For one fun case to think about, see this example where a read-only
transaction fails with on conflict with two already-committed
transactions:

https://wiki.postgresql.org/wiki/SSI#Rollover

Also consider when there is a long-running transactions and SSI
falls back to SLRU summarization.

If we can find a way to provide some useful information in some
cases without harming performance, that's fine as long as nobody
expects that it will be available in all cases.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Kevin Grittner
Merlin Moncure  wrote:

> Well, I'm arguing that duplicate key errors are not serialization
> failures unless it's likely the insertion would succeed upon a retry;
> a proper insert, not an upsert.  If that's the case with what you're
> proposing, then it makes sense to me.  But that's not what it sounds
> like...your language suggests AIUI that having the error simply be
> caused by another transaction being concurrent would be sufficient to
> switch to a serialization error (feel free to correct me if I'm
> wrong!).
>
> In other words, the current behavior is:
> txn A,B begin
> txn A inserts
> txn B inserts over A, locks, waits
> txn A commits.  B aborts with duplicate key error
>
> Assuming that case is untouched, then we're good!  My long winded
> point above is that case must fail with duplicate key error; a
> serialization error is suggesting the transaction should be retried
> and it shouldn't be...it would simply fail a second time.

What I'm proposing is that for serializable transactions B would
get a serialization failure; otherwise B would get a duplicate key
error.  If the retry of B looks at something in the database to
determine what it's primary key should be it will get a new value
on the retry, since it will be starting after the commit of A.  If
it is using a literal key, not based on something changed by A, it
will get a duplicate key error on the retry, since it will be
starting after the commit of A.

It will either succeed on retry or get an error for a different
reason.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: decreasing memory needlessly consumed by array_agg

2014-12-29 Thread Ali Akbar
2014-12-29 14:38 GMT+07:00 Jeff Davis :

> Just jumping into this patch now. Do we think this is worth changing the
> signature of functions in array.h, which might be used from a lot of
> third-party code? We might want to provide new functions to avoid a
> breaking change.
>

V6 patch from Tomas only change initArrayResult* functions. initArrayResult
is new API in 9.5 (commit bac2739), with old API still works as-is.

-- 
Ali Akbar


Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Merlin Moncure
On Mon, Dec 29, 2014 at 8:03 AM, Kevin Grittner  wrote:
> Merlin Moncure  wrote:
>> On Fri, Dec 26, 2014 at 12:38 PM, Kevin Grittner 
>> wrote:
>>> Tom Lane  wrote:
>>>
 Just for starters, a 40XXX error report will fail to provide the
 duplicated key's value.  This will be a functional regression,
>>>
>>> Not if, as is normally the case, the transaction is retried from
>>> the beginning on a serialization failure.  Either the code will
>>> check for a duplicate (as in the case of the OP on this thread) and
>>> they won't see the error, *or* the the transaction which created
>>> the duplicate key will have committed before the start of the retry
>>> and you will get the duplicate key error.
>>
>> I'm not buying that; that argument assumes duplicate key errors are
>> always 'upsert' driven.  Although OP's code may have checked for
>> duplicates it's perfectly reasonable (and in many cases preferable) to
>> force the transaction to fail and report the error directly back to
>> the application.  The application will then switch on the error code
>> and decide what to do: retry for deadlock/serialization or abort for
>> data integrity error.  IOW, the error handling semantics are
>> fundamentally different and should not be mixed.
>
> I think you might be agreeing with me without realizing it.  Right
> now you get "duplicate key error" even if the duplication is caused
> by a concurrent transaction -- it is not possible to check the
> error code (well, SQLSTATE, technically) to determine whether this
> is fundamentally a serialization problem.  What we're talking about
> is returning the serialization failure return code for the cases
> where it is a concurrent transaction causing the failure and
> continuing to return the duplicate key error for all other cases.
>
> Either I'm not understanding what you wrote above, or you seem to
> be arguing for being able to distinguish between errors caused by
> concurrent transactions and those which aren't.

Well, I'm arguing that duplicate key errors are not serialization
failures unless it's likely the insertion would succeed upon a retry;
a proper insert, not an upsert.  If that's the case with what you're
proposing, then it makes sense to me.  But that's not what it sounds
like...your language suggests AIUI that having the error simply be
caused by another transaction being concurrent would be sufficient to
switch to a serialization error (feel free to correct me if I'm
wrong!).

In other words, the current behavior is:
txn A,B begin
txn A inserts
txn B inserts over A, locks, waits
txn A commits.  B aborts with duplicate key error

Assuming that case is untouched, then we're good!  My long winded
point above is that case must fail with duplicate key error; a
serialization error is suggesting the transaction should be retried
and it shouldn't be...it would simply fail a second time.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Kevin Grittner
Merlin Moncure  wrote:
> On Fri, Dec 26, 2014 at 12:38 PM, Kevin Grittner  wrote:
>> Tom Lane  wrote:
>>
>>> Just for starters, a 40XXX error report will fail to provide the
>>> duplicated key's value.  This will be a functional regression,
>>
>> Not if, as is normally the case, the transaction is retried from
>> the beginning on a serialization failure.  Either the code will
>> check for a duplicate (as in the case of the OP on this thread) and
>> they won't see the error, *or* the the transaction which created
>> the duplicate key will have committed before the start of the retry
>> and you will get the duplicate key error.
>
> I'm not buying that; that argument assumes duplicate key errors are
> always 'upsert' driven.  Although OP's code may have checked for
> duplicates it's perfectly reasonable (and in many cases preferable) to
> force the transaction to fail and report the error directly back to
> the application.  The application will then switch on the error code
> and decide what to do: retry for deadlock/serialization or abort for
> data integrity error.  IOW, the error handling semantics are
> fundamentally different and should not be mixed.

I think you might be agreeing with me without realizing it.  Right 
now you get "duplicate key error" even if the duplication is caused 
by a concurrent transaction -- it is not possible to check the 
error code (well, SQLSTATE, technically) to determine whether this 
is fundamentally a serialization problem.  What we're talking about 
is returning the serialization failure return code for the cases 
where it is a concurrent transaction causing the failure and 
continuing to return the duplicate key error for all other cases.

Either I'm not understanding what you wrote above, or you seem to
be arguing for being able to distinguish between errors caused by
concurrent transactions and those which aren't.


--
Kevin GrittnerEDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] What exactly is our CRC algorithm?

2014-12-29 Thread Abhijit Menon-Sen
At 2014-12-29 13:22:28 +0100, and...@2ndquadrant.com wrote:
>
> How about pg_choose_crc_impl() or something?

Done.

> _sb8? Unless I miss something it's not slice by 8 but rather bytewise?

This is meant to apply on top of the earlier patches I posted to
implement slice-by-8. I'll attach both here.

> Should be marked inline.

Done (and the other one too).

> > +#ifdef __GNUC__
> > +   __asm__ ("crc32b %[data], %[crc]\n" : [crc] "+r" (crc) : [data] "rm" 
> > (data));
> 
> Have you checked which version of gcc introduced named references to
> input/output parameters?

No. The documentation calls them "asmSymbolicName"s, but I can't find
the term (or various likely alternatives, e.g. symbolic_operand may be
related) either in the release notes, the changelog, or the source (on
Github). Does anyone know, or know how to find out?

Meanwhile, I have attached the two patches with the other modifications.

-- Abhijit
diff --git a/src/include/utils/pg_crc.h b/src/include/utils/pg_crc.h
index f871cba..55934e5 100644
--- a/src/include/utils/pg_crc.h
+++ b/src/include/utils/pg_crc.h
@@ -41,6 +41,8 @@
 
 typedef uint32 pg_crc32;
 
+extern pg_crc32 pg_comp_crc32c(pg_crc32 crc, const void *data, size_t len);
+
 /*
  * CRC calculation using the CRC-32C (Castagnoli) polynomial.
  *
@@ -51,7 +53,7 @@ typedef uint32 pg_crc32;
 #define INIT_CRC32C(crc) ((crc) = 0x)
 #define FIN_CRC32C(crc)	((crc) ^= 0x)
 #define COMP_CRC32C(crc, data, len)	\
-	COMP_CRC32_NORMAL_TABLE(crc, data, len, pg_crc32c_table)
+	((crc) = pg_comp_crc32c((crc), (char *) (data), (len)))
 #define EQ_CRC32C(c1, c2) ((c1) == (c2))
 
 /*
@@ -115,7 +117,7 @@ do {			  \
 } while (0)
 
 /* Constant tables for CRC-32C and CRC-32 polynomials */
-extern CRCDLLIMPORT const uint32 pg_crc32c_table[];
+extern CRCDLLIMPORT const uint32 pg_crc32c_table[8][256];
 extern CRCDLLIMPORT const uint32 pg_crc32_table[];
 
 #endif   /* PG_CRC_H */
diff --git a/src/include/utils/pg_crc_tables.h b/src/include/utils/pg_crc_tables.h
index cb6b470..f814c91 100644
--- a/src/include/utils/pg_crc_tables.h
+++ b/src/include/utils/pg_crc_tables.h
@@ -28,71 +28,519 @@
  * This table is based on the so-called Castagnoli polynomial (the same
  * that is used e.g. in iSCSI).
  */
-const uint32 pg_crc32c_table[256] = {
-	0x, 0xF26B8303, 0xE13B70F7, 0x1350F3F4,
-	0xC79A971F, 0x35F1141C, 0x26A1E7E8, 0xD4CA64EB,
-	0x8AD958CF, 0x78B2DBCC, 0x6BE22838, 0x9989AB3B,
-	0x4D43CFD0, 0xBF284CD3, 0xAC78BF27, 0x5E133C24,
-	0x105EC76F, 0xE235446C, 0xF165B798, 0x030E349B,
-	0xD7C45070, 0x25AFD373, 0x36FF2087, 0xC494A384,
-	0x9A879FA0, 0x68EC1CA3, 0x7BBCEF57, 0x89D76C54,
-	0x5D1D08BF, 0xAF768BBC, 0xBC267848, 0x4E4DFB4B,
-	0x20BD8EDE, 0xD2D60DDD, 0xC186FE29, 0x33ED7D2A,
-	0xE72719C1, 0x154C9AC2, 0x061C6936, 0xF477EA35,
-	0xAA64D611, 0x580F5512, 0x4B5FA6E6, 0xB93425E5,
-	0x6DFE410E, 0x9F95C20D, 0x8CC531F9, 0x7EAEB2FA,
-	0x30E349B1, 0xC288CAB2, 0xD1D83946, 0x23B3BA45,
-	0xF779DEAE, 0x05125DAD, 0x1642AE59, 0xE4292D5A,
-	0xBA3A117E, 0x4851927D, 0x5B016189, 0xA96AE28A,
-	0x7DA08661, 0x8FCB0562, 0x9C9BF696, 0x6EF07595,
-	0x417B1DBC, 0xB3109EBF, 0xA0406D4B, 0x522BEE48,
-	0x86E18AA3, 0x748A09A0, 0x67DAFA54, 0x95B17957,
-	0xCBA24573, 0x39C9C670, 0x2A993584, 0xD8F2B687,
-	0x0C38D26C, 0xFE53516F, 0xED03A29B, 0x1F682198,
-	0x5125DAD3, 0xA34E59D0, 0xB01EAA24, 0x42752927,
-	0x96BF4DCC, 0x64D4CECF, 0x77843D3B, 0x85EFBE38,
-	0xDBFC821C, 0x2997011F, 0x3AC7F2EB, 0xC8AC71E8,
-	0x1C661503, 0xEE0D9600, 0xFD5D65F4, 0x0F36E6F7,
-	0x61C69362, 0x93AD1061, 0x80FDE395, 0x72966096,
-	0xA65C047D, 0x5437877E, 0x4767748A, 0xB50CF789,
-	0xEB1FCBAD, 0x197448AE, 0x0A24BB5A, 0xF84F3859,
-	0x2C855CB2, 0xDEEEDFB1, 0xCDBE2C45, 0x3FD5AF46,
-	0x7198540D, 0x83F3D70E, 0x90A324FA, 0x62C8A7F9,
-	0xB602C312, 0x44694011, 0x5739B3E5, 0xA55230E6,
-	0xFB410CC2, 0x092A8FC1, 0x1A7A7C35, 0xE811FF36,
-	0x3CDB9BDD, 0xCEB018DE, 0xDDE0EB2A, 0x2F8B6829,
-	0x82F63B78, 0x709DB87B, 0x63CD4B8F, 0x91A6C88C,
-	0x456CAC67, 0xB7072F64, 0xA457DC90, 0x563C5F93,
-	0x082F63B7, 0xFA44E0B4, 0xE9141340, 0x1B7F9043,
-	0xCFB5F4A8, 0x3DDE77AB, 0x2E8E845F, 0xDCE5075C,
-	0x92A8FC17, 0x60C37F14, 0x73938CE0, 0x81F80FE3,
-	0x55326B08, 0xA759E80B, 0xB4091BFF, 0x466298FC,
-	0x1871A4D8, 0xEA1A27DB, 0xF94AD42F, 0x0B21572C,
-	0xDFEB33C7, 0x2D80B0C4, 0x3ED04330, 0xCCBBC033,
-	0xA24BB5A6, 0x502036A5, 0x4370C551, 0xB11B4652,
-	0x65D122B9, 0x97BAA1BA, 0x84EA524E, 0x7681D14D,
-	0x2892ED69, 0xDAF96E6A, 0xC9A99D9E, 0x3BC21E9D,
-	0xEF087A76, 0x1D63F975, 0x0E330A81, 0xFC588982,
-	0xB21572C9, 0x407EF1CA, 0x532E023E, 0xA145813D,
-	0x758FE5D6, 0x87E466D5, 0x94B49521, 0x66DF1622,
-	0x38CC2A06, 0xCAA7A905, 0xD9F75AF1, 0x2B9CD9F2,
-	0xFF56BD19, 0x0D3D3E1A, 0x1E6DCDEE, 0xEC064EED,
-	0xC38D26C4, 0x31E6A5C7, 0x22B65633, 0xD0DDD530,
-	0x0417B1DB, 0xF67C32D8, 0xE52CC12C, 0x1747422F,
-	0x49547E0B, 0xBB3FFD08, 0xA86F0EFC, 0x5A048DFF,
-	0x8ECEE914, 0x7CA56A17, 0x6FF599E3, 0x9D9E1AE0,
-	0xD3D3E1AB, 0x21B862A8, 0x32E8915C, 0xC083125F,
-	0x144976B4, 0xE622F5B7, 0xF5720643, 0x07198540,

Re: [HACKERS] What exactly is our CRC algorithm?

2014-12-29 Thread Andres Freund
Hi,

On 2014-12-25 11:57:29 +0530, Abhijit Menon-Sen wrote:
> -extern pg_crc32 pg_comp_crc32c(pg_crc32 crc, const void *data, size_t len);
> +extern void pg_init_comp_crc32c(void);

How about pg_choose_crc_impl() or something?

> +extern pg_crc32 (*pg_comp_crc32c)(pg_crc32 crc, const void *data, size_t 
> len);
>  
>  /*
>   * CRC calculation using the CRC-32C (Castagnoli) polynomial.
> 
> diff --git a/src/port/pg_crc.c b/src/port/pg_crc.c
> index 2f9857b..6be17b0 100644
> --- a/src/port/pg_crc.c
> +++ b/src/port/pg_crc.c
> @@ -21,6 +21,13 @@
>  #include "utils/pg_crc.h"
>  #include "utils/pg_crc_tables.h"
>  
> +#if defined(HAVE_CPUID_H)
> +#include 
> +#elif defined(_MSC_VER)
> +#include 
> +#include 
> +#endif
> +
>  static inline uint32 bswap32(uint32 x)
>  {
>  #if defined(__GNUC__) || defined(__clang__)
> @@ -39,8 +46,8 @@ static inline uint32 bswap32(uint32 x)
>  #define cpu_to_le32(x) x
>  #endif
>  
> -pg_crc32
> -pg_comp_crc32c(pg_crc32 crc, const void *data, size_t len)
> +static pg_crc32
> +pg_comp_crc32c_sb8(pg_crc32 crc, const void *data, size_t len)

_sb8? Unless I miss something it's not slice by 8 but rather bytewise?

>  {
>   const unsigned char *p = data;
>   const uint32 *p8;
> @@ -61,7 +68,6 @@ pg_comp_crc32c(pg_crc32 crc, const void *data, size_t len)
>*/
>  
>   p8 = (const uint32 *) p;
> -
>   while (len >= 8)
>   {
>   uint32 a = *p8++ ^ cpu_to_le32(crc);
> @@ -101,8 +107,102 @@ pg_comp_crc32c(pg_crc32 crc, const void *data, size_t 
> len)
>*/
>  
>   p = (const unsigned char *) p8;
> - while (len-- > 0)
> + while (len > 0)
> + {
>   crc = pg_crc32c_table[0][(crc ^ *p++) & 0xFF] ^ (crc >> 8);
> + len--;
> + }
> +
> + return crc;
> +}
>  
> +static pg_crc32
> +pg_asm_crc32b(pg_crc32 crc, unsigned char data)
> +{

Should be marked inline.

> +#ifdef __GNUC__
> + __asm__ ("crc32b %[data], %[crc]\n" : [crc] "+r" (crc) : [data] "rm" 
> (data));

Have you checked which version of gcc introduced named references to
input/output parameters?

>   return crc;
> +#elif defined(_MSC_VER)
> + return _mm_crc32_u8(crc, data);
> +#else
> +#error "Don't know how to generate crc32b instruction"
> +#endif
>  }
> +
> +static pg_crc32
> +pg_asm_crc32q(uint64 crc, unsigned long long data)
> +{

inline.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Keep track of transaction commit timestamps

2014-12-29 Thread Andres Freund
On 2014-12-29 12:50:23 +0200, Heikki Linnakangas wrote:
> On 12/29/2014 12:39 PM, Petr Jelinek wrote:
> >On 29/12/14 11:16, Andres Freund wrote:
> >>On 2014-12-29 12:06:07 +0200, Heikki Linnakangas wrote:
> >>>To be honest, I think this patch should be reverted. Instead, we should
> >>>design a system where extensions can define their own SLRUs to store
> >>>additional per-transaction information. That way, each extension can have 
> >>>as
> >>>much space per transaction as needed, and support functions that make most
> >>>sense with the information. Commit timestamp tracking would be one such
> >>>extension, and for this node ID stuff, you could have another one (or
> >>>include it in the replication extension).
> >>
> >>If somebody wants that they should develop it. But given that we, based
> >>on previous discussions, don't want to run user defined code in the
> >>relevant phase during transaction commit *and* replay I don't think it'd
> >>be all that easy to do it fast and flexible.
> >
> >Right, I would love to have custom SLRUs but I don't see it happening
> >given those two restrictions, otherwise I would write the CommitTs patch
> >that way in the first place...
> 
> Transaction commit and replay can treat the per-transaction information as
> an opaque blob. It just needs to be included in the commit record, and
> replay needs to write it to the SLRU. That way you don't need to run any
> user-defined code in those phases.

Meh. Only if you want to duplicate the timestamps from the commits.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2014-12-29 Thread Abhijit Menon-Sen
Hi.

I've changed pgaudit to work as you suggested.

A quick note on the implementation: pgaudit was already installing an
ExecutorCheckPerms_hook anyway; I adapted code from ExecRTECheckPerms
to check if the audit role has been granted any of the permissions
required for the operation.

This means there are three ways to configure auditing:

1. GRANT … ON … TO audit, which logs any operations that correspond to
   the granted permissions.

2. Set pgaudit.roles = 'r1, r2, …', which logs everything done by r1,
   r2, and any of their descendants.

3. Set pgaudit.log = 'read, write, …', which logs any events in any of
   the listed classes.

(This is a small change from the earlier behaviour where, if a role was
listed in .roles, it was still subject to the .log setting. I find that
more useful in practice, but since we're discussing Stephen's proposal,
I implemented what he said.)

The new pgaudit.c is attached here for review. Nothing else has changed
from the earlier submission; and everything is in the github repository
(github.com/2ndQuadrant/pgaudit).

-- Abhijit
/*
 * pgaudit/pgaudit.c
 *
 * Copyright © 2014, PostgreSQL Global Development Group
 *
 * Permission to use, copy, modify, and distribute this software and
 * its documentation for any purpose, without fee, and without a
 * written agreement is hereby granted, provided that the above
 * copyright notice and this paragraph and the following two
 * paragraphs appear in all copies.
 *
 * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT,
 * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
 * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
 * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED
 * OF THE POSSIBILITY OF SUCH DAMAGE.
 *
 * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
 * A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS
 * IS" BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE,
 * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
 */

#include "postgres.h"

#include "access/htup_details.h"
#include "access/sysattr.h"
#include "access/xact.h"
#include "catalog/catalog.h"
#include "catalog/objectaccess.h"
#include "catalog/pg_class.h"
#include "commands/dbcommands.h"
#include "catalog/pg_proc.h"
#include "commands/event_trigger.h"
#include "executor/executor.h"
#include "executor/spi.h"
#include "miscadmin.h"
#include "libpq/auth.h"
#include "nodes/nodes.h"
#include "tcop/utility.h"
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/guc.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
#include "utils/syscache.h"
#include "utils/timestamp.h"

PG_MODULE_MAGIC;

void _PG_init(void);

Datum pgaudit_func_ddl_command_end(PG_FUNCTION_ARGS);
Datum pgaudit_func_sql_drop(PG_FUNCTION_ARGS);

PG_FUNCTION_INFO_V1(pgaudit_func_ddl_command_end);
PG_FUNCTION_INFO_V1(pgaudit_func_sql_drop);

/*
 * pgaudit_roles_str is the string value of the pgaudit.roles
 * configuration variable, which is a list of role names.
 */

char *pgaudit_roles_str = NULL;

/*
 * pgaudit_log_str is the string value of the pgaudit.log configuration
 * variable, e.g. "read, write, user". Each token corresponds to a flag
 * in enum LogClass below. We convert the list of tokens into a bitmap
 * in pgaudit_log for internal use.
 */

char *pgaudit_log_str = NULL;
static uint64 pgaudit_log = 0;

enum LogClass {
	LOG_NONE = 0,

	/* SELECT */
	LOG_READ = (1 << 0),

	/* INSERT, UPDATE, DELETE, TRUNCATE */
	LOG_WRITE = (1 << 1),

	/* GRANT, REVOKE, ALTER … */
	LOG_PRIVILEGE = (1 << 2),

	/* CREATE/DROP/ALTER ROLE */
	LOG_USER = (1 << 3),

	/* DDL: CREATE/DROP/ALTER */
	LOG_DEFINITION = (1 << 4),

	/* DDL: CREATE OPERATOR etc. */
	LOG_CONFIG = (1 << 5),

	/* VACUUM, REINDEX, ANALYZE */
	LOG_ADMIN = (1 << 6),

	/* Function execution */
	LOG_FUNCTION = (1 << 7),

	/* Absolutely everything; not available via pgaudit.log */
	LOG_ALL = ~(uint64)0
};

/*
 * This module collects AuditEvents from various sources (event
 * triggers, and executor/utility hooks) and passes them to the
 * log_audit_event() function.
 *
 * An AuditEvent represents an operation that potentially affects a
 * single object. If an underlying command affects multiple objects,
 * multiple AuditEvents must be created to represent it.
 */

typedef struct {
	NodeTag type;
	const char *object_id;
	const char *object_type;
	const char *command_tag;
	const char *command_text;
	bool granted;
} AuditEvent;

/*
 * Returns the oid of the hardcoded "audit" role.
 */

static Oid
audit_role_oid()
{
	HeapTuple roleTup;
	Oid oid = InvalidOid;

	roleTup = SearchSysCache1(AUTHNAME, PointerGetDatum("audit"));
	if (HeapTupleIsValid(roleTup)) {
		oid = HeapTupleGetOid(roleTup);
		ReleaseSysCache(roleTup);
	}

	return oid;
}

/* Returns true if either pgaudit.roles or pgaudit.log is set. */

static inline bool

Re: [HACKERS] The return value of allocate_recordbuf()

2014-12-29 Thread Heikki Linnakangas

On 12/26/2014 09:31 AM, Fujii Masao wrote:

Hi,

While reviewing FPW compression patch, I found that allocate_recordbuf()
always returns TRUE though its source code comment says that FALSE is
returned if out of memory. Its return value is checked in two places, but
which is clearly useless.

allocate_recordbuf() was introduced by 7fcbf6a, and then changed by
2c03216 so that palloc() is used instead of malloc and FALSE is never returned
even if out of memory. So this seems an oversight of 2c03216. Maybe
we should change it so that it checks whether we can enlarge the memory
with the requested size before actually allocating the memory?


Hmm. There is no way to check beforehand if a palloc() will fail because 
of OOM. We could check for MaxAllocSize, though.


Actually, before 2c03216, when we used malloc() here, the maximum record 
size was 4GB. Now it's only 1GB, because of MaxAllocSize. Are we OK with 
that, or should we use palloc_huge?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Keep track of transaction commit timestamps

2014-12-29 Thread Heikki Linnakangas

On 12/29/2014 12:39 PM, Petr Jelinek wrote:

On 29/12/14 11:16, Andres Freund wrote:

On 2014-12-29 12:06:07 +0200, Heikki Linnakangas wrote:

To be honest, I think this patch should be reverted. Instead, we should
design a system where extensions can define their own SLRUs to store
additional per-transaction information. That way, each extension can have as
much space per transaction as needed, and support functions that make most
sense with the information. Commit timestamp tracking would be one such
extension, and for this node ID stuff, you could have another one (or
include it in the replication extension).


If somebody wants that they should develop it. But given that we, based
on previous discussions, don't want to run user defined code in the
relevant phase during transaction commit *and* replay I don't think it'd
be all that easy to do it fast and flexible.


Right, I would love to have custom SLRUs but I don't see it happening
given those two restrictions, otherwise I would write the CommitTs patch
that way in the first place...


Transaction commit and replay can treat the per-transaction information 
as an opaque blob. It just needs to be included in the commit record, 
and replay needs to write it to the SLRU. That way you don't need to run 
any user-defined code in those phases.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Keep track of transaction commit timestamps

2014-12-29 Thread Petr Jelinek

On 29/12/14 11:16, Andres Freund wrote:

On 2014-12-29 12:06:07 +0200, Heikki Linnakangas wrote:

That's a little bit better, but I have to say I'm still not impressed. There
are so many implicit assumptions in the system. The first assumption is that
a 32-bit node id is sufficient.


Seriously? Are we going to build facilities for replication systems with
that many nodes? It seems absolutely unrealistic that a) somebody does
this b) that we'll blindly meet the demands of such a super hypothetical
scenario.



+1, Honestly, if somebody will ever have setup with more nodes than what 
fits into 32bits they will run into bigger problems than nodeid being 
too small.



Then there's the assumption that the node id should be "sticky",
i.e. it's set for the whole session. Again, I'm sure that's useful for
many systems, but I could just as easily imagine that you'd want it to
reset after every commit.


It's trivial to add that/reset it manually. So what?


Yes you can reset in the extension after commit, or you can actually 
override both commit timestamp and nodeid after commit if you so wish.





To be honest, I think this patch should be reverted. Instead, we should
design a system where extensions can define their own SLRUs to store
additional per-transaction information. That way, each extension can have as
much space per transaction as needed, and support functions that make most
sense with the information. Commit timestamp tracking would be one such
extension, and for this node ID stuff, you could have another one (or
include it in the replication extension).


If somebody wants that they should develop it. But given that we, based
on previous discussions, don't want to run user defined code in the
relevant phase during transaction commit *and* replay I don't think it'd
be all that easy to do it fast and flexible.



Right, I would love to have custom SLRUs but I don't see it happening 
given those two restrictions, otherwise I would write the CommitTs patch 
that way in the first place...



--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Keep track of transaction commit timestamps

2014-12-29 Thread Andres Freund
On 2014-12-29 12:06:07 +0200, Heikki Linnakangas wrote:
> That's a little bit better, but I have to say I'm still not impressed. There
> are so many implicit assumptions in the system. The first assumption is that
> a 32-bit node id is sufficient.

Seriously? Are we going to build facilities for replication systems with
that many nodes? It seems absolutely unrealistic that a) somebody does
this b) that we'll blindly meet the demands of such a super hypothetical
scenario.

> Then there's the assumption that the node id should be "sticky",
> i.e. it's set for the whole session. Again, I'm sure that's useful for
> many systems, but I could just as easily imagine that you'd want it to
> reset after every commit.

It's trivial to add that/reset it manually. So what?

> To be honest, I think this patch should be reverted. Instead, we should
> design a system where extensions can define their own SLRUs to store
> additional per-transaction information. That way, each extension can have as
> much space per transaction as needed, and support functions that make most
> sense with the information. Commit timestamp tracking would be one such
> extension, and for this node ID stuff, you could have another one (or
> include it in the replication extension).

If somebody wants that they should develop it. But given that we, based
on previous discussions, don't want to run user defined code in the
relevant phase during transaction commit *and* replay I don't think it'd
be all that easy to do it fast and flexible.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Keep track of transaction commit timestamps

2014-12-29 Thread Heikki Linnakangas

On 12/19/2014 11:30 AM, Petr Jelinek wrote:

as promised I am sending code-comment patch that explains the use of
commit timestamps + nodeid C api for the conflict resolution, comments
welcome.


That's a little bit better, but I have to say I'm still not impressed. 
There are so many implicit assumptions in the system. The first 
assumption is that a 32-bit node id is sufficient. I'm sure it is for 
many replication systems, but might not be for all. Then there's the 
assumption that the node id should be "sticky", i.e. it's set for the 
whole session. Again, I'm sure that's useful for many systems, but I 
could just as easily imagine that you'd want it to reset after every commit.


To be honest, I think this patch should be reverted. Instead, we should 
design a system where extensions can define their own SLRUs to store 
additional per-transaction information. That way, each extension can 
have as much space per transaction as needed, and support functions that 
make most sense with the information. Commit timestamp tracking would be 
one such extension, and for this node ID stuff, you could have another 
one (or include it in the replication extension).


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers