Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-02-05 Thread Peter Geoghegan
On Tue, Feb 4, 2014 at 11:56 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Since, as I mentioned, _bt_finish_split() ultimately unlocks *and
 unpins*, it may not be the same buffer as before, so even with the
 refactoring there are race conditions.

 Care to elaborate? Or are you just referring to the missing buf =  ?

Yes, that's all I meant.

 Attached is a new version of the patch, with those issues fixed.
 btree-incomplete-split-4.patch is a complete patch against the latest
 fix-btree-page-deletion patch, and moveright-assign-fix.patch is just the
 changes to _bt_moveright, if you want to review just the changes since the
 previous patch I posted.

Cool. I'll take a good look at it tomorrow morning PST.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-05 Thread Andres Freund
On 2014-02-04 16:24:02 -0800, Peter Geoghegan wrote:
 On Mon, Feb 3, 2014 at 3:38 PM, Andres Freund and...@2ndquadrant.com wrote:
   A quick hack (attached) making BufferDescriptor 64byte aligned indeed
   restored performance across all max_connections settings. It's not
   surprising that a misaligned buffer descriptor causes problems -
   there'll be plenty of false sharing of the spinlocks otherwise. Curious
   that the the intel machine isn't hurt much by this.
 
  What fiddling are you thinking of?
 
  Basically always doing a TYPEALIGN(CACHELINE_SIZE, addr) before
  returning from ShmemAlloc() (and thereby ShmemInitStruct).
 
 There is something you have not drawn explicit attention to that is
 very interesting. If we take REL9_3_STABLE tip to be representative
 (built with full -O2 optimization, no assertions just debugging
 symbols), setting max_connections to 91 from 90 does not have the
 effect of making the BufferDescriptors array aligned; it has the
 effect of making it *misaligned*. You reported that 91 was much better
 than 90. I think that the problem actually occurs when the array *is*
 aligned!

I don't think you can learn much from the alignment in 9.3
vs. HEAD. Loads has changed since, most prominently and recently
Robert's LWLock work. That certainly has changed allocation patterns.
It will also depend on some other parameters, e.g. changing
max_wal_senders, max_background_workers will also change the
offset. It's not that 91 is intrinsically better, it just happened to
give a aligned BufferDescriptors array when the other parameters weren't
changed at the same time.

 I suspect that the scenario described in this article accounts for the
 quite noticeable effect reported: http://danluu.com/3c-conflict

I don't think that's applicable here. What's described there is relevant
for access patterns that are larger multiple of the cacheline size - but
our's is exactly cacheline sized. What can happen in such scenarios is
that all your accesses map to the same set of cachelines, so instead of
using most of the cache, you end up using only 8 or so (8 is a common
size of set associative caches these days).
Theoretically we could see something like that for shared_buffers
itself, but I *think* our accesses are too far spread around in them for
that to be a significant issue.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CacheInvalidateRelcache in btree is a crummy idea

2014-02-05 Thread Heikki Linnakangas

On 02/02/2014 11:45 PM, Tom Lane wrote:

So I'm thinking my commit d2896a9ed, which introduced this mechanism,
was poorly thought out and we should just remove the relcache invals
as per the attached patch.  Letting _bt_getroot() update the cached
metapage at next use should be a lot cheaper than a full relcache
rebuild for the index.


Looks good to me.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2014-02-05 Thread Heikki Linnakangas

On 02/04/2014 11:58 PM, Andres Freund wrote:

On February 4, 2014 10:50:10 PM CET, Peter Geoghegan p...@heroku.com wrote:

On Tue, Feb 4, 2014 at 11:11 AM, Andres Freund and...@2ndquadrant.com
wrote:

Does this feature relate to compression of WAL page images at all?


No.


So the obvious question is: where, if anywhere, do the two efforts
(this patch, and Fujii's patch) overlap? Does Fujii have any concerns
about this patch as it relates to his effort to compress FPIs?


I think there's zero overlap. They're completely complimentary features. It's 
not like normal WAL records have an irrelevant volume.


Correct. Compressing a full-page image happens on the first update after 
a checkpoint, and the diff between old and new tuple is not used in that 
case.


Compressing full page images makes a difference if you're doing random 
updates across a large table, so that you only update each buffer 1-2 
times. This patch will have no effect in that case. And when you update 
the same page many times between checkpoints, the full-page image is 
insignificant, and this patch has a big effect.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN improvements part2: fast scan

2014-02-05 Thread Alexander Korotkov
On Wed, Feb 5, 2014 at 1:23 AM, Alexander Korotkov aekorot...@gmail.comwrote:

 On Mon, Feb 3, 2014 at 6:31 PM, Alexander Korotkov 
 aekorot...@gmail.comwrote:

 On Mon, Jan 27, 2014 at 7:30 PM, Alexander Korotkov aekorot...@gmail.com
  wrote:

 On Mon, Jan 27, 2014 at 2:32 PM, Alexander Korotkov 
 aekorot...@gmail.com wrote:

 Every single change you did in fast scan seems to be reasonable, but
 testing shows that something went wrong. Simple test with 3 words of
 different selectivities.

 After applying your patches:

 # select count(*) from fts_test where fti @@ plainto_tsquery('english',
 'gin index select');
  count
 ───
627
 (1 row)

 Time: 21,252 ms

 In original fast-scan:

 # select count(*) from fts_test where fti @@ plainto_tsquery('english',
 'gin index select');
  count
 ───
627
 (1 row)

 Time: 3,382 ms

 I'm trying to get deeper into it.


 I had two guesses about why it's become so slower than in my original
 fast-scan:
 1) Not using native consistent function
 2) Not sorting entries

 I attach two patches which rollback these two features (sorry for awful
 quality of second). Native consistent function accelerates thing
 significantly, as expected. Tt seems that sorting entries have almost no
 effect. However it's still not as fast as initial fast-scan:

 # select count(*) from fts_test where fti @@ plainto_tsquery('english',
 'gin index select');
  count
 ───
627
 (1 row)

 Time: 5,381 ms

 Tomas, could you rerun your tests with first and both these patches
 applied against patches by Heikki?


 I found my patch 0005-Ternary-consistent-implementation.patch to be
 completely wrong. It introduces ternary consistent function to opclass, but
 don't uses it, because I forgot to include ginlogic.c change into patch.
 So, it shouldn't make any impact on performance. However, testing results
 with that patch significantly differs. That makes me very uneasy. Can we
 now reproduce exact same?

 Right version of these two patches in one against current head is
 attached. I've rerun tests with it, results are
 /mnt/sas-raid10/gin-testing/queries/9.4-fast-scan-10. Could you rerun
 postprocessing including graph drawing?

 Sometimes test cases are not what we expect. For example:

 =# explain SELECT id FROM messages WHERE body_tsvector @@
 to_tsquery('english','(5alpha1-initdb''d)');
QUERY PLAN


 ─
  Bitmap Heap Scan on messages  (cost=84.00..88.01 rows=1 width=4)
Recheck Cond: (body_tsvector @@ '''5alpha1-initdb''  ''5alpha1'' 
 ''initdb''  ''d'''::tsquery)
-  Bitmap Index Scan on messages_body_tsvector_idx  (cost=0.00..84.00
 rows=1 width=0)
  Index Cond: (body_tsvector @@ '''5alpha1-initdb''  ''5alpha1''
  ''initdb''  ''d'''::tsquery)
  Planning time: 0.257 ms
 (5 rows)

 5alpha1-initdb'd is 3 gin entries with different frequencies.

 Also, these patches are not intended to change relevance ordering speed.
 When number of results are high, most of time is relevance calculating and
 sorting. I propose to remove ORDER BY clause from test cases to see scan
 speed more clear.

 I've dump of postgresql.org search queries from Magnus. We can add them
 to our test case.


  It turns out that version 10 contained bug in ternary consistent function
 implementation for tsvector. Fixed in attached version.


Attached patch is light version of fast scan. It does extra consistent
function calls only on startScanKey, no extra calls during scan of the
index.
It finds subset of rarest entries absence of which guarantee false
consistent function result.

I've run real-life tests based of postgresql.org logs (33318 queries). Here
is a table with summary time of running whole test case.

=# select method, sum(time) from test_result group by method order by
method;
   method|   sum
-+--
 fast-scan-11| 126916.11199
 fast-scan-light |   137321.211
 heikki  |   466751.693
 heikki-true-ternary | 454113.62397
 master  |   446976.288
(6 rows)

where 'heikki' is gin-ternary-logic binary-heap
preconsistent-only-on-new-page.patch and 'heikki-true-ternary' is version
with my catalog changes promoting ternary consistent function to opclass.

We can see fast-scan-light gives almost same performance benefit on 
queries. However, I test only CPU effect, not IO effect.
Any thoughts?

--
With best regards,
Alexander Korotkov.




gin-fast-scan-light.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix picksplit with nan values

2014-02-05 Thread Alexander Korotkov
On Sat, Feb 1, 2014 at 7:50 AM, Bruce Momjian br...@momjian.us wrote:


 Where are we on this?


I found myself to have empty draft letter from November with new version of
patch attached. I'll return here when we have some solution in gin fast
scan challenge.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2014-02-05 Thread Heikki Linnakangas

On 02/05/2014 07:54 AM, Amit Kapila wrote:

On Tue, Feb 4, 2014 at 11:58 PM, Robert Haas robertmh...@gmail.com wrote:

On Tue, Feb 4, 2014 at 12:39 PM, Amit Kapila amit.kapil...@gmail.com wrote:

Now there is approximately 1.4~5% CPU gain for
hundred tiny fields, half nulled case



Assuming that the logic isn't buggy, a point in need of further study,
I'm starting to feel like we want to have this.  And I might even be
tempted to remove the table-level off switch.


I have tried to stress on worst case more, as you are thinking to
remove table-level switch and found that even if we increase the
data by approx. 8 times (ten long fields, all changed, each field contains
80 byte data), the CPU overhead is still  5% which clearly shows that
the overhead doesn't increase much even if the length of unmatched data
is increased by much larger factor.
So the data for worst case adds more weight to your statement
(remove table-level switch), however there is no harm in keeping
table-level option with default as 'true' and if some users are really sure
the updates in their system will have nothing in common, then they can
make this new option as 'false'.

Below is data for the new case  ten long fields, all changed added
in attached script file:


That's not the worst case, by far.

First, note that the skipping while scanning new tuple is only performed 
in the first loop. That means that as soon as you have a single match, 
you fall back to hashing every byte. So for the worst case, put one 
4-byte field as the first column, and don't update it.


Also, I suspect the runtimes in your test were dominated by I/O. When I 
scale down the number of rows involved so that the whole test fits in 
RAM, I get much bigger differences with and without the patch. You might 
also want to turn off full_page_writes, to make the effect clear with 
less data.


So, I came up with the attached worst case test, modified from your 
latest test suite.


unpatched:


   testname   | wal_generated | duration
--+---+--
 ten long fields, all but one changed | 343385312 | 2.20806908607483
 ten long fields, all but one changed | 336263592 | 2.18997097015381
 ten long fields, all but one changed | 336264504 | 2.17843413352966
(3 rows)


pgrb_delta_encoding_v8.patch:

   testname   | wal_generated | duration
--+---+--
 ten long fields, all but one changed | 338356944 | 3.33501315116882
 ten long fields, all but one changed | 344059272 | 3.37364101409912
 ten long fields, all but one changed | 336257840 | 3.36244201660156
(3 rows)

So with this test, the overhead is very significant.

With the skipping logic, another kind of worst case case is that you 
have a lot of similarity between the old and new tuple, but you miss it 
because you skip. For example, if you change the first few columns, but 
leave a large text column at the end of the tuple unchanged.


- Heikki


wal-update-testsuite.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2014-02-05 Thread Heikki Linnakangas

On 01/30/2014 08:53 AM, Amit Kapila wrote:

On Wed, Jan 29, 2014 at 8:13 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

On 01/29/2014 02:21 PM, Amit Kapila wrote:

The main reason to process in chunks as much as possible is to save
cpu cycles. For example if we build hash table byte-by-byte, then even
for best case where most of tuple has a match, it will have reasonable
overhead due to formation of hash table.


Hmm. One very simple optimization we could do is to just compare the two
strings byte by byte, before doing anything else, to find any common prefix
they might have. Then output a tag for the common prefix, and run the normal
algorithm on the rest of the strings. In many real-world tables, the 1-2
first columns are a key that never changes, so that might work pretty well
in practice. Maybe it would also be worthwhile to do the same for any common
suffix the tuples might have.


Is it possible to do for both prefix and suffix together, basically
the question I
have in mind is what will be deciding factor for switching from hash table
mechanism to string comparison mode for suffix. Do we switch when we find
long enough match?


I think you got it backwards. You don't switch from hash table mechanism 
to string comparison. You do the prefix/suffix comparison *first*, and 
run the hash table algorithm only on the middle part, between the 
common prefix and suffix.



Can we do this optimization after the basic version is acceptable?


I would actually suggest doing that first. Perhaps even ditch the whole 
history table approach and do *only* the scan for prefix and suffix. 
That's very cheap, and already covers a large fraction of UPDATEs that 
real applications do. In particular, it's optimal for the case that you 
update only a single column, something like UPDATE foo SET bar = bar + 1.


I'm pretty sure the overhead of that would be negligible, so we could 
always enable it. There are certainly a lot of scenarios where 
prefix/suffix detection alone wouldn't help, but so what.


Attached is a quick patch for that, if you want to test it.

- Heikki
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index e0b8a4e..c4ac2bd 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1014,6 +1014,22 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
 /listitem
/varlistentry
 
+   varlistentry
+termliteralwal_compress_update/ (typeboolean/)/term
+listitem
+ para
+  Enables or disables the WAL tuple compression for commandUPDATE/
+  on this table.  Default value of this option is false to maintain
+  backward compatability for the command. If true, all the update
+  operations on this table which will place the new tuple on same page
+  as it's original tuple will compress the WAL for new tuple and
+  subsequently reduce the WAL volume.  It is recommended to enable
+  this option for tables where commandUPDATE/ changes less than
+  50 percent of tuple data.
+ /para
+ /listitem
+/varlistentry
+
/variablelist
 
   /refsect2
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index aea9d40..3bf5728 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,7 @@
 #include access/sysattr.h
 #include access/tuptoaster.h
 #include executor/tuptable.h
+#include utils/pg_rbcompress.h
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -617,6 +618,44 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest-t_data, (char *) src-t_data, src-t_len);
 }
 
+/* 
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples and generate
+ *  encoded wal tuple (EWT), using pgrb. The result is stored
+ *  in *encdata.
+ * 
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+  char *encdata, uint32 *enclen)
+{
+	return pgrb_delta_encode(
+		(char *) newtup-t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup-t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup-t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup-t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, NULL
+		);
+}
+
+/* 
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * 
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	pgrb_delta_decode(encdata, enclen,
+			 (char *) newtup-t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+			 newtup-t_len,
+			 (char *) oldtup-t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 oldtup-t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the 

Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2014-02-05 Thread Amit Kapila
On Wed, Feb 5, 2014 at 5:29 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 01/30/2014 08:53 AM, Amit Kapila wrote:

 Is it possible to do for both prefix and suffix together, basically
 the question I
 have in mind is what will be deciding factor for switching from hash table
 mechanism to string comparison mode for suffix. Do we switch when we find
 long enough match?


 I think you got it backwards. You don't switch from hash table mechanism to
 string comparison. You do the prefix/suffix comparison *first*, and run the
 hash table algorithm only on the middle part, between the common prefix
 and suffix.


 Can we do this optimization after the basic version is acceptable?


 I would actually suggest doing that first. Perhaps even ditch the whole
 history table approach and do *only* the scan for prefix and suffix. That's
 very cheap, and already covers a large fraction of UPDATEs that real
 applications do. In particular, it's optimal for the case that you update
 only a single column, something like UPDATE foo SET bar = bar + 1.

 I'm pretty sure the overhead of that would be negligible, so we could always
 enable it. There are certainly a lot of scenarios where prefix/suffix
 detection alone wouldn't help, but so what.

 Attached is a quick patch for that, if you want to test it.

I have done one test where there is a large suffix match, but
not large enough that it can compress more than 75% of string,
the CPU overhead with wal-update-prefix-suffix-encode-1.patch is
not much, but there is no I/O reduction as well. However for same
case there is both significant WAL reduction and CPU gain with
pgrb_delta_encoding_v8.patch

I have updated  ten long fields, all changed such that there is large
suffix match. Updated script is attached.

Unpatched
   testname   | wal_generated | duration
--+---+--
 ten long fields, all changed |1760986528 | 28.3700430393219
 ten long fields, all changed |1760981320 |   28.53244805336
 ten long fields, all changed |1764294992 | 28.6722140312195
(3 rows)


wal-update-prefix-suffix-encode-1.patch
   testname   | wal_generated | duration
--+---+--
 ten long fields, all changed |1760986016 | 29.4183659553528
 ten long fields, all changed |1760981904 | 29.7636449337006
 ten long fields, all changed |1762436104 |  29.508908033371
(3 rows)

pgrb_delta_encoding_v8.patch

   testname   | wal_generated | duration
--+---+--
 ten long fields, all changed | 733969304 |  23.916286945343
 ten long fields, all changed | 733977040 | 23.6019561290741
 ten long fields, all changed | 737384632 | 24.2645490169525


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


wal-update-testsuite.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CacheInvalidateRelcache in btree is a crummy idea

2014-02-05 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 02/02/2014 11:45 PM, Tom Lane wrote:
 So I'm thinking my commit d2896a9ed, which introduced this mechanism,
 was poorly thought out and we should just remove the relcache invals
 as per the attached patch.  Letting _bt_getroot() update the cached
 metapage at next use should be a lot cheaper than a full relcache
 rebuild for the index.

 Looks good to me.

I did think of one possible objection to this idea.  Although
_bt_getroot() checks that the page it arrives at is a usable fast root,
it could still fail if the page number is off the end of the relation;
that is, we have a cache entry that predates an index truncation.
Currently, the only way for a btree to get physically shorter is a
REINDEX, which implies a relfilenode change and hence a (transactional)
relcache flush, so this couldn't happen.  And if we tried to allow
VACUUM to truncate an index without a full lock, we'd probably have
the same type of issue for any concurrent process that's following
a stale cross-page link, not just a link to the root.  So I'm not
particularly impressed by this objection, but it could be made.

If we ever did need to make that work, a possible solution would be
to refactor things so that the metapage cache lives at the smgr level
not the relcache level, where it'd get blown away by a cross-backend
smgr inval (CacheInvalidateSmgr) --- which is nontransactional,
thereby fixing the basic problem with the way it's being done now.
There would still be some issues about locking, but at least the
cache per se wouldn't be a hazard anymore.

Since I'm not aware of any plans to make on-the-fly btree truncation
work, I won't bother to implement that now, but I thought I'd mention
it for the archives.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-05 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2014-02-04 16:24:02 -0800, Peter Geoghegan wrote:
 I suspect that the scenario described in this article accounts for the
 quite noticeable effect reported: http://danluu.com/3c-conflict

 I don't think that's applicable here.

Maybe, or maybe not, but I think it does say that we should be very wary
of proposals to force data structure alignment without any testing of the
consequences.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-05 Thread Andres Freund
On 2014-02-05 09:57:11 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2014-02-04 16:24:02 -0800, Peter Geoghegan wrote:
  I suspect that the scenario described in this article accounts for the
  quite noticeable effect reported: http://danluu.com/3c-conflict
 
  I don't think that's applicable here.
 
 Maybe, or maybe not, but I think it does say that we should be very wary
 of proposals to force data structure alignment without any testing of the
 consequences.

I agree it needs testing, but what the page is talking about really,
really doesn't have *anything* to do with this. What the author is
talking about is not storing things *page* aligned (I.e. 4k or a
multiple). The problem is that the beginning of each such page falls
into the same cacheline set and thus doesn't utilize the entire L1/L2/L3
but only the single set they map into.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-05 Thread Greg Stark
On Wed, Feb 5, 2014 at 3:57 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Maybe, or maybe not, but I think it does say that we should be very wary
 of proposals to force data structure alignment without any testing of the
 consequences.


Sure. see for instance
http://igoro.com/archive/gallery-of-processor-cache-effects/

From what I understand what Andres is suggesting is ensuring that each
BufferDescriptor is sitting entirely in one cache line. That seems
very unlikely to be worse than having a BufferDescriptor spanning two
cache lines and being on the same cache line as the adjacent
BufferDescriptors. But this all depends on knowing the length of the
cache lines. I see a lot of confusion online over whether cache lines
are 64 bytes, 128 bytes, or other length even just on Intel
architectures, let alone others.

I wonder if there are any generic tools to optimize array/structure
layouts based on cachegrind profiling or something like that. Then we
wouldn't need to know the oddities ourselves and optimize manually. We
could maybe even do it on the build farm and select the right profile
at build time by matching build target information.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-05 Thread Andres Freund
On 2014-02-05 16:14:01 +0100, Greg Stark wrote:
 I see a lot of confusion online over whether cache lines
 are 64 bytes, 128 bytes, or other length even just on Intel
 architectures, let alone others.

All current x86 processors use 64. But even if it were bigger/smaller,
they will be either 32, or 128. Neither benefits from touching more
cachelines than necessary. E.g. in the 128 case, we could still touch
two with the current code.
The effects referred to upthread only affect code with larger multiples
of the cacheline size. Not what we have here.

 I wonder if there are any generic tools to optimize array/structure
 layouts based on cachegrind profiling or something like that. Then we
 wouldn't need to know the oddities ourselves and optimize manually. We
 could maybe even do it on the build farm and select the right profile
 at build time by matching build target information.

There's profiling tools (e.g. perf's -e
stalled-cycles-(frontent|backend)), but I don't think there's more than
that.
And I think somebody already thought about it (c.f. ALIGNOF_BUFFER), it
just wasn't updated in the last 10 years.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2014-02-05 Thread Amit Kapila
On Wed, Feb 5, 2014 at 5:13 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 02/05/2014 07:54 AM, Amit Kapila wrote:

 On Tue, Feb 4, 2014 at 11:58 PM, Robert Haas robertmh...@gmail.com
 wrote:

 On Tue, Feb 4, 2014 at 12:39 PM, Amit Kapila amit.kapil...@gmail.com
 wrote:

 Now there is approximately 1.4~5% CPU gain for
 hundred tiny fields, half nulled case


 Assuming that the logic isn't buggy, a point in need of further study,
 I'm starting to feel like we want to have this.  And I might even be
 tempted to remove the table-level off switch.


 I have tried to stress on worst case more, as you are thinking to
 remove table-level switch and found that even if we increase the
 data by approx. 8 times (ten long fields, all changed, each field
 contains
 80 byte data), the CPU overhead is still  5% which clearly shows that
 the overhead doesn't increase much even if the length of unmatched data
 is increased by much larger factor.
 So the data for worst case adds more weight to your statement
 (remove table-level switch), however there is no harm in keeping
 table-level option with default as 'true' and if some users are really
 sure
 the updates in their system will have nothing in common, then they can
 make this new option as 'false'.

 Below is data for the new case  ten long fields, all changed added
 in attached script file:


 That's not the worst case, by far.

 First, note that the skipping while scanning new tuple is only performed in
 the first loop. That means that as soon as you have a single match, you fall
 back to hashing every byte. So for the worst case, put one 4-byte field as
 the first column, and don't update it.

 Also, I suspect the runtimes in your test were dominated by I/O. When I
 scale down the number of rows involved so that the whole test fits in RAM, I
 get much bigger differences with and without the patch. You might also want
 to turn off full_page_writes, to make the effect clear with less data.

 So with this test, the overhead is very significant.

 With the skipping logic, another kind of worst case case is that you have
 a lot of similarity between the old and new tuple, but you miss it because
 you skip.

This is exactly the reason why I have not kept skipping logic in second
pass(loop), but I think may be it would have been better to keep it not
as aggressive as in first pass. The basic idea I had in mind is that if we
get match, then there is high chance that we get match in consecutive
positions.

I think we should see this patch as I/O reduction feature rather than CPU
gain/overhead because the I/O reduction in WAL has some other has
some other benefits like transfer for replication, in archiving, recovery,
basically where-ever there is disk read operation, as I/O reduction will
amount to less data read and which can be beneficial in many ways.

Sometime back, I was reading article on benefits of compression
in database where the benefits are shown something like what
I said above (atleast that is what I understood from it). Link to that
article is:
http://db2guys.wordpress.com/2013/08/23/compression/

Another thing is that I think it might be difficult to get negligible
overhead for data which is very less or non-compressible, that's
why it is preferred to have compression for table enabled with
switch.

Is it viable to see here, what is the best way to get I/O reduction
for most cases and provide a switch so that for worst cases
user can make it off?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2014-02-05 Thread Heikki Linnakangas

On 02/05/2014 04:48 PM, Amit Kapila wrote:

I have done one test where there is a large suffix match, but
not large enough that it can compress more than 75% of string,
the CPU overhead with wal-update-prefix-suffix-encode-1.patch is
not much, but there is no I/O reduction as well.


Hmm, it's supposed to compress if you save at least 25%, not 75%. 
Apparently I got that backwards in the patch...


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2014-02-05 Thread Amit Kapila
On Wed, Feb 5, 2014 at 8:50 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 02/05/2014 04:48 PM, Amit Kapila wrote:

 I have done one test where there is a large suffix match, but
 not large enough that it can compress more than 75% of string,
 the CPU overhead with wal-update-prefix-suffix-encode-1.patch is
 not much, but there is no I/O reduction as well.


 Hmm, it's supposed to compress if you save at least 25%, not 75%. Apparently
 I got that backwards in the patch...

Okay I think that is right, may be I can change the that check to see the
difference, but in general isn't it going to loose compression in much more
cases like if there is less than 25% match in prefix/suffix, but
more than 50% match in between the string.

While debugging, I noticed that it compresses less than history table
approach for general cases when internally update is done like for
Truncate table.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb and nested hstore

2014-02-05 Thread Teodor Sigaev

+static void
+recvJsonbValue(StringInfo buf, JsonbValue *v, uint32 level, int c)
+   v-size = sizeof(JEntry) * 2 + VARSIZE_ANY(v-numeric);


What's the *2 here?
Reservation for aligment. It's allowed to be v-size greater than it's actually 
needed. Fixed.




This function and recvJsonbValue call each other recursively, afaics
without any limit, shouldn't they check for the stack depth?

added a check_stack_depth()



*3?


Jentry + header + reservation for aligment


+   v-hash.pairs = palloc(sizeof(*v-hash.pairs) * 
v-hash.npairs);
+

if (v-hash.npairs  (buf-len  - buf-cursor) / (2 * sizeof(uint32)))
ereport(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE)
2 * sizeof(uint32)  - minimal size of object element (key plus its value)



Shouldn't that be an ereport(ERRCODE_DATATYPE_MISMATCH)? Similar in a
few other places.

fixed


+char *
+JsonbToCString(StringInfo out, char *in, int estimated_len)

Such a behaviour certainly deserves a documentary comment. Generally
some more functions could use that.

add comment




+   while ((type = JsonbIteratorGet(it, v, false)) != 0)
+reout:
+   goto reout;


Hrmpf.


:) commented




+Datum
+jsonb_typeof(PG_FUNCTION_ARGS)
+{

...

+}


Hm, shouldn't that be in jsonfuncs.c?

No idea, i don't have an objection

send/recv for hstore is fixed too. Should I make new version of patch? Right now 
it's placed on github. May be Andrew wants to change something?




--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb and nested hstore

2014-02-05 Thread Merlin Moncure
On Wed, Feb 5, 2014 at 12:44 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 send/recv functions are also needed for binary-format COPY. IMHO jsonb must
 have send/recv functions. All other built-in types have them, except for
 types like 'smgr', 'aclitem' and 'any*' that no-one should be using as
 column types.

Yes -- completely agree.  I also consider the hstore functionality (in
particular, searching and access operators) to be essential
functionality.

I'm actually surprised we have an alternate binary wire format for
jsonb at all; json is explicitly text and I'm not sure what the use
case of sending the internal structure is.  Meaning, maybe jsonb
send/recv should be a thin wrapper to sending the json string.   The
hstore send/recv I think properly covers the case where client side
binary wire format actors would want to manage performance critical
cases that want to avoid parsing.

On Wed, Feb 5, 2014 at 1:21 AM, Oleg Bartunov obartu...@gmail.com wrote:
 Andrew provided us more information and we'll work on recv. What
 people think about testing this stuff ?  btw, we don't have any
 regression test on this.

I'm intensely interested in this work; I consider it to be transformative.

I've *lightly* tested the jsonb/hstore functionality and so far
everything is working.

I still have concerns about the API.  Aside from the stuff I mentioned
upthread I find the API split between jsonb and hstore to be a little
odd; a lot of useful bits (for example, the @ operator) come via the
hstore type only.  So these types are joined at the hip for real work
which makes the diverging incomplete behaviors in functions like
populate_record() disconcerting.  Another point I'm struggling with is
what jsonb brings to the table that isn't covered either hstore or
json; working through a couple of cases I find myself not using the
jsonb functionality except as a 'hstore json formatter' which the json
type covers.  I'm probably being obtuse, but we have to be cautious
before plonking a couple of dozen extra functions in the public
schema.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb and nested hstore

2014-02-05 Thread Andrew Dunstan


On 02/05/2014 10:48 AM, Merlin Moncure wrote:

On Wed, Feb 5, 2014 at 12:44 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

send/recv functions are also needed for binary-format COPY. IMHO jsonb must
have send/recv functions. All other built-in types have them, except for
types like 'smgr', 'aclitem' and 'any*' that no-one should be using as
column types.

Yes -- completely agree.  I also consider the hstore functionality (in
particular, searching and access operators) to be essential
functionality.

I'm actually surprised we have an alternate binary wire format for
jsonb at all; json is explicitly text and I'm not sure what the use
case of sending the internal structure is.  Meaning, maybe jsonb
send/recv should be a thin wrapper to sending the json string.   The
hstore send/recv I think properly covers the case where client side
binary wire format actors would want to manage performance critical
cases that want to avoid parsing.




The whole reason we have jsonb is to avoid reparsing where possible. In 
fact, I'd rather have the send and recv functions in the jsonb code and 
have hstore's functions call them, so we don't duplicate code.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb and nested hstore

2014-02-05 Thread Andrew Dunstan


On 02/05/2014 10:36 AM, Teodor Sigaev wrote:





+Datum
+jsonb_typeof(PG_FUNCTION_ARGS)
+{

...

+}


Hm, shouldn't that be in jsonfuncs.c?

No idea, i don't have an objection


No it shouldn't. The json equivalent function is in json.c, and needs to 
be because it uses the parser internals that aren't exposed outside that 
code.




send/recv for hstore is fixed too. Should I make new version of patch? 
Right now it's placed on github. May be Andrew wants to change something?





I'll take a look, but I think we need to unify this so we use one set of 
send/recv code for the two types if possible, as I just said to Merlin.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-05 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 And I think somebody already thought about it (c.f. ALIGNOF_BUFFER), it
 just wasn't updated in the last 10 years.

No, ALIGNOF_BUFFER is there because we read something that said that I/O
transfers between userspace and kernel disk cache would be faster with
aligned buffers.  There's been no particular thought given to alignment
of other data structures, AFAIR.

It may well be that your proposal is spot on.  But I'd like to see some
data-structure-by-data-structure measurements, rather than assuming that
alignment must be a good thing.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb and nested hstore

2014-02-05 Thread Merlin Moncure
On Wed, Feb 5, 2014 at 10:22 AM, Andrew Dunstan and...@dunslane.net wrote:
 I'm actually surprised we have an alternate binary wire format for
 jsonb at all; json is explicitly text and I'm not sure what the use
 case of sending the internal structure is.  Meaning, maybe jsonb
 send/recv should be a thin wrapper to sending the json string.   The
 hstore send/recv I think properly covers the case where client side
 binary wire format actors would want to manage performance critical
 cases that want to avoid parsing.

 The whole reason we have jsonb is to avoid reparsing where possible

Sure; but on the server side.  The wire format is for handling client
concerns.  For example, the case you're arguing for would be for libpq
client to extract as jsonb as binary, manipulate it on a binary level,
then send it back as binary.  I find this case to be something of a
stretch.

That being said, for binary dump/restore perhaps there's a performance
case to be made.

 In fact, I'd rather have the send and recv functions in the jsonb code and 
 have
 hstore's functions call them, so we don't duplicate code.

yeah.  Agree that there needs to be two sets of routines, not three.
I think a case could be made for the  jsonb type could take either
json's or hstore's depending on if the above. FWIW, either way is
fine.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-05 Thread Andres Freund
On 2014-02-05 11:23:29 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  And I think somebody already thought about it (c.f. ALIGNOF_BUFFER), it
  just wasn't updated in the last 10 years.
 
 No, ALIGNOF_BUFFER is there because we read something that said that I/O
 transfers between userspace and kernel disk cache would be faster with
 aligned buffers.  There's been no particular thought given to alignment
 of other data structures, AFAIR.

But it's not aligned anymore on at last half a decade's hardware, and
it's what we already align *all* bigger ShmemAlloc() values with. And
BufferDescriptors surely counts as larger in its entirety.

 It may well be that your proposal is spot on.  But I'd like to see some
 data-structure-by-data-structure measurements, rather than assuming that
 alignment must be a good thing.

I am fine with just aligning BufferDescriptors properly. That has
clearly shown massive improvements.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb and nested hstore

2014-02-05 Thread Tom Lane
Merlin Moncure mmonc...@gmail.com writes:
 On Wed, Feb 5, 2014 at 10:22 AM, Andrew Dunstan and...@dunslane.net wrote:
 The whole reason we have jsonb is to avoid reparsing where possible

 Sure; but on the server side.  The wire format is for handling client
 concerns.  For example, the case you're arguing for would be for libpq
 client to extract as jsonb as binary, manipulate it on a binary level,
 then send it back as binary.  I find this case to be something of a
 stretch.

I'm with Merlin in thinking that the case for exposing a binary format
to clients is pretty weak, or at least a convincing use-case has not
been shown.  Given the concerns upthread about security hazards in the
patch's existing recv code, and the fact that it's already February,
switching to binary is the same as text may well be the most prudent
path here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb and nested hstore

2014-02-05 Thread Andrew Dunstan


On 02/05/2014 11:40 AM, Tom Lane wrote:

Merlin Moncure mmonc...@gmail.com writes:

On Wed, Feb 5, 2014 at 10:22 AM, Andrew Dunstan and...@dunslane.net wrote:

The whole reason we have jsonb is to avoid reparsing where possible

Sure; but on the server side.  The wire format is for handling client
concerns.  For example, the case you're arguing for would be for libpq
client to extract as jsonb as binary, manipulate it on a binary level,
then send it back as binary.  I find this case to be something of a
stretch.

I'm with Merlin in thinking that the case for exposing a binary format
to clients is pretty weak, or at least a convincing use-case has not
been shown.  Given the concerns upthread about security hazards in the
patch's existing recv code, and the fact that it's already February,
switching to binary is the same as text may well be the most prudent
path here.




If we do that we're going to have to live with that forever, aren't we? 
I don't see why there should be a convincing case for binary format for 
nested hstore but not for jsonb.


If it were only for arbitrary libpq clietns I wouldn't bother so much. 
To me the main case for binary format is that some people use COPY 
BINARY for efficiency reasons, and I heard tell recently of someone 
working on that as an option for pg_dump, which seems to me worth 
considering.



cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: show xid and xmin in pg_stat_activity and pg_stat_replication

2014-02-05 Thread Robert Haas
On Mon, Feb 3, 2014 at 6:47 AM, Christian Kruse
christ...@2ndquadrant.com wrote:
 [ new patch ]

Is there some compelling reason not to write the documentation link as
xref linkend=guc-hot-standby-feedback rather than using link?

It feels weird to me that the new columns are called transactionid and
xmin.  Why not xid and xmin?

If I understand correctly, modifying PgBackendStatus adds additional
fields to the shared memory data structure that are never used and
will be returned by functions like pgstat_fetch_stat_beentry()
unitialized.  That seems both inefficient and a pitfall for the
unwary.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb and nested hstore

2014-02-05 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 02/05/2014 11:40 AM, Tom Lane wrote:
 switching to binary is the same as text may well be the most prudent
 path here.

 If we do that we're going to have to live with that forever, aren't we? 

Yeah, but the other side of that coin is that we'll have to live forever
with whatever binary format we pick, too.  If it turns out to be badly
designed, that could be much worse than eating some parsing costs during
dump/restore.

If we had infinite time/manpower, this wouldn't really be an issue.
We don't, though, and so I suggest that this may be one of the better
things to toss overboard.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Prohibit row-security + inheritance in 9.4?

2014-02-05 Thread Robert Haas
On Mon, Feb 3, 2014 at 10:23 PM, Craig Ringer cr...@2ndquadrant.com wrote:
 Then during expansion of the range table, no code is needed to
 ignore child rls quals and copy parent rels to child rels.
 This is what's already implemented and isn't a huge amount of code to
 begin with, so I don't see this as being an argument against having the
 flexibility.

 It would seem to me that any additional logic that can be avoided during
 planning is a good thing. Also, the argument that something is already
 implemented, does not itself make it good to commit.

 The implementation with the minimum of required logic and complexity is
 apply the parent's row-security quals to the children, don't check
 children for quals.

 Any special handling of child rels creates exciting challenges because
 inheritance expansion happens _after_ a bunch of the query planner and
 optimizer has run. Injecting new expression trees at this point is
 messy, especially if you want those expression trees to in turn contain
 row-security qualified tables, inheritance, etc.

 As previously discussed, applying the parent qual to children ensures
 that what's visible when querying a relation that has children is
 consistent with its row-security qual.

 If the parent qual is applied only on the parent rel directly, not
 children, then querying the parent could emit rows not permitted by the
 parent's row-security qual. I'm not happy with that, and as Stephen
 poined out upthread, it isn't really consistent with how permissions
 work with inheritance otherwise.

 If instead the parent qual is applied to the parent and all children,
 and you *also* add the quals of each child, you get a complex, hard to
 debug, hard to optimize mess. You also run back into the problem
 mentioned above, that adding quals after inhertiance expansion is messy
 and problematic. It's also really ugly in what's going to be the most
 common case, where the child quals are the same as the parent quals, as
 you'll get nested identical quals.

 Despite the challenges with it, I think that's the least insane way to
 respect child quals on parents. It has pretty much zero chance of
 happening in 9.4, though; the discussed approach of building
 row-security on top of updatable security barrier views doesn't play
 well with adding inheritance on inheritance-expanded children.

 One answer for that would be to keep inheritance as it is for 9.4 (if I
 can get the remaining issues sorted out) and in 9.5, if possible, allow
 the addition of a row-security qual that, if it appears on a child rel
 during inheritance expansion, _is_ expanded. At least, if it proves
 necessary, which I'm not entirely convinced of.

Me, neither.  When you first described the scheme you're currently
pursuing, I thought it sounded kind of nuts.  But as I've thought
about it more, it's grown on me.  Today, the way to do row-level
security is:

1. Create a security barrier view over the table.
2. Grant rights to the view instead of the table, and tell people to
go through that.

Now, if you did that, as far as I can see, it would have exactly the
same semantics as what you're proposing to do here, with the sole
exception that you wouldn't need to tell people to access the view
rather the table.  So yeah it's kind of funky but I have a feeling any
rule we come up with here will seem odd in some scenarios, and this
one at least has the virtue of being relatively easy to implement and
consistent with how similar things work today.  I can't knock that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb and nested hstore

2014-02-05 Thread Andrew Dunstan


On 02/05/2014 12:48 PM, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

On 02/05/2014 11:40 AM, Tom Lane wrote:

switching to binary is the same as text may well be the most prudent
path here.

If we do that we're going to have to live with that forever, aren't we?

Yeah, but the other side of that coin is that we'll have to live forever
with whatever binary format we pick, too.  If it turns out to be badly
designed, that could be much worse than eating some parsing costs during
dump/restore.

If we had infinite time/manpower, this wouldn't really be an issue.
We don't, though, and so I suggest that this may be one of the better
things to toss overboard.





The main reason I'm prepared to consider this is the JSON parser seems 
to be fairly efficient (See Oleg's recent stats) and in fact we'd more 
or less be parsing the binary format on input anyway, so there's no 
proof that a binary format is going to be hugely faster (or possibly 
even that it will be faster at all).


If anyone else has opinions on this sing out pretty darn soon (like the 
next 24 hours or so, before I begin.)


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-05 Thread Peter Geoghegan
On Wed, Feb 5, 2014 at 7:21 AM, Andres Freund and...@2ndquadrant.com wrote:
 All current x86 processors use 64. But even if it were bigger/smaller,
 they will be either 32, or 128. Neither benefits from touching more
 cachelines than necessary. E.g. in the 128 case, we could still touch
 two with the current code.

That's true, but I believe that a hardware optimization called
Adjacent Cache Line Prefetch is widely supported by recent
microarchitectures, and is sometimes enabled by default. I'm not
suggesting that that would necessarily radically alter the outcome,
but it is a consideration.



-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb and nested hstore

2014-02-05 Thread Merlin Moncure
On Wed, Feb 5, 2014 at 11:48 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 If we had infinite time/manpower, this wouldn't really be an issue.
 We don't, though, and so I suggest that this may be one of the better
 things to toss overboard.

The hstore send/recv functions have basically the same
(copy/pasted/name adjusted) implementation.  Since hstore will
presumably remain (as the current hstore is) 'deep binary' and all of
Andres's gripes apply to the hstore as well, this change buys us
precisely zap from a time perspective; it comes down to which is
intrinsically the better choice.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add CREATE support to event triggers

2014-02-05 Thread Robert Haas
On Tue, Feb 4, 2014 at 12:11 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 I have run into some issues, though:

 1. certain types, particularly timestamp/timestamptz but really this
 could happen for any type, have unusual typmod output behavior.  For
 those one cannot just use the schema-qualified catalog names and then
 append the typmod at the end; because what you end up is something like
pg_catalog.timestamptz(4) with time zone
 because, for whatever reason, the with time zone is part of typmod
 output.  But this doesn't work at all for input.  I'm not sure how to
 solve this.

How about doing whatever pg_dump does?

 2. I have been having object definitions be emitted complete; in
 particular, sequences have OWNED BY clauses when they have an owning
 column.  But this doesn't work with a SERIAL column, because we get
 output like this:

 alvherre=# CREATE  TABLE public.hijo  (b serial);
 NOTICE:  expanded: CREATE  SEQUENCE public.hijo_b_seq INCREMENT BY 1 MINVALUE 
 1 MAXVALUE 9223372036854775807 START WITH 1 CACHE 1 NO CYCLE OWNED BY 
 public.hijo.b
 NOTICE:  expanded: CREATE  TABLE public.hijo  (b pg_catalog.int4 DEFAULT 
 nextval('hijo_b_seq'::regclass) NOT NULL )

 which is all nice, except that the sequence is using the column name as
 owner before the column has been created in the first place.  Both these
 command will, of course, fail, because both depend on the other to have
 been executed first.  The tie is easy to break in this case: just don't
 emit the OWNED BY clause .. but it makes me a bit nervous to be
 hardcoding the decision of parts that might depend on others.  OTOH
 pg_dump already knows how to split objects in constituent parts as
 necessary; maybe it's not so bad.

Well, the sequence can't depend on a table column that doesn't exist
yet, so if it's in effect doing what you've shown there, it's
cheating by virtue of knowing that nobody can observe the
intermediate state.  Strictly speaking, there's nothing wrong with
emitting those commands just as you have them there; they won't run,
but if what you want to do is log what's happened rather than replay
it, that's OK.  Producing output that is actually executable is a
strictly harder problem than producing output that accurately
describes what happened.  As you say, pg_dump already splits things
and getting executable output out of this facility will require the
same kinds of tricks here.  This gets back to my worry about
maintaining two or three copies of the code that solve many of the
same problems in quite different ways...

 3. It is possible to coerce ruleutils.c to emit always-qualified names
 by using PushOverrideSearchPath() facility; but actually this doesn't
 always work, because some places in namespace.c believe that
 PG_CATALOG_NAMESPACE is always visible and so certain objects are not
 qualified.  In particular, text columns using default collation will be
 emitted as having collation default and not pg_catalog.default as I
 would have initially expected.  Right now it doesn't seem like this is a
 problem, but it is unusual.

We have a quote_all_identifiers flag.  We could have a
schema_qualify_all_identifiers flag, too.  Then again, why is the
behavior of schema-qualifying absolutely everything even desirable?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Viability of text HISTORY/INSTALL/regression README files (was Re: [COMMITTERS] pgsql: Document a few more regression test hazards.)

2014-02-05 Thread Robert Haas
On Tue, Feb 4, 2014 at 11:43 PM, Noah Misch n...@leadboat.com wrote:
 Right.  I mean, a lot of the links say things like Section 26.2
 which obviously makes no sense in a standalone text file.

 For xrefs normally displayed that way, text output could emit a URL, either
 inline or in the form of a footnote.  For link targets (e.g. SQL commands)
 having a friendly text fragment for xref sites, use the normal fragment.

True.  If we're going to keep these things around, something like that
would avoid some annoyances for documentation authors.  But I still
think we ought to just nuke them, because who cares?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-05 Thread Robert Haas
On Wed, Feb 5, 2014 at 11:37 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-02-05 11:23:29 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  And I think somebody already thought about it (c.f. ALIGNOF_BUFFER), it
  just wasn't updated in the last 10 years.

 No, ALIGNOF_BUFFER is there because we read something that said that I/O
 transfers between userspace and kernel disk cache would be faster with
 aligned buffers.  There's been no particular thought given to alignment
 of other data structures, AFAIR.

 But it's not aligned anymore on at last half a decade's hardware, and
 it's what we already align *all* bigger ShmemAlloc() values with. And
 BufferDescriptors surely counts as larger in its entirety.

 It may well be that your proposal is spot on.  But I'd like to see some
 data-structure-by-data-structure measurements, rather than assuming that
 alignment must be a good thing.

 I am fine with just aligning BufferDescriptors properly. That has
 clearly shown massive improvements.

I thought your previous idea of increasing BUFFERALIGN to 64 bytes had
a lot to recommend it.  But that doesn't mean it doesn't need testing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: show xid and xmin in pg_stat_activity and pg_stat_replication

2014-02-05 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 It feels weird to me that the new columns are called transactionid and
 xmin.  Why not xid and xmin?

Actually the part of that that bothers me is xmin, which conflicts
with a reserved system column name.  While you can legally pick such
conflicting names for view columns, it's not going to be so much fun
when you try to join that view against some regular table.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] should we add a XLogRecPtr/LSN SQL type?

2014-02-05 Thread Robert Haas
On Tue, Feb 4, 2014 at 3:26 PM, Peter Eisentraut pete...@gmx.net wrote:
 Perhaps this type should be called pglsn, since it's an
 implementation-specific detail and not a universal concept like int,
 point, or uuid.

If we're going to do that, I suggest pg_lsn rather than pglsn.  We
already have pg_node_tree, so using underscores for separation would
be more consistent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: show xid and xmin in pg_stat_activity and pg_stat_replication

2014-02-05 Thread Robert Haas
On Wed, Feb 5, 2014 at 1:21 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 It feels weird to me that the new columns are called transactionid and
 xmin.  Why not xid and xmin?

 Actually the part of that that bothers me is xmin, which conflicts
 with a reserved system column name.  While you can legally pick such
 conflicting names for view columns, it's not going to be so much fun
 when you try to join that view against some regular table.

That's a fair point, too.  So maybe we should go with something like
backend_xid and backend_xmin or some other prefix that we come up
with.  My concern is more that I think they should be consistent
somehow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] narwhal and PGDLLIMPORT

2014-02-05 Thread Joshua D. Drake


On 02/03/2014 07:04 AM, Robert Haas wrote:


On Mon, Feb 3, 2014 at 9:57 AM, Andrew Dunstan and...@dunslane.net wrote:

It's not so long ago that they were saying they would no longer publish
free-as-in-beer command line compilers at all. The outrage made them change
their minds, but we really can't rely on only Microsoft compilers for
Windows.


+1.



It appears that LLVM supports Windows:

http://llvm.org/docs/GettingStartedVS.html

Sincerely,

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
   a rose in the deeps of my heart. - W.B. Yeats


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-05 Thread Peter Geoghegan
On Tue, Feb 4, 2014 at 4:24 PM, Peter Geoghegan p...@heroku.com wrote:
 There is something you have not drawn explicit attention to that is
 very interesting. If we take REL9_3_STABLE tip to be representative
 (built with full -O2 optimization, no assertions just debugging
 symbols), setting max_connections to 91 from 90 does not have the
 effect of making the BufferDescriptors array aligned; it has the
 effect of making it *misaligned*.

I spoke too soon; the effect is indeed reversed on master (i.e. bad
max_connection settings are misaligned, and vice-versa).


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] should we add a XLogRecPtr/LSN SQL type?

2014-02-05 Thread Peter Eisentraut
On 2/5/14, 1:31 PM, Robert Haas wrote:
 On Tue, Feb 4, 2014 at 3:26 PM, Peter Eisentraut pete...@gmx.net wrote:
 Perhaps this type should be called pglsn, since it's an
 implementation-specific detail and not a universal concept like int,
 point, or uuid.
 
 If we're going to do that, I suggest pg_lsn rather than pglsn.  We
 already have pg_node_tree, so using underscores for separation would
 be more consistent.

Yes, that's a good precedent in multiple ways.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [doc patch] extra_float_digits and casting from real to numeric

2014-02-05 Thread Robert Haas
On Tue, Feb 4, 2014 at 11:25 AM, Christoph Berg
christoph.b...@credativ.de wrote:
 Re: To Tom Lane 2014-01-08 20140108094017.ga20...@msgid.df7cb.de
 What about this patch to mention this gotcha more explicitely in the
 documentation?

 diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
 new file mode 100644
 index 0386330..968f4a7
 *** a/doc/src/sgml/datatype.sgml
 --- b/doc/src/sgml/datatype.sgml
 *** NUMERIC
 *** 689,694 
 --- 689,697 
 literal0/literal, the output is the same on every platform
 supported by PostgreSQL.  Increasing it will produce output that
 more accurately represents the stored value, but may be unportable.
 +   Casts to other numeric datatypes and the literalto_char/literal
 +   function are not affected by this setting, it affects only the text
 +   representation.
/para
   /note


 Anyone for that patch?

Well, the new text kinda recapitulates what the existing text already
says.  If we're going to clarify, I'd do it like this:

 The xref linkend=guc-extra-float-digits setting controls the
  number of extra significant digits included when a floating point
  value is converted to text for output.  It does not affect the results
  when a floating point number is converted to some other data type
  or formatted using literalto_char/literal.

But frankly I'm inclined to just leave it alone.  It says that it
affects what happens when the value is converted to text for output.
 That's specific and accurate.  Granted, someone could misunderstand,
but that's true of almost anything we might write, and being too
long-winded has costs of its own.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] narwhal and PGDLLIMPORT

2014-02-05 Thread Andrew Dunstan


On 02/05/2014 01:41 PM, Joshua D. Drake wrote:


On 02/03/2014 07:04 AM, Robert Haas wrote:


On Mon, Feb 3, 2014 at 9:57 AM, Andrew Dunstan and...@dunslane.net 
wrote:

It's not so long ago that they were saying they would no longer publish
free-as-in-beer command line compilers at all. The outrage made them 
change

their minds, but we really can't rely on only Microsoft compilers for
Windows.


+1.



It appears that LLVM supports Windows:

http://llvm.org/docs/GettingStartedVS.html





If someone wants to work on getting a Windows build working with llvm 
then go for it. But until then the Mingw tools are the only thing we 
have that we know works other than Microsoft tools. Having a compiler 
that's alleged to work is only one step of many in getting there.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2014-02-05 Thread Peter Geoghegan
On Wed, Feb 5, 2014 at 12:50 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I think there's zero overlap. They're completely complimentary features.
 It's not like normal WAL records have an irrelevant volume.


 Correct. Compressing a full-page image happens on the first update after a
 checkpoint, and the diff between old and new tuple is not used in that case.

Uh, I really just meant that one thing that might overlap is
considerations around the choice of compression algorithm. I think
that there was some useful discussion of that on the other thread as
well.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb and nested hstore

2014-02-05 Thread Josh Berkus
On 02/05/2014 07:48 AM, Merlin Moncure wrote:
 Another point I'm struggling with is
 what jsonb brings to the table that isn't covered either hstore or
 json; working through a couple of cases I find myself not using the
 jsonb functionality except as a 'hstore json formatter' which the json
 type covers.  I'm probably being obtuse, but we have to be cautious
 before plonking a couple of dozen extra functions in the public
 schema.

There's three reasons why it's worthwhile:

1) user-friendliness: telling users they need to do ::JSON and
::HSTORE2 all the time is sufficiently annoying -- and prone to
causing errors -- to be a blocker to adoption by a certain, very
numerous, class of user.

2) performance: to the extent that we can operate entirely in JSONB and
not transform back and forth to JSON and HSTORE, function calls (and
index lookups) will be much faster.  And given the competition, speed is
important.

3) growth: 9.4's JSONB functions are a prerequisite to developing richer
JSON querying capabilities in 9.5 and later, which will go beyond JSON
formatting for HSTORE.

Frankly, if it were entirely up to me HSTORE2 would be part of core and
its only interface would be JSONB.  But it's not.  So this is a compromise.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mvcc catalo gsnapshots and TopTransactionContext

2014-02-05 Thread Tom Lane
Noah Misch n...@leadboat.com writes:
 The following test case reliably hits this new assertion:

 create table t (c int);
 begin;
 create index on t (c);
 savepoint q;
 insert into t values (1);
 select 1/0;

As of commit ac8bc3b6e4, this test case no longer triggers the assertion,
because it depends on the INSERT issuing a relcache invalidation request
against t's index.  btree used to do that when changing the index
metapage, but no longer does.  The underlying problem is certainly still
there, though, and can be triggered with this slightly more complex test:

create or replace function inverse(int) returns float8 as
$$
begin
  analyze t1;
  return 1::float8/$1;
exception
  when division_by_zero then return 0;
end$$ language plpgsql volatile;

drop table if exists t1;

create table t1 (c float8 unique);
insert into t1 values (1);
insert into t1 values (inverse(0));

Here, the ANALYZE triggers a relcache inval within the subtransaction
established by the function's BEGIN/EXCEPTION block, and then we abort
that subtransaction with a zero-divide, so you end up at the same place
as with the older example.

Of note is that we still have to have an index on t1; remove that,
no assert.  This is a bit surprising since the ANALYZE certainly causes
a relcache flush on the table not just the index.  The reason turns out
to be that for a simple table like this, relcache entry rebuild does not
involve consulting any syscache: we load the pg_class row with a systable
scan on pg_class, and build the tuple descriptor using a systable scan on
pg_attribute, and we're done.  IIRC this was done this way intentionally
to avoid duplicative caching of the pg_class and pg_attribute rows.
However, RelationReloadIndexInfo uses the syscaches with enthusiasm, so
you will hit the Assert in question if you're trying to rebuild an index;
and you possibly could hit it for a table if you have a more complicated
table definition, such as one with rules.

Of course, a direct system catalog scan is certainly no safer in an
aborted transaction than the one that catcache.c is refusing to do.
Therefore, in my opinion, relcache.c ought also to be doing an
Assert(IsTransactionState()), at least in ScanPgRelation and perhaps
everywhere that it does catalog scans.

I stuck such an Assert in ScanPgRelation, and verified that it doesn't
break any existing regression tests --- although of course the above
test case still fails, and now even without declaring an index.

Barring objections I'll go commit that.  It's a bit pointless to be
Asserting that catcache.c does nothing unsafe when relcache.c does
the same things without any such test.

(Alternatively, maybe we should centralize the asserting in
systable_beginscan or some such place?)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: show xid and xmin in pg_stat_activity and pg_stat_replication

2014-02-05 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Feb 5, 2014 at 1:21 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Actually the part of that that bothers me is xmin, which conflicts
 with a reserved system column name.  While you can legally pick such
 conflicting names for view columns, it's not going to be so much fun
 when you try to join that view against some regular table.

 That's a fair point, too.  So maybe we should go with something like
 backend_xid and backend_xmin or some other prefix that we come up
 with.  My concern is more that I think they should be consistent
 somehow.

Works for me.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb and nested hstore

2014-02-05 Thread Andrew Dunstan


On 02/05/2014 02:03 PM, Josh Berkus wrote:


Frankly, if it were entirely up to me HSTORE2 would be part of core and
its only interface would be JSONB.  But it's not.  So this is a compromise.



You could only do that by inventing a new type. But hstore2 isn't a new 
type, it's meant to be the existing hstore type with new capabilities.


Incidentally, some work is being done by one of my colleagues on an 
extension of gin/gist operators for indexing jsonb similarly to hstore2. 
Now that will possibly be something we can bring into 9.4, although 
we'll have to check how we go about pg_upgrade for that case.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add CREATE support to event triggers

2014-02-05 Thread Alvaro Herrera
Robert Haas escribió:
 On Tue, Feb 4, 2014 at 12:11 AM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
  I have run into some issues, though:
 
  1. certain types, particularly timestamp/timestamptz but really this
  could happen for any type, have unusual typmod output behavior.  For
  those one cannot just use the schema-qualified catalog names and then
  append the typmod at the end; because what you end up is something like
 pg_catalog.timestamptz(4) with time zone
  because, for whatever reason, the with time zone is part of typmod
  output.  But this doesn't work at all for input.  I'm not sure how to
  solve this.
 
 How about doing whatever pg_dump does?

We use format_type() for that as far as I know.  What it does
differently is use undecorated names defined by the standard for some
types, which are never schema qualified and are never ambiguous because
they are reserved words that would require quoting if used by
user-defined type names.  We can't use that here: somewhere upthread we
noticed issues when using those which is why we're now trying to use
catalog names instead of those special names.  (I don't think it's
impossible to use such names: we just need to ensure we handle quoting
correctly for the funny cases such as char and bit.)

One idea is to chop the typmod output string at the closing parens.
That seems to work well for the types that come with core code ... and
the rule with the funny string at the end after the parenthised part of
the typmod works only by type names with hardcoded productions in
gram.y, so there is no danger that outside code will have a problem with
that strategy.

(I verified PostGIS types with typmods just to see an example of
out-of-core code at work, and unsurprisingly it only emits stuff inside
parens.)

  2. I have been having object definitions be emitted complete; in
  particular, sequences have OWNED BY clauses when they have an owning
  column.  But this doesn't work with a SERIAL column, because we get
  output like this:

 Well, the sequence can't depend on a table column that doesn't exist
 yet, so if it's in effect doing what you've shown there, it's
 cheating by virtue of knowing that nobody can observe the
 intermediate state.  Strictly speaking, there's nothing wrong with
 emitting those commands just as you have them there; they won't run,
 but if what you want to do is log what's happened rather than replay
 it, that's OK.  Producing output that is actually executable is a
 strictly harder problem than producing output that accurately
 describes what happened.  As you say, pg_dump already splits things
 and getting executable output out of this facility will require the
 same kinds of tricks here.

Yeah.  Moreover this will require that this new event trigger code is
able to handle (at least certain kinds of) ALTER, which expands this
patch in scope by a wide margin.

Producing executable commands is an important goal.

 This gets back to my worry about maintaining two or three copies of
 the code that solve many of the same problems in quite different
 ways...

Agreed.  It would be real good to be able to use this code for pg_dump
too, but it seems dubious.  The two environments seem just too different
to be able to reuse anything.  But if you see a way, by all means shout.

  3. It is possible to coerce ruleutils.c to emit always-qualified names
  by using PushOverrideSearchPath() facility; but actually this doesn't
  always work, because some places in namespace.c believe that
  PG_CATALOG_NAMESPACE is always visible and so certain objects are not
  qualified.  In particular, text columns using default collation will be
  emitted as having collation default and not pg_catalog.default as I
  would have initially expected.  Right now it doesn't seem like this is a
  problem, but it is unusual.
 
 We have a quote_all_identifiers flag.  We could have a
 schema_qualify_all_identifiers flag, too.

Actually the bit about collations was just a garden variety bug: turns
out I was using a %{}I specifier (and correspondingly only the collation
name as a string) instead of %{}D and get_catalog_object_by_oid() to
match.  I'm not yet sure if this new flag you suggest will really be
needed, but thanks for the idea.

 Then again, why is the behavior of schema-qualifying absolutely
 everything even desirable?

Well, someone could create a collation in another schema with the same
name as a system collation and the command would become ambiguous.  For
example, this command
  create table f (a text collate POSIX);
creates a column whose collation is either public.POSIX or the system
POSIX collation, depending on whether public appears before pg_catalog
in search_path.  Having someone create a POSIX collation in a schema
that appears earlier than pg_catalog is pretty far-fetched, but ISTM
that we really need to cover that scenario.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 

Re: [HACKERS] narwhal and PGDLLIMPORT

2014-02-05 Thread Joshua D. Drake


On 02/05/2014 10:56 AM, Andrew Dunstan wrote:



It appears that LLVM supports Windows:

http://llvm.org/docs/GettingStartedVS.html





If someone wants to work on getting a Windows build working with llvm
then go for it. But until then the Mingw tools are the only thing we
have that we know works other than Microsoft tools. Having a compiler
that's alleged to work is only one step of many in getting there.


I was just trying to provide another resource in case people wanted to 
take a look.


JD



cheers

andrew




--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
   a rose in the deeps of my heart. - W.B. Yeats


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add CREATE support to event triggers

2014-02-05 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Robert Haas escribió:
 How about doing whatever pg_dump does?

 We use format_type() for that as far as I know.  What it does
 differently is use undecorated names defined by the standard for some
 types, which are never schema qualified and are never ambiguous because
 they are reserved words that would require quoting if used by
 user-defined type names.  We can't use that here: somewhere upthread we
 noticed issues when using those which is why we're now trying to use
 catalog names instead of those special names.  (I don't think it's
 impossible to use such names: we just need to ensure we handle quoting
 correctly for the funny cases such as char and bit.)

Yeah, but wouldn't that complexity also bubble into user code within the
event triggers?  Since there's no real need for SQL standard compliance
in this context, I think minimizing the number of weird formats is a
win.

 One idea is to chop the typmod output string at the closing parens.

+1.  The only reason timestamptypmodout works like that is that we're
trying to match the SQL standard's spelling of the type names, and
that committee apparently considers it an off day whenever they can't
invent some randomly-incompatible-with-everything syntax.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb and nested hstore

2014-02-05 Thread Merlin Moncure
On Wed, Feb 5, 2014 at 1:03 PM, Josh Berkus j...@agliodbs.com wrote:
 On 02/05/2014 07:48 AM, Merlin Moncure wrote:
 Another point I'm struggling with is
 what jsonb brings to the table that isn't covered either hstore or
 json; working through a couple of cases I find myself not using the
 jsonb functionality except as a 'hstore json formatter' which the json
 type covers.  I'm probably being obtuse, but we have to be cautious
 before plonking a couple of dozen extra functions in the public
 schema.

 There's three reasons why it's worthwhile:

 1) user-friendliness: telling users they need to do ::JSON and
 ::HSTORE2 all the time is sufficiently annoying -- and prone to
 causing errors -- to be a blocker to adoption by a certain, very
 numerous, class of user.

That's a legitimate point of concern.  But in and of itself I'm sure
sure it warrants exposing a separate API.

 2) performance: to the extent that we can operate entirely in JSONB and
 not transform back and forth to JSON and HSTORE, function calls (and
 index lookups) will be much faster.  And given the competition, speed is
 important.

Not following this.  I do not see how the presence of jsonb helps at
all. Client to server communication will be text-binary (and vice
versa) and handling within the server itself will be in binary.  This
is the crux of my point.

 3) growth: 9.4's JSONB functions are a prerequisite to developing richer
 JSON querying capabilities in 9.5 and later, which will go beyond JSON
 formatting for HSTORE.

I kind of get this point.   But in lieu of a practical use case today,
what's the rush to implement?   I fully anticipate I'm out on left
field on this one (I have a cot and mini fridge there).  The question
on the table is: what use cases (performance included) does jsonb
solve that is not solve can't be solved without it?  With the possible
limited exception of andrew's yet to be delivered enhanced
deserialization routines, I can't think of any.  If presented with
reasonable evidence I'll shut my yap, pronto.

 Frankly, if it were entirely up to me HSTORE2 would be part of core and
 its only interface would be JSONB.  But it's not.  So this is a compromise.

I don't.  To be pedantic: hstore is in core, but packaged as an
extension.  That's a very important distinction.

In fact, I'll go further and say it seem wise for all SQL standard
type work to happen in extensions.  As long as it's an in core contrib
extension, I see no stigma to that whatsoever.  It's not clear at all
to me why the json type was put to the public schema and now we're
about to double down with jsonb.  Having things extension packaged
greatly eases concerns about future API changes because if problems
emerge it's not impossible to imagine compatibility extensions to
appear to bridge the gap if certain critical functions change.  That's
exactly the sort of thing that we may want to happen here, I think.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb and nested hstore

2014-02-05 Thread Andrew Dunstan


On 02/05/2014 03:15 PM, Merlin Moncure wrote:

On Wed, Feb 5, 2014 at 1:03 PM, Josh Berkus j...@agliodbs.com wrote:

On 02/05/2014 07:48 AM, Merlin Moncure wrote:

Another point I'm struggling with is
what jsonb brings to the table that isn't covered either hstore or
json; working through a couple of cases I find myself not using the
jsonb functionality except as a 'hstore json formatter' which the json
type covers.  I'm probably being obtuse, but we have to be cautious
before plonking a couple of dozen extra functions in the public
schema.

There's three reasons why it's worthwhile:

1) user-friendliness: telling users they need to do ::JSON and
::HSTORE2 all the time is sufficiently annoying -- and prone to
causing errors -- to be a blocker to adoption by a certain, very
numerous, class of user.

That's a legitimate point of concern.  But in and of itself I'm sure
sure it warrants exposing a separate API.


2) performance: to the extent that we can operate entirely in JSONB and
not transform back and forth to JSON and HSTORE, function calls (and
index lookups) will be much faster.  And given the competition, speed is
important.

Not following this.  I do not see how the presence of jsonb helps at
all. Client to server communication will be text-binary (and vice
versa) and handling within the server itself will be in binary.  This
is the crux of my point.


3) growth: 9.4's JSONB functions are a prerequisite to developing richer
JSON querying capabilities in 9.5 and later, which will go beyond JSON
formatting for HSTORE.

I kind of get this point.   But in lieu of a practical use case today,
what's the rush to implement?   I fully anticipate I'm out on left
field on this one (I have a cot and mini fridge there).  The question
on the table is: what use cases (performance included) does jsonb
solve that is not solve can't be solved without it?  With the possible
limited exception of andrew's yet to be delivered enhanced
deserialization routines, I can't think of any.  If presented with
reasonable evidence I'll shut my yap, pronto.


Frankly, if it were entirely up to me HSTORE2 would be part of core and
its only interface would be JSONB.  But it's not.  So this is a compromise.

I don't.  To be pedantic: hstore is in core, but packaged as an
extension.  That's a very important distinction.

In fact, I'll go further and say it seem wise for all SQL standard
type work to happen in extensions.  As long as it's an in core contrib
extension, I see no stigma to that whatsoever.  It's not clear at all
to me why the json type was put to the public schema and now we're
about to double down with jsonb.  Having things extension packaged
greatly eases concerns about future API changes because if problems
emerge it's not impossible to imagine compatibility extensions to
appear to bridge the gap if certain critical functions change.  That's
exactly the sort of thing that we may want to happen here, I think.




The time for this discussion was months ago. I would not have spent many 
many hours of my time if I thought it was going to be thrown away. I 
find this attitude puzzling, to say the least. You were a major part of 
the discussion when we said OK, we'll leave json as it is (text based) 
and add jsonb. That's exactly what we're doing.


And no, hstore is NOT in core. In core for a type means to me it's 
builtin, with a fixed Oid.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb and nested hstore

2014-02-05 Thread Merlin Moncure
On Wed, Feb 5, 2014 at 2:37 PM, Andrew Dunstan and...@dunslane.net wrote:
 The time for this discussion was months ago. I would not have spent many
 many hours of my time if I thought it was going to be thrown away. I find
 this attitude puzzling, to say the least. You were a major part of the
 discussion when we said OK, we'll leave json as it is (text based) and add
 jsonb. That's exactly what we're doing.

certainly. I'll shut my yap; I understand your puzzlement.  At the
time though, I had assumed the API was going to incorporate more of
the hstore feature set than it did.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb and nested hstore

2014-02-05 Thread Andrew Dunstan


On 02/05/2014 03:45 PM, Merlin Moncure wrote:

On Wed, Feb 5, 2014 at 2:37 PM, Andrew Dunstan and...@dunslane.net wrote:

The time for this discussion was months ago. I would not have spent many
many hours of my time if I thought it was going to be thrown away. I find
this attitude puzzling, to say the least. You were a major part of the
discussion when we said OK, we'll leave json as it is (text based) and add
jsonb. That's exactly what we're doing.

certainly. I'll shut my yap; I understand your puzzlement.  At the
time though, I had assumed the API was going to incorporate more of
the hstore feature set than it did.



And we will. Specifically the indexing ops I mentioned upthread. We've 
got done as much as could be done this cycle. That's how Postgres 
development works.


One of the major complaints about json in 9.3 is that almost all the 
functions and operators involve reparsing the json. The equivalent 
operations for jsonb do not, and should accordingly be significantly 
faster. That's what I have been spending my time on. I don't think 
that's an inconsiderable advance.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb and nested hstore

2014-02-05 Thread Josh Berkus
Merlin,

 Not following this.  I do not see how the presence of jsonb helps at
 all. Client to server communication will be text-binary (and vice
 versa) and handling within the server itself will be in binary.  This
 is the crux of my point.

Except that handling it on the server, in binary, would require using
the HSTORE syntax.  Otherwise you're converting from text JSON and back
whenever you want to nest functions.

 I kind of get this point.   But in lieu of a practical use case today,
 what's the rush to implement?   I fully anticipate I'm out on left
 field on this one (I have a cot and mini fridge there).  The question
 on the table is: what use cases (performance included) does jsonb
 solve that is not solve can't be solved without it? 

Indexed element extraction.  JSON path queries.  JSON manipulation.

If JSONB is in 9.4, then these are things we can build as extensions and
have available long before September 2015 -- in fact, we've already
started on a couple.  If JSONB isn't in core as a data type, then we
have to wait for the 9.5 dev cycle to do anything.

 In fact, I'll go further and say it seem wise for all SQL standard
 type work to happen in extensions.  As long as it's an in core contrib
 extension, I see no stigma to that whatsoever.  It's not clear at all
 to me why the json type was put to the public schema and now we're
 about to double down with jsonb.

I'll agree that having hstore in contrib and json in core has been a
significant source of issues.

On 02/05/2014 12:45 PM, Merlin Moncure wrote: certainly. I'll shut my
yap; I understand your puzzlement.  At the
 time though, I had assumed the API was going to incorporate more of
 the hstore feature set than it did.

That was the original goal.  However, Oleg and Teodor's late delivery of
Hstore2 limited what Andrew could do for JSONB before CF4 started.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb and nested hstore

2014-02-05 Thread Merlin Moncure
On Wed, Feb 5, 2014 at 3:03 PM, Josh Berkus j...@agliodbs.com wrote:
 That was the original goal.  However, Oleg and Teodor's late delivery of
 Hstore2 limited what Andrew could do for JSONB before CF4 started.

yeah. anyways, I'm good on this point.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] encouraging index-only scans

2014-02-05 Thread Robert Haas
First, thanks for this thoughtful email.

On Tue, Feb 4, 2014 at 7:14 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Mon, Feb 3, 2014 at 8:55 AM, Robert Haas robertmh...@gmail.com wrote:
 I've also had some further thoughts about the right way to drive
 vacuum scheduling.  I think what we need to do is tightly couple the
 rate at which we're willing to do vacuuming to the rate at which we're
 incurring vacuum debt.  That is, if we're creating 100kB/s of pages
 needing vacuum, we vacuum at 2-3MB/s (with default settings).

 If we can tolerate 2-3MB/s without adverse impact on other work, then we can
 tolerate it.  Do we gain anything substantial by sand-bagging it?

No.  The problem is the other direction.

 If
 we're creating 10MB/s of pages needing vacuum, we *still* vacuum at
 2-3MB/s.  Not shockingly, vacuum gets behind, the database bloats, and
 everything goes to heck.

 (Your reference to bloat made be me think your comments here are about
 vacuuming in general, not specific to IOS.  If that isn't the case, then
 please ignore.)

 If we can only vacuum at 2-3MB/s without adversely impacting other activity,
 but we are creating 10MB/s of future vacuum need, then there are basically
 two possibilities I can think of.  Either the 10MB/s represents a spike, and
 vacuum should tolerate it and hope to catch up on the debt later.  Or it
 represents a new permanent condition, in which case I bought too few hard
 drives for the work load, and no scheduling decision that autovacuum can
 make will save me from my folly. Perhaps there is some middle ground between
 those possibilities, but I don't see room for much middle ground.

 I guess there might be entirely different possibilities not between those
 two; for example, I don't realize I'm doing something that is generating
 10MB/s of vacuum debt, and would like to have this thing I'm doing be
 automatically throttled to the point it doesn't interfere with other
 processes (either directly, or indirectly by bloat)

The underlying issue here is that, in order for there not to be a
problem, a user needs to configure their autovacuum processes to
vacuum at a rate which is greater than or equal to the average rate at
which vacuum debt is being created.  If they don't, they get runaway
bloat.  But to do that, they need to know at what rate they are
creating vacuum debt, which is almost impossible to figure out right
now; and even if they did know it, they'd then need to figure out what
vacuum cost delay settings would allow vacuuming at a rate sufficient
to keep up, which isn't quite as hard to estimate but certainly
involves nontrivial math.  So a lot of people have this set wrong, and
it's not easy to get it right except by frobbing the settings until
you find something that works well in practice.

Also, a whole *lot* of problems in this area are caused by cases where
the rate at which vacuum debt is being created *changes*.  Autovacuum
is keeping up, but then you have either a load spike or just a gradual
increase in activity and it doesn't keep up any more.  You don't
necessarily notice right away, and by the time you do there's no easy
way to recover.  If you've got a table with lots of dead tuples in it,
but it's also got enough internal freespace to satisfy as many inserts
and updates as are happening, then it's possibly reasonable to put off
vacuuming in the hopes that system load will be lower at some time in
the future.  But if you've got a table with lots of dead tuples in it,
and you're extending it to create internal freespace instead of
vacuuming it, it is highly like that you are not doing what will make
the user most happy.  Even if vacuuming that table slows down
foreground activity quite badly, it is probably better than
accumulating an arbitrary amount of bloat.

 The rate of vacuuming needs to be tied
 somehow to the rate at which we're creating stuff that needs to be
 vacuumed.  Right now we don't even have a way to measure that, let
 alone auto-regulate the aggressiveness of autovacuum on that basis.

 There is the formula used to decide when a table gets vacuumed.  Isn't the
 time delta in this formula a measure of how fast we are creating stuff that
 needs to be vacuumed for bloat reasons?  Is your objection that it doesn't
 include other reasons we might want to vacuum, or that it just doesn't work
 very well, or that is not explicitly exposed?

AFAICT, the problem isn't when the table gets vacuumed so much as *how
fast* it gets vacuumed.  The autovacuum algorithm does a fine job
selecting tables for vacuuming, for the most part.  There are problems
with insert-only tables and sometimes for large tables the default
threshold (0.20) is too high, but it's not terrible.  However, the
limit on the overall rate of vacuuming activity to 2-3MB/s regardless
of how fast we're creating vacuum debt is a big problem.

 Similarly, for marking of pages as all-visible, we currently make the
 same decision whether the relation is getting index-scanned 

Re: [HACKERS] jsonb and nested hstore

2014-02-05 Thread Andrew Dunstan


On 02/05/2014 04:06 PM, Merlin Moncure wrote:

On Wed, Feb 5, 2014 at 3:03 PM, Josh Berkus j...@agliodbs.com wrote:

That was the original goal.  However, Oleg and Teodor's late delivery of
Hstore2 limited what Andrew could do for JSONB before CF4 started.


I also had issues. But this is the sort of thing that happens. We get 
done as much as we can.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor performance improvement in transition to external sort

2014-02-05 Thread Robert Haas
On Tue, Feb 4, 2014 at 5:22 PM, Jeremy Harris j...@wizmail.org wrote:
 The attached patch replaces the existing siftup method for heapify with
 a siftdown method. Tested with random integers it does 18% fewer
 compares and takes 10% less time for the heapify, over the work_mem
 range 1024 to 1048576.

 Both algorithms appear to be O(n) (contradicting Wikipedia's claim
 that a siftup heapify is O(n log n)).

I think Wikipedia is right.  Inserting an a tuple into a heap is O(lg
n); doing that n times is therefore O(n lg n).  Your patch likewise
implements an outer loop which runs O(n) times and an inner loop which
runs at most O(lg n) times, so a naive analysis of that algorithm also
comes out to O(n lg n).  Wikipedia's article on
http://en.wikipedia.org/wiki/Heapsort explains why a tighter bound is
possible for the siftdown case.

I think I tested something like this at some point and concluded that
it didn't really help much, because building the initial heap was a
pretty small part of the runtime of a large sort.  It may still be
worth doing, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mvcc catalo gsnapshots and TopTransactionContext

2014-02-05 Thread Noah Misch
On Wed, Feb 05, 2014 at 02:07:29PM -0500, Tom Lane wrote:
 Of course, a direct system catalog scan is certainly no safer in an
 aborted transaction than the one that catcache.c is refusing to do.
 Therefore, in my opinion, relcache.c ought also to be doing an
 Assert(IsTransactionState()), at least in ScanPgRelation and perhaps
 everywhere that it does catalog scans.
 
 I stuck such an Assert in ScanPgRelation, and verified that it doesn't
 break any existing regression tests --- although of course the above
 test case still fails, and now even without declaring an index.
 
 Barring objections I'll go commit that.  It's a bit pointless to be
 Asserting that catcache.c does nothing unsafe when relcache.c does
 the same things without any such test.
 
 (Alternatively, maybe we should centralize the asserting in
 systable_beginscan or some such place?)

I'd put it in LockAcquireExtended() and perhaps also heap_beginscan_internal()
and/or index_beginscan_internal().  ScanPgRelation() isn't special.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mvcc catalo gsnapshots and TopTransactionContext

2014-02-05 Thread Andres Freund
On 2014-02-05 14:07:29 -0500, Tom Lane wrote:
 I stuck such an Assert in ScanPgRelation, and verified that it doesn't
 break any existing regression tests --- although of course the above
 test case still fails, and now even without declaring an index.
 
 Barring objections I'll go commit that.  It's a bit pointless to be
 Asserting that catcache.c does nothing unsafe when relcache.c does
 the same things without any such test.
 
 (Alternatively, maybe we should centralize the asserting in
 systable_beginscan or some such place?)

I don't have a problem with sticking an additional assert elsewhere, but
I think ScanPgRelation, systable_beginscan are a bit late, because they
frequently won't be hit during testing because the lookups will be
cached...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mvcc catalo gsnapshots and TopTransactionContext

2014-02-05 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2014-02-05 14:07:29 -0500, Tom Lane wrote:
 I stuck such an Assert in ScanPgRelation, and verified that it doesn't
 break any existing regression tests --- although of course the above
 test case still fails, and now even without declaring an index.
 
 Barring objections I'll go commit that.  It's a bit pointless to be
 Asserting that catcache.c does nothing unsafe when relcache.c does
 the same things without any such test.
 
 (Alternatively, maybe we should centralize the asserting in
 systable_beginscan or some such place?)

 I don't have a problem with sticking an additional assert elsewhere, but
 I think ScanPgRelation, systable_beginscan are a bit late, because they
 frequently won't be hit during testing because the lookups will be
 cached...

Oh, good point.  By analogy to the placement of the existing Assert in
SearchCatCache, the one for relcache should be in RelationIdGetRelation.

[ experiments... ]  OK, that passes regression tests too.  Good...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-02-05 Thread Peter Geoghegan
On Thu, Jan 23, 2014 at 1:36 PM, Peter Geoghegan p...@heroku.com wrote:
 So while post-recovery callbacks no longer exist for any
 rmgr-managed-resource, 100% of remaining startup and cleanup callbacks
 concern the simple management of memory of AM-specific recovery
 contexts (for GiST, GiN and SP-GiST). I have to wonder if there isn't
 a better abstraction than that, such as a generic recovery memory
 context, allowing you to retire all 3 callbacks. I mean, StartupXLOG()
 just calls those callbacks for each resource at exactly the same time
 anyway, just as it initializes resource managers in precisely the same
 manner earlier on. Plus if you look at what those AM-local memory
 management routines do, it all seems very simple.

What are your thoughts on this, as someone that has a broader
perspective here? Are you inclined to keep the startup and cleanup
callbacks in anticipation of a day when that degree of generality is
useful? That would be pretty well-precedented of course, but I would
like to hear your opinion.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb and nested hstore

2014-02-05 Thread Andrew Dunstan


On 02/05/2014 01:10 PM, Andrew Dunstan wrote:


On 02/05/2014 12:48 PM, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

On 02/05/2014 11:40 AM, Tom Lane wrote:

switching to binary is the same as text may well be the most prudent
path here.

If we do that we're going to have to live with that forever, aren't we?

Yeah, but the other side of that coin is that we'll have to live forever
with whatever binary format we pick, too.  If it turns out to be badly
designed, that could be much worse than eating some parsing costs during
dump/restore.

If we had infinite time/manpower, this wouldn't really be an issue.
We don't, though, and so I suggest that this may be one of the better
things to toss overboard.





The main reason I'm prepared to consider this is the JSON parser seems 
to be fairly efficient (See Oleg's recent stats) and in fact we'd more 
or less be parsing the binary format on input anyway, so there's no 
proof that a binary format is going to be hugely faster (or possibly 
even that it will be faster at all).


If anyone else has opinions on this sing out pretty darn soon (like 
the next 24 hours or so, before I begin.)



I got a slightly earlier start ;-) For people wanting to play along, 
here's what this change looks like: 
https://github.com/feodor/postgres/commit/3fe899b3d7e8f806b14878da4a4e2331b0eb58e8


I have a bit more cleanup to do and then I'll try to make new patches.

cheers

andrew




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Partial sort

2014-02-05 Thread Marti Raudsepp
On Tue, Jan 28, 2014 at 7:51 AM, Alexander Korotkov aekorot...@gmail.comwrote:

 On Tue, Jan 28, 2014 at 7:41 AM, Marti Raudsepp ma...@juffo.org wrote:

 But some benchmarks of planning performance are certainly warranted.


 I didn't test it, but I worry that overhead might be high.
 If it's true then it could be like constraint_exclusion option which id
 off by default because of planning overhead.


Sorry I didn't get around to this before.

I ran some synthetic benchmarks with single-column inner joins between 5
tables, with indexes on both joined columns, using only EXPLAIN (so
measuring planning time, not execution) in 9 scenarios to excercise
different code paths. According to these measurements, the overhead ranges
between 1.0 and 4.5% depending on the scenario.


Merge join with partial sort children seems like a fairly obscure use case
(though I'm sure it can help a lot in those cases). The default should
definitely allow partial sort in normal ORDER BY queries. What's under
question here is whether to enable partial sort for mergejoin.

So I see 3 possible resolutions:
1. The overhead is deemed acceptable to enable by default, in which case
we're done here.
2. Add a three-value runtime setting like: enable_partialsort = [ off |
no_mergejoin | on ], defaulting to no_mergejoin (just to get the point
across, clearly we need better naming). This is how constraint_exclusion
works.
3. Remove the partialsort mergejoin code entirely, keeping the rest of the
cases.

What do you think?


All the tests are available here:
https://github.com/intgr/benchjunk/tree/master/partial_sort (using script
run2.sh)

Overhead by test (partial-sort-7.patch.gz):
join5.sql 2.9% (all joins on the same column)
star5.sql 1.7% (star schema kind of join)
line5.sql 1.9% (joins chained to each other)
lim_join5.sql 4.5% (same as above, with LIMIT 1)
lim_star5.sql 2.8%
lim_line5.sql 1.8%
limord_join5.sql 4.3% (same as above, with ORDER BY  LIMIT 1)
limord_star5.sql 3.9%
limord_line5.sql 1.0%

Full data:
PostgreSQL @ git ac8bc3b
join5.sql tps = 499.490173 (excluding connections establishing)
join5.sql tps = 503.756335 (excluding connections establishing)
join5.sql tps = 504.814072 (excluding connections establishing)
star5.sql tps = 492.799230 (excluding connections establishing)
star5.sql tps = 492.570615 (excluding connections establishing)
star5.sql tps = 491.949985 (excluding connections establishing)
line5.sql tps = 773.945050 (excluding connections establishing)
line5.sql tps = 773.858068 (excluding connections establishing)
line5.sql tps = 774.551240 (excluding connections establishing)
lim_join5.sql tps = 392.539745 (excluding connections establishing)
lim_join5.sql tps = 391.867549 (excluding connections establishing)
lim_join5.sql tps = 393.361655 (excluding connections establishing)
lim_star5.sql tps = 418.431804 (excluding connections establishing)
lim_star5.sql tps = 419.258985 (excluding connections establishing)
lim_star5.sql tps = 419.434697 (excluding connections establishing)
lim_line5.sql tps = 713.852506 (excluding connections establishing)
lim_line5.sql tps = 713.636694 (excluding connections establishing)
lim_line5.sql tps = 712.971719 (excluding connections establishing)
limord_join5.sql tps = 381.068465 (excluding connections establishing)
limord_join5.sql tps = 380.379359 (excluding connections establishing)
limord_join5.sql tps = 381.182385 (excluding connections establishing)
limord_star5.sql tps = 412.997935 (excluding connections establishing)
limord_star5.sql tps = 411.401352 (excluding connections establishing)
limord_star5.sql tps = 413.209784 (excluding connections establishing)
limord_line5.sql tps = 688.906406 (excluding connections establishing)
limord_line5.sql tps = 689.445483 (excluding connections establishing)
limord_line5.sql tps = 688.758042 (excluding connections establishing)

partial-sort-7.patch.gz
join5.sql tps = 479.508034 (excluding connections establishing)
join5.sql tps = 488.263674 (excluding connections establishing)
join5.sql tps = 490.127433 (excluding connections establishing)
star5.sql tps = 482.106063 (excluding connections establishing)
star5.sql tps = 484.179687 (excluding connections establishing)
star5.sql tps = 483.027372 (excluding connections establishing)
line5.sql tps = 758.092993 (excluding connections establishing)
line5.sql tps = 759.697814 (excluding connections establishing)
line5.sql tps = 759.792792 (excluding connections establishing)
lim_join5.sql tps = 375.517211 (excluding connections establishing)
lim_join5.sql tps = 375.539109 (excluding connections establishing)
lim_join5.sql tps = 375.841645 (excluding connections establishing)
lim_star5.sql tps = 407.683110 (excluding connections establishing)
lim_star5.sql tps = 407.414409 (excluding connections establishing)
lim_star5.sql tps = 407.526613 (excluding connections establishing)
lim_line5.sql tps = 699.905101 (excluding connections establishing)
lim_line5.sql tps = 700.349675 (excluding 

Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-02-05 Thread Peter Geoghegan
On Tue, Feb 4, 2014 at 11:56 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I also changed _bt_moveright to never return a write-locked buffer, when the
 caller asked for a read-lock (an issue you pointed out earlier in this
 thread).

I think that _bt_moveright() looks good now.

There is now bitrot, caused by commit ac8bc3. I rebased myself, but
that was non-trivial, surprisingly; I had to tweak by hand.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] narwhal and PGDLLIMPORT

2014-02-05 Thread Craig Ringer
On 02/05/2014 01:52 PM, Tom Lane wrote:
 Craig Ringer cr...@2ndquadrant.com writes:
 On 02/05/2014 06:29 AM, Tom Lane wrote:
 I had been okay with the manual PGDLLIMPORT-sprinkling approach
 (not happy with it, of course, but prepared to tolerate it) as long
 as I believed the buildfarm would reliably tell us of the need for
 it.  That assumption has now been conclusively disproven, though.
 
 I'm kind of horrified that the dynamic linker doesn't throw its toys
 when it sees this.
 
 Indeed :-(.
 
 The truly strange part of this is that it seems that the one Windows
 buildfarm member that's telling the truth (or most nearly so, anyway)
 is narwhal, which appears to have the oldest and cruftiest toolchain
 of the lot.  I'd really like to come out the other end of this
 investigation with a clear understanding of why the newer toolchains
 are failing to report a link problem, and yet not building working
 executables.

I've done some digging and asked for some input from people with
appropriate knowledge. Getting a hard answer's going to require some
quality time with a debugger that I won't have until after the CF.
Here's what I've found so far:


In a .DEF file, symbol exports default to code exports if unspecified.

http://msdn.microsoft.com/en-us/library/hyx1zcd3.aspx

(I'm told that .DEF files take precedence over annotations in code, but
haven't found a reference to confirm that yet, and I'm not convinced
it's true since data annotated correctly with __declspec(dllexport)
seems to work. You'd think places like this would mention it, but they
don't: http://msdn.microsoft.com/en-us/library/7k30y2k5.aspx)


So we're probably generating code thunk functions for our global
variables in the import library.

There seem to be two possibilities for what's happening when you access
an extern variable defined in another module without __declspec(dllimport):

1. Our extern variable points to a thunk function's definition; or

2. Our extern variable points to the indirected pointer to the real
variable, which is what you get if you link to a CONSTANT export without
using __declspec(dllimport).

I'd need to go poking in disassemblies with the debugger to confirm
which it is, and don't have time right now. After the CF, maybe. I
suspect #1, since we're probably generating thunk functions for our data
exports.



Either way, the key thing is:

http://msdn.microsoft.com/en-us/library/hyx1zcd3.aspx

Note that when you export a variable from a DLL with a .def file, you
do not need to specify __declspec(dllexport) on the variable. However,
in any file that uses the DLL, you must still use __declspec(dllimport)
on the declaration of data.

See also rules and limitations for dllimport/dllexport:

http://msdn.microsoft.com/en-us/library/da6zd0a4.aspx

This page notes that you should annotate .DEF entries for data with DATA
to reduce the likelihood that incorrect coding will cause a problem.

http://msdn.microsoft.com/en-us/library/54xsd65y.aspx

That will prevent generation of the linker symbol that's a pointer to
the real data, so we should get linker errors when we fail to specify
__declspec(dllimport).

To do this, src/tools/msvc/gendefs.pl needs to be modified to correctly
annotate globals with DATA based on the dumpbin output. I'll have a
crack at that after the CF is closed.

However, *this modification will not get rid of the need for PGDLLIMPORT
macros in headers*, because it's still necessary to explicitly tag them
with __declspec(dllimport) for correct results.

In fact, I'm now wondering if we're using inefficient thunk based
function calls from modules because we're not specifying
__declspec(dllimport) on external functions, too. Again, that's a
post-CF thing to check out using the MSVC debugger.





BTW, there's a general reference for __declspec(dllexport) here:

http://msdn.microsoft.com/en-us/library/a90k134d.aspx

and higher level concepts here:

http://msdn.microsoft.com/en-us/library/z4zxe9k8.aspx
http://msdn.microsoft.com/en-us/library/900axts6.aspx




Other references:

http://msdn.microsoft.com/en-us/library/da6zd0a4.aspx
http://msdn.microsoft.com/en-us/library/zw3za17w.aspx
http://msdn.microsoft.com/en-us/library/619w14ds.aspx
http://msdn.microsoft.com/en-us/library/54xsd65y.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2006/07/20/672695.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2006/07/18/669668.aspx

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] narwhal and PGDLLIMPORT

2014-02-05 Thread Craig Ringer
On 02/06/2014 10:14 AM, Craig Ringer wrote:
 On 02/05/2014 01:52 PM, Tom Lane wrote:
 Craig Ringer cr...@2ndquadrant.com writes:
 On 02/05/2014 06:29 AM, Tom Lane wrote:
 I had been okay with the manual PGDLLIMPORT-sprinkling approach
 (not happy with it, of course, but prepared to tolerate it) as long
 as I believed the buildfarm would reliably tell us of the need for
 it.  That assumption has now been conclusively disproven, though.

 I'm kind of horrified that the dynamic linker doesn't throw its toys
 when it sees this.

 Indeed :-(.

 The truly strange part of this is that it seems that the one Windows
 buildfarm member that's telling the truth (or most nearly so, anyway)
 is narwhal, which appears to have the oldest and cruftiest toolchain
 of the lot.  I'd really like to come out the other end of this
 investigation with a clear understanding of why the newer toolchains
 are failing to report a link problem, and yet not building working
 executables.
 
 I've done some digging and asked for some input from people with
 appropriate knowledge. Getting a hard answer's going to require some
 quality time with a debugger that I won't have until after the CF.

... though this link may shed some light, it turns out:

http://msdn.microsoft.com/en-us/library/aa271769(v=vs.60).aspx

So, yeah. It looks like we're probably linking to thunk functions as
data (ouch). To produce the desired error if __declspec(dllimport) is
missing we need to change gendefs.pl to emit DATA for exported globals.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add CREATE support to event triggers

2014-02-05 Thread Robert Haas
On Wed, Feb 5, 2014 at 2:26 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 Then again, why is the behavior of schema-qualifying absolutely
 everything even desirable?

 Well, someone could create a collation in another schema with the same
 name as a system collation and the command would become ambiguous.  For
 example, this command
   create table f (a text collate POSIX);
 creates a column whose collation is either public.POSIX or the system
 POSIX collation, depending on whether public appears before pg_catalog
 in search_path.  Having someone create a POSIX collation in a schema
 that appears earlier than pg_catalog is pretty far-fetched, but ISTM
 that we really need to cover that scenario.

Hmm, good point.  I guess we don't worry much about this with pg_dump
because we assume that we're restoring into an empty database (and if
not, the user gets to keep both pieces).  You're applying a higher
standard here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Partial sort

2014-02-05 Thread Robert Haas
On Wed, Feb 5, 2014 at 6:58 PM, Marti Raudsepp ma...@juffo.org wrote:
 I ran some synthetic benchmarks with single-column inner joins between 5
 tables, with indexes on both joined columns, using only EXPLAIN (so
 measuring planning time, not execution) in 9 scenarios to excercise
 different code paths. According to these measurements, the overhead ranges
 between 1.0 and 4.5% depending on the scenario.

Hmm, sounds a little steep.  Why is it so expensive?  I'm probably
missing something here, because I would have thought that planner
support for partial sorts would consist mostly of considering the same
sorts we consider today, but with the costs reduced by the batching.
Changing the cost estimation that way can't be that much more
expensive than what we're already doing, so the overhead should be
minimal.  What the patch is actually doing seems to be something quite
a bit more invasive than that, but I'm not sure what it is exactly, or
why.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2014-02-05 Thread Robert Haas
On Wed, Feb 5, 2014 at 6:59 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Attached is a quick patch for that, if you want to test it.

But if we really just want to do prefix/suffix compression, this is a
crappy and expensive way to do it.  We needn't force everything
through the pglz tag format just because we elide a common prefix or
suffix.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2014-02-05 Thread Amit Kapila
On Wed, Feb 5, 2014 at 8:50 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 02/05/2014 04:48 PM, Amit Kapila wrote:

 I have done one test where there is a large suffix match, but
 not large enough that it can compress more than 75% of string,
 the CPU overhead with wal-update-prefix-suffix-encode-1.patch is
 not much, but there is no I/O reduction as well.


 Hmm, it's supposed to compress if you save at least 25%, not 75%. Apparently
 I got that backwards in the patch...

So If I understand the code correctly, the new check should be

if (prefixlen + suffixlen  (slen * need_rate) / 100)
return false;

rather than

if (slen - prefixlen - suffixlen  (slen * need_rate) / 100)
return false;

Please confirm, else any validation for this might not be useful?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2014-02-05 Thread Robert Haas
On Wed, Feb 5, 2014 at 6:43 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 So, I came up with the attached worst case test, modified from your latest
 test suite.

 unpatched:


testname   | wal_generated | duration
 --+---+--
  ten long fields, all but one changed | 343385312 | 2.20806908607483
  ten long fields, all but one changed | 336263592 | 2.18997097015381
  ten long fields, all but one changed | 336264504 | 2.17843413352966
 (3 rows)


 pgrb_delta_encoding_v8.patch:

testname   | wal_generated | duration
 --+---+--
  ten long fields, all but one changed | 338356944 | 3.33501315116882
  ten long fields, all but one changed | 344059272 | 3.37364101409912
  ten long fields, all but one changed | 336257840 | 3.36244201660156
 (3 rows)

 So with this test, the overhead is very significant.

Yuck.  Well that sucks.

 With the skipping logic, another kind of worst case case is that you have
 a lot of similarity between the old and new tuple, but you miss it because
 you skip. For example, if you change the first few columns, but leave a
 large text column at the end of the tuple unchanged.

I suspect there's no way to have our cake and eat it, too.  Most of
the work that Amit has done on this patch in the last few revs is to
cut back CPU overhead in the cases where the patch can't help because
the tuple has been radically modified.  If we're trying to get maximum
compression, we need to go the other way: for example, we could just
feed both the old and new tuples through pglz (or snappy, or
whatever).  That would allow us to take advantage not only of
similarity between the old and new tuples but also internal
duplication within either the old or the new tuple, but it would also
cost more CPU.  The concern with minimizing overhead in cases where
the compression doesn't help has thus far pushed us in the opposite
direction, namely passing over compression opportunities that a more
aggressive algorithm could find in order to keep the overhead low.

Off-hand, I'm wondering why we shouldn't apply the same skipping
algorithm that Amit is using at the beginning of the string for the
rest of it as well.  It might be a little too aggressive (maybe the
skip distance shouldn't increase by quite as much as doubling every
time, or not beyond 16/32 bytes?) but I don't see why the general
principle isn't sound wherever we are in the tuple.

Unfortunately, despite changing things to make a history entry only
every 4th character, building the history is still pretty expensive.
By the time we even begin looking at the tuple we're gonna compress,
we've already spent something like half the total effort, and of
course we have to go further than that before we know whether our
attempt to compress is actually going anywhere.  I think that's the
central problem here.  pglz has several safeguards to ensure that it
doesn't do too much work in vain: we abort if we find nothing
compressible within first_success_by bytes, or if we emit enough total
output to be certain that we won't meet the need_rate threshold.
Those safeguards are a lot less effective here because they can't be
applied until *after* we've already paid the cost of building the
history.  If we could figure out some way to apply those guards, or
other guards, earlier in the algorithm, we could do a better job
mitigating the worst-case scenarios, but I don't have a good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-02-05 Thread Peter Geoghegan
Some more thoughts:

Please add comments above _bt_mark_page_halfdead(), a new routine from
the dependency patch. I realize that this is substantially similar to
part of how _bt_pagedel() used to work, but it's still incongruous.

 ! Our approach is to create any missing downlinks on-they-fly, when
 ! searching the tree for a new insertion. It could be done during searches,
 ! too, but it seems best not to put any extra updates in what would otherwise

s/on-they-fly/on-the-fly/

I'm not sure about this:

 *** _bt_findinsertloc(Relation rel,
 *** 675,680 
 --- 701,707 
 static void
 _bt_insertonpg(Relation rel,
  Buffer buf,
 +Buffer cbuf,
  BTStack stack,
  IndexTuple itup,
  OffsetNumber newitemoff,

Wouldn't lcbuf be a better name for the new argument? After all, in
relevant contexts 'buf' is the parent of both of the pages post-split,
but there are two children in play here. You say:

+  *   When inserting to a non-leaf page, 'cbuf' is the left-sibling 
of the
+  *   page we're inserting the downlink for.  This function will 
clear the
+  *   INCOMPLETE_SPLIT flag on it, and release the buffer.

I suggest also noting in comments at this point that cbuf/the
left-sibling is the old half from the split.

Even having vented about cbuf's name, I can kind of see why you didn't
call it lcbuf: we already have an BTPageOpaque lpageop variable for
the special 'buf' metadata within the _bt_insertonpg() function; so
there might be an ambiguity between the possibly-soon-to-be-left page
(the target page) and the *child* left page that needs to have its
flag cleared. Although I still think you should change the name of
lpageop (further details follow).

Consider this:

   /* release buffers */
+  if (!P_ISLEAF(lpageop))
+  _bt_relbuf(rel, cbuf);

(Rebased, so this looks a bit different to your original version IIRC).

This is checking that the _bt_insertonpg() target page (whose special
data lpageop points to) is not a leaf page, and releasing the *child*
left page if it isn't. This is pretty confusing. So I suggest naming
lpageop tpageop ('t' for target, a terminology already used in the
comments above the function).

Also, I suggest you change this existing code comment:

 *  On entry, we must have the right buffer in which to do the
 *  insertion, and the buffer must be pinned and write-locked.  
On return,
 *  we will have dropped both the pin and the lock on the buffer.

Change right to correct here. Minor point, but right is a loaded word.

But why are you doing new things while checking P_ISLEAF(lpageop) in
_bt_insertonpg() to begin with? Would you not be better off doing
things when there is a child buffer passed (e.g. if
(BufferIsValid(cbuf))...), and only then asserting
!P_ISLEAF(lpageop)? That seems to me to more naturally establish the
context of _bt_insertonpg() is called to insert downlink after
split. Although maybe that conflates things within XLOG stuff. Hmm.

Perhaps some of these observations could equally well apply to
_bt_split(), which is similarly charged with releasing a left child
page on inserting a downlink. Particularly around reconsidering the
left vs left child vs target terminology.

Consider what happens when _bt_split() is passed a (left) child buffer
(a 'cbuf' argument). We are:

1. Splitting a page (first phase, which may only be finished lazily
some time later).

2. Iff this is a non-leaf page split, simultaneously unmark the flag
from some *other* split which we have a child buffer to unmark. Needs
to be part of same critical section. Unlock + release cbuf, again only
iff target/right split page contains a leaf page.

So, at the risk of belaboring the point:

* Some callers to _bt_split() and _bt_insertonpg() pass a 'cbuf' that
is not invalid.

* If either of those 2 functions don't unlock + release those buffers,
no one ever will.

* Therefore, at the very least we should unlock + release those
buffers **just because they're not invalid**. That's a sufficient
condition to do so, and attaching that to level is unnecessarily
unclear. As I said, assert non-leafness.

Incidentally, I think that in general we could be clearer on the fact
that a root page may also be a leaf page.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Row-security on updatable s.b. views

2014-02-05 Thread Craig Ringer
On 02/04/2014 02:43 PM, Craig Ringer wrote:
 On 01/30/2014 04:05 PM, Craig Ringer wrote:
 On 01/30/2014 01:25 PM, Craig Ringer wrote:
 On 01/29/2014 09:47 PM, Craig Ringer wrote:
 https://github.com/ringerc/postgres/compare/rls-9.4-upd-sb-views

 i.e. https://github.com/ringerc/postgres.git ,
  branch rls-9.4-upd-sb-views

 (subject to rebasing) or the non-rebased tag rls-9.4-upd-sb-views-v2

 Pushed an update to the branch. New update tagged
 rls-9.4-upd-sb-views-v3 . Fixes an issue with rowmarking that stems from
 the underlying updatable s.b. views patch.

 Other tests continue to fail, this isn't ready yet.

 Specifically:
 
 - row-security rule recursion detection isn't solved yet, it just
 overflows the stack.
 
 This is now fixed in the newly tagged rls-9.4-upd-sb-views-v4 in
 g...@github.com:ringerc/postgres.git .

Based on Tom's objections, another approach is presented in
rls-9.4-upd-sb-views-v5 on  g...@github.com:ringerc/postgres.git . The
Query node is used to record the recursive expansion parent list
instead, and copying is avoided.

However, I've separately tracked down the cause of the test failures like:

ERROR:  could not open file base/16384/30135: No such file or directory

This occurs where a row-security qual is declared to use a view.
Row-security quals get stored without rewriting (which is necessary, see
below). The qual is injected into securityQuals and expanded, but *not
rewritten*. So the executor sees an unexpected view in the tree.

Because a view RTE has its relid field set to the view's oid, this
doesn't get picked up until we try to actually scan the view relation in
the executor.

(I'd like to add Asserts to make the executor fail a bit more
informatively when you try to scan a view, but that's separate.)


So, that's clearly broken. There are really two possible solutions:

1. Try (again) to do row-security in the rewriter. This was previously
impossible because of the definition of row-security behaviour around
inheritance, but with the simplified inheritance model now proposed I
think it's possible.

2. Invoke RewriteQuery from within expand_security_quals, rewriting the
query after security qual expansion. This is only needed for
row-security; for updatable s.b. views rewriting has happened with
recursion into securityQuals during the original rewrite pass.


I suspect that (2) will result in a resounding yuck.

So I have to see if I can now turn around *again* and plug row-security
into the rewriter after all.  That's a pretty frustrating thing to
discover in mid-late CF4.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add CREATE support to event triggers

2014-02-05 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Feb 5, 2014 at 2:26 PM, Alvaro Herrera alvhe...@2ndquadrant.com 
 wrote:
 Then again, why is the behavior of schema-qualifying absolutely
 everything even desirable?

 Well, someone could create a collation in another schema with the same
 name as a system collation and the command would become ambiguous.

 Hmm, good point.  I guess we don't worry much about this with pg_dump
 because we assume that we're restoring into an empty database (and if
 not, the user gets to keep both pieces).  You're applying a higher
 standard here.

Robert, that's just horsepucky.  pg_dump is very careful about schemas.
It's also careful to not schema-qualify names unnecessarily, which is an
intentional tradeoff to improve readability of the dump --- at the cost
that the dump might break if restored into a nonempty database with
conflicting objects.  In the case of data passed to event triggers,
there's a different tradeoff to be made: people will probably value
consistency over readability, so always-qualify is probably the right
choice here.  But in neither case are we being sloppy.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2014-02-05 Thread Gavin Flower

On 06/02/14 16:59, Robert Haas wrote:

On Wed, Feb 5, 2014 at 6:43 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

So, I came up with the attached worst case test, modified from your latest
test suite.

unpatched:


testname   | wal_generated | duration
--+---+--
  ten long fields, all but one changed | 343385312 | 2.20806908607483
  ten long fields, all but one changed | 336263592 | 2.18997097015381
  ten long fields, all but one changed | 336264504 | 2.17843413352966
(3 rows)


pgrb_delta_encoding_v8.patch:

testname   | wal_generated | duration
--+---+--
  ten long fields, all but one changed | 338356944 | 3.33501315116882
  ten long fields, all but one changed | 344059272 | 3.37364101409912
  ten long fields, all but one changed | 336257840 | 3.36244201660156
(3 rows)

So with this test, the overhead is very significant.

Yuck.  Well that sucks.


With the skipping logic, another kind of worst case case is that you have
a lot of similarity between the old and new tuple, but you miss it because
you skip. For example, if you change the first few columns, but leave a
large text column at the end of the tuple unchanged.

I suspect there's no way to have our cake and eat it, too.  Most of
the work that Amit has done on this patch in the last few revs is to
cut back CPU overhead in the cases where the patch can't help because
the tuple has been radically modified.  If we're trying to get maximum
compression, we need to go the other way: for example, we could just
feed both the old and new tuples through pglz (or snappy, or
whatever).  That would allow us to take advantage not only of
similarity between the old and new tuples but also internal
duplication within either the old or the new tuple, but it would also
cost more CPU.  The concern with minimizing overhead in cases where
the compression doesn't help has thus far pushed us in the opposite
direction, namely passing over compression opportunities that a more
aggressive algorithm could find in order to keep the overhead low.

Off-hand, I'm wondering why we shouldn't apply the same skipping
algorithm that Amit is using at the beginning of the string for the
rest of it as well.  It might be a little too aggressive (maybe the
skip distance shouldn't increase by quite as much as doubling every
time, or not beyond 16/32 bytes?) but I don't see why the general
principle isn't sound wherever we are in the tuple.

Unfortunately, despite changing things to make a history entry only
every 4th character, building the history is still pretty expensive.
By the time we even begin looking at the tuple we're gonna compress,
we've already spent something like half the total effort, and of
course we have to go further than that before we know whether our
attempt to compress is actually going anywhere.  I think that's the
central problem here.  pglz has several safeguards to ensure that it
doesn't do too much work in vain: we abort if we find nothing
compressible within first_success_by bytes, or if we emit enough total
output to be certain that we won't meet the need_rate threshold.
Those safeguards are a lot less effective here because they can't be
applied until *after* we've already paid the cost of building the
history.  If we could figure out some way to apply those guards, or
other guards, earlier in the algorithm, we could do a better job
mitigating the worst-case scenarios, but I don't have a good idea.

Surely the weighting should be done according to the relative scarcity 
of processing power vs I/O bandwidth? I get the impression that 
different workloads and hardware configurations may favour conserving 
one of processor or I/O resources.  Would it be feasible to have 
different logic, depending on the trade-offs identified?



Cheers,
Gavin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers