Re: [HACKERS] Postgres code for a query intermediate dataset

2014-09-15 Thread Mark Kirkwood

On 14/09/14 20:43, Mark Kirkwood wrote:

On 14/09/14 20:24, Atri Sharma wrote:


How do you plan to  do all that VACUUM does for this table then?

It seems to me that you are saying to VACUUM that it need not be
concerned with table 'A' and you are assuming ownership of all the tasks
performed by VACUUM for this table. Seems pretty broken to me, not to
mention the performance degradations.



I think the whole point of such a modification is that nothing is done
to such tables, as you want to see all the previous versions.

Clearly this is less performant for standard workloads...but we are
talking about non standard workloads surely...


To be fair with respect to what Atri is saying, I should have said 
something like:


Clearly this is *horribly* less performant for standard workloads...etc :-)

Also there is the good point he raised about transaction xid wrap, so 
some messing about with that part of VACUUM would be required too (it's 
the little complications that all add up)!



The TRIGGER based approach is clearly a lot simpler! However for an 
interest project to understand Postgres internals the other approach is 
worthwhile.


Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] orangutan seizes up during isolation-check

2014-09-15 Thread Heikki Linnakangas

On 09/15/2014 07:51 AM, Noah Misch wrote:

libintl replaces setlocale().  Its setlocale(LC_x, ) uses OS-specific APIs
to determine the default locale when $LANG and similar environment variables
are empty, as they are during make check NO_LOCALE=1.  On OS X, it calls[1]
CFLocaleCopyCurrent(), which in turn spins up a thread.  See the end of this
message for the postmaster thread stacks active upon hitting a breakpoint set
at _dispatch_mgr_thread.


Ugh. I'd call that a bug in libintl. setlocale() has no business to make 
the process multi-threaded.


Do we have the same problem in backends? At a quick glance, aside from 
postmaster we only use PG_SETMASK(BlockSig) in signal handlers, to 
prevent another signal handler from running concurrently.



I see two options for fixing this in pg_perm_setlocale(LC_x, ):

1. Fork, call setlocale(LC_x, ) in the child, pass back the effective locale
name through a pipe, and pass that name to setlocale() in the original
process.  The short-lived child will get the extra threads, and the
postmaster will remain clean.

2. On OS X, check for relevant environment variables.  Finding none, set
LC_x=C before calling setlocale(LC_x, ).  A variation is to raise
ereport(FATAL) if sufficient environment variables aren't in place.  Either
way ensures the libintl setlocale() will never call CFLocaleCopyCurrent().
This is simpler than (1), but it entails a behavior change: LANG= initdb
will use LANG=C or fail rather than use the OS X user account locale.

I'm skeptical of the value of looking up locale information using other OS X
facilities when the usual environment variables are inconclusive, but I see no
clear cause to reverse that decision now.  I lean toward (1).


Both of those are horrible hacks. And who's to say that calling 
setlocale(LC_x, foo) won't also call some function that makes the 
process multi-threaded. If not in any current OS X release, it might 
still happen in a future one.


One idea would be to use an extra pthread mutex or similar, in addition 
to PG_SETMASK(). Whenever you do PG_SETMASK(BlockSig), also grab the 
mutex, and release it when you do PG_SETMASK(UnBlockSig).


It would be nice to stop doing non-trivial things in the signal handler 
in the first place. It's pretty scary, even though it works when the 
process is single-threaded. I believe the reason it's currently 
implemented like that are the same problems that the latch code solves 
with the self-pipe trick: select() is not interrupted by a signal on all 
platforms, and even if it was, you would need pselect() with is not 
available (or does not work correctly even if it exists) on all 
platforms. I think we could use a latch in postmaster too.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] implement subject alternative names support for SSL connections

2014-09-15 Thread Alexey Klyukin
On Fri, Sep 12, 2014 at 4:20 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

 Hmm. If that's what the browsers do, I think we should also err on the
 side of caution here. Ignoring the CN is highly unlikely to cause anyone
 a problem; a CA worth its salt should not issue a certificate with a CN
 that's not also listed in the SAN section. But if you have such a
 certificate anyway for some reason, it shouldn't be too difficult to get
 a new certificate. Certificates expire every 1-3 years anyway, so there
 must be a procedure to renew them anyway.


 Committed, with that change, ie. the CN is not checked if SANs are present.

 Thanks for bearing through all these iterations!

Great news! Thank you very much for devoting your time and energy to
the review and providing such a useful feedback!
On the CN thing, I don't have particularly strong arguments for either
of the possible behaviors, so sticking to RFC makes sense here

Sincerely,
-- 
Alexey Klyukin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-09-15 Thread Heikki Linnakangas

On 09/13/2014 11:25 AM, Fabien COELHO wrote:



[about logging...]


Here is an attempt at updating the log features, including the aggregate
and sampling stuff, with skipped transactions under throttling.

I moved the logging stuff into a function which is called when a
transaction is skipped or finished.


Makes sense.

I spent some time on this, and this is what I ended up with. Notable 
changes:


* I split this into two patches for easier review. The first patch 
contains the refactoring of the logging stuff, and the second patch 
contains the new functionality.


* I renamed many of the variables, to be more consistent with existing 
similar variables


* progress reporting was broken with !PTHREAD_FORK_EMULATION. Need to 
collect the number of skipped xacts from all threads.


* I renamed the long option to --latency-limit. --limit is too generic.

Please have a look. I have not looked at the docs changes yet.

One thing that needs some thinking and changing is the progress 
reporting. It currently looks like this:


progress: 1.0 s, 4863.0 tps, lat 3.491 ms stddev 2.487, lag 1.809 ms, 99 
skipped
progress: 2.0 s, 5042.8 tps, lat 3.265 ms stddev 2.264, lag 1.584 ms, 16 
skipped
progress: 3.0 s, 4926.1 tps, lat 2.731 ms stddev 2.371, lag 1.196 ms, 45 
skipped
progress: 4.0 s, 4963.9 tps, lat 1.904 ms stddev 1.212, lag 0.429 ms, 0 
skipped
progress: 5.0 s, 4971.2 tps, lat 2.592 ms stddev 1.722, lag 0.975 ms, 0 
skipped


The absolute number of skipped transactions doesn't make much sense when 
all the other numbers are averages, and tps is a 1/s value. If you don't 
know the total number of transactions executed, the absolute number is 
meaningless. (Although you can calculate the absolute number of 
transactions executed by multiplying the TPS value with the interval). I 
think a percentage would be better here.


Should we also print the number of late transactions here? I think that 
would be an even more important detail than the number of skipped 
transactions. It might be better to print only the percentage of late 
transactions, including skipped ones. Or both, but it's difficult to 
cram everything on a single line. This needs some further thinking..


- Heikki

commit 6f1158e9a7e77710164d503b40d77fd9dcde7a08
Author: Heikki Linnakangas heikki.linnakan...@iki.fi
Date:   Mon Sep 15 11:09:29 2014 +0300

Refactor log-printing in pgbench to a separate function.

The doCustom function was getting really unwieldy. Upcoming patches will add
even more code there.

diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 087e0d3..0daf92f 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -347,6 +347,9 @@ static char *select_only = {
 static void setalarm(int seconds);
 static void *threadRun(void *arg);
 
+static void doLog(TState *thread, CState *st, FILE *logfile, instr_time *now,
+  AggVals *agg);
+
 static void
 usage(void)
 {
@@ -1016,6 +1019,16 @@ doCustom(TState *thread, CState *st, instr_time *conn_time, FILE *logfile, AggVa
 	PGresult   *res;
 	Command   **commands;
 	bool		trans_needs_throttle = false;
+	instr_time	now;
+
+	/*
+	 * gettimeofday() isn't free, so we get the current timestamp lazily the
+	 * first time it's needed, we reuse the same value throughout this
+	 * function after that. This also ensures that e.g. the calculated latency
+	 * reported in the log file and in the totals are the same. Zero means
+	 * not set yet.
+	 */
+	INSTR_TIME_SET_ZERO(now);
 
 top:
 	commands = sql_files[st-use_file];
@@ -1049,10 +1062,10 @@ top:
 
 	if (st-sleeping)
 	{			/* are we sleeping? */
-		instr_time	now;
 		int64		now_us;
 
-		INSTR_TIME_SET_CURRENT(now);
+		if (INSTR_TIME_IS_ZERO(now))
+			INSTR_TIME_SET_CURRENT(now);
 		now_us = INSTR_TIME_GET_MICROSEC(now);
 		if (st-txn_scheduled = now_us)
 		{
@@ -1074,11 +1087,6 @@ top:
 
 	if (st-listen)
 	{			/* are we receiver? */
-		instr_time	now;
-		bool		now_valid = false;
-
-		INSTR_TIME_SET_ZERO(now); /* initialize to keep compiler quiet */
-
 		if (commands[st-state]-type == SQL_COMMAND)
 		{
 			if (debug)
@@ -1100,181 +1108,40 @@ top:
 		{
 			int			cnum = commands[st-state]-command_num;
 
-			if (!now_valid)
-			{
+			if (INSTR_TIME_IS_ZERO(now))
 INSTR_TIME_SET_CURRENT(now);
-now_valid = true;
-			}
 			INSTR_TIME_ACCUM_DIFF(thread-exec_elapsed[cnum],
   now, st-stmt_begin);
 			thread-exec_count[cnum]++;
 		}
 
-		/* transaction finished: record latency under progress or throttling */
-		if ((progress || throttle_delay)  commands[st-state + 1] == NULL)
+		/* transaction finished: calculate latency and log the transaction */
+		if (commands[st-state + 1] == NULL)
 		{
-			int64		latency;
-
-			if (!now_valid)
+			/* only calculate latency if an option is used that needs it. */
+			if (progress || throttle_delay)
 			{
-INSTR_TIME_SET_CURRENT(now);
-now_valid = true;
-			}
-
-			latency = INSTR_TIME_GET_MICROSEC(now) - st-txn_scheduled;
+int64		

Re: [HACKERS] implement subject alternative names support for SSL connections

2014-09-15 Thread Alexey Klyukin
On Mon, Sep 15, 2014 at 10:23 AM, Alexey Klyukin al...@hintbits.com wrote:
 On Fri, Sep 12, 2014 at 4:20 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:

 Hmm. If that's what the browsers do, I think we should also err on the
 side of caution here. Ignoring the CN is highly unlikely to cause anyone
 a problem; a CA worth its salt should not issue a certificate with a CN
 that's not also listed in the SAN section. But if you have such a
 certificate anyway for some reason, it shouldn't be too difficult to get
 a new certificate. Certificates expire every 1-3 years anyway, so there
 must be a procedure to renew them anyway.


 Committed, with that change, ie. the CN is not checked if SANs are present.

Actually, I disagree with the way the patch ignores the CN. Currently,
it skips the
CN unconditionally if the SubjectAltName section is present. But what
RFC 6125 says
is:

If a subjectAltName extension of type dNSName is present, that MUST
   be used as the identity.  Otherwise, the (most specific) Common Name
   field in the Subject field of the certificate MUST be used.

This means that we have to check that at least one dNSName resource is
present before
rejecting to examine the CN. Attached is a one-liner (excluding
comments) that fixes this.

Regards,
Alexey
diff --git a/src/interfaces/libpq/fe-secure-openssl.c 
b/src/interfaces/libpq/fe-secure-openssl.c
new file mode 100644
index 98d02b6..dd4fab8
*** a/src/interfaces/libpq/fe-secure-openssl.c
--- b/src/interfaces/libpq/fe-secure-openssl.c
*** verify_peer_name_matches_certificate(PGc
*** 626,637 
sk_GENERAL_NAME_free(peer_san);
}
/*
!* If there is no subjectAltName extension, check the Common Name.
 *
!* (Per RFC 2818 and RFC 6125, if the subjectAltName extension is 
present,
 * the CN must be ignored.)
 */
!   else
{
X509_NAME  *subject_name;
  
--- 626,637 
sk_GENERAL_NAME_free(peer_san);
}
/*
!* If there is no subjectAltName extension of type dNSName, check the 
Common Name.
 *
!* (Per RFC 2818 and RFC 6125, if the subjectAltName extension of type 
dNSName is present,
 * the CN must be ignored.)
 */
!   if (names_examined == 0)
{
X509_NAME  *subject_name;
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-15 Thread Heikki Linnakangas

On 09/15/2014 02:42 AM, Arthur Silva wrote:

Em 14/09/2014 12:21, Andres Freund and...@2ndquadrant.com escreveu:


On 2014-09-13 20:27:51 -0500, k...@rice.edu wrote:


What we are looking for here is uniqueness thus better error

detection. Not

avalanche effect, nor cryptographically secure, nor bit distribution.
As far as I'm aware CRC32C is unbeaten collision wise and time proven.

I couldn't find tests with xxhash and crc32 on the same hardware so I

spent

some time putting together a benchmark (see attachment, to run it just
start run.sh)

I included a crc32 implementation using ssr4.2 instructions (which

works on

pretty much any Intel processor built after 2008 and AMD built after

2012),

a portable Slice-By-8 software implementation and xxhash since it's

the

fastest software 32bit hash I know of.

Here're the results running the test program on my i5-4200M

crc sb8: 90444623
elapsed: 0.513688s
speed: 1.485220 GB/s

crc hw: 90444623
elapsed: 0.048327s
speed: 15.786877 GB/s

xxhash: 7f4a8d5
elapsed: 0.182100s
speed: 4.189663 GB/s

The hardware version is insanely and works on the majority of Postgres
setups and the fallback software implementations is 2.8x slower than

the

fastest 32bit hash around.

Hopefully it'll be useful in the discussion.


Note that all these numbers aren't fully relevant to the use case
here. For the WAL - which is what we're talking about and the only place
where CRC32 is used with high throughput - the individual parts of a
record are pretty darn small on average. So performance of checksumming
small amounts of data is more relevant. Mind, that's not likely to go
for CRC32, especially not slice-by-8. The cache fooprint of the large
tables is likely going to be noticeable in non micro benchmarks.


Indeed, the small input sizes is something I was missing. Something more
cache friendly would be better, it's just a matter of finding a better
candidate.


It's worth noting that the extra tables that slicing-by-4 requires are 
and *in addition to* the lookup table we already have. And slicing-by-8 
builds on the slicing-by-4 lookup tables. Our current algorithm uses a 
1kB lookup table, slicing-by-4 a 4kB, and slicing-by-8 8kB. But the 
first 1kB of the slicing-by-4 lookup table is identical to the current 
1kB lookup table, and the first 4kB of the slicing-by-8 are identical to 
the slicing-by-4 tables.


It would be pretty straightforward to use the current algorithm when the 
WAL record is very small, and slicing-by-4 or slicing-by-8 for larger 
records (like FPWs), where the larger table is more likely to pay off. I 
have no idea where the break-even point is with the current algorithm 
vs. slicing-by-4 and a cold cache, but maybe we can get a handle on that 
with some micro-benchmarking.


Although this is complicated by the fact that slicing-by-4 or -8 might 
well be a win even with very small records, if you generate a lot of them.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-15 Thread Heikki Linnakangas

On 09/14/2014 11:34 PM, Peter Geoghegan wrote:

On Sun, Sep 14, 2014 at 7:37 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

Both values vary in range 5.9 - 6.1 s, so it's fair to say that the useless
memcmp() is free with these parameters.

Is this the worst case scenario?


Other than pushing the differences much much later in the strings
(which you surely thought of already), yes.


Please test the absolute worst case scenario you can think of. As I said 
earlier, if you can demonstrate that the slowdown of that is acceptable, 
we don't even need to discuss how likely or realistic the case is.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [bug fix or improvement?] Correctly place DLLs for ECPG apps in bin folder

2014-09-15 Thread Heikki Linnakangas
This patch has been sitting in the commitfest with no activity since 
this email from July. What's the real status, who's got the ball on this 
one? Please update the commitfest app accordingly. If nothing happens in 
the next few days, I'll mark this as Returned with Feedback.


On 07/09/2014 03:53 PM, MauMau wrote:

From: Alvaro Herrera alvhe...@2ndquadrant.com

I think the suggestion by Peter Eisentraut upthread was pretty
reasonable -- the Makefiles are already aware that they are building a
shared library, so scrape the info off them.  The less hardcoded stuff
in the msvc framework, the better.


I overlooked his mail.  Do you mean something like this? (I don't know yet
how to express this in Perl)

for each Makefile in under top_dir_in_source_tree or src/interfaces
{
   if (Makefile contains SO_MAJOR_VERSION)
   {
 extract NAME value from Makefile
 move lib/lib${NAME}.dll to bin/
   }
}

But the source tree contains as many as 206 Makefiles.  I'm worried that it
will waste the install time.  Should all these Makefiles be examined, or 16
Makefiles in src/interfaces/?

Regards
MauMau








--
- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Partial sort

2014-09-15 Thread Alexander Korotkov
On Sun, Sep 14, 2014 at 7:39 AM, Peter Geoghegan p...@heroku.com wrote:

 On Fri, Sep 12, 2014 at 2:19 PM, Alexander Korotkov
 aekorot...@gmail.com wrote:
  Actually, higher cardinality skip columns is better. Sorting of smaller
  groups is faster than sorting larger groups of same size. Also, with
 smaller
  groups you achieve limit more accurate (in average), i.e. sort smaller
  amount of total rows.

 Higher cardinality leading key columns are better for what? Do you
 mean they're better in that those cases are more sympathetic to your
 patch, or better in the general sense that they'll perform better for
 the user? The first example query, that you chose to demonstrate your
 patch had a leading, indexed column (column v1) with much lower
 cardinality/n_distinct than the column that had to be sorted on
 (column v2). That was what my comments were based on.

 When this feature is used, there will often be a significantly lower
 cardinality in the leading, indexed column (as in your example).
 Otherwise, the user might well have been happy to just order on the
 indexed/leading column alone. That isn't definitely true, but it's
 often true.


I just meant higher cardinality is cheaper for do partial sort. We could
have some misunderstood because of notions high and low are relative.
When cardinality is 1 then partial sort seems to be useless. When
cardinality is few then it still could be cheaper to do sequential scan +
sort rather than index scan + partial sort. When cardinality is close to
number of rows then as you mentioned user probably don't need to sort by
rest of columns. At least one exception is when user needs to force
uniqueness of order. So, we need to target something in the middle of this
two corner cases.


  I'm not sure if that's worth it to more or less
  duplicate heap_tuple_attr_equals() to save a mere n expensive
  comparisons, but it's something to think about (actually, there are
  probably less than even n comparisons in practice because there'll be
  a limit).
 
  Not correct. Smaller groups are not OK. Imagine that two representations
 of
  same skip column value exists. Index may return them in any order, even
  change them one by one. In this case sorting on other column never takes
  place, while it should.

 I think I explained this badly - it wouldn't be okay to make the
 grouping smaller, but as you say we could tie-break with a proper
 B-Tree support function 1 comparison to make sure we really had
 reached the end of our grouping.

 FWIW I want all bttextcmp()/varstr_cmp() comparisons to try a memcmp()
 first, so the use of the equality operator with text in mind that you
 mention may soon not be useful at all. The evidence suggests that
 memcmp() is so much cheaper than special preparatory NUL-termination +
 strcoll() that we should always try it first when sorting text, even
 when we have no good reason to think it will work. The memcmp() is
 virtually free. And so, you see why it might be worth thinking about
 this when we already have reasonable confidence that many comparisons
 will indicate that datums are equal. Other datatypes will have
 expensive real comparators, not just text or numeric.


When strings are not equal bttextcmp still needs to use strcoll while
texteq doesn't need that. So, it would be still advantage of using equality
operator over comparison function: equality operator don't have to compare
unequal values. However, real cost of this advantage depends on presorted
columns cardinality as well.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] PoC: Partial sort

2014-09-15 Thread Alexander Korotkov
On Sun, Sep 14, 2014 at 9:32 AM, Peter Geoghegan p...@heroku.com wrote:

 I think we might be better off if a tuplesort function was called
 shortly after tuplesort_begin_heap() is called. How top-n heap sorts
 work is something that largely lives in tuplesort's head. Today, we
 call tuplesort_set_bound() to hint to tuplesort By the way, this is a
 top-n heap sort applicable sort. I think that with this patch, we
 should then hint (where applicable) by the way, you won't actually be
 required to sort those first n indexed attributes; rather, you can
 expect to scan those in logical order. You may work the rest out
 yourself, and may be clever about exploiting the sorted-ness of the
 first few columns. The idea of managing a bunch of tiny sorts from
 with ExecSort(), and calling the new function tuplesort_reset() seems
 questionable. tuplesortstate is supposed to be private/opaque to
 nodeSort.c, and the current design strains that.

 I'd like to keep nodeSort.c simple. I think it's pretty clear that the
 guts of this do not belong within ExecSort(), in any case. Also, the
 additions there should be much better commented, wherever they finally
 end up.


As I understand, you propose to incapsulate partial sort algorithm into
tuplesort. However, in this case we anyway need some significant change of
its interface: let tuplesort decide when it's able to return tuple.
Otherwise, we would miss significant part of LIMIT clause optimization.
tuplesort_set_bound() can't solve all the cases. There could be other
planner nodes between the partial sort and LIMIT.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] [v9.5] Custom Plan API

2014-09-15 Thread Kouhei Kaigai
 On Thu, Sep 11, 2014 at 8:40 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
  On Thu, Sep 11, 2014 at 11:24 AM, Kouhei Kaigai
  kai...@ak.jp.nec.com
  wrote:
   Don't the changes to src/backend/optimizer/plan/createplan.c
   belong in patch #2?
  
   The borderline between #1 and #2 is little bit bogus. So, I moved
   most of portion into #1, however, invocation of InitCustomScan
   (that is a callback in CustomPlanMethod) in create_custom_plan() is
 still in #2.
 
  Eh, create_custom_scan() certainly looks like it is in #1 from here,
  or at least part of it is.  It calculates tlist and clauses and then
  does nothing with them.  That clearly can't be the right division.
 
  I think it would make sense to have create_custom_scan() compute
  tlist and clauses first, and then pass those to CreateCustomPlan().
  Then you don't need a separate InitCustomScan() - which is misnamed
  anyway, since it has nothing to do with ExecInitCustomScan().
 
  The only reason why I put separate hooks here is, create_custom_scan()
  needs to know exact size of the CustomScan node (including private
  fields), however, it is helpful for extensions to kick its callback to
  initialize the node next to the common initialization stuff.
 
 Why does it need to know that?  I don't see that it's doing anything that
 requires knowing the size of that node, and if it is, I think it shouldn't
 be.  That should get delegated to the callback provided by the custom plan
 provider.
 
Sorry, my explanation might be confusable. The create_custom_scan() does not
need to know the exact size of the CustomScan (or its inheritance) because of
the two separated hooks; one is node allocation time, the other is the tail
of the series of initialization.
If we have only one hook here, we need to have a mechanism to informs
create_custom_scan() an exact size of the CustomScan node; including private
fields managed by the provider, instead of the first hook on node allocation
time. In this case, node allocation shall be processed by create_custom_scan()
and it has to know exact size of the node to be allocated.

How do I implement the feature here? Is the combination of static node size
and callback on the tail more simple than the existing design that takes two
individual hooks on create_custom_scan()?

  Regarding to the naming, how about GetCustomScan() instead of
 InitCustomScan()?
  It follows the manner in create_foreignscan_plan().
 
 I guess that's a bit better, but come to think of it, I'd really like to
 avoid baking in the assumption that the custom path provider has to return
 any particular type of plan node.  A good start would be to give it a name
 that doesn't imply that - e.g. PlanCustomPath().
 
OK, I'll use this naming.

   OK, I revised. Now custom-scan assumes it has a particular valid
   relation to be scanned, so no code path with scanrelid == 0 at this
 moment.
  
   Let us revisit this scenario when custom-scan replaces relation-joins.
   In this case, custom-scan will not be associated with a particular
   base- relation, thus it needs to admit a custom-scan node with
   scanrelid
  == 0.
 
  Yeah, I guess the question there is whether we'll want let CustomScan
  have scanrelid == 0 or require that CustomJoin be used there instead.
 
  Right now, I cannot imagine a use case that requires individual
  CustomJoin node because CustomScan with scanrelid==0 (that performs
  like custom-plan rather than custom-scan in actually) is sufficient.
 
  If a CustomScan gets chosen instead of built-in join logics, it shall
  looks like a relation scan on the virtual one that is consists of two
  underlying relation. Callbacks of the CustomScan has a responsibility
  to join underlying relations; that is invisible from the core executor.
 
  It seems to me CustomScan with scanrelid==0 is sufficient to implement
  an alternative logic on relation joins, don't need an individual node
  from the standpoint of executor.
 
 That's valid logic, but it's not the only way to do it.  If we have CustomScan
 and CustomJoin, either of them will require some adaption to handle this
 case.  We can either allow a custom scan that isn't scanning any particular
 relation (i.e. scanrelid == 0), or we can allow a custom join that has no
 children.  I don't know which way will come out cleaner, and I think it's
 good to leave that decision to one side for now.
 
Yep. I agree with you. It may not be productive discussion to conclude
this design topic right now. Let's assume CustomScan scans on a particular
relation (scanrelid != 0) on the first revision.

   Why can't the Custom(GpuHashJoin) node build the hash table
   internally instead of using a separate node?
  
   It's possible, however, it prevents to check sub-plans using
   EXPLAIN if we manage inner-plans internally. So, I'd like to have a
   separate node being connected to the inner-plan.
 
  Isn't that just a matter of letting the EXPLAIN code print more stuff?
  Why can't it?
 
  My GpuHashJoin 

Re: [HACKERS] implement subject alternative names support for SSL connections

2014-09-15 Thread Heikki Linnakangas

On 09/15/2014 01:44 PM, Alexey Klyukin wrote:

Committed, with that change, ie. the CN is not checked if SANs are present.


Actually, I disagree with the way the patch ignores the CN. Currently,
it skips the
CN unconditionally if the SubjectAltName section is present. But what
RFC 6125 says
is:

If a subjectAltName extension of type dNSName is present, that MUST
be used as the identity.  Otherwise, the (most specific) Common Name
field in the Subject field of the certificate MUST be used.

This means that we have to check that at least one dNSName resource is
present before
rejecting to examine the CN. Attached is a one-liner (excluding
comments) that fixes this.


Ok, good catch. Fixed.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Triconsistent catalog declaration

2014-09-15 Thread Alexander Korotkov
Hi!

9.4 GIN introduces new triconsistent method which can return one of three
values in spite of just consistent method. But in catalog declaration
triconsistent still returns bool type. It doesn't affect anything because
nobody calls triconsistent from SQL. But I think it would be correct to
declare them returns int2.

--
With best regards,
Alexander Korotkov.


tri-consistent-catalog.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Triconsistent catalog declaration

2014-09-15 Thread Heikki Linnakangas

On 09/15/2014 04:58 PM, Alexander Korotkov wrote:

Hi!

9.4 GIN introduces new triconsistent method which can return one of three
values in spite of just consistent method. But in catalog declaration
triconsistent still returns bool type. It doesn't affect anything because
nobody calls triconsistent from SQL.


Good point.


But I think it would be correct to declare them returns int2.


No, int2 is two bytes wide, while the return value is a single byte. I 
guess char would be the right type.


That makes for a bit awkward input and output from psql, when the values 
used are 0, 1, 2, rather than ascii characters. But that's OK, because 
as you said these functions are not callable from psql anyway, as they 
have internal arguments.


This requires a catalog change to fix. Are we still planning to do a 
catversion bump for 9.4 because of the jsonb changes?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] orangutan seizes up during isolation-check

2014-09-15 Thread Noah Misch
On Mon, Sep 15, 2014 at 10:11:57AM +0300, Heikki Linnakangas wrote:
 On 09/15/2014 07:51 AM, Noah Misch wrote:
 libintl replaces setlocale().  Its setlocale(LC_x, ) uses OS-specific APIs
 to determine the default locale when $LANG and similar environment variables
 are empty, as they are during make check NO_LOCALE=1.  On OS X, it calls[1]
 CFLocaleCopyCurrent(), which in turn spins up a thread.  See the end of this
 message for the postmaster thread stacks active upon hitting a breakpoint set
 at _dispatch_mgr_thread.
 
 Ugh. I'd call that a bug in libintl. setlocale() has no business to
 make the process multi-threaded.

Fair interpretation.  libintl's options for mitigating the problem are even
more constrained, unfortunately.

 Do we have the same problem in backends? At a quick glance, aside
 from postmaster we only use PG_SETMASK(BlockSig) in signal
 handlers, to prevent another signal handler from running
 concurrently.

In backends, we pass a specific value, not , to pg_perm_setlocale().  (I
expect the problem in EXEC_BACKEND builds, but I doubt anyone uses that as a
production configuration on OS X.)  Every frontend program does call
setlocale(LC_ALL, ) via set_pglocale_pgservice().  None use sigprocmask(),
but a few do use fork().

 1. Fork, call setlocale(LC_x, ) in the child, pass back the effective 
 locale
 name through a pipe, and pass that name to setlocale() in the original
 process.  The short-lived child will get the extra threads, and the
 postmaster will remain clean.

 I'm skeptical of the value of looking up locale information using other OS X
 facilities when the usual environment variables are inconclusive, but I see 
 no
 clear cause to reverse that decision now.  I lean toward (1).
 
 Both of those are horrible hacks. And who's to say that calling
 setlocale(LC_x, foo) won't also call some function that makes the
 process multi-threaded. If not in any current OS X release, it might
 still happen in a future one.

Right.  Not just setlocale(); a large body of APIs could change that way.  The
CAVEATS section[1] of the OS X fork() manual page suggests Apple had that in
mind at some point.  To be 100% safe, we would write the postmaster as though
it's always multithreaded, including making[2] EXEC_BACKEND the only supported
configuration on OS X.

Assuming we don't do that anytime soon, I did have in mind to make the
postmaster raise an error if it becomes multithreaded.  (Apple's
pthread_is_threaded_np() is usable for this check.)  Then, at least, we'll
more-easily hear about new problems of this nature.

 One idea would be to use an extra pthread mutex or similar, in
 addition to PG_SETMASK(). Whenever you do PG_SETMASK(BlockSig),
 also grab the mutex, and release it when you do
 PG_SETMASK(UnBlockSig).

I had considered and tested such a thing, and it sufficed to clear orangutan's
present troubles.  You still face undefined behavior after fork().  It's okay
in practice if libdispatch threads are careful to register pthread_atfork()
handlers for any relevant resources.  It's a fair option, but running
setlocale(LC_x, ) in a short-lived child assumes less about the system
library implementation, seems no more of a hack to me, and isolates the
overhead at postmaster start.

[1] 
https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man2/fork.2.html
[2] http://www.postgresql.org/message-id/1349280184-sup-...@alvh.no-ip.org


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] delta relations in AFTER triggers

2014-09-15 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki Linnakangas hlinnakan...@vmware.com writes:

 On 08/30/2014 12:15 AM, Kevin Grittner wrote:
 If we were to go with the hooks as you propose, we would still need
 to take the information from TriggerData and put it somewhere else
 for the hook to reference.

 Sure.

 FWIW, I agree with Heikki on this point.  It makes a lot more sense for
 the parser to provide hooks comparable to the existing hooks for resolving
 column refs, and it's not apparent that loading such functionality into
 SPI is sane at all.

 OTOH, I agree with Kevin that the things we're talking about are
 lightweight relations not variables.

Try as I might, I was unable to find any sensible way to use hooks.
If the issue was only the parsing, the route was fairly obvious,
but the execution side needs to access the correct tuplestore(s)
for each run, too -- so the sort of information provided by
relcache needed to be passed in to based on the context within the
process (i.e., if you have nested triggers firing, each needs to
use a different tuplestore with the same name in the same
function, even though it's using the same plan).  On both sides it
seemed easier to pass things through the same sort of techniques as
normal parameters; I couldn't find a way to use hooks that didn't
just make things uglier.

I see the down side of making this a feature which can only be used
from SPI, so I've updated the patch to allow it from other
contexts.  On the other hand, I see many uses for it where SPI *is*
used, and the SPI interface needs to change to support that.  The
functions I had added are one way to do that on the
parsing/planning side without breaking any existing code.  The same
information (i.e., the metadata) needs to be passed to the executor
along with references to the actual tuplestores, and that needs to
go through a different path -- at least for the plpgsql usage.

I broke out the changes from the previous patch in multiple commits
in my repository on github:

commit 2520db8fbb41c68a82c2c750c8543154c6d85eb9
Author: Kevin Grittner kgri...@postgresql.org
Date:   Mon Sep 15 01:17:14 2014 -0500

Use executor state rather than SPI to get named tuplestores.

spi.c and trigger.c will need a little more clean-up to match this,
and it's likely that not all places that need to pass the baton are
doing so, but at least this passes regression tests.

commit de9067258125226d1625f160c3eee9aff90ca598
Author: Kevin Grittner kgri...@postgresql.org
Date:   Sun Sep 14 22:01:04 2014 -0500

Pass the Tsrcache through ParserState to parse analysis.

commit e94779c1e22ec587446a7aa2593ba5f102b6a28b
Author: Kevin Grittner kgri...@postgresql.org
Date:   Sun Sep 14 19:23:45 2014 -0500

Modify SPI to use Tsrcache instead of List.

commit 7af841881d9113eb4c8dca8e82dc1867883bf75d
Author: Kevin Grittner kgri...@postgresql.org
Date:   Sun Sep 14 18:45:54 2014 -0500

Create context-specific Tsrcache.

Not used yet.

commit 35790a4b6c236d09e0be261be9b0017d34eaf9c9
Author: Kevin Grittner kgri...@postgresql.org
Date:   Sun Sep 14 16:49:37 2014 -0500

Create Tsrmd structure so tuplestore.h isn't needed in parser.

commit 93d57c580da095b71d9214f69fede71d2f8ed840
Author: Kevin Grittner kgri...@postgresql.org
Date:   Sun Sep 14 15:59:45 2014 -0500

Improve some confusing naming for TuplestoreRelation node.

Anyone who has already reviewed the earlier patch may want to look
at these individually:

https://github.com/kgrittn/postgres/compare/transition

Attached is a new patch including all of that.  Hopefully I haven't
misunderstood what Heikki and Tom wanted; if I have, just let me
know where I went wrong.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

trigger-transition-tables-v4.patch.gz
Description: application/tgz

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Collation-aware comparisons in GIN opclasses

2014-09-15 Thread Alexander Korotkov
Hackers,

some GIN opclasses uses collation-aware comparisons while they don't need
to do especially collation-aware comparison. Examples are text[] and hstore
opclasses. Depending on collation this may make them a much slower.

See example.

# show lc_collate ;
 lc_collate
─
 ru_RU.UTF-8
(1 row)

# create table test as (select array_agg(i::text) from
generate_series(1,100) i group by (i-1)/10);
SELECT 10

# create index test_idx on test using gin(array_agg);
CREATE INDEX
Time: *26930,423 ms*

# create index test_idx2 on test using gin(array_agg collate C);
CREATE INDEX
Time: *5143,682 ms*

Index creation with collation ru_RU.UTF-8 is 5 times slower while
collation has absolutely no effect on index functionality.

However, we can just replace comparison function for those opclasses
because it would break binary compatibility for pg_upgrade. I see following
solution:

   1. Rename such opclasses and make them not default.
   2. Create new default opclasses with bitwise comparison functions.
   3. Write recommendation to re-create indexes with default opclasses into
   documentation.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] .ready files appearing on slaves

2014-09-15 Thread Jehan-Guillaume de Rorthais
Hi hackers,

An issue that seems related to this has been posted on pgsql-admin. See:

  
http://www.postgresql.org/message-id/caas3tylnxaydz0+zhxlpdvtovhqovr+jsphp30o8kvwqqs0...@mail.gmail.com

How can we help on this issue?

Cheers,

On Thu, 4 Sep 2014 17:50:36 +0200
Jehan-Guillaume de Rorthais j...@dalibo.com wrote:

 Hi hackers,
 
 Since few months, we occasionally see .ready files appearing on some slave
  instances from various context. The two I have in mind are under 9.2.x.
 
 I tried to investigate a bit. These .ready files are created when a WAL file
 from pg_xlog has no corresponding file in pg_xlog/archive_status. I could
 easily experience this by deleting such a file: it is created again at the
 next restartpoint or checkpoint received from the master.
 
 Looking at the WAL in pg_xlog folder corresponding to these .ready files, they
 are all much older than the current WAL cycle in both mtime and name logic
 sequence. As instance on one of these box we have currently 6 of those ghost
 WALs:
 
   00021E5300FF
   00021F1800FF
   0002204700FF
   000220BF00FF
   0002214000FF
   0002237000FF
   0002255D00A8
   0002255D00A9
   [...normal WAL sequence...]
   0002255E009D
 
 And on another box:
 
   0001040E00FF
   0001041400DA
   0001046E00FF
   0001047000FF
   00010485000F
   000104850010
   [...normal WAL sequence...]
   000104850052
 
 So it seems for some reasons, these old WALs were forgotten by the
 restartpoint mechanism when they should have been recylced/deleted. 
 
 For one of these servers, I could correlate this with some brutal
 disconnection of the streaming replication appearing in its logs. But there
 was no known SR disconnection on the second one.
 
 Any idea about this weird behaviour? What can we do to help you investigate
 further?
 
 Regards,



-- 
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-09-15 Thread Robert Haas
On Sun, Sep 14, 2014 at 4:37 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On 12 September 2014 18:19, Simon Riggs si...@2ndquadrant.com wrote:
 On 12 September 2014 15:30, Tom Lane t...@sss.pgh.pa.us wrote:

 After a little bit I remembered there was already a function for this.
 So specifically, I'd suggest using ExecRelationIsTargetRelation()
 to decide whether to mark the scan as requiring pruning.

 Sounds cool. Thanks both, this is sounding like a viable route now.

 Yes, this is viable.

 Patch attached, using Alvaro's idea of use-case specific pruning and
 Tom's idea of aiming at target relations. Patch uses or extends
 existing infrastructure, so its shorter than it might have been, yet
 with all that bufmgr yuck removed.

 This is very, very good because while going through this I notice the
 dozen or more places where we were pruning blocks in annoying places I
 didn't even know about such as about 4-5 constraint checks. In more
 than a few DDL commands like ALTER TABLE and CLUSTER we were even
 pruning the old relation prior to rewrite.

Do we really want to disable HOT for all catalog scans?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres code for a query intermediate dataset

2014-09-15 Thread Robert Haas
On Sun, Sep 14, 2014 at 5:18 AM, Rohit Goyal rhtgyl...@gmail.com wrote:
 Thanks for reply. But, I think i confused you. I am talking about access
 using indexes. So, I assume that B+ tree store key-value pair where rohit is
 the key and all the versions are its value.

 Another way to think is I have a secondary index on emp. name and there are
 4 rohit exist in DB. So, now B+ tree gives me 4 different tuple pointer for
 each Rohit. I want to know the code portion for this where i can see all 4
 tuple pointer before each one have I/O access to fetch its tuple.

You may want to look at index_getnext(), index_getnext_tid(), and/or
heap_hot_search_buffer().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-09-15 Thread Robert Haas
On Mon, Sep 15, 2014 at 6:34 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Please have a look. I have not looked at the docs changes yet.

 One thing that needs some thinking and changing is the progress reporting.
 It currently looks like this:

 progress: 1.0 s, 4863.0 tps, lat 3.491 ms stddev 2.487, lag 1.809 ms, 99
 skipped
 progress: 2.0 s, 5042.8 tps, lat 3.265 ms stddev 2.264, lag 1.584 ms, 16
 skipped
 progress: 3.0 s, 4926.1 tps, lat 2.731 ms stddev 2.371, lag 1.196 ms, 45
 skipped
 progress: 4.0 s, 4963.9 tps, lat 1.904 ms stddev 1.212, lag 0.429 ms, 0
 skipped
 progress: 5.0 s, 4971.2 tps, lat 2.592 ms stddev 1.722, lag 0.975 ms, 0
 skipped

 The absolute number of skipped transactions doesn't make much sense when all
 the other numbers are averages, and tps is a 1/s value. If you don't know
 the total number of transactions executed, the absolute number is
 meaningless. (Although you can calculate the absolute number of transactions
 executed by multiplying the TPS value with the interval). I think a
 percentage would be better here.

 Should we also print the number of late transactions here? I think that
 would be an even more important detail than the number of skipped
 transactions. It might be better to print only the percentage of late
 transactions, including skipped ones. Or both, but it's difficult to cram
 everything on a single line. This needs some further thinking..

I'm not sure I like the idea of printing a percentage.  It might be
unclear what the denominator was if somebody feels the urge to work
back to the actual number of skipped transactions.  I mean, I guess
it's probably just the value you passed to -R, so maybe that's easy
enough, but then why bother dividing in the first place?  The user can
do that easily enough if they want the data that way.

I agree with you that it would be good to get some statistics on
late/skipped transactions, but it's not obvious what people will want.
Late transactions, straight up?  Late by more than a threshold value?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb contains behaviour weirdness

2014-09-15 Thread Josh Berkus
On 09/12/2014 01:33 PM, Tom Lane wrote:
 Josh Berkus j...@agliodbs.com writes:
 However, this better become a FAQ item, because it's not necessarily the
 behavior that folks used to JSON but not Postgres will expect.
 
 No, it's a bug, not a documentation deficiency.

Hmmm?  Are you proposing that we should change how ARRAY @ ARRAY works
for non-JSON data?

Or are you proposing that JSONARRAY @ JSONARRAY should work differently
from ARRAY @ ARRAY?

Or are you agreeing with Peter that it's the first case that's a bug,
and the 2nd two tests are correct, and just being unclear?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-15 Thread Josh Berkus
On 09/12/2014 01:30 PM, Heikki Linnakangas wrote:
 
 Performance was one argument for sure. It's not hard to come up with a
 case where the all-lengths approach is much slower: take a huge array
 with, say, million elements, and fetch the last element in a tight loop.
 And do that in a PL/pgSQL function without storing the datum to disk, so
 that it doesn't get toasted. Not a very common thing to do in real life,
 although something like that might come up if you do a lot of json
 processing in PL/pgSQL. but storing offsets makes that faster.

While I didnt post the results (because they were uninteresting), I did
specifically test the last element in a set of 200 elements for
all-lengths vs. original offsets for JSONB, and the results were not
statistically different.

I did not test against your patch; is there some reason why your patch
would be faster for the last element case than the original offsets
version?

If not, I think the corner case is so obscure as to be not worth
optimizing for.  I can't imagine that more than a tiny minority of our
users are going to have thousands of keys per datum.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-15 Thread Robert Haas
On Sun, Sep 14, 2014 at 10:37 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 09/13/2014 11:28 PM, Peter Geoghegan wrote:
 Anyway, attached rough test program implements what you outline. This
 is for 30,000 32 byte strings (where just the final two bytes differ).
 On my laptop, output looks like this (edited to only show median
 duration in each case):

 Got to be careful to not let the compiler optimize away microbenchmarks like
 this. At least with my version of gcc, the strcoll calls get optimized away,
 as do the memcmp calls, if you don't use the result for anything. Clang was
 even more aggressive; it ran both comparisons in 0.0 seconds. Apparently it
 optimizes away the loops altogether.

 Also, there should be a setlocale(LC_ALL, ) call somewhere. Otherwise it
 runs in C locale, and we don't use strcoll() at all for C locale.

 After fixing those, it runs much slower, so I had to reduce the number of
 strings. Here's a fixed program. I'm now getting numbers like this:

 (baseline) duration of comparisons without useless memcmp()s: 6.007368
 seconds

 duration of comparisons with useless memcmp()s: 6.079826 seconds

 Both values vary in range 5.9 - 6.1 s, so it's fair to say that the useless
 memcmp() is free with these parameters.

 Is this the worst case scenario?

I can't see a worse one, and I replicated your results here on my
MacBook Pro.  I also tried with 1MB strings and, surprisingly, it was
basically free there, too.

It strikes me that perhaps we should make this change (rearranging
things so that the memcmp tiebreak is run before strcoll) first,
before dealing with the rest of the abbreviated keys infrastructure.
It appears to be a separate improvement which is worthwhile
independently of what we do about that patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-15 Thread Claudio Freire
On Mon, Sep 15, 2014 at 2:12 PM, Josh Berkus j...@agliodbs.com wrote:
 If not, I think the corner case is so obscure as to be not worth
 optimizing for.  I can't imagine that more than a tiny minority of our
 users are going to have thousands of keys per datum.

Worst case is linear cost scaling vs number of keys, which depends on
the number of keys how expensive it is.

It would have an effect only on uncompressed jsonb, since compressed
jsonb already pays a linear cost for decompression.

I'd suggest testing performance of large small keys in uncompressed
form. It's bound to have a noticeable regression there.

Now, large small keys could be 200 or 2000, or even 20k. I'd guess
several should be tested to find the shape of the curve.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.5] Custom Plan API

2014-09-15 Thread Robert Haas
On Mon, Sep 15, 2014 at 8:38 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
  The only reason why I put separate hooks here is, create_custom_scan()
  needs to know exact size of the CustomScan node (including private
  fields), however, it is helpful for extensions to kick its callback to
  initialize the node next to the common initialization stuff.

 Why does it need to know that?  I don't see that it's doing anything that
 requires knowing the size of that node, and if it is, I think it shouldn't
 be.  That should get delegated to the callback provided by the custom plan
 provider.

 Sorry, my explanation might be confusable. The create_custom_scan() does not
 need to know the exact size of the CustomScan (or its inheritance) because of
 the two separated hooks; one is node allocation time, the other is the tail
 of the series of initialization.
 If we have only one hook here, we need to have a mechanism to informs
 create_custom_scan() an exact size of the CustomScan node; including private
 fields managed by the provider, instead of the first hook on node allocation
 time. In this case, node allocation shall be processed by create_custom_scan()
 and it has to know exact size of the node to be allocated.

 How do I implement the feature here? Is the combination of static node size
 and callback on the tail more simple than the existing design that takes two
 individual hooks on create_custom_scan()?

I still don't get it.  Right now, the logic in create_custom_scan(),
which I think should really be create_custom_plan() or
create_plan_from_custom_path(), basically looks like this:

1. call hook function CreateCustomPlan
2. compute values for tlist and clauses
3. pass those values to hook function InitCustomScan()
4. call copy_path_costsize

What I think we should do is:

1. compute values for tlist and clauses
2. pass those values to hook function PlanCustomPath(), which will return a Plan
3. call copy_path_costsize

create_custom_scan() does not need to allocate the node.  You don't
need the node to be allocated before computing tlist and clauses.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-15 Thread Peter Geoghegan
On Mon, Sep 15, 2014 at 10:17 AM, Robert Haas robertmh...@gmail.com wrote:
 It strikes me that perhaps we should make this change (rearranging
 things so that the memcmp tiebreak is run before strcoll) first,
 before dealing with the rest of the abbreviated keys infrastructure.
 It appears to be a separate improvement which is worthwhile
 independently of what we do about that patch.

I guess we could do that, but AFAICT the only open item blocking the
commit of a basic version of abbreviated keys (the informally agreed
to basic version lacking support for single-attribute aggregates) is
what to do about the current need to create a separate sortsupport
state. I've talked about my thoughts on that question in detail now
[1].

BTW, you probably realize this, but we still need a second memcmp()
after strcoll() too. hu_HU will care about that [2].

[1] 
http://www.postgresql.org/message-id/cam3swzqcdcnfwd3qzoo4qmy4g8okhuqyrd26bbla7fl2x-n...@mail.gmail.com
[2] 
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-09-15 Thread Fabien COELHO



I'm not sure I like the idea of printing a percentage.  It might be
unclear what the denominator was if somebody feels the urge to work
back to the actual number of skipped transactions.  I mean, I guess
it's probably just the value you passed to -R, so maybe that's easy
enough, but then why bother dividing in the first place?  The user can
do that easily enough if they want the data that way.


Indeed skipped and late per second may have an unclear denominator. If 
you divide by the time, the unit would be tps, but 120 tps performance 
including 20 late tps, plus 10 skipped tps... I do not think it is that 
clear. Reporting tps for transaction *not* performed looks strange.


Maybe late transactions could be given as a percentage of all processed 
transactions in the interval. But for skipped the percentage of what? The 
only number that would make sense is the total number of transactions 
schedule in the interval, but that would mean that the denominator for 
late would be different than the denominator for skipped, which is 
basically uncomprehensible.



I agree with you that it would be good to get some statistics on
late/skipped transactions, but it's not obvious what people will want.
Late transactions, straight up?  Late by more than a threshold value?


Yes.

Under throttling transaction are given a schedule start time. When the 
transactions can actually start:


  (1) if it is already more late (before even starting) than the latency
  limit (a threshold), it is *NOT* started, but counted skipped

  (2) otherwise it is started. When it finishes, it may be
(2a) out of the latency limit (scheduled time + limit)
 = it is counted as late
(2b) within the latency limit
 = all is well

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-15 Thread Robert Haas
On Mon, Sep 15, 2014 at 1:34 PM, Peter Geoghegan p...@heroku.com wrote:
 On Mon, Sep 15, 2014 at 10:17 AM, Robert Haas robertmh...@gmail.com wrote:
 It strikes me that perhaps we should make this change (rearranging
 things so that the memcmp tiebreak is run before strcoll) first,
 before dealing with the rest of the abbreviated keys infrastructure.
 It appears to be a separate improvement which is worthwhile
 independently of what we do about that patch.

 I guess we could do that, but AFAICT the only open item blocking the
 commit of a basic version of abbreviated keys (the informally agreed
 to basic version lacking support for single-attribute aggregates) is
 what to do about the current need to create a separate sortsupport
 state. I've talked about my thoughts on that question in detail now
 [1].

I think there's probably more than that to work out, but in any case
there's no harm in getting a simple optimization done first before
moving on to a complicated one.

 BTW, you probably realize this, but we still need a second memcmp()
 after strcoll() too. hu_HU will care about that [2].

 [1] 
 http://www.postgresql.org/message-id/cam3swzqcdcnfwd3qzoo4qmy4g8okhuqyrd26bbla7fl2x-n...@mail.gmail.com
 [2] 
 http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad

I rather assume we could reuse the results of the first memcmp()
instead of doing it again.

x = memcmp();
if (x == 0)
   return x;
y = strcoll();
if (y == 0)
   return x;
return y;

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Triconsistent catalog declaration

2014-09-15 Thread Robert Haas
On Mon, Sep 15, 2014 at 10:13 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 That makes for a bit awkward input and output from psql, when the values
 used are 0, 1, 2, rather than ascii characters. But that's OK, because as
 you said these functions are not callable from psql anyway, as they have
 internal arguments.

Maybe we should change them to something a bit more understandable.

 This requires a catalog change to fix. Are we still planning to do a
 catversion bump for 9.4 because of the jsonb changes?

That was my understanding, although we seem to be proceeding at an
inexplicably glacial pace.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-15 Thread Peter Geoghegan
On Mon, Sep 15, 2014 at 10:53 AM, Robert Haas robertmh...@gmail.com wrote:
 I think there's probably more than that to work out, but in any case
 there's no harm in getting a simple optimization done first before
 moving on to a complicated one.

I guess we never talked about the abort logic in all that much detail.
I suppose there's that, too.

 I rather assume we could reuse the results of the first memcmp()
 instead of doing it again.

 x = memcmp();
 if (x == 0)
return x;
 y = strcoll();
 if (y == 0)
return x;
 return y;

Of course, but you know what I mean. (I'm sure the compiler will
realize this if the programmer doesn't)

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb contains behaviour weirdness

2014-09-15 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 On 09/12/2014 01:33 PM, Tom Lane wrote:
 No, it's a bug, not a documentation deficiency.

 Hmmm?  Are you proposing that we should change how ARRAY @ ARRAY works
 for non-JSON data?

No.

 Or are you proposing that JSONARRAY @ JSONARRAY should work differently
 from ARRAY @ ARRAY?

And no.  It's a bug that jsonb array containment works differently from
regular array containment.  We understand the source of the bug, ie a
mistaken optimization.  I don't see why there's much need for discussion
about anything except whether removing the optimization altogether
(as Peter proposed) is the best fix, or whether we want to retain
some weaker form of it.

Personally I'd think that we should retain it for objects; Peter's
main argument against that was that the comment would be too complicated,
but that seems a bit silly from here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Triconsistent catalog declaration

2014-09-15 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Sep 15, 2014 at 10:13 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 This requires a catalog change to fix. Are we still planning to do a
 catversion bump for 9.4 because of the jsonb changes?

 That was my understanding, although we seem to be proceeding at an
 inexplicably glacial pace.

We already had a post-beta2 catversion bump for 30c05261a.  As long
as this is a catalog fix and not an on-disk-data change, there's no
reason not to make it in 9.4.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb contains behaviour weirdness

2014-09-15 Thread Peter Geoghegan
On Mon, Sep 15, 2014 at 11:12 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Personally I'd think that we should retain it for objects; Peter's
 main argument against that was that the comment would be too complicated,
 but that seems a bit silly from here.

I just don't see any point to it. My argument against the complexity
of explaining why the optimization is only used with objects is based
on the costs and the benefits. I think the benefits are very close to
nil.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-15 Thread Robert Haas
On Mon, Sep 15, 2014 at 1:55 PM, Peter Geoghegan p...@heroku.com wrote:
 On Mon, Sep 15, 2014 at 10:53 AM, Robert Haas robertmh...@gmail.com wrote:
 I think there's probably more than that to work out, but in any case
 there's no harm in getting a simple optimization done first before
 moving on to a complicated one.

 I guess we never talked about the abort logic in all that much detail.
 I suppose there's that, too.

Well, the real point is that from where I'm sitting, this...

 x = memcmp();
 if (x == 0)
return x;
 y = strcoll();
 if (y == 0)
return x;
 return y;

...looks like about a 10-line patch.  We have the data to show that
the loss is trivial even in the worst case, and we have or should be
able to get data showing that the best-case win is significant even
without the abbreviated key stuff.  If you'd care to draft a patch for
just that, I assume we could get it committed in a day or two, whereas
I'm quite sure that considerably more work than that remains for the
main patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-15 Thread Peter Geoghegan
On Mon, Sep 15, 2014 at 11:20 AM, Robert Haas robertmh...@gmail.com wrote:
 ...looks like about a 10-line patch.  We have the data to show that
 the loss is trivial even in the worst case, and we have or should be
 able to get data showing that the best-case win is significant even
 without the abbreviated key stuff.  If you'd care to draft a patch for
 just that, I assume we could get it committed in a day or two, whereas
 I'm quite sure that considerably more work than that remains for the
 main patch.

OK, I'll draft a patch for that today, including similar alterations
to varstr_cmp() for the benefit of Windows and so on.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb contains behaviour weirdness

2014-09-15 Thread Robert Haas
On Mon, Sep 15, 2014 at 2:18 PM, Peter Geoghegan p...@heroku.com wrote:
 On Mon, Sep 15, 2014 at 11:12 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Personally I'd think that we should retain it for objects; Peter's
 main argument against that was that the comment would be too complicated,
 but that seems a bit silly from here.

 I just don't see any point to it. My argument against the complexity
 of explaining why the optimization is only used with objects is based
 on the costs and the benefits. I think the benefits are very close to
 nil.

That seems pessimistic to me; I'm with Tom.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb contains behaviour weirdness

2014-09-15 Thread Tom Lane
Peter Geoghegan p...@heroku.com writes:
 On Mon, Sep 15, 2014 at 11:12 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Personally I'd think that we should retain it for objects; Peter's
 main argument against that was that the comment would be too complicated,
 but that seems a bit silly from here.

 I just don't see any point to it. My argument against the complexity
 of explaining why the optimization is only used with objects is based
 on the costs and the benefits. I think the benefits are very close to
 nil.

It might be that the benefit is very close to nil; that would depend a lot
on workload, so it's hard to be sure.  I'd say though that the cost is
also very close to nil, in the sense that we're considering two additional
compare-and-branch instructions in a function that will surely expend
hundreds or thousands of instructions if there's no such short-circuit.

I've certainly been on the side of that optimization isn't worth its
keep many times before, but I don't think the case is terribly clear cut
here.  Since somebody (possibly you) thought it was worth having to begin
with, I'm inclined to follow that lead.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb contains behaviour weirdness

2014-09-15 Thread Peter Geoghegan
On Mon, Sep 15, 2014 at 11:29 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 It might be that the benefit is very close to nil; that would depend a lot
 on workload, so it's hard to be sure.  I'd say though that the cost is
 also very close to nil, in the sense that we're considering two additional
 compare-and-branch instructions in a function that will surely expend
 hundreds or thousands of instructions if there's no such short-circuit.

Fair enough.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collation-aware comparisons in GIN opclasses

2014-09-15 Thread Peter Geoghegan
On Mon, Sep 15, 2014 at 8:28 AM, Alexander Korotkov
aekorot...@gmail.com wrote:
 some GIN opclasses uses collation-aware comparisons while they don't need to
 do especially collation-aware comparison. Examples are text[] and hstore
 opclasses. Depending on collation this may make them a much slower.

I'm glad that I saw how pointless this was with the jsonb GIN default
opclass during development.

 Rename such opclasses and make them not default.
 Create new default opclasses with bitwise comparison functions.
 Write recommendation to re-create indexes with default opclasses into
 documentation.

I certainly think this should be fixed if at all possible, but I'm not
sure about this plan. Can we really rename an opclass without
consequence, including having that respected across pg_upgrade?


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers

2014-09-15 Thread Michael Paquier
On Sat, Sep 13, 2014 at 3:56 AM, Robert Haas robertmh...@gmail.com wrote:
 I think so.  If we make it a function, then it's either the kind of
 function that you access via pg_proc, or it's the kind that's written
 in C and installed by storing a function pointer in a hook variable
 from _PG_init().  The first approach is a non-starter because it would
 require walsender to be connected to the database where that function
 lives, which is a non-starter at least for logical replication where
 we need walsender to be connected to the database being replicated.
 Even if we found some way around that problem, and I'm not sure there
 is one, I suspect the overhead would be pretty high.  The second
 approach - a hook that can be accessed directly by loadable modules -
 seems like it would work fine; the only problem that is that you've
 got to write your policy function in C.  But I have no issue with
 exposing it that way if someone wants to write a patch.  There is no
 joy in getting between the advanced user and his nutty-complicated
 sync rep configuration.  However, I suspect that most users will
 prefer a more user-friendly interface.

Reading both your answers, I'd tend to think that having a set of
hooks to satisfy all the potential user requests would be enough. We
could let the server code decide what is the priority of the standbys
using the information in synchronous_standby_names, then have the
hooks interact with SyncRepReleaseWaiters and pg_stat_get_wal_senders.

We would need two hooks:
- one able to get an array of WAL sender position defining all the
nodes considered as sync nodes. This is enough for
pg_stat_get_wal_senders. SyncRepReleaseWaiters would need it as
well...
- a second able to define the update policy of the write and flush
positions in walsndctl (SYNC_REP_WAIT_FLUSH and SYNC_REP_WAIT_WRITE),
as well as the next failover policy. This would be needed for
SyncRepReleaseWaiters when a WAL sender calls SyncRepReleaseWaiters.

Perhaps that's overthinking, but I am getting the impression that
whatever the decision taken, it would involve modifications of the
sync-standby parametrization at GUC level, and whatever the path
chosen (dedicated language, set of integer params), there will be
complains about what things can or cannot be done.

At least a set of hooks has the merit to say: do what you like with
your synchronous node policy.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collation-aware comparisons in GIN opclasses

2014-09-15 Thread Alexander Korotkov
On Mon, Sep 15, 2014 at 10:51 PM, Peter Geoghegan p...@heroku.com wrote:

 On Mon, Sep 15, 2014 at 8:28 AM, Alexander Korotkov
 aekorot...@gmail.com wrote:
  some GIN opclasses uses collation-aware comparisons while they don't
 need to
  do especially collation-aware comparison. Examples are text[] and hstore
  opclasses. Depending on collation this may make them a much slower.

 I'm glad that I saw how pointless this was with the jsonb GIN default
 opclass during development.

  Rename such opclasses and make them not default.
  Create new default opclasses with bitwise comparison functions.
  Write recommendation to re-create indexes with default opclasses into
  documentation.

 I certainly think this should be fixed if at all possible, but I'm not
 sure about this plan. Can we really rename an opclass without
 consequence, including having that respected across pg_upgrade?


Just rename doesn't seem to be safe. Since pg_upgrade uses pg_dump, all
indexes are linked to opclasses using their names. Therefore existed
indexes will be linked to new opclasses. It's likely we need to execute SQL
script renaming opclasses before pg_upgrade. Another option is to don't
rename old opclasses, just create new default opclasses with new names.
Bruce, what is your opinion about pg_upgrade?
Contrib opclasses would be safe to rename using migration script.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-15 Thread Josh Berkus
On 09/15/2014 10:23 AM, Claudio Freire wrote:
 Now, large small keys could be 200 or 2000, or even 20k. I'd guess
 several should be tested to find the shape of the curve.

Well, we know that it's not noticeable with 200, and that it is
noticeable with 100K.  It's only worth testing further if we think that
having more than 200 top-level keys in one JSONB value is going to be a
use case for more than 0.1% of our users.  I personally do not.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-15 Thread Claudio Freire
On Mon, Sep 15, 2014 at 4:09 PM, Josh Berkus j...@agliodbs.com wrote:
 On 09/15/2014 10:23 AM, Claudio Freire wrote:
 Now, large small keys could be 200 or 2000, or even 20k. I'd guess
 several should be tested to find the shape of the curve.

 Well, we know that it's not noticeable with 200, and that it is
 noticeable with 100K.  It's only worth testing further if we think that
 having more than 200 top-level keys in one JSONB value is going to be a
 use case for more than 0.1% of our users.  I personally do not.

Yes, but bear in mind that the worst case is exactly at the use case
jsonb was designed to speed up: element access within relatively big
json documents.

Having them uncompressed is expectable because people using jsonb will
often favor speed over compactness if it's a tradeoff (otherwise
they'd use plain json).

So while you're right that it's perhaps above what would be a common
use case, the range somewhere between 200 and 100K for the tipping
point seems overly imprecise to me.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-15 Thread Josh Berkus
On 09/15/2014 12:15 PM, Claudio Freire wrote:
 So while you're right that it's perhaps above what would be a common
 use case, the range somewhere between 200 and 100K for the tipping
 point seems overly imprecise to me.

Well, then, you know how to solve that.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-15 Thread Claudio Freire
On Mon, Sep 15, 2014 at 4:17 PM, Josh Berkus j...@agliodbs.com wrote:
 On 09/15/2014 12:15 PM, Claudio Freire wrote:
 So while you're right that it's perhaps above what would be a common
 use case, the range somewhere between 200 and 100K for the tipping
 point seems overly imprecise to me.

 Well, then, you know how to solve that.


I was hoping testing with other numbers was a simple hitting a key for
someone else.

But sure. I'll set something up.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collation-aware comparisons in GIN opclasses

2014-09-15 Thread Tom Lane
Peter Geoghegan p...@heroku.com writes:
 On Mon, Sep 15, 2014 at 8:28 AM, Alexander Korotkov
 aekorot...@gmail.com wrote:
 Rename such opclasses and make them not default.
 Create new default opclasses with bitwise comparison functions.
 Write recommendation to re-create indexes with default opclasses into
 documentation.

 I certainly think this should be fixed if at all possible, but I'm not
 sure about this plan. Can we really rename an opclass without
 consequence, including having that respected across pg_upgrade?

No.  And we don't know how to change the default opclass without
breaking things, either.  See previous discussions about how we
might fix the totally-broken default gist opclass that btree_gist
creates for the inet type [1].  The motivation for getting rid of that
is *way* stronger than it might be slow, but there's no apparent
way to make something else be the default without creating havoc.

regards, tom lane

[1] 
http://www.postgresql.org/message-id/flat/cae2gyzyuesd188j0b290gf16502h9b-lwnrs3rfi1swdb9q...@mail.gmail.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-15 Thread Josh Berkus
On 09/15/2014 12:25 PM, Claudio Freire wrote:
 On Mon, Sep 15, 2014 at 4:17 PM, Josh Berkus j...@agliodbs.com wrote:
 On 09/15/2014 12:15 PM, Claudio Freire wrote:
 So while you're right that it's perhaps above what would be a common
 use case, the range somewhere between 200 and 100K for the tipping
 point seems overly imprecise to me.

 Well, then, you know how to solve that.
 
 
 I was hoping testing with other numbers was a simple hitting a key for
 someone else.

Nope.  My test case has a fixed size.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-09-15 Thread Simon Riggs
On 15 September 2014 17:09, Robert Haas robertmh...@gmail.com wrote:

 Do we really want to disable HOT for all catalog scans?

The intention of the patch is that catalog scans are treated
identically to non-catalog scans. The idea here is that HOT cleanup
only occurs on scans on target relations, so only INSERT, UPDATE and
DELETE do HOT cleanup.

It's possible that many catalog scans don't follow the normal target
relation logic, so we might argue we should use HOT every time. OTOH,
since we now have separate catalog xmins we may find that using HOT on
catalogs is no longer effective. So I could go either way on how to
proceed; its an easy change either way.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-15 Thread Robert Haas
On Mon, Sep 15, 2014 at 3:09 PM, Josh Berkus j...@agliodbs.com wrote:
 On 09/15/2014 10:23 AM, Claudio Freire wrote:
 Now, large small keys could be 200 or 2000, or even 20k. I'd guess
 several should be tested to find the shape of the curve.

 Well, we know that it's not noticeable with 200, and that it is
 noticeable with 100K.  It's only worth testing further if we think that
 having more than 200 top-level keys in one JSONB value is going to be a
 use case for more than 0.1% of our users.  I personally do not.

FWIW, I have written one (1) application that uses JSONB and it has
one sub-object (not the top-level object) that in the most typical
configuration contains precisely 270 keys.  Now, granted, that is not
the top-level object, if that distinction is actually relevant here,
but color me just a bit skeptical of this claim anyway.  This was just
a casual thing I did for my own use, not anything industrial strength,
so it's hard to believe I'm stressing the system more than 99.9% of
users will.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Collation-aware comparisons in GIN opclasses

2014-09-15 Thread Peter Geoghegan
On Mon, Sep 15, 2014 at 12:45 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 No.  And we don't know how to change the default opclass without
 breaking things, either.

Is there a page on the Wiki along the lines of things that we would
like to change if ever there is a substantial change in on-disk format
that will break pg_upgrade? ISTM that we should be intelligently
saving those some place, just as Redhat presumably save up
ABI-breakage over many years for the next major release of RHEL.
Alexander's complaint is a good example of such a change, IMV. Isn't
it more or less expected that the day will come when we'll make a
clean break?


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-15 Thread Josh Berkus
On 09/15/2014 02:16 PM, Robert Haas wrote:
 On Mon, Sep 15, 2014 at 3:09 PM, Josh Berkus j...@agliodbs.com wrote:
 On 09/15/2014 10:23 AM, Claudio Freire wrote:
 Now, large small keys could be 200 or 2000, or even 20k. I'd guess
 several should be tested to find the shape of the curve.

 Well, we know that it's not noticeable with 200, and that it is
 noticeable with 100K.  It's only worth testing further if we think that
 having more than 200 top-level keys in one JSONB value is going to be a
 use case for more than 0.1% of our users.  I personally do not.
 
 FWIW, I have written one (1) application that uses JSONB and it has
 one sub-object (not the top-level object) that in the most typical
 configuration contains precisely 270 keys.  Now, granted, that is not
 the top-level object, if that distinction is actually relevant here,
 but color me just a bit skeptical of this claim anyway.  This was just
 a casual thing I did for my own use, not anything industrial strength,
 so it's hard to believe I'm stressing the system more than 99.9% of
 users will.

Actually, having the keys all at the same level *is* relevant for the
issue we're discussing.  If those 270 keys are organized in a tree, it's
not the same as having them all on one level (and not as problematic).

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Suspicious check (src/backend/access/gin/gindatapage.c)

2014-09-15 Thread Michael Paquier
On Fri, Sep 12, 2014 at 5:42 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 09/12/2014 11:38 AM, Heikki Linnakangas wrote:

 Now that the logic is fixed, I hope we
 won't get complaints that the indexes are bigger, if you fill a table by
 appending to the end. I wonder if we should aim at an even more uneven
 split; the default fillfactor for B-trees is 90%, for example. I didn't
 go that high when I wrote that, because the code in previous versions
 always did a 50/50 split. But it could be argued that a higher
 fillfactor makes sense for a GIN index - they typically don't get as
 much random updates as a B-tree.


 Actually, we should add a fillfactor reloption to GIN. But that's 9.5
 material.

Actually gin is the only method that does not have this parameter, no?
Then the following extra-processing would be enough I imagine:
1) Tune freespace using fillfactor when placing keys to leaf data page
(dataPlaceToPageLeaf)
2) On split, instead of (BLCKSZ * 3) / 4, have the left leaf full at
(BLCKSZ * fillfactor) / 100
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-15 Thread Peter Geoghegan
On Mon, Sep 15, 2014 at 11:25 AM, Peter Geoghegan p...@heroku.com wrote:
 OK, I'll draft a patch for that today, including similar alterations
 to varstr_cmp() for the benefit of Windows and so on.

I attach a much simpler patch, that only adds an opportunistic
memcmp() == 0 before a possible strcoll().  Both
bttextfastcmp_locale() and varstr_cmp() have the optimization added,
since there is no point in leaving anyone out for this part.

When this is committed, and I hear back from you on the question of
what to do about having an independent sortsupport state for
abbreviated tie-breakers (and possibly other issues of concern), I'll
produce a rebased patch with a single commit.

-- 
Peter Geoghegan
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
new file mode 100644
index 7549542..53be8ec
*** a/src/backend/utils/adt/varlena.c
--- b/src/backend/utils/adt/varlena.c
*** varstr_cmp(char *arg1, int len1, char *a
*** 1405,1410 
--- 1405,1438 
  #endif
  		}
  
+ 		/*
+ 		 * Fast path:  Attempt an entirely opportunistic memcmp() == 0.
+ 		 *
+ 		 * In general there is no reason to be optimistic about being able to
+ 		 * resolve the string comparison in this way.  There are many usage
+ 		 * patterns in which this fast path will never be taken, and some in
+ 		 * which it will frequently be taken.  When it is frequently taken,
+ 		 * clearly the optimization is worthwhile, since strcoll() is known to
+ 		 * be very expensive.  When it is not taken, the overhead is completely
+ 		 * negligible, and perhaps even immeasurably small, even in the worst
+ 		 * case.  We have nothing to lose and everything to gain.
+ 		 *
+ 		 * These counter-intuitive performance characteristics are likely
+ 		 * explained by the dominant memory latency cost when performing a
+ 		 * strcoll();  in practice, this may give the compiler and CPU
+ 		 * sufficient leeway to order things in a way that hides the overhead
+ 		 * of useless memcmp() calls.  Sorting is very probably bottlenecked on
+ 		 * memory bandwidth, and so it's worth it to use a few arithmetic
+ 		 * instructions on the off chance that it results in a saving in memory
+ 		 * bandwidth.  In effect, this makes use of a capacity that would
+ 		 * otherwise go unused.  The same memory needed to be read into CPU
+ 		 * cache at this juncture anyway.  Modern CPUs can execute hundreds of
+ 		 * instructions in the time taken to fetch a single cache line from
+ 		 * main memory.
+ 		 */
+ 		if (len1 == len2  memcmp(arg1, arg2, len1) == 0)
+ 			return 0;
+ 
  #ifdef WIN32
  		/* Win32 does not have UTF-8, so we need to map to UTF-16 */
  		if (GetDatabaseEncoding() == PG_UTF8)
*** bttextfastcmp_locale(Datum x, Datum y, S
*** 1842,1847 
--- 1870,1885 
  	len1 = VARSIZE_ANY_EXHDR(arg1);
  	len2 = VARSIZE_ANY_EXHDR(arg2);
  
+ 	/*
+ 	 * Fast path:  Attempt an entirely opportunistic memcmp() == 0, as
+ 	 * explained within varstr_cmp()
+ 	 */
+ 	if (len1 == len2  memcmp(a1p, a2p, len1) == 0)
+ 	{
+ 		result = 0;
+ 		goto done;
+ 	}
+ 
  	if (len1 = tss-buflen1)
  	{
  		pfree(tss-buf1);
*** bttextfastcmp_locale(Datum x, Datum y, S
*** 1875,1880 
--- 1913,1919 
  	if (result == 0)
  		result = strcmp(tss-buf1, tss-buf2);
  
+ done:
  	/* We can't afford to leak memory here. */
  	if (PointerGetDatum(arg1) != x)
  		pfree(arg1);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-09-15 Thread Emanuel Calvo

El 14/09/14 17:37, Simon Riggs escribió:
 On 12 September 2014 18:19, Simon Riggs si...@2ndquadrant.com wrote:
 On 12 September 2014 15:30, Tom Lane t...@sss.pgh.pa.us wrote:
 After a little bit I remembered there was already a function for this.
 So specifically, I'd suggest using ExecRelationIsTargetRelation()
 to decide whether to mark the scan as requiring pruning.
 Sounds cool. Thanks both, this is sounding like a viable route now.
 Yes, this is viable.

 Patch attached, using Alvaro's idea of use-case specific pruning and
 Tom's idea of aiming at target relations. Patch uses or extends
 existing infrastructure, so its shorter than it might have been, yet
 with all that bufmgr yuck removed.

 This is very, very good because while going through this I notice the
 dozen or more places where we were pruning blocks in annoying places I
 didn't even know about such as about 4-5 constraint checks. In more
 than a few DDL commands like ALTER TABLE and CLUSTER we were even
 pruning the old relation prior to rewrite.


A simple performance test with the following variables:
LOOP=50
CONN=60
TXSS=500
SCALE=30


Select only:
WITH PATCH
Average:  20716.1 tps

NO PATCH
Average:  19141.7 tps


With writes:
WITH PATCH
Average:  2602.65

NO PATCH
Average:  2565.32


TODO:
- Consistency check.
- ALTER and CLUSTER test.






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-15 Thread Peter Geoghegan
On Mon, Sep 15, 2014 at 4:05 PM, Josh Berkus j...@agliodbs.com wrote:
 Actually, having the keys all at the same level *is* relevant for the
 issue we're discussing.  If those 270 keys are organized in a tree, it's
 not the same as having them all on one level (and not as problematic).

I believe Robert meant that the 270 keys are not at the top level, but
are at some level (in other words, some object has 270 pairs). That is
equivalent to having them at the top level for the purposes of this
discussion.

FWIW, I am slightly concerned about weighing use cases around very
large JSON documents too heavily. Having enormous jsonb documents just
isn't going to work out that well, but neither will equivalent designs
in popular document database systems for similar reasons. For example,
the maximum BSON document size supported by MongoDB is 16 megabytes,
and that seems to be something that their users don't care too much
about. Having 270 pairs in an object isn't unreasonable, but it isn't
going to be all that common either.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-15 Thread Peter Geoghegan
On Mon, Sep 15, 2014 at 4:21 PM, Peter Geoghegan p...@heroku.com wrote:
 I attach a much simpler patch, that only adds an opportunistic
 memcmp() == 0 before a possible strcoll().  Both
 bttextfastcmp_locale() and varstr_cmp() have the optimization added,
 since there is no point in leaving anyone out for this part.

FWIW, it occurs to me that this could be a big win for cases like
ExecMergeJoin(). Obviously, abbreviated keys will usually make your
merge join on text attributes a lot faster in the common case where a
sort is involved (if we consider that the sort is integral to the cost
of the join). However, when making sure that inner and outer tuples
match, the MJCompare() call will *very* frequently get away with a
cheap memcmp().  That could make a very significant additional
difference, I think.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-15 Thread Arthur Silva
I couldn't get my hands on the twitter data but I'm generating my own. The
json template is http://paste2.org/wJ1dfcjw and data was generated with
http://www.json-generator.com/. It has 35 top level keys, just in case
someone is wondering.
I generated 1 random objects and I'm inserting them repeatedly until I
got 320k rows.

Test query: SELECT data-'name', data-'email' FROM t_json
Test storage: EXTERNAL
Test jsonb lengths quartiles: {1278,1587,1731,1871,2231}
Tom's lengths+cache aware: 455ms
HEAD: 440ms

This is a realistic-ish workload in my opinion and Tom's patch performs
within 4% of HEAD.

Due to the overall lenghts I couldn't really test compressibility so I
re-ran the test. This time I inserted an array of 2 objects in each row, as
in: [obj, obj];
The objects where taken in sequence from the 1 pool so contents match
in both tests.

Test query: SELECT data # '{0, name}', data # '{0, email}', data # '{1,
name}', data # '{1, email}' FROM t_json
Test storage: EXTENDED
HEAD: 17mb table + 878mb toast
HEAD size quartiles: {2015,2500,2591,2711,3483}
HEAD query runtime: 15s
Tom's: 220mb table + 580mb toast
Tom's size quartiles: {1665,1984,2061,2142.25,2384}
Tom's query runtime: 13s

This is an intriguing edge case that Tom's patch actually outperform the
base implementation for 3~4kb jsons.


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-15 Thread Craig Ringer
On 09/16/2014 07:44 AM, Peter Geoghegan wrote:
 FWIW, I am slightly concerned about weighing use cases around very
 large JSON documents too heavily. Having enormous jsonb documents just
 isn't going to work out that well, but neither will equivalent designs
 in popular document database systems for similar reasons. For example,
 the maximum BSON document size supported by MongoDB is 16 megabytes,
 and that seems to be something that their users don't care too much
 about. Having 270 pairs in an object isn't unreasonable, but it isn't
 going to be all that common either.

Also, at a certain size the fact that Pg must rewrite the whole document
for any change to it starts to introduce other practical changes.

Anyway - this is looking like the change will go in, and with it a
catversion bump. Introduction of a jsonb version/flags byte might be
worthwhile at the same time. It seems likely that there'll be more room
for improvement in jsonb, possibly even down to using different formats
for different data.

Is it worth paying a byte per value to save on possible upgrade pain?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format and API changes (9.5)

2014-09-15 Thread Alvaro Herrera
Heikki Linnakangas wrote:

 Here we go. I've split this again into two patches. The first patch
 is just refactoring the current code. It moves XLogInsert into a new
 file, xloginsert.c, and the definition of XLogRecord to new
 xlogrecord.h header file. As a result, there is a a lot of churn in
 the #includes in C files that generate WAL records, or contain redo
 routines.  The number of files that pull in xlog.h - directly or
 indirectly through other headers - is greatly reduced.

I think you should push the first patch for now and continue to
investigate the issues in the second patch.  Some minor suggestions I
have for the first patch are that since xlog.h is no longer used by
other headers (but only by .c files), you should just #include
xloginsert.h instead of doing the forward declaration of struct
XLogRecData; and in xlog_internal.h, instead of 

- * XLogRecord is defined in xlog.h, but we avoid #including that to keep
+ * XLogRecord is defined in xlogrecord.h, but we avoid #including that to keep

I would just suggest to #include xlogrecord.h, since it should just
work to include that file from frontend programs.  It didn't work for
xlog.h because that one #includes fmgr.h IIRC and that one causes Datum
to appear, which makes compilation blow up.  Shouldn't be the case here.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2014-09-15 Thread Etsuro Fujita
(2014/09/13 0:13), Tom Lane wrote:
 Albe Laurenz laurenz.a...@wien.gv.at writes:
 Tom Lane wrote:
 I'm not sure offhand what the new plan tree ought to look like.  We could
 just generate a ForeignScan node, but that seems like rather a misnomer.
 Is it worth inventing a new ForeignUpdate node type?  Or maybe it'd still
 be ForeignScan but with a flag field saying hey this is really an update
 (or a delete).  The main benefit I can think of right now is that the
 EXPLAIN output would be less strange-looking --- but EXPLAIN is hardly
 the only thing that ever looks at plan trees, so having an outright
 misleading plan structure is likely to burn us down the line.

 I was envisioning that the EXPLAIN output would look like
 
  Foreign Scan on tab1
Remote SQL: SELECT ...
 
 for the normal case, versus
 
  Foreign Update on tab1
Remote SQL: UPDATE ...
 
 for the pushed-down-update case (and similarly for DELETE).  For a
 non-optimized update it'd still be a ForeignScan underneath a ModifyTable.

 As for the internal representation, I was thinking of adding a CmdType
 field to struct ForeignScan, with currently only CMD_SELECT, CMD_UPDATE,
 CMD_DELETE as allowed values, though possibly in future we'd think of a
 reason to allow CMD_INSERT there.

+1

 The only thing that's bothering me about this concept is that I'm not
 seeing how to scale it up to handling a pushed-down update on a join,
 ie, UPDATE foo ... FROM bar ... where both foo and bar are remote.
 Maybe it's silly to worry about that until join push-down is done;
 but in that case I'd vote for postponing this whole patch until we
 have join push-down.

OK

Thanks,

PS: I'll help Hanada-san do the work if there is anything I can do.

Best regards,
Etsuro Fujita


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format and API changes (9.5)

2014-09-15 Thread Michael Paquier
On Mon, Sep 15, 2014 at 7:03 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Heikki Linnakangas wrote:

 Here we go. I've split this again into two patches. The first patch
 is just refactoring the current code. It moves XLogInsert into a new
 file, xloginsert.c, and the definition of XLogRecord to new
 xlogrecord.h header file. As a result, there is a a lot of churn in
 the #includes in C files that generate WAL records, or contain redo
 routines.  The number of files that pull in xlog.h - directly or
 indirectly through other headers - is greatly reduced.

 I think you should push the first patch for now and continue to
 investigate the issues in the second patch. Some minor suggestions I
 have for the first patch are that since xlog.h is no longer used by
 other headers (but only by .c files), you should just #include
 xloginsert.h instead of doing the forward declaration of struct
 XLogRecData; and in xlog_internal.h, instead of

 - * XLogRecord is defined in xlog.h, but we avoid #including that to keep
 + * XLogRecord is defined in xlogrecord.h, but we avoid #including that to 
 keep

 I would just suggest to #include xlogrecord.h, since it should just
 work to include that file from frontend programs.  It didn't work for
 xlog.h because that one #includes fmgr.h IIRC and that one causes Datum
 to appear, which makes compilation blow up.  Shouldn't be the case here.

Alvaro got faster than me... I was just looking at the first patch and
+1 on those comments. It is worth noting that the first patch, as it
does only a large refactoring, does not impact performance or size of
WAL records. Btw, a declaration of access/xlog.h in fd.c is forgotten.
Also, if somebody is interested in running the test suite, there are
comments on how to do it at the top of compare.sql.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - Assert statement

2014-09-15 Thread Alvaro Herrera
Jan Wieck wrote:
 On 09/14/2014 02:25 PM, Pavel Stehule wrote:
 a) Isn't possible handle a assertion exception anywhere .. it enforce
 ROLLBACK in 100%
 
 b) Assertions should be disabled globally .. I am not sure, it it is a
 good idea, but I can understand so some tests based on queries to data
 can be performance issue.
 
 Important question is a relation assertations and exceptions. Is it only
 shortcut for exception or some different?
 
 I think that most data integrity issues can be handled by a well
 designed database schema that uses UNIQUE, NOT NULL, REFERENCES and
 CHECK constraints. Assertions are usually found inside of complex
 code constructs to check values of local variables. I don't think it
 is even a good idea to implement assertions that can query arbitrary
 data.

Actually Peter Eisentraut posted a patch for SQL assertions:
http://www.postgresql.org/message-id/1384486216.5008.17.ca...@vanquo.pezone.net

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format and API changes (9.5)

2014-09-15 Thread Michael Paquier
On Mon, Sep 15, 2014 at 8:00 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Alvaro got faster than me... I was just looking at the first patch and
 +1 on those comments. It is worth noting that the first patch, as it
 does only a large refactoring, does not impact performance or size of
 WAL records.

Some comments after a second and closer read of the first patch:
1) Wouldn't it be better to call GetFullPageWriteInfo directly in
XLogRecordAssemble, removing RedoRecPtr and doPageWrites from its
arguments?
2) XLogCheckBufferNeedsBackup is not used. It can be removed.
3) If I am following correctly, there are two code paths where
XLogInsertRecData can return InvalidXLogRecPtr, and then there is this
process in XLogInsert
+   {
+   /*
+* Undo the changes we made to the rdata chain.
+*
+* XXX: This doesn't undo *all* the changes;
the XLogRecData
+* entries for buffers that we had already
decided to back up have
+* had their data-pointers cleared. That's OK,
as long as we
+* decide to back them up on the next
iteration as well. Hence,
+* don't allow doPageWrites value to go from
true to false after
+* we've modified the rdata chain.
+*/
+   boolnewDoPageWrites;
+
+   GetFullPageWriteInfo(RedoRecPtr, newDoPageWrites);
+   if (!doPageWrites  newDoPageWrites)
+   doPageWrites = true;
+   rdt_lastnormal-next = NULL;
+   }
Wouldn't it be better to keep that in XLogInsertRecData? This way
GetFullPageWriteInfo could be removed (I actually see little point in
keeping it as part of this patch, but does this change make its sense
in patch 2?).
4) This patch is in DOS encoding (?!)

That's all I have for now...
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers